Decoding Transformers
LLM Pretraining
Transformers, at their core architecture, are basically next-word prediction machines.
You give it a phrase like "ayush is a good," and it will try to predict "boy." Depending on how close its prediction is to the actual next word, we calculate a loss and adjust the model's parameters.
The first step is to take your entire training data and convert each word (or token) into numbers using a tokenizer. Typically, about 90% of the words form the training data, and the remaining 10% become the test data.
Batching Mechanics: The get_batch
Function
Hyperparameters:
batch_size=64
(sequences per batch)block_size=256
(context window)
Visualization:
For batch_size=3
, block_size=5
:
Original text positions: [0,1,2,3,4,5,6,7,8,...] Batch 0: positions 2-6 → [2,3,4,5,6] Batch 1: positions 5-9 → [5,6,7,8,9] Batch 2: positions 9-13 → [9,10,11,12,13]
Resulting Tensor Shape: x.shape = (batch_size, block_size)
→ (64, 256)
The get_batch
function is very important. It provides random input and target batches for training and testing.
First, you check if it's training time or testing time and use the dataset accordingly. Suppose batch_size
is 3, block_size
is 5, and your training data (tokenized) is [8,111,21,23,43,54,36,57,68,39,110,911]
(length 12).
The function will generate 3 (batch_size
) random starting indices between 0 and (length of data - block_size
), which is (12 - 5 = 7).
Suppose it chose indices 0, 2, and 5.
Your x
input batch (input sequences) will be:
[[ 8, 111, 21, 23, 43], [ 21, 23, 43, 54, 36], [ 54, 36, 57, 68, 39]]
And your y
target batch (next word for each position in x
) will be the +1 index shifted version of x
:
[[111, 21, 23, 43, 54], [ 23, 43, 54, 36, 57], [ 36, 57, 68, 39, 110]]
Creating Embeddings
PyTorch nn.Embedding
- Simple Notes
What is it?
- A layer that maps integer indices to dense vectors.
- Used to convert word or item indices into learnable embeddings.
Syntax
import torch.nn as nn
nn.Embedding(num_embeddings, embedding_dim)
num_embeddings
: total number of unique indices (e.g., vocabulary size)embedding_dim
: size of each embedding vector
What does num_embeddings = 10
mean?
- It defines that there are 10 distinct items (e.g., words).
- It creates an internal lookup table of shape (10,
embedding_dim
). - Valid input indices range from 0 to 9.
Example
embedding = nn.Embedding(10, 3)
creates something like:
Index | Vector (3D) |
---|---|
0 | [0.12, -0.45, 0.78] (example) |
1 | [0.34, 0.91, -0.67] (example) |
... | ... |
9 | [0.01, 0.22, 0.93] (example) |
import torch
import torch.nn as nn
embedding = nn.Embedding(10, 3)
input_indices = torch.LongTensor([1, 2, 4, 5])
output_vectors = embedding(input_indices)
print(output_vectors)
Example output:
tensor([[ 0.6614, 0.2669, 0.0617], [ 0.6213, -0.4519, -0.1661], [-0.3727, -0.4709, 0.1994], [ 0.1008, 0.2113, 0.3170]], grad_fn=)
The output shape is (4, 3) — 4 indices, each mapped to a 3D vector.
Initialization
Embedding vectors are randomly initialized.
Trainable?
Yes. These vectors are learnable and updated via backpropagation.
Let's now use these nn.Embeddings
.
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
n_embd
is the size of embeddings we wish; generally, it's around 768 (e.g. for GPT-2 base). A question may arise here: why can't we use one-hot vectors? The reason is they become very sparse. Suppose you have 45k vocab words; then it would be a 45k-length array with all indices 0 except 1.
vocab_size
is the total vocabulary in the tokenizer (generally around 50k for models like GPT-2).
Forward Pass Through Embeddings
In the forward
method:
tok_emb = self.token_embedding_table(idx)
pos_emb = self.position_embedding_table(torch.arange(T, device=device))
x = tok_emb + pos_emb
-
tok_emb
:- Input:
idx
of shape (B, T) → Batch of token indices. - Output: Dense embeddings for each token. Shape: (B, T, C)
- Example (for B=2, T=2, C=2):
idx = [[2, 5], [1, 3]] tok_emb = [[[0.1, 0.2], [0.3, 0.4]], [[0.5, 0.6], [0.7, 0.8]]]
- Input:
-
pos_emb
:- Input: Positions (e.g.,
torch.arange(T)
) for each token in the sequence. Shape: (T) - Output: Dense embeddings for each position. Shape: (T, C)
- Example (for T=2, C=2):
pos_emb = [[0.01, 0.02], [0.03, 0.04]]
- Input: Positions (e.g.,
-
x
:- Combines token and positional embeddings by adding them element-wise. Positional embeddings (T, C) are broadcast across the batch dimension (B) to match token embeddings (B, T, C). Shape: (B, T, C)
- Example:
x = tok_emb + pos_emb = [[[0.11, 0.22], [0.33, 0.44]], [[0.51, 0.62], [0.73, 0.84]]]
The Transformer Block
Let's move forward to a Transformer block's code.
import torch.nn as nn
import torch.nn.functional as F # Assuming F is used later
class Block(nn.Module):
def __init__(self, n_embd, n_head):
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_embd, n_head, head_size) # Assuming MultiHeadAttention takes n_embd too
self.ffwd = FeedForward(n_embd) # Assuming FeedForward class definition
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return x
An example instantiation:
# block = Block(n_embd=768, n_head=12) # Example for a typical setup
# output = block(x_input_embeddings)
Now, I assume that you are a little bit familiar with how object-oriented programming works.
You do not need to care about what MultiHeadAttention
or FeedForward
classes are as of now. You just need to know that when you run this particular line block = Block(n_embd_val, n_head_val)
, the __init__
method:
def __init__(self, n_embd, n_head):
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_embd, n_head, head_size)
self.ffwd = FeedForward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
...runs, and you have all these attributes initialized. For example, block.ffwd
is an object of class FeedForward
.
When self.ffwd = FeedForward(n_embd)
this code runs, the __init__
method inside this FeedForward
class also runs and helps create attributes for the object (self.ffwd
). You do not need to know how that __init__
method looks as of now.
And when you run output = block(x)
, because the Block
class inherits from nn.Module
, the forward
method will automatically run with the execution of this line. So:
def forward(self, x):
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return x
First, this line x = x + self.sa(self.ln1(x))
would run. Let's break this down.
self.ln1(x)
means the forward
method of this ln1
object (LayerNorm) will run with input x
.
Similarly, for self.sa(self.ln1(x))
, the forward
method of the self.sa
object (MultiHeadAttention) will run with self.ln1(x)
as input. The same logic applies to the line below it for the feedforward network.
Now, some of you might be wondering what nn.LayerNorm(n_embd)
does. Basically, it takes your input x
and normalizes it across the embedding dimension for each token.
Multi-Head Self-Attention In-depth
So now let's explore the line x = x + self.sa(self.ln1(x))
in depth, focusing on the MultiHeadAttention
part.
Currently, each token has a dimension of n_embd
. Now, I want you to remember, and just remember for now, that there is a process called *attention*. One important thing to note about attention is we pass it the matrix (basically our input x
) of dimension (batch size, block size, n_embd
), and it does some calculations on that matrix and outputs it with dimension (batch size, block size, head_size
).
Remember, head_size = n_embd // n_head
.
So, what multi-head attention does is, it creates (number of heads) * attention processes (or "heads"). You input (batch size, block size, n_embd
) and get (number of heads) matrices, each of (batch size, block size, head_size
).
Suppose number of heads is 3, n_embd = 9
, so head_size = 3
. An input token embedding [2,4,3,5,8,6,7,3,1]
(9-dim) would conceptually result in 3 matrices (3-dim each from each head), for example: [21,34,54]
, [1.2,5.4,32]
, [2.3,6.1,.9]
.
Then what we do is concatenate all these head outputs: [21,34,54,1.2,5.4,32,2.3,6.1,0.9]
(back to 9-dim) and usually pass it through a final linear projection.
In the line x = x + self.sa(self.ln1(x))
, the output of self.sa(...)
(the multi-head attention mechanism) will have the same dimension n_embd
as the input x
.
The Attention Mechanism (Single Head)
Now let's move to this attention thing, what does this do?
class Head(nn.Module):
def __init__(self, n_embd, head_size, block_size, dropout_val):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout_val)
def forward(self, x):
B, T, C_in = x.shape
k = self.key(x)
q = self.query(x)
wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
wei = self.dropout(wei)
v = self.value(x)
out = wei @ v
return out
k = self.key(x)
and q = self.query(x)
both take x
(with dimension B, T, n_embd
) and produce matrices K and Q respectively, each with dimension (B, T, head_size
) via linear transformations.
Now, in the line wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5
, we calculate the scaled dot-product attention scores. q @ k.transpose(-2, -1)
performs a matrix multiplication between queries (Q) and transposed keys (KT), resulting in a matrix wei
of dimension (B, T, T). If you remember, block_size
(T) was the number of tokens we are processing at once.
So now, let us understand the philosophy behind attention. The meaning of words is different in different contexts. For example, consider these two sentences:
- This bag is light.
- I turned on the light.
The word is the same, "light," but the meaning is totally different. So how do we capture this meaning via the attention mechanism? You see that wei
matrix; for one batch item, it has dimensions (T, T) or (block_size, block_size). What this does is calculate the importance of each token with respect to every other token in the sequence.
Conceptually, for a sequence "this is a good boy" (T=5), wei
before masking and softmax might look like:
Query Token | Key: this | Key: is | Key: a | Key: good | Key: boy |
---|---|---|---|---|---|
this | 0.98 | 0.62 | 0.47 | 0.33 | 0.21 |
is | 0.61 | 0.97 | 0.56 | 0.45 | 0.29 |
a | 0.44 | 0.59 | 0.96 | 0.52 | 0.34 |
good | 0.32 | 0.49 | 0.53 | 0.95 | 0.68 |
boy | 0.25 | 0.33 | 0.37 | 0.66 | 0.94 |
So here you can see the block_size
is 5. Now, one thing to note is how we train this transformer for tasks like next-word prediction.
I will first give it the word "this" and ask it to predict the next word, which is "is". Then I will give it "this is" and ask it to predict the next word, which is "a".
Now, for a given token to predict the *next* token, you do not want the model to "see" or know about future tokens in the sequence (e.g., when processing "is", it shouldn't know "a" comes after). Otherwise, it would be cheating. So, we *mask* it.
Here is how the masked version of wei
(before softmax) looks:
Query Token | Key: this | Key: is | Key: a | Key: good | Key: boy |
---|---|---|---|---|---|
this | 0.98 | -∞ | -∞ | -∞ | -∞ |
is | 0.61 | 0.97 | -∞ | -∞ | -∞ |
a | 0.44 | 0.59 | 0.96 | -∞ | -∞ |
good | 0.32 | 0.49 | 0.53 | 0.95 | -∞ |
boy | 0.25 | 0.33 | 0.37 | 0.66 | 0.94 |
Now you might be wondering why I used negative infinity (-inf
) instead of 0. Wait, let me show you. So now, as you can see, when the model is processing the token "is" (as a query), it can attend to "this" and "is" (as keys), but its attention to "a", "good", and "boy" is blocked (set to -inf
).
Now what we do is apply softmax to this masked wei
matrix (F.softmax(wei, dim=-1)
). This converts the scores in each row into a probability distribution, meaning the sum of attention weights in each row will be 1.
So after applying softmax, it looks like (attention weights):
Query Token | Attends to: this | Attends to: is | Attends to: a | Attends to: good | Attends to: boy |
---|---|---|---|---|---|
this | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 |
is | 0.36 | 0.64 | 0.00 | 0.00 | 0.00 |
a | 0.24 | 0.39 | 0.37 | 0.00 | 0.00 |
good | 0.19 | 0.26 | 0.28 | 0.27 | 0.00 |
boy | 0.15 | 0.19 | 0.20 | 0.25 | 0.21 |
Now, wei
is of dimension (B, T, T), but we need to get an output of dimension (B, T, head_size
) for this attention head.
So we multiply it with the Value matrix (v = self.value(x)
, which has shape B, T, head_size
). The operation is out = wei @ v
. This results in out
with shape (B, T, head_size
), which is a weighted sum of the value vectors.
Now you might be wondering why we use multiple attention heads. What's wrong with using one attention? We use multiple attention heads to allow the model to capture different types of relationships or "sentiments" simultaneously. Some heads might capture what words are emotionally more important, while other attention heads might capture grammatical superiority, and so on.
FeedForward Network (FFN)
So after passing our input x
(B, T, n_embd
) through this multi-head attention mechanism (which includes concatenating head outputs and a final linear projection), we get a richer meaning output x
(still with dimensions B, T, n_embd
after the residual connection).
Now we pass this through a position-wise feedforward neural network (FFN). The FFN typically consists of two linear transformations with a non-linear activation function in between. A common structure is:
- Linear Layer 1: Projects the input from dimension
n_embd
to a higher dimension, often 4 times larger (4 *n_embd
). - Non-linear Activation: Applies a non-linear function like ReLU or GELU.
- Linear Layer 2: Projects the output back to the original dimension
n_embd
.
This whole sequence (Multi-Head Attention + Add & Norm -> Feedforward + Add & Norm) is called a Transformer block. And we pass our x
input through multiple such blocks, one after another:
x → block 1 → x' → block 2 → x'' .....
Final Output and Loss Calculation
Now let's move ahead to the model's main forward
method, which orchestrates these components.
class YourTransformerModel(nn.Module):
def __init__(self, vocab_size, n_embd, block_size, n_head, n_layer, dropout):
super().__init__()
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd)
self.lm_head = nn.Linear(n_embd, vocab_size)
self.dropout = nn.Dropout(dropout)
def forward(self, idx, targets=None):
B, T = idx.shape
tok_emb = self.token_embedding_table(idx)
pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device))
x = self.dropout(tok_emb + pos_emb)
x = self.blocks(x)
x = self.ln_f(x)
if targets is not None:
logits = self.lm_head(x)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
else:
logits = self.lm_head(x[:, [-1], :])
loss = None
return logits, loss
The initial lines for token and positional embeddings, their summation, and dropout are what we've discussed. The self.blocks(x)
line passes the combined embeddings through the stack of Transformer blocks.
The def forward(self, idx, targets=None):
method takes input token indices idx
(shape B, T) and optional targets
.
x = self.ln_f(x)
applies a final layer normalization to the output of the Transformer blocks.
The line if targets is not None:
means if we have targets (which we calculated at the start for training), we are in training mode (not inference mode like when you use ChatGPT).
Then, logits = self.lm_head(x)
applies a linear layer (lm_head
) that takes the processed x
(B, T, n_embd
) and projects it to (B, T, vocab_size
). These are the raw scores for each possible next token.
Now, if you remember, vocab_size
is the set of all possible tokens. So, for each token in the input sequence, logits
gives scores for every token in the vocabulary as a potential successor. For an input "ayush is good", for the representation corresponding to "ayush", we want the logit for "is" to be high. The model is trained accordingly.
loss = F.cross_entropy(...)
then calculates the cross-entropy loss between these predicted logits
and the actual targets
(e.g., "is good boy"). I am not covering cross-entropy loss here; you can google it or ask ChatGPT.
The else:
block handles inference mode.
logits = self.lm_head(x[:, [-1], :])
means that if you are in inference mode, do not calculate the loss. Instead, just give the score of the next word for the *last* word in the input sequence. If "ayush is good" is the input (and block_size
allows for it), we are asking the model to give scores for what will come after "good".
The expression x[:, [-1], :]
works as follows:
x
has shape (B, T,n_embd
) — batch size, sequence length, embedding dimension.x[:, [-1], :]
selects the last token's embedding (at index T-1) in the sequence for each batch item.- The shape after indexing is (B, 1,
n_embd
)::
→ all batches[-1]
→ last token position (kept as a dimension of size 1):
→ all embedding dimensions
Sampling Strategies in Inference Time
1. What Are Sampling Strategies?
When generating text, the model produces a vector of logits (unnormalized scores) for every possible next token. To turn these into an actual token choice, you need a sampling strategy:
- Greedy Decoding: Always pick the token with the highest probability (least random).
- Random Sampling: Pick tokens according to their probability distribution (more creative).
- Temperature & Top-k: Methods to control the balance between randomness and determinism.
2. Temperature
What is it?
A scalar value that controls the "sharpness" or "flatness" of the probability distribution.
How is it used?
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
- Low temperature (<1): Makes the distribution sharper (model is more confident, less random).
- High temperature (>1): Flattens the distribution (model is more random, more creative).
Example:
If logits = [2.0, 1.0, 0.1]
:
- With temperature 1.0: softmax is "normal."
- With temperature 0.5: logits become
[4.0, 2.0, 0.2]
→ softmax is sharper. - With temperature 2.0: logits become
[1.0, 0.5, 0.05]
→ softmax is flatter.
3. Top-k Sampling
What is it?
A technique to restrict sampling to only the top k
most likely tokens.
How is it used?
v, _ = torch.topk(logits, k)
logits[logits < v[:, [-1]]] = -float('Inf')
probs = F.softmax(logits, dim=-1)
- Only the
k
tokens with the highest logits are kept; all others are set to-inf
(zero probability after softmax). - The next token is sampled only from these top
k
candidates.
Why use it?
- Prevents the model from picking rare, low-probability tokens that can lead to incoherent or nonsensical text.
- Balances creativity and coherence.
4. Combined Effect
- Temperature controls the overall randomness.
- Top-k controls the candidate set for each prediction.
- Used together, you can finely tune the model’s output: from deterministic and repetitive to creative and surprising.
Conclusion
If you have ever heard of "context window," it is the same as block_size
.
This majorly sums up the Transformer architecture. For queries, hit me up at @goyalayus.