⬅ Home

Decoding Transformers

LLM Pretraining

Transformers, at their core architecture, are basically next-word prediction machines.

You give it a phrase like "ayush is a good," and it will try to predict "boy." Depending on how close its prediction is to the actual next word, we calculate a loss and adjust the model's parameters.

The first step is to take your entire training data and convert each word (or token) into numbers using a tokenizer. Typically, about 90% of the words form the training data, and the remaining 10% become the test data.

Batching Mechanics: The get_batch Function

Hyperparameters:

Visualization:

For batch_size=3, block_size=5:

Original text positions: [0,1,2,3,4,5,6,7,8,...]
Batch 0: positions 2-6 → [2,3,4,5,6]
Batch 1: positions 5-9 → [5,6,7,8,9]
Batch 2: positions 9-13 → [9,10,11,12,13]
                

Resulting Tensor Shape: x.shape = (batch_size, block_size) → (64, 256)

The get_batch function is very important. It provides random input and target batches for training and testing.

First, you check if it's training time or testing time and use the dataset accordingly. Suppose batch_size is 3, block_size is 5, and your training data (tokenized) is [8,111,21,23,43,54,36,57,68,39,110,911] (length 12).

The function will generate 3 (batch_size) random starting indices between 0 and (length of data - block_size), which is (12 - 5 = 7).

Suppose it chose indices 0, 2, and 5.

Your x input batch (input sequences) will be:

[[  8, 111,  21,  23,  43],
 [ 21,  23,  43,  54,  36],
 [ 54,  36,  57,  68,  39]]
                

And your y target batch (next word for each position in x) will be the +1 index shifted version of x:

[[111,  21,  23,  43,  54],
 [ 23,  43,  54,  36,  57],
 [ 36,  57,  68,  39, 110]]
                

Creating Embeddings

PyTorch nn.Embedding - Simple Notes

What is it?

Syntax


import torch.nn as nn

nn.Embedding(num_embeddings, embedding_dim)
                

What does num_embeddings = 10 mean?

Example

embedding = nn.Embedding(10, 3) creates something like:

IndexVector (3D)
0[0.12, -0.45, 0.78] (example)
1[0.34, 0.91, -0.67] (example)
......
9[0.01, 0.22, 0.93] (example)

import torch
import torch.nn as nn

embedding = nn.Embedding(10, 3)
input_indices = torch.LongTensor([1, 2, 4, 5])
output_vectors = embedding(input_indices)
print(output_vectors)
                

Example output:

tensor([[ 0.6614,  0.2669,  0.0617],
        [ 0.6213, -0.4519, -0.1661],
        [-0.3727, -0.4709,  0.1994],
        [ 0.1008,  0.2113,  0.3170]], grad_fn=)
                

The output shape is (4, 3) — 4 indices, each mapped to a 3D vector.

Initialization

Embedding vectors are randomly initialized.

Trainable?

Yes. These vectors are learnable and updated via backpropagation.

Let's now use these nn.Embeddings.


self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
                

n_embd is the size of embeddings we wish; generally, it's around 768 (e.g. for GPT-2 base). A question may arise here: why can't we use one-hot vectors? The reason is they become very sparse. Suppose you have 45k vocab words; then it would be a 45k-length array with all indices 0 except 1.

vocab_size is the total vocabulary in the tokenizer (generally around 50k for models like GPT-2).

Forward Pass Through Embeddings

In the forward method:


tok_emb = self.token_embedding_table(idx)
pos_emb = self.position_embedding_table(torch.arange(T, device=device))
x = tok_emb + pos_emb
                

The Transformer Block

Let's move forward to a Transformer block's code.


import torch.nn as nn
import torch.nn.functional as F # Assuming F is used later

class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_embd, n_head, head_size) # Assuming MultiHeadAttention takes n_embd too
        self.ffwd = FeedForward(n_embd) # Assuming FeedForward class definition
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x
                

An example instantiation:


# block = Block(n_embd=768, n_head=12) # Example for a typical setup
# output = block(x_input_embeddings)
                

Now, I assume that you are a little bit familiar with how object-oriented programming works.

You do not need to care about what MultiHeadAttention or FeedForward classes are as of now. You just need to know that when you run this particular line block = Block(n_embd_val, n_head_val), the __init__ method:


    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_embd, n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
                

...runs, and you have all these attributes initialized. For example, block.ffwd is an object of class FeedForward.

When self.ffwd = FeedForward(n_embd) this code runs, the __init__ method inside this FeedForward class also runs and helps create attributes for the object (self.ffwd). You do not need to know how that __init__ method looks as of now.

And when you run output = block(x), because the Block class inherits from nn.Module, the forward method will automatically run with the execution of this line. So:


    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x
                

First, this line x = x + self.sa(self.ln1(x)) would run. Let's break this down.

self.ln1(x) means the forward method of this ln1 object (LayerNorm) will run with input x.

Similarly, for self.sa(self.ln1(x)), the forward method of the self.sa object (MultiHeadAttention) will run with self.ln1(x) as input. The same logic applies to the line below it for the feedforward network.

Now, some of you might be wondering what nn.LayerNorm(n_embd) does. Basically, it takes your input x and normalizes it across the embedding dimension for each token.

Multi-Head Self-Attention In-depth

So now let's explore the line x = x + self.sa(self.ln1(x)) in depth, focusing on the MultiHeadAttention part.

Currently, each token has a dimension of n_embd. Now, I want you to remember, and just remember for now, that there is a process called *attention*. One important thing to note about attention is we pass it the matrix (basically our input x) of dimension (batch size, block size, n_embd), and it does some calculations on that matrix and outputs it with dimension (batch size, block size, head_size).

Remember, head_size = n_embd // n_head.

So, what multi-head attention does is, it creates (number of heads) * attention processes (or "heads"). You input (batch size, block size, n_embd) and get (number of heads) matrices, each of (batch size, block size, head_size).

Suppose number of heads is 3, n_embd = 9, so head_size = 3. An input token embedding [2,4,3,5,8,6,7,3,1] (9-dim) would conceptually result in 3 matrices (3-dim each from each head), for example: [21,34,54], [1.2,5.4,32], [2.3,6.1,.9].

Then what we do is concatenate all these head outputs: [21,34,54,1.2,5.4,32,2.3,6.1,0.9] (back to 9-dim) and usually pass it through a final linear projection.

In the line x = x + self.sa(self.ln1(x)), the output of self.sa(...) (the multi-head attention mechanism) will have the same dimension n_embd as the input x.

The Attention Mechanism (Single Head)

Now let's move to this attention thing, what does this do?


class Head(nn.Module):
    def __init__(self, n_embd, head_size, block_size, dropout_val):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout_val)

    def forward(self, x):
        B, T, C_in = x.shape
        k = self.key(x)
        q = self.query(x)
        
        wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5
        
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        
        v = self.value(x)
        out = wei @ v
        return out
                

k = self.key(x) and q = self.query(x) both take x (with dimension B, T, n_embd) and produce matrices K and Q respectively, each with dimension (B, T, head_size) via linear transformations.

Now, in the line wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5, we calculate the scaled dot-product attention scores. q @ k.transpose(-2, -1) performs a matrix multiplication between queries (Q) and transposed keys (KT), resulting in a matrix wei of dimension (B, T, T). If you remember, block_size (T) was the number of tokens we are processing at once.

So now, let us understand the philosophy behind attention. The meaning of words is different in different contexts. For example, consider these two sentences:

  1. This bag is light.
  2. I turned on the light.

The word is the same, "light," but the meaning is totally different. So how do we capture this meaning via the attention mechanism? You see that wei matrix; for one batch item, it has dimensions (T, T) or (block_size, block_size). What this does is calculate the importance of each token with respect to every other token in the sequence.

Conceptually, for a sequence "this is a good boy" (T=5), wei before masking and softmax might look like:

Query TokenKey: thisKey: isKey: aKey: goodKey: boy
this0.980.620.470.330.21
is0.610.970.560.450.29
a0.440.590.960.520.34
good0.320.490.530.950.68
boy0.250.330.370.660.94

So here you can see the block_size is 5. Now, one thing to note is how we train this transformer for tasks like next-word prediction.

I will first give it the word "this" and ask it to predict the next word, which is "is". Then I will give it "this is" and ask it to predict the next word, which is "a".

Now, for a given token to predict the *next* token, you do not want the model to "see" or know about future tokens in the sequence (e.g., when processing "is", it shouldn't know "a" comes after). Otherwise, it would be cheating. So, we *mask* it.

Here is how the masked version of wei (before softmax) looks:

Query TokenKey: thisKey: isKey: aKey: goodKey: boy
this0.98-∞-∞-∞-∞
is0.610.97-∞-∞-∞
a0.440.590.96-∞-∞
good0.320.490.530.95-∞
boy0.250.330.370.660.94

Now you might be wondering why I used negative infinity (-inf) instead of 0. Wait, let me show you. So now, as you can see, when the model is processing the token "is" (as a query), it can attend to "this" and "is" (as keys), but its attention to "a", "good", and "boy" is blocked (set to -inf).

Now what we do is apply softmax to this masked wei matrix (F.softmax(wei, dim=-1)). This converts the scores in each row into a probability distribution, meaning the sum of attention weights in each row will be 1.

So after applying softmax, it looks like (attention weights):

Query TokenAttends to: thisAttends to: isAttends to: aAttends to: goodAttends to: boy
this1.000.000.000.000.00
is0.360.640.000.000.00
a0.240.390.370.000.00
good0.190.260.280.270.00
boy0.150.190.200.250.21

Now, wei is of dimension (B, T, T), but we need to get an output of dimension (B, T, head_size) for this attention head.

So we multiply it with the Value matrix (v = self.value(x), which has shape B, T, head_size). The operation is out = wei @ v. This results in out with shape (B, T, head_size), which is a weighted sum of the value vectors.

Now you might be wondering why we use multiple attention heads. What's wrong with using one attention? We use multiple attention heads to allow the model to capture different types of relationships or "sentiments" simultaneously. Some heads might capture what words are emotionally more important, while other attention heads might capture grammatical superiority, and so on.

FeedForward Network (FFN)

So after passing our input x (B, T, n_embd) through this multi-head attention mechanism (which includes concatenating head outputs and a final linear projection), we get a richer meaning output x (still with dimensions B, T, n_embd after the residual connection).

Now we pass this through a position-wise feedforward neural network (FFN). The FFN typically consists of two linear transformations with a non-linear activation function in between. A common structure is:

This whole sequence (Multi-Head Attention + Add & Norm -> Feedforward + Add & Norm) is called a Transformer block. And we pass our x input through multiple such blocks, one after another:

x → block 1 → x' → block 2 → x'' .....
                

Final Output and Loss Calculation

Now let's move ahead to the model's main forward method, which orchestrates these components.


class YourTransformerModel(nn.Module):
    def __init__(self, vocab_size, n_embd, block_size, n_head, n_layer, dropout):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) 
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.dropout = nn.Dropout(dropout) 

    def forward(self, idx, targets=None):
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) 
        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device))
        x = self.dropout(tok_emb + pos_emb)
        
        x = self.blocks(x) 
        
        x = self.ln_f(x) 

        if targets is not None:
            logits = self.lm_head(x)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        else:
            logits = self.lm_head(x[:, [-1], :])
            loss = None
            
        return logits, loss
                

The initial lines for token and positional embeddings, their summation, and dropout are what we've discussed. The self.blocks(x) line passes the combined embeddings through the stack of Transformer blocks.

The def forward(self, idx, targets=None): method takes input token indices idx (shape B, T) and optional targets.

x = self.ln_f(x) applies a final layer normalization to the output of the Transformer blocks.

The line if targets is not None: means if we have targets (which we calculated at the start for training), we are in training mode (not inference mode like when you use ChatGPT).

Then, logits = self.lm_head(x) applies a linear layer (lm_head) that takes the processed x (B, T, n_embd) and projects it to (B, T, vocab_size). These are the raw scores for each possible next token.

Now, if you remember, vocab_size is the set of all possible tokens. So, for each token in the input sequence, logits gives scores for every token in the vocabulary as a potential successor. For an input "ayush is good", for the representation corresponding to "ayush", we want the logit for "is" to be high. The model is trained accordingly.

loss = F.cross_entropy(...) then calculates the cross-entropy loss between these predicted logits and the actual targets (e.g., "is good boy"). I am not covering cross-entropy loss here; you can google it or ask ChatGPT.

The else: block handles inference mode.

logits = self.lm_head(x[:, [-1], :]) means that if you are in inference mode, do not calculate the loss. Instead, just give the score of the next word for the *last* word in the input sequence. If "ayush is good" is the input (and block_size allows for it), we are asking the model to give scores for what will come after "good".

The expression x[:, [-1], :] works as follows:

Sampling Strategies in Inference Time

1. What Are Sampling Strategies?

When generating text, the model produces a vector of logits (unnormalized scores) for every possible next token. To turn these into an actual token choice, you need a sampling strategy:

2. Temperature

What is it?

A scalar value that controls the "sharpness" or "flatness" of the probability distribution.

How is it used?


logits = logits / temperature
probs = F.softmax(logits, dim=-1)
                

Example:

If logits = [2.0, 1.0, 0.1]:

3. Top-k Sampling

What is it?

A technique to restrict sampling to only the top k most likely tokens.

How is it used?


v, _ = torch.topk(logits, k)
logits[logits < v[:, [-1]]] = -float('Inf')
probs = F.softmax(logits, dim=-1)
                

Why use it?

4. Combined Effect

Conclusion

If you have ever heard of "context window," it is the same as block_size.

This majorly sums up the Transformer architecture. For queries, hit me up at @goyalayus.