How LLM Tokenizers Work: Understanding Byte-Pair Encoding (BPE)

ChatGPT doesn't understand your English words directly; it only understands numbers. So whenever you prompt something into it, it first generates numbers associated with each word (or part of a word) and then processes those numbers.

For example, the sentence "he ate a crocodile." might be converted into a sequence of numbers like [234, 412, 232, 767].

The algorithm responsible for this conversion—turning words into numbers—is called a tokenizer.

Flawed Approaches to Tokenization

If you first think about how to design a tokenizer, some basic approaches come to mind, but they have significant issues:

Assign a number to each character: Instead of assigning a number to a whole word, we could assign one to each character (e.g., a → 1, b → 2).
For example: "he ate a crocodile" → ['h','e',' ','a','t','e',' ','a',' ','c','r','o','c','o','d','i','l','e'] → [8, 5, 19, 1, 14, 5, 19, 1, 19, 3, 12, 11, 3, 11, 4, 9, 10, 5] (hypothetical mapping).
The issue with this is that the input sequences become very long, making processing slow and costly. Also, individual characters often don't carry enough semantic meaning on their own.
Assign each word in the world a specific number: Doing this is very hard because the number of unique words in any language is vast (practically infinite if you consider all variations, typos, new words, etc.). This would lead to an enormous vocabulary and other problems like handling unknown words.

The Optimal Solution: Byte-Pair Encoding (BPE)

So, what's the optimal solution? Many modern LLMs use a technique called Byte-Pair Encoding (BPE).

LLMs are trained on huge amounts of text data from the internet. The BPE algorithm also learns from this data. Here's the core idea: You take the training data, and for a pre-decided number of iterations (e.g., 30,000 to 50,000), you repeat the following steps: Find the pair of characters (or existing tokens) that appears most frequently together in the data. Merge this pair into a single new token. Add this new token to your vocabulary.

This way, common words (like "the", "is") or common sub-word units (like "ing", "est", "token") eventually become single tokens in the vocabulary, while rare words can still be represented as a sequence of smaller, known sub-word tokens.

Let's look at a simplified, illustrative example of how BPE might work on a small piece of text. The following is a conceptual walkthrough:

Okay, let's run two iterations of BPE on the given sentence, with minimal explanation to emulate the algorithmic steps:

Sentence: "The sun rose quietly over the hills, casting golden light across the valley as birds chirped softly, welcoming the calm beauty of a new day."

Initial Tokens (with end-of-word markers like , and assuming basic tokenization where words are units):

"The", "sun", "rose", "quietly", "over", "the", "hills", "casting", "golden", "light", "across", "the", "valley", "as", "birds", "chirped", "softly", "welcoming", "the", "calm", "beauty", "of", "a", "new", "day"

(Note: BPE often starts from individual characters of the training corpus. For this example, we're simplifying the starting point.)

Initial Vocabulary (unique characters from the sentence, plus the end-of-word marker):

{'T', 'h', 'e', '', 's', 'u', 'n', 'r', 'o', 'q', 'i', 't', 'l', 'y', 'v', 'c', 'a', 'g', 'd', 'p', 'f', 'm', 'b', 'k', 'w' ... (all unique characters)}

Iteration 1:

Count Adjacent Pairs: (Example: "T" "h", "h" "e", "e" "", etc., across the whole corpus)
Most Frequent Pair: Let's say, hypothetically, after counting, the most frequent pair is ("e", "").
Merge: Create new token "e".
Updated Tokens (Conceptual): Words ending in "e" would now use this new token. For example, "rose" could be seen as r o s e. The representation in the text changes.
```
"Th e", "sun", "ros e", "quietly", "over", "th e", "hills", ...
```
Updated Vocabulary: Now includes "e" as a token.
```
{..., 'e'}
```

Iteration 2:

Count Adjacent Pairs: (Now including "e" as a potential part of a pair) Let's say the most frequent is ("t", "h").
Merge: Create new token "th".
Updated Tokens (Conceptual): Sequences of "th" would now use this token.
```
"Th e", "sun", "ros e", "quietly", "over", "th e", "hills", ... 
```
(Note: "The" might become "Th" + "e" if "Th" was already a token or became one. If "th" is a new token, then "the" as t h e might become th e.)
Updated Vocabulary: Now includes "th" as a token.
```
{..., 'e', 'th'}
```

Result: After two iterations, "e" and "th" have been added to the vocabulary based on the hypothetical frequency of pairs. The sentence's tokenization has been updated accordingly, with these new units treated as single tokens where they appear.

So, after many thousands of such iterations, the vocabulary will contain a mix of individual characters, common sub-words (like "-ing", "pre-"), and full common words ("love", "hello").

When a ChatGPT user types a word like "loveable", the tokenizer will look up its vocabulary. If "love" is token 87 and "able" is token 21, then "loveable" might be broken down into [87, 21] and this sequence of numbers is given as input to the GPT model.

There are many online websites where you can visualize tokenizers in action for different LLMs and see how they break down text.