Byte Pair Encoding: building the GPT tokenizer with Karpathy -

Reading Time: 15 minutes

👉 Useful links: Youtube lecture, minbpe GitHub repo, Colab notebook

Table of Contents

What is a tokenizer and why do we need one

The purpose of this post is to put in writing (part of) the latest Andrej Karpathy’s lecture on LLM Tokenization. I’ll specifically try to cover the Byte Pair Encoding (BPE) algorithm, which is at the core of modern tokenizers, and hence a foundational layer of LLMs.

What is a tokenizer and why does it matter?

If you ever had anything to do with text in the good old NLP days, you are surely aware that models don’t process strings. They like crunching numbers. So, if I have to train a sentiment analysis classifier on top of “this TV show is amazing” and teach it that this sentence’s sentiment is positive, then I’ll have to transform that same sentence into a sequence of integers first. The omnipresent LLMs make no exception. Both to train a GPT and to run inference on it, we have to turn strings into integers so that the Transformer can understand them. The job of encoding text into numbers and decoding numbers back into text is the one of the tokenizer. If you followed the previous lecture about how to build GPT from scratch (blog post here) you probably remember that in that one Andrej trained his version of the language transformer on a chunk of Shakespeare’s work and that the atomic unit of text getting fed to the model was a single Latin character. In that case, the training data had been split into a list of single characters, for a total of only 65 tokens. Every character was identified with an integer between 0 and 64 ({"a": 0, "b":1, "c": 2, ...}). That was the integer locating each character within the trainable embedding layer (or lookup table if you are less fancy), hence providing the position of the vector to pluck out and ingest into the model (well, the embedding is already part of the model technically 🫣)

That character level tokenization served the purpose of the lecture but it was very simple and naive. Modern LLMs need more powerful techniques to split text into tokens. BPE is the most common of those and we’ll delve into it in detail in a bit. But let’s start from where BPE was introduced first. That is the GPT2 paper back in 2018, aka “Language Models are Unsupervised Multitask Learners”. The sections of interest to us are 2.2 (Input Representation) and 2.3 (Model). Here are some interesting excerpts

Current large scale LMs include pre-processing steps such as lowercasing, tokenization, and out-of-vocabulary tokens which restrict the space of model-able strings. While processing Unicode strings as a sequence of UTF-8 bytes elegantly fulfills this requirement […], current byte-level LMs are not competitive with word-level LMs on large scale datasets […]. Byte Pair Encoding (BPE) […] is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences. Despite its name, reference BPE implementations often operate on Unicode code points and not byte sequences. […] In contrast, a byte-level version of BPE only requires a base vocabulary of size 256. […] The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens […]
Language Models are Unsupervised Multitask Learners

Don’t worry if you don’t understand this paragraph. We’ll learn what the above means as we go deeper. Regardless, a couple of key takeaways stand out to me here: BPE operates on the UTF-8 byte representation of Unicode strings. BPE starts from a base vocabulary of only 256 tokens and (in the GPT2 case) expands it to more than 50k (how?). The context size of the Transformer doubles from 512 to 1024 (is this somehow related to a richer vocabulary?).

Before answering these questions and understanding what this whole Unicode, BPE, UTF-8 madness is all about, let’s take another step back and motivate why we need to care about the tokenizer in the first place. Splitting text into single characters as we did for Shakespeare kinda worked all right. Is it worth bothering? Andrej answers by providing the following list of problems most of us (LLM users) have stumbled upon at some point. The root cause of all of them is tokenization. Better figure out how that works then.

Why can’t LLM spell words? Tokenization.

Why can’t LLM do super simple string processing tasks like reversing a string? Tokenization.

Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.

Why is LLM bad at simple arithmetic? Tokenization.

Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.

Why did my LLM abruptly halt when it sees the string “<|endoftext|>”? Tokenization.

What is this weird warning I get about a “trailing whitespace”? Tokenization.

Why the LLM break if I ask it about “SolidGoldMagikarp”? Tokenization.

Why should I prefer to use YAML over JSON with LLMs? Tokenization.

Karpathy’s Colab notebook

Tiktokenizer

To get a sense of the complexity of the tokenization task, take a look at tiktokenizer. This app runs tokenization in JavaScript directly in the browser in real time. It’s the most effective way to understand how a piece of text gets “seen” by an LM. For instance, here’s the difference between the GPT2 and the GPT4 tokenizers encoding the same piece of text.

**Left**: GPT2 tokenizer. **Right**: GPT4 (`cl100k_base`) tokenizer. Use the slider to see the difference. Notice how GPT2 encodes the sentence in 48 tokens, whereas GPT4 in 39.

And here is how Andrej explains it (up until timestamp 14m:50s).

If you watched until the end of the section, you might have had a💡moment when Karpathy mentions the following. The GPT4 tokenizer (cl100k_base) uses ~half the tokens to encode the input text compared to GPT2 (188 vs 300). Assuming a fixed context window for both models, GPT4 would be able to attend to ~twice as many tokens compared to GPT2. Put differently, to predict the next token, GPT4 would base its decision on ~twice as many tokens as GPT2, simply because it is ~2x more efficient in compressing input text. See how “I am Francesco, I am Italian ” requires 12 tokens for GPT2 and only 8 for GPT4, thanks to grouping white spaces at the end.

This is interesting. What is this “grouping multiple smaller tokens into a bigger one”? That’s Byte Pair Encoding (+ some tricks on top) at work.

What is text in Python: Unicode and UTF-8

Sorry, but I have to take yet another little detour first. We have been talking about text for a while, but exactly what is “text”? Or better, what is “text” for Python (the programming language we are using)? It’s a string or also👇

Textual data in Python is handled with str objects, or strings. Strings are immutable sequences of Unicode code points
https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str

What are Unicode code points? I asked ChatGPT to explain them to me in very simple terms and this is what it came up with

Sure, I’d be happy to explain Unicode code points in simple terms.

Think of Unicode as a big, universal library of all the characters from every language in the world, along with symbols, emojis, and more. Every character, whether it’s a letter in the English alphabet like “A,” a Chinese character like “汉,” or an emoji like “😊,” is stored in this library.

Now, every book in a library has a unique code that helps you find it. Similarly, in the Unicode library, every character has a unique “code point,” which is just a special code that identifies it. These code points are usually written in a format that looks like “U+0041” for the letter “A”. “U+” is just a way of saying “Unicode,” and the numbers and letters that follow are the unique identifier for the character.

So, when you’re using text in Python, and it talks about Unicode code points, it’s referring to these unique codes for each character in your text. Because Python uses Unicode, it means you can represent and work with text from any language or set of symbols uniformly.
ChatGPT from OpenAI

In Python, we can use the ord function to access the Unicode code point of a character. Those are the code points for 안녕하세요 👋 (hello in Korean!)

Why can’t we use these Unicode integers to encode a string? Because the vocabulary would be very big (150k) and still pretty much alive. An ever-changing beast.

Unicode, formally The Unicode Standard,^{[note 1]} is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world’s major writing systems. Version 15.1 of the standard^[A] defines 149813 characters^[3] and 161 scripts used in various ordinary, literary, academic, and technical contexts.
https://en.wikipedia.org/wiki/Unicode

We need something better. To find it we turn to encodings.

Unicode text is processed and stored as binary data using one of several encodings, which define how to translate the standard’s abstracted codes for characters into sequences of bytes. The Unicode Standard itself defines three encodings: UTF-8, UTF-16, and UTF-32, though several others exist. Of these, UTF-8 is the most widely used by a large margin, in part due to its backwards-compatibility with ASCII.
https://en.wikipedia.org/wiki/Unicode

So UTF-8 is the most common by far. It is used to convert the Unicode code point into a byte stream of between 1 and 4 bytes. The way to encode a string in UTF-8 in Python is "string".encode("utf-8"). We get the raw bytes of a string’s representation in that encoding by calling a list around the encoded text. Like this.

But if we wanted to stick to UTF-8 then we’d have to work with byte streams, so numbers between 0 and 255. This is too small of a vocabulary. Tokenizing text with such a vocabulary would result in very very long and stretched sequences of bytes. The reason is that simple characters (UNICODE) correspond to a single byte whereas more complex Unicode code points (🅤🅝🅘🅒🅞🅓🅔) translate to multiple bytes. From the transformer perspective, we’ll have a tiny embedding layer, and an equally small final layer, with a super long stretched text in between. Given the finite context length, the model will not be able to attend to sufficiently long text to come up with a prediction. So we want to stick to UTF-8, but we also want a larger vocabulary size, that we can tune as a hyper-parameter. How can we get there? Meet the Byte Pair Encoding (BPE) algorithm.

Byte Pair Encoding (BPE)

High-level principles

Let’s see how the algo works on the dummy string used on the BPE Wikipedia page: aaabdaaabac. The core of the strategy is to iteratively identify pairs of consecutive characters that appear the most often, define new tokens based on those, and replace them in the string. The result is that the text will progressively get shorter and the vocabulary larger.

BPE from scratch

👉 You can follow along in Karpathy’s Colab notebook. I have played around with some of Andrej’s functions, as a couple were not super readable to me. So the screenshots I paste below might not show exactly the original code.

Let’s do the same on a longer data set and start from our UTF-8 byte-based 256-big vocabulary. We’ll end up with a compressed version of the text, and a larger vocabulary, together with routines to encode arbitrary text to tokens and decode tokens back to text.

Here’s what’s needed to come up with a functioning tokenizer:

Train the tokenizer: this means applying BPE on an arbitrarily large corpus of data. The goal is to obtain bytes’ merging rules and a richer vocabulary compared to the 0-255 one, simple bytes offer. The vocabulary we get at this stage should be flexible and large enough to be able to effectively encode whatever piece of text the LLM will receive at training and inference time. If we train the tokenizer on English text only, no matter how big and varied the text, we cannot expect the LLM to do well on Korean inputs. And that would be the tokenizer’s fault, which does not do a good job of splitting Korean inputs effectively.
Write 2 functions: an encoder, to encode text into tokens given the merging rules, and a decoder, to decode tokens back to text, using the vocabulary enriched with the merging rules.

That’s about it.

The tokenizer is an independent entity with respect to the LLM

Training the tokenizer

To get started we grab the first paragraph of A Programmer’s Introduction to Unicode. Here it is

Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception.
A Programmer’s Introduction to Unicode

It is 533 characters long. When it is encoded in UTF-8 though, the length increases to 616. That’s because more complicated code points, such as 🅤🅝🅘🅒🅞🅓🅔, require more than 1 byte to be represented.

The first thing to do is to iterate over the text and count the pairs of consecutive characters appearing the most often. It seems the byte-represented pair (101, 32) appears 20 times. Kinda makes sense given it translates to e followed by a space (think the ). Quite a common pattern.

Now that we have the top pairs sorted in descending order of frequency, we can iteratively replace each of them with a new token. The logic is commented in the screenshot👇. Let’s check if it works as expected by applying the merge function to the list [5, 6, 6, 7, 9, 1] to replace the pair (6, 7) with the new token 99. The output is as expected: [5, 6, 99, 9, 1].

Now we can apply the merge function to the original list of tokens, replacing the top_pair (101, 32) with 256 (remember our vocabulary stretched up to 255, so 256 is indeed a new token). We end up with 596 tokens instead of the original 616. Exactly 20 less (the top_pair frequency). So the vocabulary grows by 1 and the text decreases by 20. Good start.

Now that we have the ball rolling, let’s be more ambitious and use the entire blog post A Programmer’s Introduction to Unicode. It will give us better statistics compared to just the first paragraph. This is👇 24597 bytes long (after UTF-8 encoding). We want to achieve a vocabulary of size 276. Which means we have to perform 20 merges in total, given we start from a vocabulary of 256 elements. We do the following:

Compute with get_stats the pairs of consecutive characters appearing the most often
Get the top pair by frequency
Replace it with a new token
Repeat 20 times

The first new token is 256 (256 + 0), then 257 (256 + 1) and so on. 🚨 Notice how tokens 256 and 259, created after the 1st and 3rd merges, also get merged into a new token at the very end: 275. The order of the merging steps is important, as newly created tokens after each replacement become candidates for merging themselves!

We end up with a vocabulary of 276 tokens (from 256) and a text encoded in 19438 tokens (from 24597). A compression ratio of 1.27x. Not bad 😎.

We have a vocabulary and a set of merging rules. We can move to encoding and decoding.

The encoder

This is easy. We already have the code. We iterate over the text we want to encode and apply the merging rules in the order we created them while training (🚨this works only thanks to dict.items() in Python allowing us to loop through a dictionary without changing the order of its elements). Like so👇. Notice how, when encoding the gpts are cool we end up with several “custom” tokens we created during training. 256 (merging e and whitespace) is there for instance.

The decoder

This is relatively simple as well. We have to enrich our vocabulary first. We have the merging rules, but we never updated the original 0-255 bytes vocabulary. Once done, it is just a matter of iterating over tokens, mapping them to their corresponding byte values, and then decoding them to Unicode characters. As👇.

There is a catch here. What does errors=replace mean in the UTF-8 decode function? It’s because sometimes when we map tokens to Unicode characters, we end up with invalid byte sequences, that Python cannot decode. errors=replace avoids throwing an exception, and returns a default character instead: �. Here is how Andrej explains it (up until timestamp 48m:21s).

Not all text was born equal: regex patterns

Is that all? Have we just implemented a GPT-level tokenizer? Not really. The culprit is explained in this section of the GPT2 paper

However, directly applying BPE to the byte sequence results in suboptimal merges due to BPE using a greedy frequency based heuristic for building the token vocabulary. We observed BPE including many versions of common words like dog since they occur in many variations such as dog. dog! dog? . This results in a sub-optimal allocation of limited vocabulary slots and model capacity. To avoid this, we prevent BPE from merging across character categories for any byte sequence.
Language Models are Unsupervised Multitask Learners

How does OpenAI do this? we prevent BPE from merging across character categories. By using a regex pattern, they split the input text into a list of separate chunks, run BPE separately on each of them, and then concatenate the result (as Andrej explains here this is not entirely accurate but it gets pretty close). The pattern in question is this one

# Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

Here’s how it looks like in action

import regex as re
gpt2pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pat, "Hello've world123 how's     are you!!!?   "))

#output
#['Hello', "'ve", ' world', '123', ' how', "'s", '    ', ' are', ' you', '!!!?', '   ']

I asked ChatGPT to explain it to me and here is what it came up with (post edited for better clarity):

's|'t|'re|'ve|'m|'ll|'d: This part matches specific contractions that follow words. These are common English contractions like 's (is or has), 't (not), 're (are), 've (have), 'm (am), 'll (will), and 'd (had or would).
- This is what captures 've and 's
?\p{L}+: Matches one or more letters (\p{L}+) possibly preceded by a space (?). The \p{L} represents any kind of letter from any language. This part is for capturing words, allowing for a leading space.
- This is what captures 'Hello', ' world', ' how', ' are', ' you'. Notice how the leading whitespace is incorporated within the consecutive sequence of characters.
?\p{N}+: Similar to the previous, this matches one or more numeric characters (\p{N}+), possibly preceded by a space (?). \p{N} represents any kind of numeric digit. This part is for capturing numbers, again allowing for a leading space.
- This is what captures 123 and splits it from Hello
?[^\s\p{L}\p{N}]+: This part matches one or more characters that are not spaces (\s), letters (\p{L}), or numeric digits (\p{N}). It allows for a preceding space (?). This is used for capturing symbols or punctuation marks, potentially with a leading space.
- This is what matches the punctuation !!!?.
\s+(?!\S): Matches one or more spaces (\s+), but only if not followed by a non-space character. (?!\S) is a negative lookahead that ensures there’s no non-space character ahead. This part is for capturing trailing spaces in a string.
- This is what matches the 4 whitespaces between 's and ' are'. Not how the whitespace preceding ' are' is kept separated.
\s+: Matches one or more spaces. This part is used to capture spaces between words or tokens within the string.
- This is what matches the 3 trailing whitespace at the end of the text

Notice also, how the regex pattern used in the GPT2 tokenizer is far from being the most robust. It’s case-sensitive, which causes some serious headaches (as acknowledged here). It’s very English-centric, and the apostrophes they hardcode in it are not even all the possible ones Unicode supports.

Is it better for GPT4?

Yes. We can check the different behaviors of the two tokenizers using OpenAI’s tiktoken library (the tiktokenizer app works too of course). Here’s GPT4 regex pattern compared to GPT2 one (it looks different but in practice, it is equivalent to the regex we explored above). With GPT4, they seem to have addressed some of the issues previously highlighted. Also, keep in mind that GPT4 tokenizer leads to 2x the size of GPT2 vocabulary (~100k vs ~50k).

Special tokens

It’s interesting to see what happens when we start typing '<|endoftext|>' in tiktokenizer using the GPT4 tokenizer. Up until the very last character, the tokenizer treats the string as we would expect, splitting it into the tokens assigned by BPE during training. When we get to the end though, only one token is found: 100257. The last one in the vocabulary btw (you can check that there is no 100258). What is this? That’s the string we use to concatenate unrelated pieces of text in the training corpus. E.g. if we are using Taylor Swift’s Wikipedia page and then a recipe for pancakes, we cannot just stitch them together like pancakes was a natural continuation of Taylor Swift’s life. We have to let the model know that “Hey! Pay attention here! That’s something different. Erase your memory and start over.” This is logical, but it also means that the tokenizer will have to have some special embedded rules to make sure BPE splits are not applied to these tokens, which are, as expected, special. '<|endoftext|>' is far from being the only special token out there, but you get the idea.

Quirks of LLM tokenization

This post is already quite long, so I am not going to put in writing the last section of Karpathy’s lecture, about quirks of LLM tokenization. It’s the natural conclusion of this journey, as we are now equipped to understand a bunch of weird behaviors of these models. All of them attributable to tokenization. I encourage you to enjoy it directly from Andrej ☝️. It’s quite hilarious and very instructive.

I am mentioning explicitly the <|endoftext|> one as we just saw special tokens, but I advise checking out SolidGoldMagikarp 🐟 🤔 to have a good laugh.

Why did my LLM abruptly halt when it sees the string “<|endoftext|>”?

What happens when we ask GPT4 to print the special token <|endoftext|>? Nothing. We break it completely. All my attempts caused the LLM to melt down. See how I tried prompting it again, but I am not even marked as USER anymore. I was able to type under the ASSISTANT text box, which is where GPT4 should have replied to me 🤯. The entire UI is down 😂.