Add simpler BPE, and make previous BPE better (#870)

* Add simpler BPE, and make previous BPE better

* update

* Update README.md
This commit is contained in:
Sebastian Raschka 2025-10-08 22:22:34 -05:00 committed by GitHub
parent 1164cb3e8f
commit fecfdd16ff
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 1223 additions and 122 deletions

2
.gitignore vendored
View File

@ -85,6 +85,8 @@ Qwen3-0.6B/
tokenizer-base.json
tokenizer-reasoning.json
tokenizer.json
config.json
bpe_merges.txt
# Datasets
the-verdict.txt

View File

@ -158,7 +158,7 @@ Several folders contain optional materials as a bonus for interested readers:
- [Installing Python Packages and Libraries Used In This Book](setup/02_installing-python-libraries)
- [Docker Environment Setup Guide](setup/03_optional-docker-environment)
- **Chapter 2: Working with text data**
- [Byte Pair Encoding (BPE) Tokenizer From Scratch](ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb)
- [Byte Pair Encoding (BPE) Tokenizer From Scratch](ch02/05_bpe-from-scratch/bpe-from-scratch-simple.ipynb)
- [Comparing Various Byte Pair Encoding (BPE) Implementations](ch02/02_bonus_bytepair-encoder)
- [Understanding the Difference Between Embedding Layers and Linear Layers](ch02/03_bonus_embedding-vs-matmul)
- [Dataloader Intuition with Simple Numbers](ch02/04_bonus_dataloader-intuition)

View File

@ -1,3 +1,5 @@
# Byte Pair Encoding (BPE) Tokenizer From Scratch
- [bpe-from-scratch.ipynb](bpe-from-scratch.ipynb) contains optional (bonus) code that explains and shows how the BPE tokenizer works under the hood.
- [bpe-from-scratch-simple.ipynb](bpe-from-scratch-simple.ipynb) contains optional (bonus) code that explains and shows how the BPE tokenizer works under the hood; this is geared for simplicity and readability.
- [bpe-from-scratch.ipynb](bpe-from-scratch.ipynb) implements a more sophisticated (and much more complicated) BPE tokenizer that behaves similarly as tiktoken with respect to all the edge cases; it also has additional funcitionality for loading the official GPT-2 vocab.

View File

@ -0,0 +1,970 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "9dec0dfb-3d60-41d0-a63a-b010dce67e32",
"metadata": {},
"source": [
"<table style=\"width:100%\">\n",
"<tr>\n",
"<td style=\"vertical-align:middle; text-align:left;\">\n",
"<font size=\"2\">\n",
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
"</font>\n",
"</td>\n",
"<td style=\"vertical-align:middle; text-align:left;\">\n",
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
"</td>\n",
"</tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"id": "5e475425-8300-43f2-a5e8-6b5d2de59925",
"metadata": {},
"source": [
"# Byte Pair Encoding (BPE) Tokenizer From Scratch -- Simple"
]
},
{
"cell_type": "markdown",
"id": "a1bfc3f3-8ec1-4fd3-b378-d9a3d7807a54",
"metadata": {},
"source": [
"- This is a standalone notebook implementing the popular byte pair encoding (BPE) tokenization algorithm, which is used in models like GPT-2 to GPT-4, Llama 3, etc., from scratch for educational purposes\n",
"- For more details about the purpose of tokenization, please refer to [Chapter 2](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb); this code here is bonus material explaining the BPE algorithm\n",
"- The original BPE tokenizer that OpenAI implemented for training the original GPT models can be found [here](https://github.com/openai/gpt-2/blob/master/src/encoder.py)\n",
"- The BPE algorithm was originally described in 1994: \"[A New Algorithm for Data Compression](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM)\" by Philip Gage\n",
"- Most projects, including Llama 3, nowadays use OpenAI's open-source [tiktoken library](https://github.com/openai/tiktoken) due to its computational performance; it allows loading pretrained GPT-2 and GPT-4 tokenizers, for example (the Llama 3 models were trained using the GPT-4 tokenizer as well)\n",
"- The difference between the implementations above and my implementation in this notebook, besides it being is that it also includes a function for training the tokenizer (for educational purposes)\n",
"- There's also an implementation called [minBPE](https://github.com/karpathy/minbpe) with training support, which is maybe more performant (my implementation here is focused on educational purposes); in contrast to `minbpe` my implementation additionally allows loading the original OpenAI tokenizer vocabulary and merges"
]
},
{
"cell_type": "markdown",
"id": "910acd61-8947-4cfa-962f-16f4c733f2db",
"metadata": {},
"source": [
"**This is a very naive implementation for educational purposes. The [bpe-from-scratch.ipynb](bpe-from-scratch.ipynb) notebook contains a more sophisticated (but much harder to read) implementation that matches the behavior in tiktoken.**"
]
},
{
"cell_type": "markdown",
"id": "f62336db-f45c-4894-9167-7583095dbdf1",
"metadata": {},
"source": [
"&nbsp;\n",
"# 1. The main idea behind byte pair encoding (BPE)"
]
},
{
"cell_type": "markdown",
"id": "cd3f1231-bd42-41b5-a017-974b8c660a44",
"metadata": {},
"source": [
"- The main idea in BPE is to convert text into an integer representation (token IDs) for LLM training (see [Chapter 2](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb))\n",
"\n",
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/bpe-from-scratch/bpe-overview.webp\" width=\"600px\">"
]
},
{
"cell_type": "markdown",
"id": "760c625d-26a1-4896-98a2-0fdcd1591256",
"metadata": {},
"source": [
"&nbsp;\n",
"## 1.1 Bits and bytes"
]
},
{
"cell_type": "markdown",
"id": "d4ddaa35-0ed7-4012-827e-911de11c266c",
"metadata": {},
"source": [
"- Before getting to the BPE algorithm, let's introduce the notion of bytes\n",
"- Consider converting text into a byte array (BPE stands for \"byte\" pair encoding after all):"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8c9bc9e4-120f-4bac-8fa6-6523c568d12e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"bytearray(b'This is some text')\n"
]
}
],
"source": [
"text = \"This is some text\"\n",
"byte_ary = bytearray(text, \"utf-8\")\n",
"print(byte_ary)"
]
},
{
"cell_type": "markdown",
"id": "dbd92a2a-9d74-4dc7-bb53-ac33d6cf2fab",
"metadata": {},
"source": [
"- When we call `list()` on a `bytearray` object, each byte is treated as an individual element, and the result is a list of integers corresponding to the byte values:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6c586945-d459-4f9a-855d-bf73438ef0e3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[84, 104, 105, 115, 32, 105, 115, 32, 115, 111, 109, 101, 32, 116, 101, 120, 116]\n"
]
}
],
"source": [
"ids = list(byte_ary)\n",
"print(ids)"
]
},
{
"cell_type": "markdown",
"id": "71efea37-f4c3-4cb8-bfa5-9299175faf9a",
"metadata": {},
"source": [
"- This would be a valid way to convert text into a token ID representation that we need for the embedding layer of an LLM\n",
"- However, the downside of this approach is that it is creating one ID for each character (that's a lot of IDs for a short text!)\n",
"- I.e., this means for a 17-character input text, we have to use 17 token IDs as input to the LLM:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "0d5b61d9-79a0-48b4-9b3e-64ab595c5b01",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of characters: 17\n",
"Number of token IDs: 17\n"
]
}
],
"source": [
"print(\"Number of characters:\", len(text))\n",
"print(\"Number of token IDs:\", len(ids))"
]
},
{
"cell_type": "markdown",
"id": "68cc833a-c0d4-4d46-9180-c0042fd6addc",
"metadata": {},
"source": [
"- If you have worked with LLMs before, you may know that the BPE tokenizers have a vocabulary where we have a token ID for whole words or subwords instead of each character\n",
"- For example, the GPT-2 tokenizer tokenizes the same text (\"This is some text\") into only 4 instead of 17 tokens: `1212, 318, 617, 2420`\n",
"- You can double-check this using the interactive [tiktoken app](https://tiktokenizer.vercel.app/?model=gpt2) or the [tiktoken library](https://github.com/openai/tiktoken):\n",
"\n",
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/bpe-from-scratch/tiktokenizer.webp\" width=\"600px\">\n",
"\n",
"```python\n",
"import tiktoken\n",
"\n",
"gpt2_tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
"gpt2_tokenizer.encode(\"This is some text\")\n",
"# prints [1212, 318, 617, 2420]\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "425b99de-cbfc-441c-8b3e-296a5dd7bb27",
"metadata": {},
"source": [
"- Since a byte consists of 8 bits, there are 2<sup>8</sup> = 256 possible values that a single byte can represent, ranging from 0 to 255\n",
"- You can confirm this by executing the code `bytearray(range(0, 257))`, which will warn you that `ValueError: byte must be in range(0, 256)`)\n",
"- A BPE tokenizer usually uses these 256 values as its first 256 single-character tokens; one could visually check this by running the following code:\n",
"\n",
"```python\n",
"import tiktoken\n",
"gpt2_tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
"\n",
"for i in range(300):\n",
" decoded = gpt2_tokenizer.decode([i])\n",
" print(f\"{i}: {decoded}\")\n",
"\"\"\"\n",
"prints:\n",
"0: !\n",
"1: \"\n",
"2: #\n",
"...\n",
"255: <20> # <---- single character tokens up to here\n",
"256: t\n",
"257: a\n",
"...\n",
"298: ent\n",
"299: n\n",
"\"\"\"\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "97ff0207-7f8e-44fa-9381-2a4bd83daab3",
"metadata": {},
"source": [
"- Above, note that entries 256 and 257 are not single-character values but double-character values (a whitespace + a letter), which is a little shortcoming of the original GPT-2 BPE Tokenizer (this has been improved in the GPT-4 tokenizer)"
]
},
{
"cell_type": "markdown",
"id": "8241c23a-d487-488d-bded-cdf054e24920",
"metadata": {},
"source": [
"&nbsp;\n",
"## 1.2 Building the vocabulary"
]
},
{
"cell_type": "markdown",
"id": "d7c2ceb7-0b3f-4a62-8dcc-07810cd8886e",
"metadata": {},
"source": [
"- The goal of the BPE tokenization algorithm is to build a vocabulary of commonly occurring subwords like `298: ent` (which can be found in *entangle, entertain, enter, entrance, entity, ...*, for example), or even complete words like \n",
"\n",
"```\n",
"318: is\n",
"617: some\n",
"1212: This\n",
"2420: text\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "8c0d4420-a4c7-4813-916a-06f4f46bc3f0",
"metadata": {},
"source": [
"- The BPE algorithm was originally described in 1994: \"[A New Algorithm for Data Compression](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM)\" by Philip Gage\n",
"- Before we get to the actual code implementation, the form that is used for LLM tokenizers today can be summarized as follows:"
]
},
{
"cell_type": "markdown",
"id": "ebc71db9-b070-48c4-8412-81f45b308ab3",
"metadata": {},
"source": [
"&nbsp;\n",
"## 1.3 BPE algorithm outline\n",
"\n",
"**1. Identify frequent pairs**\n",
"- In each iteration, scan the text to find the most commonly occurring pair of bytes (or characters)\n",
"\n",
"**2. Replace and record**\n",
"\n",
"- Replace that pair with a new placeholder ID (one not already in use, e.g., if we start with 0...255, the first placeholder would be 256)\n",
"- Record this mapping in a lookup table\n",
"- The size of the lookup table is a hyperparameter, also called \"vocabulary size\" (for GPT-2, that's\n",
"50,257)\n",
"\n",
"**3. Repeat until no gains**\n",
"\n",
"- Keep repeating steps 1 and 2, continually merging the most frequent pairs\n",
"- Stop when no further compression is possible (e.g., no pair occurs more than once)\n",
"\n",
"**Decompression (decoding)**\n",
"\n",
"- To restore the original text, reverse the process by substituting each ID with its corresponding pair, using the lookup table\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "e9f5ac9a-3528-4186-9468-8420c7b2ac00",
"metadata": {},
"source": [
"&nbsp;\n",
"## 1.4 BPE algorithm example\n",
"\n",
"### 1.4.1 Concrete example of the encoding part (steps 1 & 2)\n",
"\n",
"- Suppose we have the text (training dataset) `the cat in the hat` from which we want to build the vocabulary for a BPE tokenizer\n",
"\n",
"**Iteration 1**\n",
"\n",
"1. Identify frequent pairs\n",
" - In this text, \"th\" appears twice (at the beginning and before the second \"e\")\n",
"\n",
"2. Replace and record\n",
" - replace \"th\" with a new token ID that is not already in use, e.g., 256\n",
" - the new text is: `<256>e cat in <256>e hat`\n",
" - the new vocabulary is\n",
"\n",
"```\n",
" 0: ...\n",
" ...\n",
" 256: \"th\"\n",
"```\n",
"\n",
"**Iteration 2**\n",
"\n",
"1. **Identify frequent pairs** \n",
" - In the text `<256>e cat in <256>e hat`, the pair `<256>e` appears twice\n",
"\n",
"2. **Replace and record** \n",
" - replace `<256>e` with a new token ID that is not already in use, for example, `257`. \n",
" - The new text is:\n",
" ```\n",
" <257> cat in <257> hat\n",
" ```\n",
" - The updated vocabulary is:\n",
" ```\n",
" 0: ...\n",
" ...\n",
" 256: \"th\"\n",
" 257: \"<256>e\"\n",
" ```\n",
"\n",
"**Iteration 3**\n",
"\n",
"1. **Identify frequent pairs** \n",
" - In the text `<257> cat in <257> hat`, the pair `<257> ` appears twice (once at the beginning and once before “hat”).\n",
"\n",
"2. **Replace and record** \n",
" - replace `<257> ` with a new token ID that is not already in use, for example, `258`. \n",
" - the new text is:\n",
" ```\n",
" <258>cat in <258>hat\n",
" ```\n",
" - The updated vocabulary is:\n",
" ```\n",
" 0: ...\n",
" ...\n",
" 256: \"th\"\n",
" 257: \"<256>e\"\n",
" 258: \"<257> \"\n",
" ```\n",
" \n",
"- and so forth\n",
"\n",
"&nbsp;\n",
"### 1.4.2 Concrete example of the decoding part (steps 3)\n",
"\n",
"- To restore the original text, we reverse the process by substituting each token ID with its corresponding pair in the reverse order they were introduced\n",
"- Start with the final compressed text: `<258>cat in <258>hat`\n",
"- Substitute `<258>` → `<257> `: `<257> cat in <257> hat` \n",
"- Substitute `<257>` → `<256>e`: `<256>e cat in <256>e hat`\n",
"- Substitute `<256>` → \"th\": `the cat in the hat`"
]
},
{
"cell_type": "markdown",
"id": "a2324948-ddd0-45d1-8ba8-e8eda9fc6677",
"metadata": {},
"source": [
"&nbsp;\n",
"## 2. A simple BPE implementation"
]
},
{
"cell_type": "markdown",
"id": "429ca709-40d7-4e3d-bf3e-4f5687a2e19b",
"metadata": {},
"source": [
"- Below is an implementation of this algorithm described above as a Python class that mimics the `tiktoken` Python user interface\n",
"- Note that the encoding part above describes the original training step via `train()`; however, the `encode()` method works similarly (although it looks a bit more complicated because of the special token handling):\n",
"\n",
"1. Split the input text into individual bytes\n",
"2. Repeatedly find & replace (merge) adjacent tokens (pairs) when they match any pair in the learned BPE merges (from highest to lowest \"rank,\" i.e., in the order they were learned)\n",
"3. Continue merging until no more merges can be applied\n",
"4. The final list of token IDs is the encoded output"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "3e4a15ec-2667-4f56-b7c1-34e8071b621d",
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter, deque\n",
"from functools import lru_cache\n",
"\n",
"\n",
"class BPETokenizerSimple:\n",
" def __init__(self):\n",
" # Maps token_id to token_str (e.g., {11246: \"some\"})\n",
" self.vocab = {}\n",
" # Maps token_str to token_id (e.g., {\"some\": 11246})\n",
" self.inverse_vocab = {}\n",
" # Dictionary of BPE merges: {(token_id1, token_id2): merged_token_id}\n",
" self.bpe_merges = {}\n",
"\n",
" def train(self, text, vocab_size, allowed_special={\"<|endoftext|>\"}):\n",
" \"\"\"\n",
" Train the BPE tokenizer from scratch.\n",
"\n",
" Args:\n",
" text (str): The training text.\n",
" vocab_size (int): The desired vocabulary size.\n",
" allowed_special (set): A set of special tokens to include.\n",
" \"\"\"\n",
"\n",
" # Preprocess: Replace spaces with 'Ġ'\n",
" # Note that Ġ is a particularity of the GPT-2 BPE implementation\n",
" # E.g., \"Hello world\" might be tokenized as [\"Hello\", \"Ġworld\"]\n",
" # (GPT-4 BPE would tokenize it as [\"Hello\", \" world\"])\n",
" processed_text = []\n",
" for i, char in enumerate(text):\n",
" if char == \" \" and i != 0:\n",
" processed_text.append(\"Ġ\")\n",
" if char != \" \":\n",
" processed_text.append(char)\n",
" processed_text = \"\".join(processed_text)\n",
"\n",
" # Initialize vocab with unique characters, including 'Ġ' if present\n",
" # Start with the first 256 ASCII characters\n",
" unique_chars = [chr(i) for i in range(256)]\n",
"\n",
" # Extend unique_chars with characters from processed_text that are not already included\n",
" unique_chars.extend(char for char in sorted(set(processed_text)) if char not in unique_chars)\n",
"\n",
" # Optionally, ensure 'Ġ' is included if it is relevant to your text processing\n",
" if 'Ġ' not in unique_chars:\n",
" unique_chars.append('Ġ')\n",
"\n",
" # Now create the vocab and inverse vocab dictionaries\n",
" self.vocab = {i: char for i, char in enumerate(unique_chars)}\n",
" self.inverse_vocab = {char: i for i, char in self.vocab.items()}\n",
"\n",
" # Add allowed special tokens\n",
" if allowed_special:\n",
" for token in allowed_special:\n",
" if token not in self.inverse_vocab:\n",
" new_id = len(self.vocab)\n",
" self.vocab[new_id] = token\n",
" self.inverse_vocab[token] = new_id\n",
"\n",
" # Tokenize the processed_text into token IDs\n",
" token_ids = [self.inverse_vocab[char] for char in processed_text]\n",
"\n",
" # BPE steps 1-3: Repeatedly find and replace frequent pairs\n",
" for new_id in range(len(self.vocab), vocab_size):\n",
" pair_id = self.find_freq_pair(token_ids, mode=\"most\")\n",
" if pair_id is None: # No more pairs to merge. Stopping training.\n",
" break\n",
" token_ids = self.replace_pair(token_ids, pair_id, new_id)\n",
" self.bpe_merges[pair_id] = new_id\n",
"\n",
" # Build the vocabulary with merged tokens\n",
" for (p0, p1), new_id in self.bpe_merges.items():\n",
" merged_token = self.vocab[p0] + self.vocab[p1]\n",
" self.vocab[new_id] = merged_token\n",
" self.inverse_vocab[merged_token] = new_id\n",
"\n",
" def encode(self, text):\n",
" \"\"\"\n",
" Encode the input text into a list of token IDs.\n",
"\n",
" Args:\n",
" text (str): The text to encode.\n",
"\n",
" Returns:\n",
" List[int]: The list of token IDs.\n",
" \"\"\"\n",
" tokens = []\n",
" # Split text into tokens, keeping newlines intact\n",
" words = text.replace(\"\\n\", \" \\n \").split() # Ensure '\\n' is treated as a separate token\n",
"\n",
" for i, word in enumerate(words):\n",
" if i > 0 and not word.startswith(\"\\n\"):\n",
" tokens.append(\"Ġ\" + word) # Add 'Ġ' to words that follow a space or newline\n",
" else:\n",
" tokens.append(word) # Handle first word or standalone '\\n'\n",
"\n",
" token_ids = []\n",
" for token in tokens:\n",
" if token in self.inverse_vocab:\n",
" # token is contained in the vocabulary as is\n",
" token_id = self.inverse_vocab[token]\n",
" token_ids.append(token_id)\n",
" else:\n",
" # Attempt to handle subword tokenization via BPE\n",
" sub_token_ids = self.tokenize_with_bpe(token)\n",
" token_ids.extend(sub_token_ids)\n",
"\n",
" return token_ids\n",
"\n",
" def tokenize_with_bpe(self, token):\n",
" \"\"\"\n",
" Tokenize a single token using BPE merges.\n",
"\n",
" Args:\n",
" token (str): The token to tokenize.\n",
"\n",
" Returns:\n",
" List[int]: The list of token IDs after applying BPE.\n",
" \"\"\"\n",
" # Tokenize the token into individual characters (as initial token IDs)\n",
" token_ids = [self.inverse_vocab.get(char, None) for char in token]\n",
" if None in token_ids:\n",
" missing_chars = [char for char, tid in zip(token, token_ids) if tid is None]\n",
" raise ValueError(f\"Characters not found in vocab: {missing_chars}\")\n",
"\n",
" can_merge = True\n",
" while can_merge and len(token_ids) > 1:\n",
" can_merge = False\n",
" new_tokens = []\n",
" i = 0\n",
" while i < len(token_ids) - 1:\n",
" pair = (token_ids[i], token_ids[i + 1])\n",
" if pair in self.bpe_merges:\n",
" merged_token_id = self.bpe_merges[pair]\n",
" new_tokens.append(merged_token_id)\n",
" # Uncomment for educational purposes:\n",
" # print(f\"Merged pair {pair} -> {merged_token_id} ('{self.vocab[merged_token_id]}')\")\n",
" i += 2 # Skip the next token as it's merged\n",
" can_merge = True\n",
" else:\n",
" new_tokens.append(token_ids[i])\n",
" i += 1\n",
" if i < len(token_ids):\n",
" new_tokens.append(token_ids[i])\n",
" token_ids = new_tokens\n",
"\n",
" return token_ids\n",
"\n",
" def decode(self, token_ids):\n",
" \"\"\"\n",
" Decode a list of token IDs back into a string.\n",
"\n",
" Args:\n",
" token_ids (List[int]): The list of token IDs to decode.\n",
"\n",
" Returns:\n",
" str: The decoded string.\n",
" \"\"\"\n",
" decoded_string = \"\"\n",
" for token_id in token_ids:\n",
" if token_id not in self.vocab:\n",
" raise ValueError(f\"Token ID {token_id} not found in vocab.\")\n",
" token = self.vocab[token_id]\n",
" if token.startswith(\"Ġ\"):\n",
" # Replace 'Ġ' with a space\n",
" decoded_string += \" \" + token[1:]\n",
" else:\n",
" decoded_string += token\n",
" return decoded_string\n",
"\n",
" @lru_cache(maxsize=None)\n",
" def get_special_token_id(self, token):\n",
" return self.inverse_vocab.get(token, None)\n",
"\n",
" @staticmethod\n",
" def find_freq_pair(token_ids, mode=\"most\"):\n",
" pairs = Counter(zip(token_ids, token_ids[1:]))\n",
"\n",
" if mode == \"most\":\n",
" return max(pairs.items(), key=lambda x: x[1])[0]\n",
" elif mode == \"least\":\n",
" return min(pairs.items(), key=lambda x: x[1])[0]\n",
" else:\n",
" raise ValueError(\"Invalid mode. Choose 'most' or 'least'.\")\n",
"\n",
" @staticmethod\n",
" def replace_pair(token_ids, pair_id, new_id):\n",
" dq = deque(token_ids)\n",
" replaced = []\n",
"\n",
" while dq:\n",
" current = dq.popleft()\n",
" if dq and (current, dq[0]) == pair_id:\n",
" replaced.append(new_id)\n",
" # Remove the 2nd token of the pair, 1st was already removed\n",
" dq.popleft()\n",
" else:\n",
" replaced.append(current)\n",
"\n",
" return replaced\n"
]
},
{
"cell_type": "markdown",
"id": "46db7310-79c7-4ee0-b5fa-d760c6e1aa67",
"metadata": {},
"source": [
"- There is a lot of code in the `BPETokenizerSimple` class above, and discussing it in detail is out of scope for this notebook, but the next section offers a short overview of the usage to understand the class methods a bit better"
]
},
{
"cell_type": "markdown",
"id": "8ffe1836-eed4-40dc-860b-2d23074d067e",
"metadata": {},
"source": [
"## 3. BPE implementation walkthrough"
]
},
{
"cell_type": "markdown",
"id": "3c7c996c-fd34-484f-a877-13d977214cf7",
"metadata": {},
"source": [
"- In practice, I highly recommend using [tiktoken](https://github.com/openai/tiktoken) as my implementation above focuses on readability and educational purposes, not on performance\n",
"- However, the usage is more or less similar to tiktoken, except that tiktoken does not have a training method\n",
"- Let's see how my `BPETokenizerSimple` Python code above works by looking at some examples below (a detailed code discussion is out of scope for this notebook)"
]
},
{
"cell_type": "markdown",
"id": "e82acaf6-7ed5-4d3b-81c0-ae4d3559d2c7",
"metadata": {},
"source": [
"### 3.1 Training, encoding, and decoding"
]
},
{
"cell_type": "markdown",
"id": "962bf037-903e-4555-b09c-206e1a410278",
"metadata": {},
"source": [
"- First, let's consider some sample text as our training dataset:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "4d197cad-ed10-4a42-b01c-a763859781fb",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import urllib.request\n",
"\n",
"if not os.path.exists(\"../01_main-chapter-code/the-verdict.txt\"):\n",
" url = (\"https://raw.githubusercontent.com/rasbt/\"\n",
" \"LLMs-from-scratch/main/ch02/01_main-chapter-code/\"\n",
" \"the-verdict.txt\")\n",
" file_path = \"../01_main-chapter-code/the-verdict.txt\"\n",
" urllib.request.urlretrieve(url, file_path)\n",
"\n",
"with open(\"../01_main-chapter-code/the-verdict.txt\", \"r\", encoding=\"utf-8\") as f: # added ../01_main-chapter-code/\n",
" text = f.read()"
]
},
{
"cell_type": "markdown",
"id": "04d1b6ac-71d3-4817-956a-9bc7e463a84a",
"metadata": {},
"source": [
"- Next, let's initialize and train the BPE tokenizer with a vocabulary size of 1,000\n",
"- Note that the vocabulary size is already 255 by default due to the byte values discussed earlier, so we are only \"learning\" 745 vocabulary entries \n",
"- For comparison, the GPT-2 vocabulary is 50,257 tokens, the GPT-4 vocabulary is 100,256 tokens (`cl100k_base` in tiktoken), and GPT-4o uses 199,997 tokens (`o200k_base` in tiktoken); they have all much bigger training sets compared to our simple example text above"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "027348fd-d52f-4396-93dd-38eed142df9b",
"metadata": {},
"outputs": [],
"source": [
"tokenizer = BPETokenizerSimple()\n",
"tokenizer.train(text, vocab_size=1000, allowed_special={\"<|endoftext|>\"})"
]
},
{
"cell_type": "markdown",
"id": "2474ff05-5629-4f13-9e03-a47b1e713850",
"metadata": {},
"source": [
"- You may want to inspect the vocabulary contents (but note it will create a long list)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f705a283-355e-4460-b940-06bbc2ae4e61",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000\n"
]
}
],
"source": [
"# print(tokenizer.vocab)\n",
"print(len(tokenizer.vocab))"
]
},
{
"cell_type": "markdown",
"id": "36c9da0f-8a18-41cd-91ea-9ccc2bb5febb",
"metadata": {},
"source": [
"- This vocabulary is created by merging 742 times (~ `1000 - len(range(0, 256))`)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3da42d1c-f75c-4ba7-a6c5-4cb8543d4a44",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"742\n"
]
}
],
"source": [
"print(len(tokenizer.bpe_merges))"
]
},
{
"cell_type": "markdown",
"id": "5dac69c9-8413-482a-8148-6b2afbf1fb89",
"metadata": {},
"source": [
"- This means that the first 256 entries are single-character tokens"
]
},
{
"cell_type": "markdown",
"id": "451a4108-7c8b-4b98-9c67-d622e9cdf250",
"metadata": {},
"source": [
"- Next, let's use the created merges via the `encode` method to encode some text:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "e1db5cce-e015-412b-ad56-060b8b638078",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46]\n"
]
}
],
"source": [
"input_text = \"Jack embraced beauty through art and life.\"\n",
"token_ids = tokenizer.encode(input_text)\n",
"print(token_ids)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "1ed1b344-f7d4-4e9e-ac34-2a04b5c5b7a8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of characters: 42\n",
"Number of token IDs: 20\n"
]
}
],
"source": [
"print(\"Number of characters:\", len(input_text))\n",
"print(\"Number of token IDs:\", len(token_ids))"
]
},
{
"cell_type": "markdown",
"id": "50c1cfb9-402a-4e1e-9678-0b7547406248",
"metadata": {},
"source": [
"- From the lengths above, we can see that a 42-character sentence was encoded into 20 token IDs, effectively cutting the input length roughly in half compared to a character-byte-based encoding"
]
},
{
"cell_type": "markdown",
"id": "252693ee-e806-4dac-ab76-2c69086360f4",
"metadata": {},
"source": [
"- Note that the vocabulary itself is used in the `decode()` method, which allows us to map the token IDs back into text:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "da0e1faf-1933-43d9-b681-916c282a8f86",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46]\n"
]
}
],
"source": [
"print(token_ids)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "8b690e83-5d6b-409a-804e-321c287c24a4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Jack embraced beauty through art and life.\n"
]
}
],
"source": [
"print(tokenizer.decode(token_ids))"
]
},
{
"cell_type": "markdown",
"id": "adea5d09-e5ef-4721-994b-b9b25662fa0a",
"metadata": {},
"source": [
"- Iterating over each token ID can give us a better understanding of how the token IDs are decoded via the vocabulary:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "2b9e6289-92cb-4d88-b3c8-e836d7c8095f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"424 -> Jack\n",
"256 -> \n",
"654 -> em\n",
"531 -> br\n",
"302 -> ac\n",
"311 -> ed\n",
"256 -> \n",
"296 -> be\n",
"97 -> a\n",
"465 -> ut\n",
"121 -> y\n",
"595 -> through\n",
"841 -> ar\n",
"116 -> t\n",
"287 -> a\n",
"466 -> nd\n",
"256 -> \n",
"326 -> li\n",
"972 -> fe\n",
"46 -> .\n"
]
}
],
"source": [
"for token_id in token_ids:\n",
" print(f\"{token_id} -> {tokenizer.decode([token_id])}\")"
]
},
{
"cell_type": "markdown",
"id": "5ea41c6c-5538-4fd5-8b5f-195960853b71",
"metadata": {},
"source": [
"- As we can see, most token IDs represent 2-character subwords; that's because the training data text is very short with not that many repetitive words, and because we used a relatively small vocabulary size"
]
},
{
"cell_type": "markdown",
"id": "600055a3-7ec8-4abf-b88a-c4186fb71463",
"metadata": {},
"source": [
"- As a summary, calling `decode(encode())` should be able to reproduce arbitrary input texts:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "c7056cb1-a9a3-4cf6-8364-29fb493ae240",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'This is some text.'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenizer.decode(tokenizer.encode(\"This is some text.\"))"
]
},
{
"cell_type": "markdown",
"id": "3558af04-483c-4f6b-88f5-a534f37316cd",
"metadata": {},
"source": [
"&nbsp;\n",
"# 4. Conclusion"
]
},
{
"cell_type": "markdown",
"id": "410ed0e6-ad06-4bb3-bb39-6b8110c1caa4",
"metadata": {},
"source": [
"- That's it! That's how BPE works in a nutshell, complete with a training method for creating new tokenizers \n",
"- I hope you found this brief tutorial useful for educational purposes; if you have any questions, please feel free to open a new Discussion [here](https://github.com/rasbt/LLMs-from-scratch/discussions/categories/q-a)\n",
"\n",
"\n",
"**This is a very naive implementation for educational purposes. The [bpe-from-scratch.ipynb](bpe-from-scratch.ipynb) notebook contains a more sophisticated (but much harder to read) implementation that matches the behavior in tiktoken.**"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -81,7 +81,7 @@
},
{
"cell_type": "code",
"execution_count": 39,
"execution_count": 1,
"id": "8c9bc9e4-120f-4bac-8fa6-6523c568d12e",
"metadata": {},
"outputs": [
@ -109,7 +109,7 @@
},
{
"cell_type": "code",
"execution_count": 40,
"execution_count": 2,
"id": "6c586945-d459-4f9a-855d-bf73438ef0e3",
"metadata": {},
"outputs": [
@ -138,7 +138,7 @@
},
{
"cell_type": "code",
"execution_count": 41,
"execution_count": 3,
"id": "0d5b61d9-79a0-48b4-9b3e-64ab595c5b01",
"metadata": {},
"outputs": [
@ -382,13 +382,14 @@
},
{
"cell_type": "code",
"execution_count": 42,
"execution_count": 4,
"id": "3e4a15ec-2667-4f56-b7c1-34e8071b621d",
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter, deque\n",
"from functools import lru_cache\n",
"import re\n",
"import json\n",
"\n",
"\n",
@ -476,42 +477,49 @@
" # Load vocabulary\n",
" with open(vocab_path, \"r\", encoding=\"utf-8\") as file:\n",
" loaded_vocab = json.load(file)\n",
" # Convert loaded vocabulary to correct format\n",
" # encoder.json is {token_str: id}; we want id->str and str->id\n",
" self.vocab = {int(v): k for k, v in loaded_vocab.items()}\n",
" self.inverse_vocab = {k: int(v) for k, v in loaded_vocab.items()}\n",
"\n",
" # Handle newline character without adding a new token\n",
" \n",
" # Must have GPT-2's printable newline character 'Ċ' (U+010A) at id 198\n",
" if \"Ċ\" not in self.inverse_vocab or self.inverse_vocab[\"Ċ\"] != 198:\n",
" raise KeyError(\"Vocabulary missing GPT-2 newline glyph 'Ċ' at id 198.\")\n",
" \n",
" # Must have <|endoftext|> at 50256\n",
" if \"<|endoftext|>\" not in self.inverse_vocab or self.inverse_vocab[\"<|endoftext|>\"] != 50256:\n",
" raise KeyError(\"Vocabulary missing <|endoftext|> at id 50256.\")\n",
" \n",
" # Provide a convenience alias for '\\n' -> 198\n",
" # Keep printable character 'Ċ' in vocab so BPE merges keep working\n",
" if \"\\n\" not in self.inverse_vocab:\n",
" # Use an existing token ID as a placeholder for '\\n'\n",
" # Preferentially use \"<|endoftext|>\" if available\n",
" fallback_token = next((token for token in [\"<|endoftext|>\", \"Ġ\", \"\"] if token in self.inverse_vocab), None)\n",
" if fallback_token is not None:\n",
" newline_token_id = self.inverse_vocab[fallback_token]\n",
" self.inverse_vocab[\"\\n\"] = self.inverse_vocab[\"Ċ\"]\n",
"\n",
" if \"\\r\" not in self.inverse_vocab:\n",
" if 201 in self.vocab:\n",
" self.inverse_vocab[\"\\r\"] = 201\n",
" else:\n",
" # If no fallback token is available, raise an error\n",
" raise KeyError(\"No suitable token found in vocabulary to map '\\\\n'.\")\n",
" raise KeyError(\"Vocabulary missing carriage return token at id 201.\")\n",
"\n",
" self.inverse_vocab[\"\\n\"] = newline_token_id\n",
" self.vocab[newline_token_id] = \"\\n\"\n",
"\n",
" # Load GPT-2 merges and store them with an assigned \"rank\"\n",
" self.bpe_ranks = {} # reset ranks\n",
" # Load GPT-2 merges and store ranks\n",
" self.bpe_ranks = {}\n",
" with open(bpe_merges_path, \"r\", encoding=\"utf-8\") as file:\n",
" lines = file.readlines()\n",
" if lines and lines[0].startswith(\"#\"):\n",
" lines = lines[1:]\n",
"\n",
" \n",
" rank = 0\n",
" for line in lines:\n",
" pair = tuple(line.strip().split())\n",
" if len(pair) == 2:\n",
" token1, token2 = pair\n",
" # If token1 or token2 not in vocab, skip\n",
" if token1 in self.inverse_vocab and token2 in self.inverse_vocab:\n",
" self.bpe_ranks[(token1, token2)] = rank\n",
" rank += 1\n",
" else:\n",
" print(f\"Skipping pair {pair} as one token is not in the vocabulary.\")\n",
" token1, *rest = line.strip().split()\n",
" if len(rest) != 1:\n",
" continue\n",
" token2 = rest[0]\n",
" if token1 in self.inverse_vocab and token2 in self.inverse_vocab:\n",
" self.bpe_ranks[(token1, token2)] = rank\n",
" rank += 1\n",
" else:\n",
" # Safe to skip pairs whose symbols are not in vocab\n",
" pass\n",
"\n",
"\n",
" def encode(self, text, allowed_special=None):\n",
" \"\"\"\n",
@ -524,21 +532,35 @@
" Returns:\n",
" List of token IDs.\n",
" \"\"\"\n",
" import re\n",
" \n",
" # ---- This section is to mimic tiktoken in terms of allowed special tokens ----\n",
" specials_in_vocab = [\n",
" tok for tok in self.inverse_vocab\n",
" if tok.startswith(\"<|\") and tok.endswith(\"|>\")\n",
" ]\n",
" if allowed_special is None:\n",
" # Nothing is allowed\n",
" disallowed = [tok for tok in specials_in_vocab if tok in text]\n",
" if disallowed:\n",
" raise ValueError(f\"Disallowed special tokens encountered in text: {disallowed}\")\n",
" else:\n",
" # Some spefic tokens are allowed (e.g., we use this for <|endoftext|>)\n",
" disallowed = [tok for tok in specials_in_vocab if tok in text and tok not in allowed_special]\n",
" if disallowed:\n",
" raise ValueError(f\"Disallowed special tokens encountered in text: {disallowed}\")\n",
" # -----------------------------------------------------------------------------\n",
"\n",
" token_ids = []\n",
" \n",
" # If special token handling is enabled\n",
" # If some specials are allowed, split around them and passthrough those ids\n",
" if allowed_special is not None and len(allowed_special) > 0:\n",
" # Build regex to match allowed special tokens\n",
" special_pattern = (\n",
" \"(\" + \"|\".join(re.escape(tok) for tok in sorted(allowed_special, key=len, reverse=True)) + \")\"\n",
" )\n",
" special_pattern = \"(\" + \"|\".join(\n",
" re.escape(tok) for tok in sorted(allowed_special, key=len, reverse=True)\n",
" ) + \")\"\n",
" \n",
" last_index = 0\n",
" for match in re.finditer(special_pattern, text):\n",
" prefix = text[last_index:match.start()]\n",
" token_ids.extend(self.encode(prefix, allowed_special=None)) # Encode prefix without special handling\n",
" token_ids.extend(self.encode(prefix, allowed_special=None)) # encode prefix normally\n",
" \n",
" special_token = match.group(0)\n",
" if special_token in self.inverse_vocab:\n",
@ -547,36 +569,63 @@
" raise ValueError(f\"Special token {special_token} not found in vocabulary.\")\n",
" last_index = match.end()\n",
" \n",
" text = text[last_index:] # Remaining part to process normally\n",
" text = text[last_index:] # remainder to process normally\n",
" \n",
" # Check if any disallowed special tokens are in the remainder\n",
" # Extra guard for any other special literals left over\n",
" disallowed = [\n",
" tok for tok in self.inverse_vocab\n",
" if tok.startswith(\"<|\") and tok.endswith(\"|>\") and tok in text and tok not in allowed_special\n",
" ]\n",
" if disallowed:\n",
" raise ValueError(f\"Disallowed special tokens encountered in text: {disallowed}\")\n",
"\n",
" \n",
" # If no special tokens, or remaining text after special token split:\n",
" # ---- Newline and carriage return handling ----\n",
" tokens = []\n",
" lines = text.split(\"\\n\")\n",
" for i, line in enumerate(lines):\n",
" if i > 0:\n",
" parts = re.split(r'(\\r\\n|\\r|\\n)', text)\n",
" for part in parts:\n",
" if part == \"\":\n",
" continue\n",
" if part == \"\\r\\n\":\n",
" tokens.append(\"\\r\")\n",
" tokens.append(\"\\n\")\n",
" words = line.split()\n",
" for j, word in enumerate(words):\n",
" if j == 0 and i > 0:\n",
" tokens.append(\"Ġ\" + word)\n",
" elif j == 0:\n",
" tokens.append(word)\n",
" else:\n",
" tokens.append(\"Ġ\" + word)\n",
" continue\n",
" if part == \"\\r\":\n",
" tokens.append(\"\\r\")\n",
" continue\n",
" if part == \"\\n\":\n",
" tokens.append(\"\\n\")\n",
" continue\n",
" \n",
" for token in tokens:\n",
" if token in self.inverse_vocab:\n",
" token_ids.append(self.inverse_vocab[token])\n",
" # Normal chunk without line breaks:\n",
" # - If spaces precede a word, prefix the first word with 'Ġ' and\n",
" # add standalone 'Ġ' for additional spaces\n",
" # - If spaces trail the chunk (e.g., before a newline) add\n",
" # standalone 'Ġ' tokens (tiktoken produces id 220 for 'Ġ')\n",
" pending_spaces = 0\n",
" for m in re.finditer(r'( +)|(\\S+)', part):\n",
" if m.group(1) is not None:\n",
" pending_spaces += len(m.group(1))\n",
" else:\n",
" word = m.group(2)\n",
" if pending_spaces > 0:\n",
" tokens.append(\"Ġ\" + word) # one leading space\n",
" for _ in range(pending_spaces - 1):\n",
" tokens.append(\"Ġ\") # remaining spaces as standalone\n",
" pending_spaces = 0\n",
" else:\n",
" tokens.append(word)\n",
" # Trailing spaces (no following word): add standalone 'Ġ' tokens\n",
" for _ in range(pending_spaces):\n",
" tokens.append(\"Ġ\")\n",
" # ---------------------------------------------------------------\n",
" \n",
" # Map tokens -> ids (BPE if needed)\n",
" for tok in tokens:\n",
" if tok in self.inverse_vocab:\n",
" token_ids.append(self.inverse_vocab[tok])\n",
" else:\n",
" token_ids.extend(self.tokenize_with_bpe(token))\n",
" token_ids.extend(self.tokenize_with_bpe(tok))\n",
" \n",
" return token_ids\n",
"\n",
@ -675,20 +724,22 @@
" Returns:\n",
" str: The decoded string.\n",
" \"\"\"\n",
" decoded_string = \"\"\n",
" for i, token_id in enumerate(token_ids):\n",
" if token_id not in self.vocab:\n",
" raise ValueError(f\"Token ID {token_id} not found in vocab.\")\n",
" token = self.vocab[token_id]\n",
" if token == \"\\n\":\n",
" if decoded_string and not decoded_string.endswith(\" \"):\n",
" decoded_string += \" \" # Add space if not present before a newline\n",
" decoded_string += token\n",
" elif token.startswith(\"Ġ\"):\n",
" decoded_string += \" \" + token[1:]\n",
" out = []\n",
" for tid in token_ids:\n",
" if tid not in self.vocab:\n",
" raise ValueError(f\"Token ID {tid} not found in vocab.\")\n",
" tok = self.vocab[tid]\n",
"\n",
" # Map GPT-2 special chars back to real chars\n",
" if tid == 198 or tok == \"\\n\":\n",
" out.append(\"\\n\")\n",
" elif tid == 201 or tok == \"\\r\":\n",
" out.append(\"\\r\")\n",
" elif tok.startswith(\"Ġ\"):\n",
" out.append(\" \" + tok[1:])\n",
" else:\n",
" decoded_string += token\n",
" return decoded_string\n",
" out.append(tok)\n",
" return \"\".join(out)\n",
"\n",
" def save_vocab_and_merges(self, vocab_path, bpe_merges_path):\n",
" \"\"\"\n",
@ -809,7 +860,7 @@
},
{
"cell_type": "code",
"execution_count": 71,
"execution_count": 5,
"id": "51872c08-e01b-40c3-a8a0-e8d6a773e3df",
"metadata": {},
"outputs": [
@ -873,7 +924,7 @@
},
{
"cell_type": "code",
"execution_count": 46,
"execution_count": 6,
"id": "027348fd-d52f-4396-93dd-38eed142df9b",
"metadata": {},
"outputs": [],
@ -892,7 +943,7 @@
},
{
"cell_type": "code",
"execution_count": 47,
"execution_count": 7,
"id": "f705a283-355e-4460-b940-06bbc2ae4e61",
"metadata": {},
"outputs": [
@ -919,7 +970,7 @@
},
{
"cell_type": "code",
"execution_count": 48,
"execution_count": 8,
"id": "3da42d1c-f75c-4ba7-a6c5-4cb8543d4a44",
"metadata": {},
"outputs": [
@ -953,7 +1004,7 @@
},
{
"cell_type": "code",
"execution_count": 49,
"execution_count": 9,
"id": "e1db5cce-e015-412b-ad56-060b8b638078",
"metadata": {},
"outputs": [
@ -973,27 +1024,7 @@
},
{
"cell_type": "code",
"execution_count": 50,
"id": "78249752-38d7-47b9-b259-912bcc093dc4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 60, 124, 271, 683, 102, 116, 461, 116, 124, 62]\n"
]
}
],
"source": [
"input_text = \"Jack embraced beauty through art and life.<|endoftext|> \"\n",
"token_ids = tokenizer.encode(input_text)\n",
"print(token_ids)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"execution_count": 10,
"id": "0331d37d-49a3-44f7-9aa9-9834e0938741",
"metadata": {},
"outputs": [
@ -1001,7 +1032,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 257]\n"
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 257, 256]\n"
]
}
],
@ -1013,7 +1044,7 @@
},
{
"cell_type": "code",
"execution_count": 52,
"execution_count": 11,
"id": "1ed1b344-f7d4-4e9e-ac34-2a04b5c5b7a8",
"metadata": {},
"outputs": [
@ -1022,7 +1053,7 @@
"output_type": "stream",
"text": [
"Number of characters: 56\n",
"Number of token IDs: 21\n"
"Number of token IDs: 22\n"
]
}
],
@ -1049,7 +1080,7 @@
},
{
"cell_type": "code",
"execution_count": 53,
"execution_count": 12,
"id": "da0e1faf-1933-43d9-b681-916c282a8f86",
"metadata": {},
"outputs": [
@ -1057,7 +1088,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 257]\n"
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 257, 256]\n"
]
}
],
@ -1067,7 +1098,7 @@
},
{
"cell_type": "code",
"execution_count": 54,
"execution_count": 13,
"id": "8b690e83-5d6b-409a-804e-321c287c24a4",
"metadata": {},
"outputs": [
@ -1075,7 +1106,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Jack embraced beauty through art and life.<|endoftext|>\n"
"Jack embraced beauty through art and life.<|endoftext|> \n"
]
}
],
@ -1093,7 +1124,7 @@
},
{
"cell_type": "code",
"execution_count": 55,
"execution_count": 14,
"id": "2b9e6289-92cb-4d88-b3c8-e836d7c8095f",
"metadata": {},
"outputs": [
@ -1121,7 +1152,8 @@
"326 -> li\n",
"972 -> fe\n",
"46 -> .\n",
"257 -> <|endoftext|>\n"
"257 -> <|endoftext|>\n",
"256 -> \n"
]
}
],
@ -1148,7 +1180,7 @@
},
{
"cell_type": "code",
"execution_count": 56,
"execution_count": 15,
"id": "c7056cb1-a9a3-4cf6-8364-29fb493ae240",
"metadata": {},
"outputs": [
@ -1158,7 +1190,7 @@
"'This is some text.'"
]
},
"execution_count": 56,
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
@ -1171,7 +1203,7 @@
},
{
"cell_type": "code",
"execution_count": 57,
"execution_count": 16,
"id": "37bc6753-8f35-4ec7-b23e-df4a12103cb4",
"metadata": {},
"outputs": [
@ -1181,7 +1213,7 @@
"'This is some text with \\n newline characters.'"
]
},
"execution_count": 57,
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
@ -1210,7 +1242,7 @@
},
{
"cell_type": "code",
"execution_count": 58,
"execution_count": 17,
"id": "955181cb-0910-4c6a-9c22-d8292a3ec1fc",
"metadata": {},
"outputs": [],
@ -1221,7 +1253,7 @@
},
{
"cell_type": "code",
"execution_count": 59,
"execution_count": 18,
"id": "6e5ccfe7-ac67-42f3-b727-87886a8867f1",
"metadata": {},
"outputs": [],
@ -1241,7 +1273,7 @@
},
{
"cell_type": "code",
"execution_count": 60,
"execution_count": 19,
"id": "00d9bf8f-756f-48bf-81b8-b890e2c2ef13",
"metadata": {},
"outputs": [
@ -1249,7 +1281,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Jack embraced beauty through art and life.<|endoftext|>\n"
"Jack embraced beauty through art and life.<|endoftext|> \n"
]
}
],
@ -1259,7 +1291,7 @@
},
{
"cell_type": "code",
"execution_count": 61,
"execution_count": 20,
"id": "e7addb64-2892-4e1c-85dd-4f5152740099",
"metadata": {},
"outputs": [
@ -1269,7 +1301,7 @@
"'This is some text with \\n newline characters.'"
]
},
"execution_count": 61,
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
@ -1299,7 +1331,7 @@
},
{
"cell_type": "code",
"execution_count": 72,
"execution_count": 21,
"id": "b45b4366-2c2b-4309-9a14-febf3add8512",
"metadata": {},
"outputs": [
@ -1339,7 +1371,7 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 22,
"id": "74306e6c-47d3-45a3-9e0f-93f7303ef601",
"metadata": {},
"outputs": [],
@ -1360,7 +1392,7 @@
},
{
"cell_type": "code",
"execution_count": 24,
"execution_count": 23,
"id": "2bb722b4-dbf5-4a0c-9120-efda3293f132",
"metadata": {},
"outputs": [
@ -1370,7 +1402,7 @@
"50257"
]
},
"execution_count": 24,
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
@ -1389,7 +1421,7 @@
},
{
"cell_type": "code",
"execution_count": 25,
"execution_count": 24,
"id": "e4866de7-fb32-4dd6-a878-469ec734641c",
"metadata": {},
"outputs": [
@ -1409,7 +1441,7 @@
},
{
"cell_type": "code",
"execution_count": 26,
"execution_count": 25,
"id": "3da8d9b2-af55-4b09-95d7-fabd983e919e",
"metadata": {},
"outputs": [
@ -1461,6 +1493,14 @@
"- I hope you found this brief tutorial useful for educational purposes; if you have any questions, please feel free to open a new Discussion [here](https://github.com/rasbt/LLMs-from-scratch/discussions/categories/q-a)\n",
"- For a performance comparison with other tokenizer implementations, please see [this notebook](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a477962-ba00-429b-8be7-755a90543de7",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@ -1479,7 +1519,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.16"
"version": "3.13.5"
}
},
"nbformat": 4,

View File

@ -152,3 +152,90 @@ def test_gpt2_tokenizer_openai_edgecases(imported_module, gpt2_files):
if errors:
pytest.fail("\n".join(errors))
def test_gpt2_newline_and_eot_ids(imported_module, gpt2_files):
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)
tok = BPETokenizerSimple()
tok.load_vocab_and_merges_from_openai(
vocab_path=gpt2_files["encoder.json"], bpe_merges_path=gpt2_files["vocab.bpe"]
)
assert "Ċ" in tok.inverse_vocab, "Missing GPT-2 newline glyph 'Ċ' in inverse_vocab"
assert "<|endoftext|>" in tok.inverse_vocab, "Missing EOT in inverse_vocab"
assert tok.inverse_vocab["Ċ"] == 198, "Ċ must map to id 198"
assert tok.inverse_vocab["<|endoftext|>"] == 50256, "EOT must be 50256"
if "\n" not in tok.inverse_vocab:
tok.inverse_vocab["\n"] = tok.inverse_vocab["Ċ"]
assert tok.inverse_vocab["\n"] == 198, r"'\n' must map to 198 via Ċ"
assert tok.vocab[198] == "Ċ", "Don't overwrite vocab[198]; keep it 'Ċ'"
assert tok.vocab[50256] == "<|endoftext|>", "Don't map <|endoftext|> to anything else"
def test_no_eot_aliasing_and_disallowed_logic(imported_module, gpt2_files):
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)
tok = BPETokenizerSimple()
tok.load_vocab_and_merges_from_openai(
vocab_path=gpt2_files["encoder.json"], bpe_merges_path=gpt2_files["vocab.bpe"]
)
tik = tiktoken.get_encoding("gpt2")
text = "Hello<|endoftext|>\nworld"
# When not allowed, our encode should raise ValueError like tiktoken
with pytest.raises(ValueError):
tok.encode(text)
# When allowed, both tokenizers should match
ids_ours = tok.encode(text, allowed_special={"<|endoftext|>"})
ids_tik = tik.encode(text, allowed_special={"<|endoftext|>"})
assert ids_ours == ids_tik, "Mismatch vs tiktoken with EOT allowed"
@pytest.mark.parametrize(
"text",
[
"a\nb",
"a\n\nb",
"\nHello",
"Hello\n",
"a\r\nb",
],
)
def test_newline_roundtrip_and_equivalence(imported_module, gpt2_files, text):
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)
tok = BPETokenizerSimple()
tok.load_vocab_and_merges_from_openai(
vocab_path=gpt2_files["encoder.json"], bpe_merges_path=gpt2_files["vocab.bpe"]
)
tik = tiktoken.get_encoding("gpt2")
ids_ours = tok.encode(text)
ids_tik = tik.encode(text)
assert ids_ours == ids_tik, f"Mismatch vs tiktoken for: {repr(text)}"
# Each "\n" should correspond to id 198
expected_lf_count = text.count("\n")
assert ids_ours.count(198) == expected_lf_count
dec = tok.decode(ids_ours)
assert dec == text
def test_space_newline_space_patterns(imported_module, gpt2_files):
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)
tok = BPETokenizerSimple()
tok.load_vocab_and_merges_from_openai(
vocab_path=gpt2_files["encoder.json"], bpe_merges_path=gpt2_files["vocab.bpe"]
)
tik = tiktoken.get_encoding("gpt2")
samples = [
"Hello \nworld",
"Hello\n world",
]
for s in samples:
assert tok.encode(s) == tik.encode(s), f"Mismatch vs tiktoken: {repr(s)}"