mirror of
https://github.com/rasbt/LLMs-from-scratch.git
synced 2025-11-23 13:36:40 +00:00
Add simpler BPE, and make previous BPE better (#870)
* Add simpler BPE, and make previous BPE better * update * Update README.md
This commit is contained in:
parent
1164cb3e8f
commit
fecfdd16ff
2
.gitignore
vendored
2
.gitignore
vendored
@ -85,6 +85,8 @@ Qwen3-0.6B/
|
||||
tokenizer-base.json
|
||||
tokenizer-reasoning.json
|
||||
tokenizer.json
|
||||
config.json
|
||||
bpe_merges.txt
|
||||
|
||||
# Datasets
|
||||
the-verdict.txt
|
||||
|
||||
@ -158,7 +158,7 @@ Several folders contain optional materials as a bonus for interested readers:
|
||||
- [Installing Python Packages and Libraries Used In This Book](setup/02_installing-python-libraries)
|
||||
- [Docker Environment Setup Guide](setup/03_optional-docker-environment)
|
||||
- **Chapter 2: Working with text data**
|
||||
- [Byte Pair Encoding (BPE) Tokenizer From Scratch](ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb)
|
||||
- [Byte Pair Encoding (BPE) Tokenizer From Scratch](ch02/05_bpe-from-scratch/bpe-from-scratch-simple.ipynb)
|
||||
- [Comparing Various Byte Pair Encoding (BPE) Implementations](ch02/02_bonus_bytepair-encoder)
|
||||
- [Understanding the Difference Between Embedding Layers and Linear Layers](ch02/03_bonus_embedding-vs-matmul)
|
||||
- [Dataloader Intuition with Simple Numbers](ch02/04_bonus_dataloader-intuition)
|
||||
|
||||
@ -1,3 +1,5 @@
|
||||
# Byte Pair Encoding (BPE) Tokenizer From Scratch
|
||||
|
||||
- [bpe-from-scratch.ipynb](bpe-from-scratch.ipynb) contains optional (bonus) code that explains and shows how the BPE tokenizer works under the hood.
|
||||
- [bpe-from-scratch-simple.ipynb](bpe-from-scratch-simple.ipynb) contains optional (bonus) code that explains and shows how the BPE tokenizer works under the hood; this is geared for simplicity and readability.
|
||||
|
||||
- [bpe-from-scratch.ipynb](bpe-from-scratch.ipynb) implements a more sophisticated (and much more complicated) BPE tokenizer that behaves similarly as tiktoken with respect to all the edge cases; it also has additional funcitionality for loading the official GPT-2 vocab.
|
||||
970
ch02/05_bpe-from-scratch/bpe-from-scratch-simple.ipynb
Normal file
970
ch02/05_bpe-from-scratch/bpe-from-scratch-simple.ipynb
Normal file
@ -0,0 +1,970 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9dec0dfb-3d60-41d0-a63a-b010dce67e32",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<table style=\"width:100%\">\n",
|
||||
"<tr>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<font size=\"2\">\n",
|
||||
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
|
||||
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
|
||||
"</font>\n",
|
||||
"</td>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
|
||||
"</td>\n",
|
||||
"</tr>\n",
|
||||
"</table>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5e475425-8300-43f2-a5e8-6b5d2de59925",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Byte Pair Encoding (BPE) Tokenizer From Scratch -- Simple"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a1bfc3f3-8ec1-4fd3-b378-d9a3d7807a54",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- This is a standalone notebook implementing the popular byte pair encoding (BPE) tokenization algorithm, which is used in models like GPT-2 to GPT-4, Llama 3, etc., from scratch for educational purposes\n",
|
||||
"- For more details about the purpose of tokenization, please refer to [Chapter 2](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb); this code here is bonus material explaining the BPE algorithm\n",
|
||||
"- The original BPE tokenizer that OpenAI implemented for training the original GPT models can be found [here](https://github.com/openai/gpt-2/blob/master/src/encoder.py)\n",
|
||||
"- The BPE algorithm was originally described in 1994: \"[A New Algorithm for Data Compression](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM)\" by Philip Gage\n",
|
||||
"- Most projects, including Llama 3, nowadays use OpenAI's open-source [tiktoken library](https://github.com/openai/tiktoken) due to its computational performance; it allows loading pretrained GPT-2 and GPT-4 tokenizers, for example (the Llama 3 models were trained using the GPT-4 tokenizer as well)\n",
|
||||
"- The difference between the implementations above and my implementation in this notebook, besides it being is that it also includes a function for training the tokenizer (for educational purposes)\n",
|
||||
"- There's also an implementation called [minBPE](https://github.com/karpathy/minbpe) with training support, which is maybe more performant (my implementation here is focused on educational purposes); in contrast to `minbpe` my implementation additionally allows loading the original OpenAI tokenizer vocabulary and merges"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "910acd61-8947-4cfa-962f-16f4c733f2db",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**This is a very naive implementation for educational purposes. The [bpe-from-scratch.ipynb](bpe-from-scratch.ipynb) notebook contains a more sophisticated (but much harder to read) implementation that matches the behavior in tiktoken.**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f62336db-f45c-4894-9167-7583095dbdf1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
" \n",
|
||||
"# 1. The main idea behind byte pair encoding (BPE)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cd3f1231-bd42-41b5-a017-974b8c660a44",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- The main idea in BPE is to convert text into an integer representation (token IDs) for LLM training (see [Chapter 2](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb))\n",
|
||||
"\n",
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/bpe-from-scratch/bpe-overview.webp\" width=\"600px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "760c625d-26a1-4896-98a2-0fdcd1591256",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
" \n",
|
||||
"## 1.1 Bits and bytes"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d4ddaa35-0ed7-4012-827e-911de11c266c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Before getting to the BPE algorithm, let's introduce the notion of bytes\n",
|
||||
"- Consider converting text into a byte array (BPE stands for \"byte\" pair encoding after all):"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "8c9bc9e4-120f-4bac-8fa6-6523c568d12e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"bytearray(b'This is some text')\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"text = \"This is some text\"\n",
|
||||
"byte_ary = bytearray(text, \"utf-8\")\n",
|
||||
"print(byte_ary)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dbd92a2a-9d74-4dc7-bb53-ac33d6cf2fab",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- When we call `list()` on a `bytearray` object, each byte is treated as an individual element, and the result is a list of integers corresponding to the byte values:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "6c586945-d459-4f9a-855d-bf73438ef0e3",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[84, 104, 105, 115, 32, 105, 115, 32, 115, 111, 109, 101, 32, 116, 101, 120, 116]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"ids = list(byte_ary)\n",
|
||||
"print(ids)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "71efea37-f4c3-4cb8-bfa5-9299175faf9a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- This would be a valid way to convert text into a token ID representation that we need for the embedding layer of an LLM\n",
|
||||
"- However, the downside of this approach is that it is creating one ID for each character (that's a lot of IDs for a short text!)\n",
|
||||
"- I.e., this means for a 17-character input text, we have to use 17 token IDs as input to the LLM:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "0d5b61d9-79a0-48b4-9b3e-64ab595c5b01",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of characters: 17\n",
|
||||
"Number of token IDs: 17\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(\"Number of characters:\", len(text))\n",
|
||||
"print(\"Number of token IDs:\", len(ids))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "68cc833a-c0d4-4d46-9180-c0042fd6addc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- If you have worked with LLMs before, you may know that the BPE tokenizers have a vocabulary where we have a token ID for whole words or subwords instead of each character\n",
|
||||
"- For example, the GPT-2 tokenizer tokenizes the same text (\"This is some text\") into only 4 instead of 17 tokens: `1212, 318, 617, 2420`\n",
|
||||
"- You can double-check this using the interactive [tiktoken app](https://tiktokenizer.vercel.app/?model=gpt2) or the [tiktoken library](https://github.com/openai/tiktoken):\n",
|
||||
"\n",
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/bpe-from-scratch/tiktokenizer.webp\" width=\"600px\">\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"import tiktoken\n",
|
||||
"\n",
|
||||
"gpt2_tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
|
||||
"gpt2_tokenizer.encode(\"This is some text\")\n",
|
||||
"# prints [1212, 318, 617, 2420]\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "425b99de-cbfc-441c-8b3e-296a5dd7bb27",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Since a byte consists of 8 bits, there are 2<sup>8</sup> = 256 possible values that a single byte can represent, ranging from 0 to 255\n",
|
||||
"- You can confirm this by executing the code `bytearray(range(0, 257))`, which will warn you that `ValueError: byte must be in range(0, 256)`)\n",
|
||||
"- A BPE tokenizer usually uses these 256 values as its first 256 single-character tokens; one could visually check this by running the following code:\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"import tiktoken\n",
|
||||
"gpt2_tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
|
||||
"\n",
|
||||
"for i in range(300):\n",
|
||||
" decoded = gpt2_tokenizer.decode([i])\n",
|
||||
" print(f\"{i}: {decoded}\")\n",
|
||||
"\"\"\"\n",
|
||||
"prints:\n",
|
||||
"0: !\n",
|
||||
"1: \"\n",
|
||||
"2: #\n",
|
||||
"...\n",
|
||||
"255: <20> # <---- single character tokens up to here\n",
|
||||
"256: t\n",
|
||||
"257: a\n",
|
||||
"...\n",
|
||||
"298: ent\n",
|
||||
"299: n\n",
|
||||
"\"\"\"\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "97ff0207-7f8e-44fa-9381-2a4bd83daab3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Above, note that entries 256 and 257 are not single-character values but double-character values (a whitespace + a letter), which is a little shortcoming of the original GPT-2 BPE Tokenizer (this has been improved in the GPT-4 tokenizer)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8241c23a-d487-488d-bded-cdf054e24920",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
" \n",
|
||||
"## 1.2 Building the vocabulary"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d7c2ceb7-0b3f-4a62-8dcc-07810cd8886e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- The goal of the BPE tokenization algorithm is to build a vocabulary of commonly occurring subwords like `298: ent` (which can be found in *entangle, entertain, enter, entrance, entity, ...*, for example), or even complete words like \n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"318: is\n",
|
||||
"617: some\n",
|
||||
"1212: This\n",
|
||||
"2420: text\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8c0d4420-a4c7-4813-916a-06f4f46bc3f0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- The BPE algorithm was originally described in 1994: \"[A New Algorithm for Data Compression](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM)\" by Philip Gage\n",
|
||||
"- Before we get to the actual code implementation, the form that is used for LLM tokenizers today can be summarized as follows:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ebc71db9-b070-48c4-8412-81f45b308ab3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
" \n",
|
||||
"## 1.3 BPE algorithm outline\n",
|
||||
"\n",
|
||||
"**1. Identify frequent pairs**\n",
|
||||
"- In each iteration, scan the text to find the most commonly occurring pair of bytes (or characters)\n",
|
||||
"\n",
|
||||
"**2. Replace and record**\n",
|
||||
"\n",
|
||||
"- Replace that pair with a new placeholder ID (one not already in use, e.g., if we start with 0...255, the first placeholder would be 256)\n",
|
||||
"- Record this mapping in a lookup table\n",
|
||||
"- The size of the lookup table is a hyperparameter, also called \"vocabulary size\" (for GPT-2, that's\n",
|
||||
"50,257)\n",
|
||||
"\n",
|
||||
"**3. Repeat until no gains**\n",
|
||||
"\n",
|
||||
"- Keep repeating steps 1 and 2, continually merging the most frequent pairs\n",
|
||||
"- Stop when no further compression is possible (e.g., no pair occurs more than once)\n",
|
||||
"\n",
|
||||
"**Decompression (decoding)**\n",
|
||||
"\n",
|
||||
"- To restore the original text, reverse the process by substituting each ID with its corresponding pair, using the lookup table\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e9f5ac9a-3528-4186-9468-8420c7b2ac00",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
" \n",
|
||||
"## 1.4 BPE algorithm example\n",
|
||||
"\n",
|
||||
"### 1.4.1 Concrete example of the encoding part (steps 1 & 2)\n",
|
||||
"\n",
|
||||
"- Suppose we have the text (training dataset) `the cat in the hat` from which we want to build the vocabulary for a BPE tokenizer\n",
|
||||
"\n",
|
||||
"**Iteration 1**\n",
|
||||
"\n",
|
||||
"1. Identify frequent pairs\n",
|
||||
" - In this text, \"th\" appears twice (at the beginning and before the second \"e\")\n",
|
||||
"\n",
|
||||
"2. Replace and record\n",
|
||||
" - replace \"th\" with a new token ID that is not already in use, e.g., 256\n",
|
||||
" - the new text is: `<256>e cat in <256>e hat`\n",
|
||||
" - the new vocabulary is\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
" 0: ...\n",
|
||||
" ...\n",
|
||||
" 256: \"th\"\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"**Iteration 2**\n",
|
||||
"\n",
|
||||
"1. **Identify frequent pairs** \n",
|
||||
" - In the text `<256>e cat in <256>e hat`, the pair `<256>e` appears twice\n",
|
||||
"\n",
|
||||
"2. **Replace and record** \n",
|
||||
" - replace `<256>e` with a new token ID that is not already in use, for example, `257`. \n",
|
||||
" - The new text is:\n",
|
||||
" ```\n",
|
||||
" <257> cat in <257> hat\n",
|
||||
" ```\n",
|
||||
" - The updated vocabulary is:\n",
|
||||
" ```\n",
|
||||
" 0: ...\n",
|
||||
" ...\n",
|
||||
" 256: \"th\"\n",
|
||||
" 257: \"<256>e\"\n",
|
||||
" ```\n",
|
||||
"\n",
|
||||
"**Iteration 3**\n",
|
||||
"\n",
|
||||
"1. **Identify frequent pairs** \n",
|
||||
" - In the text `<257> cat in <257> hat`, the pair `<257> ` appears twice (once at the beginning and once before “hat”).\n",
|
||||
"\n",
|
||||
"2. **Replace and record** \n",
|
||||
" - replace `<257> ` with a new token ID that is not already in use, for example, `258`. \n",
|
||||
" - the new text is:\n",
|
||||
" ```\n",
|
||||
" <258>cat in <258>hat\n",
|
||||
" ```\n",
|
||||
" - The updated vocabulary is:\n",
|
||||
" ```\n",
|
||||
" 0: ...\n",
|
||||
" ...\n",
|
||||
" 256: \"th\"\n",
|
||||
" 257: \"<256>e\"\n",
|
||||
" 258: \"<257> \"\n",
|
||||
" ```\n",
|
||||
" \n",
|
||||
"- and so forth\n",
|
||||
"\n",
|
||||
" \n",
|
||||
"### 1.4.2 Concrete example of the decoding part (steps 3)\n",
|
||||
"\n",
|
||||
"- To restore the original text, we reverse the process by substituting each token ID with its corresponding pair in the reverse order they were introduced\n",
|
||||
"- Start with the final compressed text: `<258>cat in <258>hat`\n",
|
||||
"- Substitute `<258>` → `<257> `: `<257> cat in <257> hat` \n",
|
||||
"- Substitute `<257>` → `<256>e`: `<256>e cat in <256>e hat`\n",
|
||||
"- Substitute `<256>` → \"th\": `the cat in the hat`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a2324948-ddd0-45d1-8ba8-e8eda9fc6677",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
" \n",
|
||||
"## 2. A simple BPE implementation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "429ca709-40d7-4e3d-bf3e-4f5687a2e19b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Below is an implementation of this algorithm described above as a Python class that mimics the `tiktoken` Python user interface\n",
|
||||
"- Note that the encoding part above describes the original training step via `train()`; however, the `encode()` method works similarly (although it looks a bit more complicated because of the special token handling):\n",
|
||||
"\n",
|
||||
"1. Split the input text into individual bytes\n",
|
||||
"2. Repeatedly find & replace (merge) adjacent tokens (pairs) when they match any pair in the learned BPE merges (from highest to lowest \"rank,\" i.e., in the order they were learned)\n",
|
||||
"3. Continue merging until no more merges can be applied\n",
|
||||
"4. The final list of token IDs is the encoded output"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "3e4a15ec-2667-4f56-b7c1-34e8071b621d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from collections import Counter, deque\n",
|
||||
"from functools import lru_cache\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"class BPETokenizerSimple:\n",
|
||||
" def __init__(self):\n",
|
||||
" # Maps token_id to token_str (e.g., {11246: \"some\"})\n",
|
||||
" self.vocab = {}\n",
|
||||
" # Maps token_str to token_id (e.g., {\"some\": 11246})\n",
|
||||
" self.inverse_vocab = {}\n",
|
||||
" # Dictionary of BPE merges: {(token_id1, token_id2): merged_token_id}\n",
|
||||
" self.bpe_merges = {}\n",
|
||||
"\n",
|
||||
" def train(self, text, vocab_size, allowed_special={\"<|endoftext|>\"}):\n",
|
||||
" \"\"\"\n",
|
||||
" Train the BPE tokenizer from scratch.\n",
|
||||
"\n",
|
||||
" Args:\n",
|
||||
" text (str): The training text.\n",
|
||||
" vocab_size (int): The desired vocabulary size.\n",
|
||||
" allowed_special (set): A set of special tokens to include.\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" # Preprocess: Replace spaces with 'Ġ'\n",
|
||||
" # Note that Ġ is a particularity of the GPT-2 BPE implementation\n",
|
||||
" # E.g., \"Hello world\" might be tokenized as [\"Hello\", \"Ġworld\"]\n",
|
||||
" # (GPT-4 BPE would tokenize it as [\"Hello\", \" world\"])\n",
|
||||
" processed_text = []\n",
|
||||
" for i, char in enumerate(text):\n",
|
||||
" if char == \" \" and i != 0:\n",
|
||||
" processed_text.append(\"Ġ\")\n",
|
||||
" if char != \" \":\n",
|
||||
" processed_text.append(char)\n",
|
||||
" processed_text = \"\".join(processed_text)\n",
|
||||
"\n",
|
||||
" # Initialize vocab with unique characters, including 'Ġ' if present\n",
|
||||
" # Start with the first 256 ASCII characters\n",
|
||||
" unique_chars = [chr(i) for i in range(256)]\n",
|
||||
"\n",
|
||||
" # Extend unique_chars with characters from processed_text that are not already included\n",
|
||||
" unique_chars.extend(char for char in sorted(set(processed_text)) if char not in unique_chars)\n",
|
||||
"\n",
|
||||
" # Optionally, ensure 'Ġ' is included if it is relevant to your text processing\n",
|
||||
" if 'Ġ' not in unique_chars:\n",
|
||||
" unique_chars.append('Ġ')\n",
|
||||
"\n",
|
||||
" # Now create the vocab and inverse vocab dictionaries\n",
|
||||
" self.vocab = {i: char for i, char in enumerate(unique_chars)}\n",
|
||||
" self.inverse_vocab = {char: i for i, char in self.vocab.items()}\n",
|
||||
"\n",
|
||||
" # Add allowed special tokens\n",
|
||||
" if allowed_special:\n",
|
||||
" for token in allowed_special:\n",
|
||||
" if token not in self.inverse_vocab:\n",
|
||||
" new_id = len(self.vocab)\n",
|
||||
" self.vocab[new_id] = token\n",
|
||||
" self.inverse_vocab[token] = new_id\n",
|
||||
"\n",
|
||||
" # Tokenize the processed_text into token IDs\n",
|
||||
" token_ids = [self.inverse_vocab[char] for char in processed_text]\n",
|
||||
"\n",
|
||||
" # BPE steps 1-3: Repeatedly find and replace frequent pairs\n",
|
||||
" for new_id in range(len(self.vocab), vocab_size):\n",
|
||||
" pair_id = self.find_freq_pair(token_ids, mode=\"most\")\n",
|
||||
" if pair_id is None: # No more pairs to merge. Stopping training.\n",
|
||||
" break\n",
|
||||
" token_ids = self.replace_pair(token_ids, pair_id, new_id)\n",
|
||||
" self.bpe_merges[pair_id] = new_id\n",
|
||||
"\n",
|
||||
" # Build the vocabulary with merged tokens\n",
|
||||
" for (p0, p1), new_id in self.bpe_merges.items():\n",
|
||||
" merged_token = self.vocab[p0] + self.vocab[p1]\n",
|
||||
" self.vocab[new_id] = merged_token\n",
|
||||
" self.inverse_vocab[merged_token] = new_id\n",
|
||||
"\n",
|
||||
" def encode(self, text):\n",
|
||||
" \"\"\"\n",
|
||||
" Encode the input text into a list of token IDs.\n",
|
||||
"\n",
|
||||
" Args:\n",
|
||||
" text (str): The text to encode.\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" List[int]: The list of token IDs.\n",
|
||||
" \"\"\"\n",
|
||||
" tokens = []\n",
|
||||
" # Split text into tokens, keeping newlines intact\n",
|
||||
" words = text.replace(\"\\n\", \" \\n \").split() # Ensure '\\n' is treated as a separate token\n",
|
||||
"\n",
|
||||
" for i, word in enumerate(words):\n",
|
||||
" if i > 0 and not word.startswith(\"\\n\"):\n",
|
||||
" tokens.append(\"Ġ\" + word) # Add 'Ġ' to words that follow a space or newline\n",
|
||||
" else:\n",
|
||||
" tokens.append(word) # Handle first word or standalone '\\n'\n",
|
||||
"\n",
|
||||
" token_ids = []\n",
|
||||
" for token in tokens:\n",
|
||||
" if token in self.inverse_vocab:\n",
|
||||
" # token is contained in the vocabulary as is\n",
|
||||
" token_id = self.inverse_vocab[token]\n",
|
||||
" token_ids.append(token_id)\n",
|
||||
" else:\n",
|
||||
" # Attempt to handle subword tokenization via BPE\n",
|
||||
" sub_token_ids = self.tokenize_with_bpe(token)\n",
|
||||
" token_ids.extend(sub_token_ids)\n",
|
||||
"\n",
|
||||
" return token_ids\n",
|
||||
"\n",
|
||||
" def tokenize_with_bpe(self, token):\n",
|
||||
" \"\"\"\n",
|
||||
" Tokenize a single token using BPE merges.\n",
|
||||
"\n",
|
||||
" Args:\n",
|
||||
" token (str): The token to tokenize.\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" List[int]: The list of token IDs after applying BPE.\n",
|
||||
" \"\"\"\n",
|
||||
" # Tokenize the token into individual characters (as initial token IDs)\n",
|
||||
" token_ids = [self.inverse_vocab.get(char, None) for char in token]\n",
|
||||
" if None in token_ids:\n",
|
||||
" missing_chars = [char for char, tid in zip(token, token_ids) if tid is None]\n",
|
||||
" raise ValueError(f\"Characters not found in vocab: {missing_chars}\")\n",
|
||||
"\n",
|
||||
" can_merge = True\n",
|
||||
" while can_merge and len(token_ids) > 1:\n",
|
||||
" can_merge = False\n",
|
||||
" new_tokens = []\n",
|
||||
" i = 0\n",
|
||||
" while i < len(token_ids) - 1:\n",
|
||||
" pair = (token_ids[i], token_ids[i + 1])\n",
|
||||
" if pair in self.bpe_merges:\n",
|
||||
" merged_token_id = self.bpe_merges[pair]\n",
|
||||
" new_tokens.append(merged_token_id)\n",
|
||||
" # Uncomment for educational purposes:\n",
|
||||
" # print(f\"Merged pair {pair} -> {merged_token_id} ('{self.vocab[merged_token_id]}')\")\n",
|
||||
" i += 2 # Skip the next token as it's merged\n",
|
||||
" can_merge = True\n",
|
||||
" else:\n",
|
||||
" new_tokens.append(token_ids[i])\n",
|
||||
" i += 1\n",
|
||||
" if i < len(token_ids):\n",
|
||||
" new_tokens.append(token_ids[i])\n",
|
||||
" token_ids = new_tokens\n",
|
||||
"\n",
|
||||
" return token_ids\n",
|
||||
"\n",
|
||||
" def decode(self, token_ids):\n",
|
||||
" \"\"\"\n",
|
||||
" Decode a list of token IDs back into a string.\n",
|
||||
"\n",
|
||||
" Args:\n",
|
||||
" token_ids (List[int]): The list of token IDs to decode.\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" str: The decoded string.\n",
|
||||
" \"\"\"\n",
|
||||
" decoded_string = \"\"\n",
|
||||
" for token_id in token_ids:\n",
|
||||
" if token_id not in self.vocab:\n",
|
||||
" raise ValueError(f\"Token ID {token_id} not found in vocab.\")\n",
|
||||
" token = self.vocab[token_id]\n",
|
||||
" if token.startswith(\"Ġ\"):\n",
|
||||
" # Replace 'Ġ' with a space\n",
|
||||
" decoded_string += \" \" + token[1:]\n",
|
||||
" else:\n",
|
||||
" decoded_string += token\n",
|
||||
" return decoded_string\n",
|
||||
"\n",
|
||||
" @lru_cache(maxsize=None)\n",
|
||||
" def get_special_token_id(self, token):\n",
|
||||
" return self.inverse_vocab.get(token, None)\n",
|
||||
"\n",
|
||||
" @staticmethod\n",
|
||||
" def find_freq_pair(token_ids, mode=\"most\"):\n",
|
||||
" pairs = Counter(zip(token_ids, token_ids[1:]))\n",
|
||||
"\n",
|
||||
" if mode == \"most\":\n",
|
||||
" return max(pairs.items(), key=lambda x: x[1])[0]\n",
|
||||
" elif mode == \"least\":\n",
|
||||
" return min(pairs.items(), key=lambda x: x[1])[0]\n",
|
||||
" else:\n",
|
||||
" raise ValueError(\"Invalid mode. Choose 'most' or 'least'.\")\n",
|
||||
"\n",
|
||||
" @staticmethod\n",
|
||||
" def replace_pair(token_ids, pair_id, new_id):\n",
|
||||
" dq = deque(token_ids)\n",
|
||||
" replaced = []\n",
|
||||
"\n",
|
||||
" while dq:\n",
|
||||
" current = dq.popleft()\n",
|
||||
" if dq and (current, dq[0]) == pair_id:\n",
|
||||
" replaced.append(new_id)\n",
|
||||
" # Remove the 2nd token of the pair, 1st was already removed\n",
|
||||
" dq.popleft()\n",
|
||||
" else:\n",
|
||||
" replaced.append(current)\n",
|
||||
"\n",
|
||||
" return replaced\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "46db7310-79c7-4ee0-b5fa-d760c6e1aa67",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- There is a lot of code in the `BPETokenizerSimple` class above, and discussing it in detail is out of scope for this notebook, but the next section offers a short overview of the usage to understand the class methods a bit better"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8ffe1836-eed4-40dc-860b-2d23074d067e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. BPE implementation walkthrough"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3c7c996c-fd34-484f-a877-13d977214cf7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- In practice, I highly recommend using [tiktoken](https://github.com/openai/tiktoken) as my implementation above focuses on readability and educational purposes, not on performance\n",
|
||||
"- However, the usage is more or less similar to tiktoken, except that tiktoken does not have a training method\n",
|
||||
"- Let's see how my `BPETokenizerSimple` Python code above works by looking at some examples below (a detailed code discussion is out of scope for this notebook)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e82acaf6-7ed5-4d3b-81c0-ae4d3559d2c7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 3.1 Training, encoding, and decoding"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "962bf037-903e-4555-b09c-206e1a410278",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, let's consider some sample text as our training dataset:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "4d197cad-ed10-4a42-b01c-a763859781fb",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import urllib.request\n",
|
||||
"\n",
|
||||
"if not os.path.exists(\"../01_main-chapter-code/the-verdict.txt\"):\n",
|
||||
" url = (\"https://raw.githubusercontent.com/rasbt/\"\n",
|
||||
" \"LLMs-from-scratch/main/ch02/01_main-chapter-code/\"\n",
|
||||
" \"the-verdict.txt\")\n",
|
||||
" file_path = \"../01_main-chapter-code/the-verdict.txt\"\n",
|
||||
" urllib.request.urlretrieve(url, file_path)\n",
|
||||
"\n",
|
||||
"with open(\"../01_main-chapter-code/the-verdict.txt\", \"r\", encoding=\"utf-8\") as f: # added ../01_main-chapter-code/\n",
|
||||
" text = f.read()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "04d1b6ac-71d3-4817-956a-9bc7e463a84a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Next, let's initialize and train the BPE tokenizer with a vocabulary size of 1,000\n",
|
||||
"- Note that the vocabulary size is already 255 by default due to the byte values discussed earlier, so we are only \"learning\" 745 vocabulary entries \n",
|
||||
"- For comparison, the GPT-2 vocabulary is 50,257 tokens, the GPT-4 vocabulary is 100,256 tokens (`cl100k_base` in tiktoken), and GPT-4o uses 199,997 tokens (`o200k_base` in tiktoken); they have all much bigger training sets compared to our simple example text above"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "027348fd-d52f-4396-93dd-38eed142df9b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tokenizer = BPETokenizerSimple()\n",
|
||||
"tokenizer.train(text, vocab_size=1000, allowed_special={\"<|endoftext|>\"})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2474ff05-5629-4f13-9e03-a47b1e713850",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- You may want to inspect the vocabulary contents (but note it will create a long list)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "f705a283-355e-4460-b940-06bbc2ae4e61",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"1000\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# print(tokenizer.vocab)\n",
|
||||
"print(len(tokenizer.vocab))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "36c9da0f-8a18-41cd-91ea-9ccc2bb5febb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- This vocabulary is created by merging 742 times (~ `1000 - len(range(0, 256))`)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "3da42d1c-f75c-4ba7-a6c5-4cb8543d4a44",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"742\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(len(tokenizer.bpe_merges))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5dac69c9-8413-482a-8148-6b2afbf1fb89",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- This means that the first 256 entries are single-character tokens"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "451a4108-7c8b-4b98-9c67-d622e9cdf250",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Next, let's use the created merges via the `encode` method to encode some text:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "e1db5cce-e015-412b-ad56-060b8b638078",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"input_text = \"Jack embraced beauty through art and life.\"\n",
|
||||
"token_ids = tokenizer.encode(input_text)\n",
|
||||
"print(token_ids)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "1ed1b344-f7d4-4e9e-ac34-2a04b5c5b7a8",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of characters: 42\n",
|
||||
"Number of token IDs: 20\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(\"Number of characters:\", len(input_text))\n",
|
||||
"print(\"Number of token IDs:\", len(token_ids))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "50c1cfb9-402a-4e1e-9678-0b7547406248",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- From the lengths above, we can see that a 42-character sentence was encoded into 20 token IDs, effectively cutting the input length roughly in half compared to a character-byte-based encoding"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "252693ee-e806-4dac-ab76-2c69086360f4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Note that the vocabulary itself is used in the `decode()` method, which allows us to map the token IDs back into text:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "da0e1faf-1933-43d9-b681-916c282a8f86",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(token_ids)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "8b690e83-5d6b-409a-804e-321c287c24a4",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Jack embraced beauty through art and life.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(tokenizer.decode(token_ids))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "adea5d09-e5ef-4721-994b-b9b25662fa0a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Iterating over each token ID can give us a better understanding of how the token IDs are decoded via the vocabulary:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "2b9e6289-92cb-4d88-b3c8-e836d7c8095f",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"424 -> Jack\n",
|
||||
"256 -> \n",
|
||||
"654 -> em\n",
|
||||
"531 -> br\n",
|
||||
"302 -> ac\n",
|
||||
"311 -> ed\n",
|
||||
"256 -> \n",
|
||||
"296 -> be\n",
|
||||
"97 -> a\n",
|
||||
"465 -> ut\n",
|
||||
"121 -> y\n",
|
||||
"595 -> through\n",
|
||||
"841 -> ar\n",
|
||||
"116 -> t\n",
|
||||
"287 -> a\n",
|
||||
"466 -> nd\n",
|
||||
"256 -> \n",
|
||||
"326 -> li\n",
|
||||
"972 -> fe\n",
|
||||
"46 -> .\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for token_id in token_ids:\n",
|
||||
" print(f\"{token_id} -> {tokenizer.decode([token_id])}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5ea41c6c-5538-4fd5-8b5f-195960853b71",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- As we can see, most token IDs represent 2-character subwords; that's because the training data text is very short with not that many repetitive words, and because we used a relatively small vocabulary size"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "600055a3-7ec8-4abf-b88a-c4186fb71463",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- As a summary, calling `decode(encode())` should be able to reproduce arbitrary input texts:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "c7056cb1-a9a3-4cf6-8364-29fb493ae240",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'This is some text.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"tokenizer.decode(tokenizer.encode(\"This is some text.\"))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3558af04-483c-4f6b-88f5-a534f37316cd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
" \n",
|
||||
"# 4. Conclusion"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "410ed0e6-ad06-4bb3-bb39-6b8110c1caa4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- That's it! That's how BPE works in a nutshell, complete with a training method for creating new tokenizers \n",
|
||||
"- I hope you found this brief tutorial useful for educational purposes; if you have any questions, please feel free to open a new Discussion [here](https://github.com/rasbt/LLMs-from-scratch/discussions/categories/q-a)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**This is a very naive implementation for educational purposes. The [bpe-from-scratch.ipynb](bpe-from-scratch.ipynb) notebook contains a more sophisticated (but much harder to read) implementation that matches the behavior in tiktoken.**"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.13.5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@ -81,7 +81,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 39,
|
||||
"execution_count": 1,
|
||||
"id": "8c9bc9e4-120f-4bac-8fa6-6523c568d12e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -109,7 +109,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 40,
|
||||
"execution_count": 2,
|
||||
"id": "6c586945-d459-4f9a-855d-bf73438ef0e3",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -138,7 +138,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 41,
|
||||
"execution_count": 3,
|
||||
"id": "0d5b61d9-79a0-48b4-9b3e-64ab595c5b01",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -382,13 +382,14 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 42,
|
||||
"execution_count": 4,
|
||||
"id": "3e4a15ec-2667-4f56-b7c1-34e8071b621d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from collections import Counter, deque\n",
|
||||
"from functools import lru_cache\n",
|
||||
"import re\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"\n",
|
||||
@ -476,42 +477,49 @@
|
||||
" # Load vocabulary\n",
|
||||
" with open(vocab_path, \"r\", encoding=\"utf-8\") as file:\n",
|
||||
" loaded_vocab = json.load(file)\n",
|
||||
" # Convert loaded vocabulary to correct format\n",
|
||||
" # encoder.json is {token_str: id}; we want id->str and str->id\n",
|
||||
" self.vocab = {int(v): k for k, v in loaded_vocab.items()}\n",
|
||||
" self.inverse_vocab = {k: int(v) for k, v in loaded_vocab.items()}\n",
|
||||
"\n",
|
||||
" # Handle newline character without adding a new token\n",
|
||||
" \n",
|
||||
" # Must have GPT-2's printable newline character 'Ċ' (U+010A) at id 198\n",
|
||||
" if \"Ċ\" not in self.inverse_vocab or self.inverse_vocab[\"Ċ\"] != 198:\n",
|
||||
" raise KeyError(\"Vocabulary missing GPT-2 newline glyph 'Ċ' at id 198.\")\n",
|
||||
" \n",
|
||||
" # Must have <|endoftext|> at 50256\n",
|
||||
" if \"<|endoftext|>\" not in self.inverse_vocab or self.inverse_vocab[\"<|endoftext|>\"] != 50256:\n",
|
||||
" raise KeyError(\"Vocabulary missing <|endoftext|> at id 50256.\")\n",
|
||||
" \n",
|
||||
" # Provide a convenience alias for '\\n' -> 198\n",
|
||||
" # Keep printable character 'Ċ' in vocab so BPE merges keep working\n",
|
||||
" if \"\\n\" not in self.inverse_vocab:\n",
|
||||
" # Use an existing token ID as a placeholder for '\\n'\n",
|
||||
" # Preferentially use \"<|endoftext|>\" if available\n",
|
||||
" fallback_token = next((token for token in [\"<|endoftext|>\", \"Ġ\", \"\"] if token in self.inverse_vocab), None)\n",
|
||||
" if fallback_token is not None:\n",
|
||||
" newline_token_id = self.inverse_vocab[fallback_token]\n",
|
||||
" self.inverse_vocab[\"\\n\"] = self.inverse_vocab[\"Ċ\"]\n",
|
||||
"\n",
|
||||
" if \"\\r\" not in self.inverse_vocab:\n",
|
||||
" if 201 in self.vocab:\n",
|
||||
" self.inverse_vocab[\"\\r\"] = 201\n",
|
||||
" else:\n",
|
||||
" # If no fallback token is available, raise an error\n",
|
||||
" raise KeyError(\"No suitable token found in vocabulary to map '\\\\n'.\")\n",
|
||||
" raise KeyError(\"Vocabulary missing carriage return token at id 201.\")\n",
|
||||
"\n",
|
||||
" self.inverse_vocab[\"\\n\"] = newline_token_id\n",
|
||||
" self.vocab[newline_token_id] = \"\\n\"\n",
|
||||
"\n",
|
||||
" # Load GPT-2 merges and store them with an assigned \"rank\"\n",
|
||||
" self.bpe_ranks = {} # reset ranks\n",
|
||||
" # Load GPT-2 merges and store ranks\n",
|
||||
" self.bpe_ranks = {}\n",
|
||||
" with open(bpe_merges_path, \"r\", encoding=\"utf-8\") as file:\n",
|
||||
" lines = file.readlines()\n",
|
||||
" if lines and lines[0].startswith(\"#\"):\n",
|
||||
" lines = lines[1:]\n",
|
||||
"\n",
|
||||
" \n",
|
||||
" rank = 0\n",
|
||||
" for line in lines:\n",
|
||||
" pair = tuple(line.strip().split())\n",
|
||||
" if len(pair) == 2:\n",
|
||||
" token1, token2 = pair\n",
|
||||
" # If token1 or token2 not in vocab, skip\n",
|
||||
" if token1 in self.inverse_vocab and token2 in self.inverse_vocab:\n",
|
||||
" self.bpe_ranks[(token1, token2)] = rank\n",
|
||||
" rank += 1\n",
|
||||
" else:\n",
|
||||
" print(f\"Skipping pair {pair} as one token is not in the vocabulary.\")\n",
|
||||
" token1, *rest = line.strip().split()\n",
|
||||
" if len(rest) != 1:\n",
|
||||
" continue\n",
|
||||
" token2 = rest[0]\n",
|
||||
" if token1 in self.inverse_vocab and token2 in self.inverse_vocab:\n",
|
||||
" self.bpe_ranks[(token1, token2)] = rank\n",
|
||||
" rank += 1\n",
|
||||
" else:\n",
|
||||
" # Safe to skip pairs whose symbols are not in vocab\n",
|
||||
" pass\n",
|
||||
"\n",
|
||||
"\n",
|
||||
" def encode(self, text, allowed_special=None):\n",
|
||||
" \"\"\"\n",
|
||||
@ -524,21 +532,35 @@
|
||||
" Returns:\n",
|
||||
" List of token IDs.\n",
|
||||
" \"\"\"\n",
|
||||
" import re\n",
|
||||
" \n",
|
||||
" # ---- This section is to mimic tiktoken in terms of allowed special tokens ----\n",
|
||||
" specials_in_vocab = [\n",
|
||||
" tok for tok in self.inverse_vocab\n",
|
||||
" if tok.startswith(\"<|\") and tok.endswith(\"|>\")\n",
|
||||
" ]\n",
|
||||
" if allowed_special is None:\n",
|
||||
" # Nothing is allowed\n",
|
||||
" disallowed = [tok for tok in specials_in_vocab if tok in text]\n",
|
||||
" if disallowed:\n",
|
||||
" raise ValueError(f\"Disallowed special tokens encountered in text: {disallowed}\")\n",
|
||||
" else:\n",
|
||||
" # Some spefic tokens are allowed (e.g., we use this for <|endoftext|>)\n",
|
||||
" disallowed = [tok for tok in specials_in_vocab if tok in text and tok not in allowed_special]\n",
|
||||
" if disallowed:\n",
|
||||
" raise ValueError(f\"Disallowed special tokens encountered in text: {disallowed}\")\n",
|
||||
" # -----------------------------------------------------------------------------\n",
|
||||
"\n",
|
||||
" token_ids = []\n",
|
||||
" \n",
|
||||
" # If special token handling is enabled\n",
|
||||
" # If some specials are allowed, split around them and passthrough those ids\n",
|
||||
" if allowed_special is not None and len(allowed_special) > 0:\n",
|
||||
" # Build regex to match allowed special tokens\n",
|
||||
" special_pattern = (\n",
|
||||
" \"(\" + \"|\".join(re.escape(tok) for tok in sorted(allowed_special, key=len, reverse=True)) + \")\"\n",
|
||||
" )\n",
|
||||
" special_pattern = \"(\" + \"|\".join(\n",
|
||||
" re.escape(tok) for tok in sorted(allowed_special, key=len, reverse=True)\n",
|
||||
" ) + \")\"\n",
|
||||
" \n",
|
||||
" last_index = 0\n",
|
||||
" for match in re.finditer(special_pattern, text):\n",
|
||||
" prefix = text[last_index:match.start()]\n",
|
||||
" token_ids.extend(self.encode(prefix, allowed_special=None)) # Encode prefix without special handling\n",
|
||||
" token_ids.extend(self.encode(prefix, allowed_special=None)) # encode prefix normally\n",
|
||||
" \n",
|
||||
" special_token = match.group(0)\n",
|
||||
" if special_token in self.inverse_vocab:\n",
|
||||
@ -547,36 +569,63 @@
|
||||
" raise ValueError(f\"Special token {special_token} not found in vocabulary.\")\n",
|
||||
" last_index = match.end()\n",
|
||||
" \n",
|
||||
" text = text[last_index:] # Remaining part to process normally\n",
|
||||
" text = text[last_index:] # remainder to process normally\n",
|
||||
" \n",
|
||||
" # Check if any disallowed special tokens are in the remainder\n",
|
||||
" # Extra guard for any other special literals left over\n",
|
||||
" disallowed = [\n",
|
||||
" tok for tok in self.inverse_vocab\n",
|
||||
" if tok.startswith(\"<|\") and tok.endswith(\"|>\") and tok in text and tok not in allowed_special\n",
|
||||
" ]\n",
|
||||
" if disallowed:\n",
|
||||
" raise ValueError(f\"Disallowed special tokens encountered in text: {disallowed}\")\n",
|
||||
"\n",
|
||||
" \n",
|
||||
" # If no special tokens, or remaining text after special token split:\n",
|
||||
" # ---- Newline and carriage return handling ----\n",
|
||||
" tokens = []\n",
|
||||
" lines = text.split(\"\\n\")\n",
|
||||
" for i, line in enumerate(lines):\n",
|
||||
" if i > 0:\n",
|
||||
" parts = re.split(r'(\\r\\n|\\r|\\n)', text)\n",
|
||||
" for part in parts:\n",
|
||||
" if part == \"\":\n",
|
||||
" continue\n",
|
||||
" if part == \"\\r\\n\":\n",
|
||||
" tokens.append(\"\\r\")\n",
|
||||
" tokens.append(\"\\n\")\n",
|
||||
" words = line.split()\n",
|
||||
" for j, word in enumerate(words):\n",
|
||||
" if j == 0 and i > 0:\n",
|
||||
" tokens.append(\"Ġ\" + word)\n",
|
||||
" elif j == 0:\n",
|
||||
" tokens.append(word)\n",
|
||||
" else:\n",
|
||||
" tokens.append(\"Ġ\" + word)\n",
|
||||
" continue\n",
|
||||
" if part == \"\\r\":\n",
|
||||
" tokens.append(\"\\r\")\n",
|
||||
" continue\n",
|
||||
" if part == \"\\n\":\n",
|
||||
" tokens.append(\"\\n\")\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" for token in tokens:\n",
|
||||
" if token in self.inverse_vocab:\n",
|
||||
" token_ids.append(self.inverse_vocab[token])\n",
|
||||
" # Normal chunk without line breaks:\n",
|
||||
" # - If spaces precede a word, prefix the first word with 'Ġ' and\n",
|
||||
" # add standalone 'Ġ' for additional spaces\n",
|
||||
" # - If spaces trail the chunk (e.g., before a newline) add\n",
|
||||
" # standalone 'Ġ' tokens (tiktoken produces id 220 for 'Ġ')\n",
|
||||
" pending_spaces = 0\n",
|
||||
" for m in re.finditer(r'( +)|(\\S+)', part):\n",
|
||||
" if m.group(1) is not None:\n",
|
||||
" pending_spaces += len(m.group(1))\n",
|
||||
" else:\n",
|
||||
" word = m.group(2)\n",
|
||||
" if pending_spaces > 0:\n",
|
||||
" tokens.append(\"Ġ\" + word) # one leading space\n",
|
||||
" for _ in range(pending_spaces - 1):\n",
|
||||
" tokens.append(\"Ġ\") # remaining spaces as standalone\n",
|
||||
" pending_spaces = 0\n",
|
||||
" else:\n",
|
||||
" tokens.append(word)\n",
|
||||
" # Trailing spaces (no following word): add standalone 'Ġ' tokens\n",
|
||||
" for _ in range(pending_spaces):\n",
|
||||
" tokens.append(\"Ġ\")\n",
|
||||
" # ---------------------------------------------------------------\n",
|
||||
" \n",
|
||||
" # Map tokens -> ids (BPE if needed)\n",
|
||||
" for tok in tokens:\n",
|
||||
" if tok in self.inverse_vocab:\n",
|
||||
" token_ids.append(self.inverse_vocab[tok])\n",
|
||||
" else:\n",
|
||||
" token_ids.extend(self.tokenize_with_bpe(token))\n",
|
||||
" token_ids.extend(self.tokenize_with_bpe(tok))\n",
|
||||
" \n",
|
||||
" return token_ids\n",
|
||||
"\n",
|
||||
@ -675,20 +724,22 @@
|
||||
" Returns:\n",
|
||||
" str: The decoded string.\n",
|
||||
" \"\"\"\n",
|
||||
" decoded_string = \"\"\n",
|
||||
" for i, token_id in enumerate(token_ids):\n",
|
||||
" if token_id not in self.vocab:\n",
|
||||
" raise ValueError(f\"Token ID {token_id} not found in vocab.\")\n",
|
||||
" token = self.vocab[token_id]\n",
|
||||
" if token == \"\\n\":\n",
|
||||
" if decoded_string and not decoded_string.endswith(\" \"):\n",
|
||||
" decoded_string += \" \" # Add space if not present before a newline\n",
|
||||
" decoded_string += token\n",
|
||||
" elif token.startswith(\"Ġ\"):\n",
|
||||
" decoded_string += \" \" + token[1:]\n",
|
||||
" out = []\n",
|
||||
" for tid in token_ids:\n",
|
||||
" if tid not in self.vocab:\n",
|
||||
" raise ValueError(f\"Token ID {tid} not found in vocab.\")\n",
|
||||
" tok = self.vocab[tid]\n",
|
||||
"\n",
|
||||
" # Map GPT-2 special chars back to real chars\n",
|
||||
" if tid == 198 or tok == \"\\n\":\n",
|
||||
" out.append(\"\\n\")\n",
|
||||
" elif tid == 201 or tok == \"\\r\":\n",
|
||||
" out.append(\"\\r\")\n",
|
||||
" elif tok.startswith(\"Ġ\"):\n",
|
||||
" out.append(\" \" + tok[1:])\n",
|
||||
" else:\n",
|
||||
" decoded_string += token\n",
|
||||
" return decoded_string\n",
|
||||
" out.append(tok)\n",
|
||||
" return \"\".join(out)\n",
|
||||
"\n",
|
||||
" def save_vocab_and_merges(self, vocab_path, bpe_merges_path):\n",
|
||||
" \"\"\"\n",
|
||||
@ -809,7 +860,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 71,
|
||||
"execution_count": 5,
|
||||
"id": "51872c08-e01b-40c3-a8a0-e8d6a773e3df",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -873,7 +924,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 46,
|
||||
"execution_count": 6,
|
||||
"id": "027348fd-d52f-4396-93dd-38eed142df9b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@ -892,7 +943,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 47,
|
||||
"execution_count": 7,
|
||||
"id": "f705a283-355e-4460-b940-06bbc2ae4e61",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -919,7 +970,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 48,
|
||||
"execution_count": 8,
|
||||
"id": "3da42d1c-f75c-4ba7-a6c5-4cb8543d4a44",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -953,7 +1004,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 49,
|
||||
"execution_count": 9,
|
||||
"id": "e1db5cce-e015-412b-ad56-060b8b638078",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -973,27 +1024,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 50,
|
||||
"id": "78249752-38d7-47b9-b259-912bcc093dc4",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 60, 124, 271, 683, 102, 116, 461, 116, 124, 62]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"input_text = \"Jack embraced beauty through art and life.<|endoftext|> \"\n",
|
||||
"token_ids = tokenizer.encode(input_text)\n",
|
||||
"print(token_ids)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 51,
|
||||
"execution_count": 10,
|
||||
"id": "0331d37d-49a3-44f7-9aa9-9834e0938741",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1001,7 +1032,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 257]\n"
|
||||
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 257, 256]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -1013,7 +1044,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 52,
|
||||
"execution_count": 11,
|
||||
"id": "1ed1b344-f7d4-4e9e-ac34-2a04b5c5b7a8",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1022,7 +1053,7 @@
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of characters: 56\n",
|
||||
"Number of token IDs: 21\n"
|
||||
"Number of token IDs: 22\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -1049,7 +1080,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 53,
|
||||
"execution_count": 12,
|
||||
"id": "da0e1faf-1933-43d9-b681-916c282a8f86",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1057,7 +1088,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 257]\n"
|
||||
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 257, 256]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -1067,7 +1098,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 54,
|
||||
"execution_count": 13,
|
||||
"id": "8b690e83-5d6b-409a-804e-321c287c24a4",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1075,7 +1106,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Jack embraced beauty through art and life.<|endoftext|>\n"
|
||||
"Jack embraced beauty through art and life.<|endoftext|> \n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -1093,7 +1124,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 55,
|
||||
"execution_count": 14,
|
||||
"id": "2b9e6289-92cb-4d88-b3c8-e836d7c8095f",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1121,7 +1152,8 @@
|
||||
"326 -> li\n",
|
||||
"972 -> fe\n",
|
||||
"46 -> .\n",
|
||||
"257 -> <|endoftext|>\n"
|
||||
"257 -> <|endoftext|>\n",
|
||||
"256 -> \n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -1148,7 +1180,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 56,
|
||||
"execution_count": 15,
|
||||
"id": "c7056cb1-a9a3-4cf6-8364-29fb493ae240",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1158,7 +1190,7 @@
|
||||
"'This is some text.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 56,
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@ -1171,7 +1203,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 57,
|
||||
"execution_count": 16,
|
||||
"id": "37bc6753-8f35-4ec7-b23e-df4a12103cb4",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1181,7 +1213,7 @@
|
||||
"'This is some text with \\n newline characters.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 57,
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@ -1210,7 +1242,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 58,
|
||||
"execution_count": 17,
|
||||
"id": "955181cb-0910-4c6a-9c22-d8292a3ec1fc",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@ -1221,7 +1253,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 59,
|
||||
"execution_count": 18,
|
||||
"id": "6e5ccfe7-ac67-42f3-b727-87886a8867f1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@ -1241,7 +1273,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 60,
|
||||
"execution_count": 19,
|
||||
"id": "00d9bf8f-756f-48bf-81b8-b890e2c2ef13",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1249,7 +1281,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Jack embraced beauty through art and life.<|endoftext|>\n"
|
||||
"Jack embraced beauty through art and life.<|endoftext|> \n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -1259,7 +1291,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 61,
|
||||
"execution_count": 20,
|
||||
"id": "e7addb64-2892-4e1c-85dd-4f5152740099",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1269,7 +1301,7 @@
|
||||
"'This is some text with \\n newline characters.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 61,
|
||||
"execution_count": 20,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@ -1299,7 +1331,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 72,
|
||||
"execution_count": 21,
|
||||
"id": "b45b4366-2c2b-4309-9a14-febf3add8512",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1339,7 +1371,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"execution_count": 22,
|
||||
"id": "74306e6c-47d3-45a3-9e0f-93f7303ef601",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@ -1360,7 +1392,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"execution_count": 23,
|
||||
"id": "2bb722b4-dbf5-4a0c-9120-efda3293f132",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1370,7 +1402,7 @@
|
||||
"50257"
|
||||
]
|
||||
},
|
||||
"execution_count": 24,
|
||||
"execution_count": 23,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@ -1389,7 +1421,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"execution_count": 24,
|
||||
"id": "e4866de7-fb32-4dd6-a878-469ec734641c",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1409,7 +1441,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"execution_count": 25,
|
||||
"id": "3da8d9b2-af55-4b09-95d7-fabd983e919e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -1461,6 +1493,14 @@
|
||||
"- I hope you found this brief tutorial useful for educational purposes; if you have any questions, please feel free to open a new Discussion [here](https://github.com/rasbt/LLMs-from-scratch/discussions/categories/q-a)\n",
|
||||
"- For a performance comparison with other tokenizer implementations, please see [this notebook](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "4a477962-ba00-429b-8be7-755a90543de7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
@ -1479,7 +1519,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.16"
|
||||
"version": "3.13.5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@ -152,3 +152,90 @@ def test_gpt2_tokenizer_openai_edgecases(imported_module, gpt2_files):
|
||||
|
||||
if errors:
|
||||
pytest.fail("\n".join(errors))
|
||||
|
||||
|
||||
def test_gpt2_newline_and_eot_ids(imported_module, gpt2_files):
|
||||
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)
|
||||
|
||||
tok = BPETokenizerSimple()
|
||||
tok.load_vocab_and_merges_from_openai(
|
||||
vocab_path=gpt2_files["encoder.json"], bpe_merges_path=gpt2_files["vocab.bpe"]
|
||||
)
|
||||
|
||||
assert "Ċ" in tok.inverse_vocab, "Missing GPT-2 newline glyph 'Ċ' in inverse_vocab"
|
||||
assert "<|endoftext|>" in tok.inverse_vocab, "Missing EOT in inverse_vocab"
|
||||
|
||||
assert tok.inverse_vocab["Ċ"] == 198, "Ċ must map to id 198"
|
||||
assert tok.inverse_vocab["<|endoftext|>"] == 50256, "EOT must be 50256"
|
||||
|
||||
if "\n" not in tok.inverse_vocab:
|
||||
tok.inverse_vocab["\n"] = tok.inverse_vocab["Ċ"]
|
||||
assert tok.inverse_vocab["\n"] == 198, r"'\n' must map to 198 via Ċ"
|
||||
|
||||
assert tok.vocab[198] == "Ċ", "Don't overwrite vocab[198]; keep it 'Ċ'"
|
||||
assert tok.vocab[50256] == "<|endoftext|>", "Don't map <|endoftext|> to anything else"
|
||||
|
||||
|
||||
def test_no_eot_aliasing_and_disallowed_logic(imported_module, gpt2_files):
|
||||
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)
|
||||
tok = BPETokenizerSimple()
|
||||
tok.load_vocab_and_merges_from_openai(
|
||||
vocab_path=gpt2_files["encoder.json"], bpe_merges_path=gpt2_files["vocab.bpe"]
|
||||
)
|
||||
tik = tiktoken.get_encoding("gpt2")
|
||||
|
||||
text = "Hello<|endoftext|>\nworld"
|
||||
# When not allowed, our encode should raise ValueError like tiktoken
|
||||
with pytest.raises(ValueError):
|
||||
tok.encode(text)
|
||||
|
||||
# When allowed, both tokenizers should match
|
||||
ids_ours = tok.encode(text, allowed_special={"<|endoftext|>"})
|
||||
ids_tik = tik.encode(text, allowed_special={"<|endoftext|>"})
|
||||
assert ids_ours == ids_tik, "Mismatch vs tiktoken with EOT allowed"
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text",
|
||||
[
|
||||
"a\nb",
|
||||
"a\n\nb",
|
||||
"\nHello",
|
||||
"Hello\n",
|
||||
"a\r\nb",
|
||||
],
|
||||
)
|
||||
def test_newline_roundtrip_and_equivalence(imported_module, gpt2_files, text):
|
||||
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)
|
||||
tok = BPETokenizerSimple()
|
||||
tok.load_vocab_and_merges_from_openai(
|
||||
vocab_path=gpt2_files["encoder.json"], bpe_merges_path=gpt2_files["vocab.bpe"]
|
||||
)
|
||||
tik = tiktoken.get_encoding("gpt2")
|
||||
|
||||
ids_ours = tok.encode(text)
|
||||
ids_tik = tik.encode(text)
|
||||
|
||||
assert ids_ours == ids_tik, f"Mismatch vs tiktoken for: {repr(text)}"
|
||||
# Each "\n" should correspond to id 198
|
||||
expected_lf_count = text.count("\n")
|
||||
assert ids_ours.count(198) == expected_lf_count
|
||||
|
||||
dec = tok.decode(ids_ours)
|
||||
assert dec == text
|
||||
|
||||
|
||||
def test_space_newline_space_patterns(imported_module, gpt2_files):
|
||||
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)
|
||||
tok = BPETokenizerSimple()
|
||||
tok.load_vocab_and_merges_from_openai(
|
||||
vocab_path=gpt2_files["encoder.json"], bpe_merges_path=gpt2_files["vocab.bpe"]
|
||||
)
|
||||
tik = tiktoken.get_encoding("gpt2")
|
||||
|
||||
samples = [
|
||||
"Hello \nworld",
|
||||
"Hello\n world",
|
||||
]
|
||||
for s in samples:
|
||||
assert tok.encode(s) == tik.encode(s), f"Mismatch vs tiktoken: {repr(s)}"
|
||||
Loading…
x
Reference in New Issue
Block a user