"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
"# Extending the Tiktoken BPE Tokenizer with New Tokens"
]
},
{
"cell_type": "markdown",
"id": "bcd624b1-2060-49af-bbf6-40517a58c128",
"metadata": {},
"source": [
"- This notebook explains how we can extend an existing BPE tokenizer; specifically, we will focus on how to do it for the popular [tiktoken](https://github.com/openai/tiktoken) implementation\n",
"- For a general introduction to tokenization, please refer to [Chapter 2](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb) and the BPE from Scratch [link] tutorial\n",
"- For example, suppose we have a GPT-2 tokenizer and want to encode the following text:"
"- As we can see above, the `\"MyNewToken_1\"` is broken down into 5 individual subword tokens -- this is normal behavior for BPE when handling unknown words\n",
"- However, suppose that it's a special token that we want to encode as a single token, similar to some of the other words or `\"<|endoftext|>\"`; this notebook explains how"
]
},
{
"cell_type": "markdown",
"id": "65f62ab6-df96-4f88-ab9a-37702cd30f5f",
"metadata": {},
"source": [
" \n",
"## 1. Adding special tokens"
]
},
{
"cell_type": "markdown",
"id": "c4379fdb-57ba-4a75-9183-0aee0836c391",
"metadata": {},
"source": [
"- Note that we have to add new tokens as special tokens; the reason is that we don't have the \"merges\" for the new tokens that are created during the tokenizer training process -- even if we had them, it would be very challenging to incorporate them without breaking the existing tokenization scheme (see the BPE from scratch notebook [link] to understand the \"merges\")\n",
"- As we can see above, we have successfully updated the tokenizer\n",
"- However, to use it with a pretrained LLM, we also have to update the embedding and output layers of the LLM, which is discussed in the next section"
]
},
{
"cell_type": "markdown",
"id": "8ec7f98d-8f09-4386-83f0-9bec68ef7f66",
"metadata": {},
"source": [
" \n",
"## 2. Updating a pretrained LLM"
]
},
{
"cell_type": "markdown",
"id": "b8a4f68b-04e9-4524-8df4-8718c7b566f2",
"metadata": {},
"source": [
"- In this section, we will take a look at how we have to update an existing pretrained LLM after updating the tokenizer\n",
"- For this, we are using the original pretrained GPT-2 model that is used in the main book"
" out = gpt(torch.tensor([original_token_ids]))\n",
"\n",
"print(out)"
]
},
{
"cell_type": "markdown",
"id": "082c7a78-35a8-473e-a08d-b099a6348a74",
"metadata": {},
"source": [
"- As we can see above, this works without problems (note that the code shows the raw output without converting the outputs back into text for simplicity; for more details on that, please check out the `generate` function in Chapter 5 [link] section 5.3.3"
]
},
{
"cell_type": "markdown",
"id": "628265b5-3dde-44e7-bde2-8fc594a2547d",
"metadata": {},
"source": [
"- What happens if we try the same on the token IDs generated by the updated tokenizer now?"
]
},
{
"cell_type": "markdown",
"id": "9796ad09-787c-4c25-a7f5-6d1dfe048ac3",
"metadata": {},
"source": [
"```python\n",
"with torch.no_grad():\n",
" gpt(torch.tensor([new_token_ids]))\n",
"\n",
"print(out)\n",
"\n",
"...\n",
"# IndexError: index out of range in self\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "77d00244-7e40-4de0-942e-e15cdd8e3b18",
"metadata": {},
"source": [
"- As we can see, this results in an index error\n",
"- The reason is that the GPT model expects a fixed vocabulary size via its input embedding layer and its output layer:\n",
"# Replace the old embedding layer with the new one in the model\n",
"gpt.tok_emb = new_embedding\n",
"\n",
"print(gpt.tok_emb)"
]
},
{
"cell_type": "markdown",
"id": "63954928-31a5-4e7e-9688-2e0c156b7302",
"metadata": {},
"source": [
"- As we can see above, we now have an increased embedding layer"
]
},
{
"cell_type": "markdown",
"id": "6e68bea5-255b-47bb-b352-09ea9539bc25",
"metadata": {},
"source": [
" \n",
"### 2.4 Updating the output layer"
]
},
{
"cell_type": "markdown",
"id": "90a4a519-bf0f-4502-912d-ef0ac7a9deab",
"metadata": {},
"source": [
"- Next, we have to extend the output layer, which has 50,257 output features corresponding to the vocabulary size similar to the embedding layer (by the way, you may find the bonus material, which discusses the similarity between Linear and Embedding layers in PyTorch, useful)"
"- As we can see, the model works on the extended token set\n",
"- In practice, we want to now finetune (or continually pretrain) the model (specifically the new embedding and output layers) on data containing the new tokens"
]
},
{
"cell_type": "markdown",
"id": "6de573ad-0338-40d9-9dad-de60ae349c4f",
"metadata": {},
"source": [
"**A note about weight tying**\n",
"\n",
"- If the model uses weight tying, which means that the embedding layer and output layer share the same weights, similar to Llama 3 [link], updating the output layer is much simpler\n",
"- In this case, we can simply copy over the weights from the embedding layer:"