LLMs-from-scratch/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb
2024-05-24 07:20:37 -05:00

504 lines
11 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "c503e5ef-6bb4-45c3-ac49-0e016cedd8d0",
"metadata": {},
"source": [
"<table style=\"width:100%\">\n",
"<tr>\n",
"<td style=\"vertical-align:middle; text-align:left;\">\n",
"<font size=\"2\">\n",
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
"</font>\n",
"</td>\n",
"<td style=\"vertical-align:middle; text-align:left;\">\n",
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
"</td>\n",
"</tr>\n",
"</table>\n"
]
},
{
"cell_type": "markdown",
"id": "8a9e554f-58e3-4787-832d-d149add1b857",
"metadata": {},
"source": [
"- Install the additional package requirements for this bonus notebook by uncommenting and running the following cell:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "d70bae22-b540-4a13-ab01-e748cb9d55c9",
"metadata": {},
"outputs": [],
"source": [
"# pip install -r requirements-extra.txt"
]
},
{
"cell_type": "markdown",
"id": "737c59bb-5922-46fc-a787-1369d70925b4",
"metadata": {},
"source": [
"# Comparing Various Byte Pair Encoding (BPE) Implementations"
]
},
{
"cell_type": "markdown",
"id": "a9adc3bf-353c-411e-a471-0e92786e7103",
"metadata": {},
"source": [
"<br>\n",
"&nbsp;\n",
"\n",
"## Using BPE from `tiktoken`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1c490fca-a48a-47fa-a299-322d1a08ad17",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tiktoken version: 0.5.1\n"
]
}
],
"source": [
"from importlib.metadata import version\n",
"\n",
"print(\"tiktoken version:\", version(\"tiktoken\"))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "0952667c-ce84-4f21-87db-59f52b44cec4",
"metadata": {},
"outputs": [],
"source": [
"import tiktoken\n",
"\n",
"tik_tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
"\n",
"text = \"Hello, world. Is this-- a test?\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b039c350-18ad-48fb-8e6a-085702dfc330",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]\n"
]
}
],
"source": [
"integers = tik_tokenizer.encode(text, allowed_special={\"<|endoftext|>\"})\n",
"\n",
"print(integers)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "7b152ba4-04d3-41cc-849f-adedcfb8cabb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Hello, world. Is this-- a test?\n"
]
}
],
"source": [
"strings = tik_tokenizer.decode(integers)\n",
"\n",
"print(strings)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cf148a1a-316b-43ec-b7ba-1b6d409ce837",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"50257\n"
]
}
],
"source": [
"print(tik_tokenizer.n_vocab)"
]
},
{
"cell_type": "markdown",
"id": "6a0b5d4f-2af9-40de-828c-063c4243e771",
"metadata": {},
"source": [
"<br>\n",
"&nbsp;\n",
"\n",
"## Using the original BPE implementation used in GPT-2"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "0903108c-65cb-4ae1-967a-2155e25349c2",
"metadata": {},
"outputs": [],
"source": [
"from bpe_openai_gpt2 import get_encoder, download_vocab"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "35dd8d7c-8c12-4b68-941a-0fd05882dd45",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Fetching encoder.json: 1.04Mit [00:00, 3.14Mit/s] \n",
"Fetching vocab.bpe: 457kit [00:00, 1.67Mit/s] \n"
]
}
],
"source": [
"download_vocab()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "1888a7a9-9c40-4fe0-99b4-ebd20aa1ffd0",
"metadata": {},
"outputs": [],
"source": [
"orig_tokenizer = get_encoder(model_name=\"gpt2_model\", models_dir=\".\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "2740510c-a78a-4fba-ae18-2b156ba2dfef",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]\n"
]
}
],
"source": [
"integers = orig_tokenizer.encode(text)\n",
"\n",
"print(integers)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "434d115e-990d-42ad-88dd-31323a96e10f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Hello, world. Is this-- a test?\n"
]
}
],
"source": [
"strings = orig_tokenizer.decode(integers)\n",
"\n",
"print(strings)"
]
},
{
"cell_type": "markdown",
"id": "4f63e8c6-707c-4d66-bcf8-dd790647cc86",
"metadata": {},
"source": [
"<br>\n",
"&nbsp;\n",
"\n",
"## Using the BPE via Hugging Face transformers"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e9077bf4-f91f-42ad-ab76-f3d89128510e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'4.34.0'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import transformers\n",
"\n",
"transformers.__version__"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "a9839137-b8ea-4a2c-85fc-9a63064cf8c8",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "e4df871bb797435787143a3abe6b0231",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading tokenizer_config.json: 0%| | 0.00/26.0 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f11b27a4aabf43af9bf57f929683def6",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading vocab.json: 0%| | 0.00/1.04M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d3aa9a24aacc43108ef2ed72e7bacd33",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f9341bc23b594bb68dcf8954bff6d9bd",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading tokenizer.json: 0%| | 0.00/1.36M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c5f55f2f1dbc4152acc9b2061167ee0a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading config.json: 0%| | 0.00/665 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from transformers import GPT2Tokenizer\n",
"\n",
"hf_tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "222cbd69-6a3d-4868-9c1f-421ffc9d5fe1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hf_tokenizer(strings)[\"input_ids\"]"
]
},
{
"cell_type": "markdown",
"id": "907a1ade-3401-4f2e-9017-7f58a60cbd98",
"metadata": {},
"source": [
"<br>\n",
"&nbsp;\n",
"\n",
"## A quick performance benchmark"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "a61bb445-b151-4a2f-8180-d4004c503754",
"metadata": {},
"outputs": [],
"source": [
"with open('../01_main-chapter-code/the-verdict.txt', 'r', encoding='utf-8') as f:\n",
" raw_text = f.read()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "57f7c0a3-c1fd-4313-af34-68e78eb33653",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4.29 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%timeit orig_tokenizer.encode(raw_text)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "036dd628-3591-46c9-a5ce-b20b105a8062",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.4 ms ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n"
]
}
],
"source": [
"%timeit tik_tokenizer.encode(raw_text)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "b9c85b58-bfbc-465e-9a7e-477e53d55c90",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"8.46 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%timeit hf_tokenizer(raw_text)[\"input_ids\"]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "7117107f-22a6-46b4-a442-712d50b3ac7a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"8.36 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%timeit hf_tokenizer(raw_text, max_length=5145, truncation=True)[\"input_ids\"]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}