diff --git a/README.md b/README.md index 7cb3bcc..8045fd6 100644 --- a/README.md +++ b/README.md @@ -58,7 +58,7 @@ Alternatively, you can view this and other files on GitHub at [https://github.co | Appendix B: References and Further Reading | No code | - | | Appendix C: Exercise Solutions | No code | - | | Appendix D: Adding Bells and Whistles to the Training Loop | - [appendix-D.ipynb](appendix-D/01_main-chapter-code/appendix-D.ipynb) | [./appendix-D](./appendix-D) | -| Appendix E: Parameter-efficient Finetuning with LoRA | - Q2 2024 | ... | +| Appendix E: Parameter-efficient Finetuning with LoRA | - [appendix-E.ipynb](appendix-E/01_main-chapter-code/appendix-E.ipynb) | [./appendix-E](./appendix-E) | diff --git a/appendix-E/01_main-chapter-code/appendix-E.ipynb b/appendix-E/01_main-chapter-code/appendix-E.ipynb new file mode 100644 index 0000000..13d546d --- /dev/null +++ b/appendix-E/01_main-chapter-code/appendix-E.ipynb @@ -0,0 +1,1408 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c024bfa4-1a7a-4751-b5a1-827225a3478b", + "metadata": { + "id": "c024bfa4-1a7a-4751-b5a1-827225a3478b" + }, + "source": [ + "\n", + "Supplementary code for \"Build a Large Language Model From Scratch\": https://www.manning.com/books/build-a-large-language-model-from-scratch by Sebastian Raschka
\n", + "Code repository: https://github.com/rasbt/LLMs-from-scratch\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "58b8c870-fb72-490e-8916-d8129bd5d1ff", + "metadata": {}, + "source": [ + "# Appendix E: Parameter-efficient Finetuning with LoRA" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "5b7e01c2-1c84-4f2a-bb51-2e0b74abda90", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5b7e01c2-1c84-4f2a-bb51-2e0b74abda90", + "outputId": "9495f150-9d79-4910-d6e7-6c0d9aae4a41" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "matplotlib version: 3.7.2\n", + "numpy version: 1.25.2\n", + "tiktoken version: 0.5.1\n", + "torch version: 2.2.2\n", + "tensorflow version: 2.15.0\n", + "pandas version: 2.0.3\n" + ] + } + ], + "source": [ + "from importlib.metadata import version\n", + "\n", + "pkgs = [\"matplotlib\",\n", + " \"numpy\",\n", + " \"tiktoken\",\n", + " \"torch\",\n", + " \"tensorflow\", # For OpenAI's pretrained weights\n", + " \"pandas\" # Dataset loading\n", + " ]\n", + "for p in pkgs:\n", + " print(f\"{p} version: {version(p)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "21532056-0ef4-4c98-82c7-e91f61c6485e", + "metadata": {}, + "source": [ + "## E.1 Introduction to LoRA" + ] + }, + { + "cell_type": "markdown", + "id": "66edc999-3d91-4a1c-a157-9d056392e8d8", + "metadata": {}, + "source": [ + "- No code in this section\n", + "- Low-rank adaptation (LoRA) is a machine learning technique that modifies a pretrained model to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters\n", + "- This approach is important because it allows for efficient finetuning of large models on task-specific data, significantly reducing the computational cost and time required for finetuning" + ] + }, + { + "cell_type": "markdown", + "id": "5bb75b5d-d59c-4948-821a-1594a5883dc1", + "metadata": {}, + "source": [ + "- Suppose we have a large weight matrix $W$ for a given layer\n", + "- During backpropagation, we learn a $\\Delta W$ matrix, which contains information on how much we want to update the original weights to minimize the loss function during training\n", + "- In regular training and finetuning, the weight update is defined as follows:\n", + "\n", + "$$W_{\\text{updated}} = W + \\Delta W$$\n", + "\n", + "- The LoRA method proposed by [Hu et al.](https://arxiv.org/abs/2106.09685) offers a more efficient alternative to computing the weight updates $\\Delta W$ by learning an approximation of it, $\\Delta W \\approx AB$.\n", + "- In other words, in LoRA, we have the following, where $A$ and $B$ are two small weight matrices:\n", + "\n", + "$$W_{\\text{updated}} = W + AB$$\n", + "\n", + "- The figure below illustrates these formulas for full finetuning and LoRA side by side\n", + "\n", + "\n", + "\n", + "- If you paid close attention, the full finetuning and LoRA depictions in the figure above look slightly different from the formulas I have shown earlier\n", + "- That's due to the distributive law of matrix multiplication: we don't have to add the weights with the updated weights but can keep them separate\n", + "- For instance, if $x$ is the input data, then we can write the following for regular finetuning:\n", + "\n", + "$$x (W+\\Delta W) = x W + x \\Delta W$$\n", + "\n", + "- Similarly, we can write the following for LoRA:\n", + "\n", + "$$x (W+A B) = x W + x A B$$\n", + "\n", + "- The fact that we can keep the LoRA weight matrices separate makes LoRA especially attractive\n", + "- In practice, this means that we don't have to modify the weights of the pretrained model at all, as we can apply the LoRA matrices on the fly\n", + "- After setting up the dataset and loading the model, we we will implement LoRA in code to make these concepts less abstract" + ] + }, + { + "cell_type": "markdown", + "id": "8c7017a2-32aa-4002-a2f3-12aac293ccdf", + "metadata": { + "id": "8c7017a2-32aa-4002-a2f3-12aac293ccdf" + }, + "source": [ + "## E.2 Preparing the dataset" + ] + }, + { + "cell_type": "markdown", + "id": "669c64df-4431-4d27-834d-2bb38a01fc02", + "metadata": {}, + "source": [ + "- This section repeats the code from chapter 6 to load and prepare the dataset\n", + "- Instead of repeating this code, one could copy & paste the LoRA code from section E.3 at the end of the chapter 6 notebook\n", + "- (The LoRA code was originally the last section of chapter 6 but was moved to the appendix due to the length of chapter 6)\n", + "- In similar fashion, we could also apply LoRA to the models in chapter 7 for instruction finetuning" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "def7c09b-af9c-4216-90ce-5e67aed1065c", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "def7c09b-af9c-4216-90ce-5e67aed1065c", + "outputId": "424e4423-f623-443c-ab9e-656f9e867559" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "sms_spam_collection/SMSSpamCollection.tsv already exists. Skipping download and extraction.\n" + ] + } + ], + "source": [ + "from pathlib import Path\n", + "import pandas as pd\n", + "from previous_chapters import (\n", + " download_and_unzip,\n", + " create_balanced_dataset,\n", + " random_split\n", + ")\n", + "\n", + "\n", + "url = \"https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip\"\n", + "zip_path = \"sms_spam_collection.zip\"\n", + "extracted_path = \"sms_spam_collection\"\n", + "data_file_path = Path(extracted_path) / \"SMSSpamCollection.tsv\"\n", + "\n", + "download_and_unzip(url, zip_path, extracted_path, data_file_path)\n", + "\n", + "df = pd.read_csv(data_file_path, sep=\"\\t\", header=None, names=[\"Label\", \"Text\"])\n", + "balanced_df = create_balanced_dataset(df)\n", + "balanced_df[\"Label\"] = balanced_df[\"Label\"].map({\"ham\": 0, \"spam\": 1})\n", + "\n", + "train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)\n", + "train_df.to_csv(\"train.csv\", index=None)\n", + "validation_df.to_csv(\"validation.csv\", index=None)\n", + "test_df.to_csv(\"test.csv\", index=None)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "74c3c463-8763-4cc0-9320-41c7eaad8ab7", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "74c3c463-8763-4cc0-9320-41c7eaad8ab7", + "outputId": "b5b48439-32c8-4b37-cca2-c9dc8fa86563" + }, + "outputs": [], + "source": [ + "import torch\n", + "from torch.utils.data import Dataset\n", + "import tiktoken\n", + "from previous_chapters import SpamDataset\n", + "\n", + "\n", + "tokenizer = tiktoken.get_encoding(\"gpt2\")\n", + "train_dataset = SpamDataset(\"train.csv\", max_length=None, tokenizer=tokenizer)\n", + "val_dataset = SpamDataset(\"validation.csv\", max_length=train_dataset.max_length, tokenizer=tokenizer)\n", + "test_dataset = SpamDataset(\"test.csv\", max_length=train_dataset.max_length, tokenizer=tokenizer)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "8681adc0-6f02-4e75-b01a-a6ab75d05542", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "8681adc0-6f02-4e75-b01a-a6ab75d05542", + "outputId": "3266c410-4fdb-4a8c-a142-7f707e2525ab" + }, + "outputs": [], + "source": [ + "from torch.utils.data import DataLoader\n", + "\n", + "num_workers = 0\n", + "batch_size = 8\n", + "\n", + "torch.manual_seed(123)\n", + "\n", + "train_loader = DataLoader(\n", + " dataset=train_dataset,\n", + " batch_size=batch_size,\n", + " shuffle=True,\n", + " num_workers=num_workers,\n", + " drop_last=True,\n", + ")\n", + "\n", + "val_loader = DataLoader(\n", + " dataset=val_dataset,\n", + " batch_size=batch_size,\n", + " num_workers=num_workers,\n", + " drop_last=False,\n", + ")\n", + "\n", + "test_loader = DataLoader(\n", + " dataset=test_dataset,\n", + " batch_size=batch_size,\n", + " num_workers=num_workers,\n", + " drop_last=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "ab7335db-e0bb-4e27-80c5-eea11e593a57", + "metadata": {}, + "source": [ + "- As a verification step, we iterate through the data loaders and check that the batches contain 8 training examples each, where each training example consists of 120 tokens" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "4dee6882-4c3a-4964-af15-fa31f86ad047", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train loader:\n", + "Input batch dimensions: torch.Size([8, 120])\n", + "Label batch dimensions torch.Size([8])\n" + ] + } + ], + "source": [ + "print(\"Train loader:\")\n", + "for input_batch, target_batch in train_loader:\n", + " pass\n", + "\n", + "print(\"Input batch dimensions:\", input_batch.shape)\n", + "print(\"Label batch dimensions\", target_batch.shape)" + ] + }, + { + "cell_type": "markdown", + "id": "5cdd7947-7039-49bf-8a5e-c0a2f4281ca1", + "metadata": {}, + "source": [ + "- Lastly, let's print the total number of batches in each dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "IZfw-TYD2zTj", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "IZfw-TYD2zTj", + "outputId": "6934bbf2-9797-4fbe-d26b-1a246e18c2fb" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "130 training batches\n", + "19 validation batches\n", + "38 test batches\n" + ] + } + ], + "source": [ + "print(f\"{len(train_loader)} training batches\")\n", + "print(f\"{len(val_loader)} validation batches\")\n", + "print(f\"{len(test_loader)} test batches\")" + ] + }, + { + "cell_type": "markdown", + "id": "dec9aa4a-ffd2-4d9f-a835-cce1059fe604", + "metadata": {}, + "source": [ + "## E.3 Initializing the model" + ] + }, + { + "cell_type": "markdown", + "id": "f36ebdaf-810e-46a2-9ad9-e017a04051b1", + "metadata": {}, + "source": [ + "- This section repeats the code from chapter 6 to load and prepare the model" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "02b3a506-3879-4258-82b5-93a5b6bafa74", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "File already exists and is up-to-date: gpt2/124M/checkpoint\n", + "File already exists and is up-to-date: gpt2/124M/encoder.json\n", + "File already exists and is up-to-date: gpt2/124M/hparams.json\n", + "File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001\n", + "File already exists and is up-to-date: gpt2/124M/model.ckpt.index\n", + "File already exists and is up-to-date: gpt2/124M/model.ckpt.meta\n", + "File already exists and is up-to-date: gpt2/124M/vocab.bpe\n" + ] + } + ], + "source": [ + "from gpt_download import download_and_load_gpt2\n", + "from previous_chapters import GPTModel, load_weights_into_gpt\n", + "\n", + "\n", + "CHOOSE_MODEL = \"gpt2-small (124M)\"\n", + "INPUT_PROMPT = \"Every effort moves\"\n", + "\n", + "BASE_CONFIG = {\n", + " \"vocab_size\": 50257, # Vocabulary size\n", + " \"context_length\": 1024, # Context length\n", + " \"drop_rate\": 0.0, # Dropout rate\n", + " \"qkv_bias\": True # Query-key-value bias\n", + "}\n", + "\n", + "model_configs = {\n", + " \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n", + " \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n", + " \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n", + " \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n", + "}\n", + "\n", + "BASE_CONFIG.update(model_configs[CHOOSE_MODEL])\n", + "\n", + "model_size = CHOOSE_MODEL.split(\" \")[-1].lstrip(\"(\").rstrip(\")\")\n", + "settings, params = download_and_load_gpt2(model_size=model_size, models_dir=\"gpt2\")\n", + "\n", + "model = GPTModel(BASE_CONFIG)\n", + "load_weights_into_gpt(model, params)\n", + "model.eval();" + ] + }, + { + "cell_type": "markdown", + "id": "252614cd-7ce6-4908-83e6-3761f519904e", + "metadata": {}, + "source": [ + "- To ensure that the model was loaded corrected, let's double-check that it generates coherent text" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "8b6ce20c-0700-4783-8be0-4cf17c200a7f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Every effort moves you forward.\n", + "\n", + "The first step is to understand the importance of your work\n" + ] + } + ], + "source": [ + "from previous_chapters import (\n", + " generate_text_simple,\n", + " text_to_token_ids,\n", + " token_ids_to_text\n", + ")\n", + "\n", + "\n", + "text_1 = \"Every effort moves you\"\n", + "\n", + "token_ids = generate_text_simple(\n", + " model=model,\n", + " idx=text_to_token_ids(text_1, tokenizer),\n", + " max_new_tokens=15,\n", + " context_size=BASE_CONFIG[\"context_length\"]\n", + ")\n", + "\n", + "print(token_ids_to_text(token_ids, tokenizer))" + ] + }, + { + "cell_type": "markdown", + "id": "8174b31b-1ab5-4115-b01c-245369da5af3", + "metadata": {}, + "source": [ + "- Then, we prepare the model for classification finetuning similar to chapter 6, where we replace the output layer" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "e255ce91-d73a-4854-90a4-95804928eb16", + "metadata": {}, + "outputs": [], + "source": [ + "torch.manual_seed(123)\n", + "\n", + "num_classes = 2\n", + "model.out_head = torch.nn.Linear(in_features=768, out_features=num_classes)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "02e6f057-1383-4ece-8444-0a88e71ac75d", + "metadata": {}, + "outputs": [], + "source": [ + "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + "model.to(device); # no assignment model = model.to(device) necessary for nn.Module classes" + ] + }, + { + "cell_type": "markdown", + "id": "8e951cd6-5e42-44d2-b21f-895cb61004fe", + "metadata": {}, + "source": [ + "- Lastly, let's calcuate the initial classification accuracy of the non-finetuning model (we expect this to be around 50%, which means that the model is not able to reliably distinguish between spam and non-spam messages, yet)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "fc7dd72c-73a2-4881-ade0-0a9605f1ab8c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Training accuracy: 46.25%\n", + "Validation accuracy: 45.00%\n", + "Test accuracy: 48.75%\n" + ] + } + ], + "source": [ + "from previous_chapters import calc_accuracy_loader\n", + "\n", + "\n", + "torch.manual_seed(123)\n", + "train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=10)\n", + "val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=10)\n", + "test_accuracy = calc_accuracy_loader(test_loader, model, device, num_batches=10)\n", + "\n", + "print(f\"Training accuracy: {train_accuracy*100:.2f}%\")\n", + "print(f\"Validation accuracy: {val_accuracy*100:.2f}%\")\n", + "print(f\"Test accuracy: {test_accuracy*100:.2f}%\")" + ] + }, + { + "cell_type": "markdown", + "id": "398a1ec9-e2a1-43d6-bf9f-12ee54b46a7b", + "metadata": { + "id": "398a1ec9-e2a1-43d6-bf9f-12ee54b46a7b" + }, + "source": [ + "## E.4 Parameter-efficient finetuning with LoRA" + ] + }, + { + "cell_type": "markdown", + "id": "652a4a82-61ef-4d0a-9858-8988e844f12c", + "metadata": {}, + "source": [ + "- We begin by initializing a LoRALayer that creates the matrices $A$ and $B$, along with the `alpha` scaling hyperparameter and the `rank` ($r$) hyperparameters\n", + "- This layer can accept an input and compute the corresponding output, as illustrated in the figure below\n", + "\n", + "\n", + "\n", + "In code, this LoRA layer depicted in the figure above looks like as follows" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "2ds9ywjMwvIW", + "metadata": { + "id": "2ds9ywjMwvIW" + }, + "outputs": [], + "source": [ + "class LoRALayer(torch.nn.Module):\n", + " def __init__(self, in_dim, out_dim, rank, alpha):\n", + " super().__init__()\n", + " std_dev = 1 / torch.sqrt(torch.tensor(rank).float())\n", + " self.A = torch.nn.Parameter(torch.randn(in_dim, rank) * std_dev)\n", + " self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))\n", + " self.alpha = alpha\n", + "\n", + " def forward(self, x):\n", + " x = self.alpha * (x @ self.A @ self.B)\n", + " return x" + ] + }, + { + "cell_type": "markdown", + "id": "ad21faa8-0614-4257-93cd-68952193e14a", + "metadata": {}, + "source": [ + "- In the code above, `rank` is a hyperparameter that controls the inner dimension of the matrices $A$ and $B$\n", + "- In other words, this parameter controls the number of additional parameters introduced by LoRA and is a key factor in determining the balance between model adaptability and parameter efficiency\n", + "- The second hyperparameter, alpha, is a scaling hyperparameter applied to the output of the low-rank adaptation\n", + "- It essentially controls the extent to which the adapted layer's output is allowed to influence the original output of the layer being adapted\n", + "- This can be seen as a way to regulate the impact of the low-rank adaptation on the layer's output\n", + "- So far, the `LoRALayer` class we implemented above allows us to transform the layer inputs $x$\n", + "- However, in LoRA, we are usually interested in replacing existing `Linear` layers so that the weight update is applied to the existing pretrained weights, as shown in the figure below\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "3e6d5da0-dfce-4808-b89b-29ff333f563f", + "metadata": {}, + "source": [ + "- To incorporate the original `Linear` layer weights as shown in the figure above, we implement a `LinearWithLoRA` layer below that uses the previously implemented LoRALayer and can be used to replace existing `Linear` layers in a neural network, for example, the self-attention module or feed forward modules in an LLM" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "127d3a64-8359-4b21-b056-78d58cc75fe8", + "metadata": {}, + "outputs": [], + "source": [ + "class LinearWithLoRA(torch.nn.Module):\n", + " def __init__(self, linear, rank, alpha):\n", + " super().__init__()\n", + " self.linear = linear\n", + " self.lora = LoRALayer(\n", + " linear.in_features, linear.out_features, rank, alpha\n", + " )\n", + "\n", + " def forward(self, x):\n", + " return self.linear(x) + self.lora(x)" + ] + }, + { + "cell_type": "markdown", + "id": "e1145a90-35ff-462c-820b-15483fa5b051", + "metadata": {}, + "source": [ + "- Note that since we initialize the weight matrix $B$ (`self.B` in `LoraLayer`) with zero values in the LoRA layer, the matrix multiplication between $A$ and $B$ results in a matrix consisting of 0's and doesn't affect the original weights (since adding 0 to the original weights does not modify them)" + ] + }, + { + "cell_type": "markdown", + "id": "e98a6d36-7bc9-434c-a7f1-533f26aff06d", + "metadata": { + "id": "4D21Jk7Vw3nG" + }, + "source": [ + "- To try LoRA on the GPT model we defined earlier, we define a `replace_linear_with_lora` function to replace all `Linear` layers in the model with the new `LinearWithLoRA` layers" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "WlQZ8ygqzN_g", + "metadata": { + "id": "WlQZ8ygqzN_g" + }, + "outputs": [], + "source": [ + "def replace_linear_with_lora(model, rank, alpha):\n", + " for name, module in model.named_children():\n", + " if isinstance(module, torch.nn.Linear):\n", + " # Replace the Linear layer with LinearWithLoRA\n", + " setattr(model, name, LinearWithLoRA(module, rank, alpha))\n", + " else:\n", + " # Recursively apply the same function to child modules\n", + " replace_linear_with_lora(module, rank, alpha)" + ] + }, + { + "cell_type": "markdown", + "id": "8c172164-cdde-4489-b7d7-aaed9cc2f5f2", + "metadata": {}, + "source": [ + "- We then freeze the original model parameter and use the `replace_linear_with_lora` to replace the said `Linear` layers below" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "dbe15350-4da9-4829-9d23-98bbd3d0b1a1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total trainable parameters before: 124,441,346\n", + "Total trainable parameters after: 0\n" + ] + } + ], + "source": [ + "total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n", + "print(f\"Total trainable parameters before: {total_params:,}\")\n", + "\n", + "for param in model.parameters():\n", + " param.requires_grad = False\n", + "\n", + "total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n", + "print(f\"Total trainable parameters after: {total_params:,}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "mLk_fPq0yz_u", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "mLk_fPq0yz_u", + "outputId": "7ba89607-ca75-4718-e8dc-9cdc44c3e410" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total trainable LoRA parameters: 1,333,264\n" + ] + } + ], + "source": [ + "replace_linear_with_lora(model, rank=8, alpha=8)\n", + "\n", + "total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n", + "print(f\"Total trainable LoRA parameters: {total_params:,}\")" + ] + }, + { + "cell_type": "markdown", + "id": "b8b6819e-ef7a-4f0d-841a-1b467496bef9", + "metadata": {}, + "source": [ + "- As we can see, we reduced the number of trainable parameters by almost 100x when using LoRA\n", + "- Let's now double-check whether the layers have been modified as intended by printing the model architecture" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "1711be61-bb2c-466f-9b5b-24f4aa5ccd9c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GPTModel(\n", + " (tok_emb): Embedding(50257, 768)\n", + " (pos_emb): Embedding(1024, 768)\n", + " (drop_emb): Dropout(p=0.0, inplace=False)\n", + " (trf_blocks): Sequential(\n", + " (0): TransformerBlock(\n", + " (att): MultiHeadAttention(\n", + " (W_query): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_key): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_value): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (out_proj): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (dropout): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (ff): FeedForward(\n", + " (layers): Sequential(\n", + " (0): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=3072, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (1): GELU()\n", + " (2): LinearWithLoRA(\n", + " (linear): Linear(in_features=3072, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm()\n", + " (norm2): LayerNorm()\n", + " (drop_resid): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (1): TransformerBlock(\n", + " (att): MultiHeadAttention(\n", + " (W_query): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_key): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_value): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (out_proj): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (dropout): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (ff): FeedForward(\n", + " (layers): Sequential(\n", + " (0): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=3072, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (1): GELU()\n", + " (2): LinearWithLoRA(\n", + " (linear): Linear(in_features=3072, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm()\n", + " (norm2): LayerNorm()\n", + " (drop_resid): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (2): TransformerBlock(\n", + " (att): MultiHeadAttention(\n", + " (W_query): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_key): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_value): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (out_proj): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (dropout): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (ff): FeedForward(\n", + " (layers): Sequential(\n", + " (0): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=3072, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (1): GELU()\n", + " (2): LinearWithLoRA(\n", + " (linear): Linear(in_features=3072, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm()\n", + " (norm2): LayerNorm()\n", + " (drop_resid): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (3): TransformerBlock(\n", + " (att): MultiHeadAttention(\n", + " (W_query): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_key): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_value): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (out_proj): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (dropout): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (ff): FeedForward(\n", + " (layers): Sequential(\n", + " (0): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=3072, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (1): GELU()\n", + " (2): LinearWithLoRA(\n", + " (linear): Linear(in_features=3072, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm()\n", + " (norm2): LayerNorm()\n", + " (drop_resid): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (4): TransformerBlock(\n", + " (att): MultiHeadAttention(\n", + " (W_query): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_key): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_value): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (out_proj): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (dropout): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (ff): FeedForward(\n", + " (layers): Sequential(\n", + " (0): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=3072, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (1): GELU()\n", + " (2): LinearWithLoRA(\n", + " (linear): Linear(in_features=3072, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm()\n", + " (norm2): LayerNorm()\n", + " (drop_resid): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (5): TransformerBlock(\n", + " (att): MultiHeadAttention(\n", + " (W_query): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_key): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_value): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (out_proj): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (dropout): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (ff): FeedForward(\n", + " (layers): Sequential(\n", + " (0): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=3072, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (1): GELU()\n", + " (2): LinearWithLoRA(\n", + " (linear): Linear(in_features=3072, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm()\n", + " (norm2): LayerNorm()\n", + " (drop_resid): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (6): TransformerBlock(\n", + " (att): MultiHeadAttention(\n", + " (W_query): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_key): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_value): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (out_proj): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (dropout): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (ff): FeedForward(\n", + " (layers): Sequential(\n", + " (0): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=3072, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (1): GELU()\n", + " (2): LinearWithLoRA(\n", + " (linear): Linear(in_features=3072, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm()\n", + " (norm2): LayerNorm()\n", + " (drop_resid): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (7): TransformerBlock(\n", + " (att): MultiHeadAttention(\n", + " (W_query): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_key): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_value): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (out_proj): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (dropout): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (ff): FeedForward(\n", + " (layers): Sequential(\n", + " (0): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=3072, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (1): GELU()\n", + " (2): LinearWithLoRA(\n", + " (linear): Linear(in_features=3072, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm()\n", + " (norm2): LayerNorm()\n", + " (drop_resid): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (8): TransformerBlock(\n", + " (att): MultiHeadAttention(\n", + " (W_query): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_key): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_value): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (out_proj): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (dropout): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (ff): FeedForward(\n", + " (layers): Sequential(\n", + " (0): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=3072, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (1): GELU()\n", + " (2): LinearWithLoRA(\n", + " (linear): Linear(in_features=3072, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm()\n", + " (norm2): LayerNorm()\n", + " (drop_resid): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (9): TransformerBlock(\n", + " (att): MultiHeadAttention(\n", + " (W_query): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_key): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_value): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (out_proj): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (dropout): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (ff): FeedForward(\n", + " (layers): Sequential(\n", + " (0): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=3072, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (1): GELU()\n", + " (2): LinearWithLoRA(\n", + " (linear): Linear(in_features=3072, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm()\n", + " (norm2): LayerNorm()\n", + " (drop_resid): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (10): TransformerBlock(\n", + " (att): MultiHeadAttention(\n", + " (W_query): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_key): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_value): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (out_proj): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (dropout): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (ff): FeedForward(\n", + " (layers): Sequential(\n", + " (0): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=3072, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (1): GELU()\n", + " (2): LinearWithLoRA(\n", + " (linear): Linear(in_features=3072, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm()\n", + " (norm2): LayerNorm()\n", + " (drop_resid): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (11): TransformerBlock(\n", + " (att): MultiHeadAttention(\n", + " (W_query): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_key): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (W_value): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (out_proj): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (dropout): Dropout(p=0.0, inplace=False)\n", + " )\n", + " (ff): FeedForward(\n", + " (layers): Sequential(\n", + " (0): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=3072, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (1): GELU()\n", + " (2): LinearWithLoRA(\n", + " (linear): Linear(in_features=3072, out_features=768, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm()\n", + " (norm2): LayerNorm()\n", + " (drop_resid): Dropout(p=0.0, inplace=False)\n", + " )\n", + " )\n", + " (final_norm): LayerNorm()\n", + " (out_head): LinearWithLoRA(\n", + " (linear): Linear(in_features=768, out_features=2, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + ")\n" + ] + } + ], + "source": [ + "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + "model.to(device)\n", + "\n", + "print(model)" + ] + }, + { + "cell_type": "markdown", + "id": "c4bbc9d7-65ec-4675-bab8-2e56eb0cfb55", + "metadata": {}, + "source": [ + "- Based on the model architecture above, we can see that the model now contains our new `LinearWithLoRA` layers\n", + "- Also, since we initialized matrix $B$ with 0's, we expect the initial model performance to be unchanged compared to before" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "DAlrb_I00VEU", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "DAlrb_I00VEU", + "outputId": "3dae5ff0-316d-408e-c8dc-2b8c60f9b994" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Training accuracy: 46.25%\n", + "Validation accuracy: 45.00%\n", + "Test accuracy: 48.75%\n" + ] + } + ], + "source": [ + "torch.manual_seed(123)\n", + "train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=10)\n", + "val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=10)\n", + "test_accuracy = calc_accuracy_loader(test_loader, model, device, num_batches=10)\n", + "\n", + "print(f\"Training accuracy: {train_accuracy*100:.2f}%\")\n", + "print(f\"Validation accuracy: {val_accuracy*100:.2f}%\")\n", + "print(f\"Test accuracy: {test_accuracy*100:.2f}%\")" + ] + }, + { + "cell_type": "markdown", + "id": "13735b3e-f0c3-4dba-ae3d-4141b2878101", + "metadata": {}, + "source": [ + "- Let's now get to the interesting part and finetune the model reusing the training function from chapter 6\n", + "- The training takes about 15 minutes on a M3 MacBook Air laptop computer and less than half a minute on a V100 or A100 GPU" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "wCParRvr0eff", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "wCParRvr0eff", + "outputId": "b86fd5f4-1527-4549-e0b0-9dff37836f0a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Ep 1 (Step 000000): Train loss 2.849, Val loss 2.565\n", + "Ep 1 (Step 000050): Train loss 0.515, Val loss 0.465\n", + "Ep 1 (Step 000100): Train loss 0.191, Val loss 0.423\n", + "Training accuracy: 97.50% | Validation accuracy: 97.50%\n", + "Ep 2 (Step 000150): Train loss 0.170, Val loss 0.072\n", + "Ep 2 (Step 000200): Train loss 0.014, Val loss 0.087\n", + "Ep 2 (Step 000250): Train loss 0.027, Val loss 0.197\n", + "Training accuracy: 100.00% | Validation accuracy: 92.50%\n", + "Ep 3 (Step 000300): Train loss 0.014, Val loss 0.321\n", + "Ep 3 (Step 000350): Train loss 0.015, Val loss 0.146\n", + "Training accuracy: 100.00% | Validation accuracy: 97.50%\n", + "Ep 4 (Step 000400): Train loss 0.008, Val loss 0.103\n", + "Ep 4 (Step 000450): Train loss 0.010, Val loss 0.178\n", + "Ep 4 (Step 000500): Train loss 0.097, Val loss 0.056\n", + "Training accuracy: 100.00% | Validation accuracy: 97.50%\n", + "Ep 5 (Step 000550): Train loss 0.032, Val loss 0.091\n", + "Ep 5 (Step 000600): Train loss 0.002, Val loss 0.058\n", + "Training accuracy: 100.00% | Validation accuracy: 100.00%\n", + "Ep 6 (Step 000650): Train loss 0.001, Val loss 0.009\n", + "Ep 6 (Step 000700): Train loss 0.001, Val loss 0.039\n", + "Ep 6 (Step 000750): Train loss 0.000, Val loss 0.038\n", + "Training accuracy: 100.00% | Validation accuracy: 95.00%\n", + "Training completed in 13.70 minutes.\n" + ] + } + ], + "source": [ + "import time\n", + "from previous_chapters import train_classifier_simple\n", + "\n", + "\n", + "start_time = time.time()\n", + "\n", + "torch.manual_seed(123)\n", + "\n", + "optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)\n", + "\n", + "num_epochs = 6\n", + "train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(\n", + " model, train_loader, val_loader, optimizer, device,\n", + " num_epochs=num_epochs, eval_freq=50, eval_iter=5,\n", + " tokenizer=tokenizer\n", + ")\n", + "\n", + "end_time = time.time()\n", + "execution_time_minutes = (end_time - start_time) / 60\n", + "print(f\"Training completed in {execution_time_minutes:.2f} minutes.\")" + ] + }, + { + "cell_type": "markdown", + "id": "d0c89e82-3aa8-44c6-b046-0b16200b8e6c", + "metadata": {}, + "source": [ + "- Finally, let's evaluate the model" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "bawWGijA0iF3", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 307 + }, + "id": "bawWGijA0iF3", + "outputId": "4b05b245-ffac-4d36-881b-8306a4da6b75" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from previous_chapters import plot_values\n", + "\n", + "epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))\n", + "examples_seen_tensor = torch.linspace(0, examples_seen, len(train_losses))\n", + "\n", + "plot_values(epochs_tensor, examples_seen_tensor, train_losses, val_losses, label=\"loss\")" + ] + }, + { + "cell_type": "markdown", + "id": "aa074723-e3f7-4f7e-a267-855531a037dc", + "metadata": {}, + "source": [ + "- Note that we previously calculated the accuracy values on 10 batches only; below we calculate the accuracies on the full dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "1D2awlEq0gZi", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1D2awlEq0gZi", + "outputId": "b482af19-5ebd-45b9-a9f0-99f621203ef9" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Training accuracy: 100.00%\n", + "Validation accuracy: 96.64%\n", + "Test accuracy: 98.00%\n" + ] + } + ], + "source": [ + "from previous_chapters import calc_accuracy_loader\n", + "\n", + "train_accuracy = calc_accuracy_loader(train_loader, model, device)\n", + "val_accuracy = calc_accuracy_loader(val_loader, model, device)\n", + "test_accuracy = calc_accuracy_loader(test_loader, model, device)\n", + "\n", + "print(f\"Training accuracy: {train_accuracy*100:.2f}%\")\n", + "print(f\"Validation accuracy: {val_accuracy*100:.2f}%\")\n", + "print(f\"Test accuracy: {test_accuracy*100:.2f}%\")" + ] + }, + { + "cell_type": "markdown", + "id": "1f87f5e6-339e-4fcf-900b-6d845d3c713d", + "metadata": {}, + "source": [ + "- As we can based on the relatively high accuracy values above, the LoRA finetuning was successful" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "V100", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/appendix-E/01_main-chapter-code/gpt_download.py b/appendix-E/01_main-chapter-code/gpt_download.py new file mode 100644 index 0000000..0d695d2 --- /dev/null +++ b/appendix-E/01_main-chapter-code/gpt_download.py @@ -0,0 +1,99 @@ +# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). +# Source for "Build a Large Language Model From Scratch" +# - https://www.manning.com/books/build-a-large-language-model-from-scratch +# Code: https://github.com/rasbt/LLMs-from-scratch + + +import os +import requests +import json +import numpy as np +import tensorflow as tf +from tqdm import tqdm + + +def download_and_load_gpt2(model_size, models_dir): + # Validate model size + allowed_sizes = ("124M", "355M", "774M", "1558M") + if model_size not in allowed_sizes: + raise ValueError(f"Model size not in {allowed_sizes}") + + # Define paths + model_dir = os.path.join(models_dir, model_size) + base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models" + filenames = [ + "checkpoint", "encoder.json", "hparams.json", + "model.ckpt.data-00000-of-00001", "model.ckpt.index", + "model.ckpt.meta", "vocab.bpe" + ] + + # Download files + os.makedirs(model_dir, exist_ok=True) + for filename in filenames: + file_url = os.path.join(base_url, model_size, filename) + file_path = os.path.join(model_dir, filename) + download_file(file_url, file_path) + + # Load settings and params + tf_ckpt_path = tf.train.latest_checkpoint(model_dir) + settings = json.load(open(os.path.join(model_dir, "hparams.json"))) + params = load_gpt2_params_from_tf_ckpt(tf_ckpt_path, settings) + + return settings, params + + +def download_file(url, destination): + # Send a GET request to download the file in streaming mode + response = requests.get(url, stream=True) + + # Get the total file size from headers, defaulting to 0 if not present + file_size = int(response.headers.get("content-length", 0)) + + # Check if file exists and has the same size + if os.path.exists(destination): + file_size_local = os.path.getsize(destination) + if file_size == file_size_local: + print(f"File already exists and is up-to-date: {destination}") + return + + # Define the block size for reading the file + block_size = 1024 # 1 Kilobyte + + # Initialize the progress bar with total file size + progress_bar_description = url.split("/")[-1] # Extract filename from URL + with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: + # Open the destination file in binary write mode + with open(destination, "wb") as file: + # Iterate over the file data in chunks + for chunk in response.iter_content(block_size): + progress_bar.update(len(chunk)) # Update progress bar + file.write(chunk) # Write the chunk to the file + + +def load_gpt2_params_from_tf_ckpt(ckpt_path, settings): + # Initialize parameters dictionary with empty blocks for each layer + params = {"blocks": [{} for _ in range(settings["n_layer"])]} + + # Iterate over each variable in the checkpoint + for name, _ in tf.train.list_variables(ckpt_path): + # Load the variable and remove singleton dimensions + variable_array = np.squeeze(tf.train.load_variable(ckpt_path, name)) + + # Process the variable name to extract relevant parts + variable_name_parts = name.split("/")[1:] # Skip the 'model/' prefix + + # Identify the target dictionary for the variable + target_dict = params + if variable_name_parts[0].startswith("h"): + layer_number = int(variable_name_parts[0][1:]) + target_dict = params["blocks"][layer_number] + + # Recursively access or create nested dictionaries + for key in variable_name_parts[1:-1]: + target_dict = target_dict.setdefault(key, {}) + + # Assign the variable array to the last key + last_key = variable_name_parts[-1] + target_dict[last_key] = variable_array + + return params diff --git a/appendix-E/01_main-chapter-code/previous_chapters.py b/appendix-E/01_main-chapter-code/previous_chapters.py new file mode 100644 index 0000000..b6fca51 --- /dev/null +++ b/appendix-E/01_main-chapter-code/previous_chapters.py @@ -0,0 +1,542 @@ +# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). +# Source for "Build a Large Language Model From Scratch" +# - https://www.manning.com/books/build-a-large-language-model-from-scratch +# Code: https://github.com/rasbt/LLMs-from-scratch +# +# This file collects all the relevant code that we covered thus far +# throughout Chapters 2-6. +# This file can be run as a standalone script. + +import os +from pathlib import Path +import urllib +import zipfile + +import matplotlib.pyplot as plt +import numpy as np +import pandas as pd +import tiktoken +import torch +import torch.nn as nn +from torch.utils.data import Dataset, DataLoader + + +##################################### +# Chapter 2 +##################################### + + +class GPTDatasetV1(Dataset): + def __init__(self, txt, tokenizer, max_length, stride): + self.tokenizer = tokenizer + self.input_ids = [] + self.target_ids = [] + + # Tokenize the entire text + token_ids = tokenizer.encode(txt) + + # Use a sliding window to chunk the book into overlapping sequences of max_length + for i in range(0, len(token_ids) - max_length, stride): + input_chunk = token_ids[i:i + max_length] + target_chunk = token_ids[i + 1: i + max_length + 1] + self.input_ids.append(torch.tensor(input_chunk)) + self.target_ids.append(torch.tensor(target_chunk)) + + def __len__(self): + return len(self.input_ids) + + def __getitem__(self, idx): + return self.input_ids[idx], self.target_ids[idx] + + +def create_dataloader_v1(txt, batch_size=4, max_length=256, + stride=128, shuffle=True, drop_last=True): + # Initialize the tokenizer + tokenizer = tiktoken.get_encoding("gpt2") + + # Create dataset + dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) + + # Create dataloader + dataloader = DataLoader( + dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last) + + return dataloader + + +##################################### +# Chapter 3 +##################################### +class MultiHeadAttention(nn.Module): + def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): + super().__init__() + assert d_out % num_heads == 0, "d_out must be divisible by n_heads" + + self.d_out = d_out + self.num_heads = num_heads + self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim + + self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) + self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) + self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) + self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs + self.dropout = nn.Dropout(dropout) + self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) + + def forward(self, x): + b, num_tokens, d_in = x.shape + + keys = self.W_key(x) # Shape: (b, num_tokens, d_out) + queries = self.W_query(x) + values = self.W_value(x) + + # We implicitly split the matrix by adding a `num_heads` dimension + # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim) + keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) + values = values.view(b, num_tokens, self.num_heads, self.head_dim) + queries = queries.view(b, num_tokens, self.num_heads, self.head_dim) + + # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim) + keys = keys.transpose(1, 2) + queries = queries.transpose(1, 2) + values = values.transpose(1, 2) + + # Compute scaled dot-product attention (aka self-attention) with a causal mask + attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head + + # Original mask truncated to the number of tokens and converted to boolean + mask_bool = self.mask.bool()[:num_tokens, :num_tokens] + + # Use the mask to fill attention scores + attn_scores.masked_fill_(mask_bool, -torch.inf) + + attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) + attn_weights = self.dropout(attn_weights) + + # Shape: (b, num_tokens, num_heads, head_dim) + context_vec = (attn_weights @ values).transpose(1, 2) + + # Combine heads, where self.d_out = self.num_heads * self.head_dim + context_vec = context_vec.reshape(b, num_tokens, self.d_out) + context_vec = self.out_proj(context_vec) # optional projection + + return context_vec + + +##################################### +# Chapter 4 +##################################### +class LayerNorm(nn.Module): + def __init__(self, emb_dim): + super().__init__() + self.eps = 1e-5 + self.scale = nn.Parameter(torch.ones(emb_dim)) + self.shift = nn.Parameter(torch.zeros(emb_dim)) + + def forward(self, x): + mean = x.mean(dim=-1, keepdim=True) + var = x.var(dim=-1, keepdim=True, unbiased=False) + norm_x = (x - mean) / torch.sqrt(var + self.eps) + return self.scale * norm_x + self.shift + + +class GELU(nn.Module): + def __init__(self): + super().__init__() + + def forward(self, x): + return 0.5 * x * (1 + torch.tanh( + torch.sqrt(torch.tensor(2.0 / torch.pi)) * + (x + 0.044715 * torch.pow(x, 3)) + )) + + +class FeedForward(nn.Module): + def __init__(self, cfg): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]), + GELU(), + nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]), + ) + + def forward(self, x): + return self.layers(x) + + +class TransformerBlock(nn.Module): + def __init__(self, cfg): + super().__init__() + self.att = MultiHeadAttention( + d_in=cfg["emb_dim"], + d_out=cfg["emb_dim"], + context_length=cfg["context_length"], + num_heads=cfg["n_heads"], + dropout=cfg["drop_rate"], + qkv_bias=cfg["qkv_bias"]) + self.ff = FeedForward(cfg) + self.norm1 = LayerNorm(cfg["emb_dim"]) + self.norm2 = LayerNorm(cfg["emb_dim"]) + self.drop_resid = nn.Dropout(cfg["drop_rate"]) + + def forward(self, x): + # Shortcut connection for attention block + shortcut = x + x = self.norm1(x) + x = self.att(x) # Shape [batch_size, num_tokens, emb_size] + x = self.drop_resid(x) + x = x + shortcut # Add the original input back + + # Shortcut connection for feed-forward block + shortcut = x + x = self.norm2(x) + x = self.ff(x) + x = self.drop_resid(x) + x = x + shortcut # Add the original input back + + return x + + +class GPTModel(nn.Module): + def __init__(self, cfg): + super().__init__() + self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"]) + self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"]) + self.drop_emb = nn.Dropout(cfg["drop_rate"]) + + self.trf_blocks = nn.Sequential( + *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]) + + self.final_norm = LayerNorm(cfg["emb_dim"]) + self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False) + + def forward(self, in_idx): + batch_size, seq_len = in_idx.shape + tok_embeds = self.tok_emb(in_idx) + pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) + x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size] + x = self.drop_emb(x) + x = self.trf_blocks(x) + x = self.final_norm(x) + logits = self.out_head(x) + return logits + + +def generate_text_simple(model, idx, max_new_tokens, context_size): + # idx is (B, T) array of indices in the current context + for _ in range(max_new_tokens): + + # Crop current context if it exceeds the supported context size + # E.g., if LLM supports only 5 tokens, and the context size is 10 + # then only the last 5 tokens are used as context + idx_cond = idx[:, -context_size:] + + # Get the predictions + with torch.no_grad(): + logits = model(idx_cond) + + # Focus only on the last time step + # (batch, n_token, vocab_size) becomes (batch, vocab_size) + logits = logits[:, -1, :] + + # Get the idx of the vocab entry with the highest logits value + idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (batch, 1) + + # Append sampled index to the running sequence + idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1) + + return idx + + +##################################### +# Chapter 5 +##################################### +def assign(left, right): + if left.shape != right.shape: + raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}") + return torch.nn.Parameter(torch.tensor(right)) + + +def load_weights_into_gpt(gpt, params): + gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe']) + gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte']) + + for b in range(len(params["blocks"])): + q_w, k_w, v_w = np.split( + (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1) + gpt.trf_blocks[b].att.W_query.weight = assign( + gpt.trf_blocks[b].att.W_query.weight, q_w.T) + gpt.trf_blocks[b].att.W_key.weight = assign( + gpt.trf_blocks[b].att.W_key.weight, k_w.T) + gpt.trf_blocks[b].att.W_value.weight = assign( + gpt.trf_blocks[b].att.W_value.weight, v_w.T) + + q_b, k_b, v_b = np.split( + (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1) + gpt.trf_blocks[b].att.W_query.bias = assign( + gpt.trf_blocks[b].att.W_query.bias, q_b) + gpt.trf_blocks[b].att.W_key.bias = assign( + gpt.trf_blocks[b].att.W_key.bias, k_b) + gpt.trf_blocks[b].att.W_value.bias = assign( + gpt.trf_blocks[b].att.W_value.bias, v_b) + + gpt.trf_blocks[b].att.out_proj.weight = assign( + gpt.trf_blocks[b].att.out_proj.weight, + params["blocks"][b]["attn"]["c_proj"]["w"].T) + gpt.trf_blocks[b].att.out_proj.bias = assign( + gpt.trf_blocks[b].att.out_proj.bias, + params["blocks"][b]["attn"]["c_proj"]["b"]) + + gpt.trf_blocks[b].ff.layers[0].weight = assign( + gpt.trf_blocks[b].ff.layers[0].weight, + params["blocks"][b]["mlp"]["c_fc"]["w"].T) + gpt.trf_blocks[b].ff.layers[0].bias = assign( + gpt.trf_blocks[b].ff.layers[0].bias, + params["blocks"][b]["mlp"]["c_fc"]["b"]) + gpt.trf_blocks[b].ff.layers[2].weight = assign( + gpt.trf_blocks[b].ff.layers[2].weight, + params["blocks"][b]["mlp"]["c_proj"]["w"].T) + gpt.trf_blocks[b].ff.layers[2].bias = assign( + gpt.trf_blocks[b].ff.layers[2].bias, + params["blocks"][b]["mlp"]["c_proj"]["b"]) + + gpt.trf_blocks[b].norm1.scale = assign( + gpt.trf_blocks[b].norm1.scale, + params["blocks"][b]["ln_1"]["g"]) + gpt.trf_blocks[b].norm1.shift = assign( + gpt.trf_blocks[b].norm1.shift, + params["blocks"][b]["ln_1"]["b"]) + gpt.trf_blocks[b].norm2.scale = assign( + gpt.trf_blocks[b].norm2.scale, + params["blocks"][b]["ln_2"]["g"]) + gpt.trf_blocks[b].norm2.shift = assign( + gpt.trf_blocks[b].norm2.shift, + params["blocks"][b]["ln_2"]["b"]) + + gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"]) + gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"]) + gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"]) + + +def text_to_token_ids(text, tokenizer): + encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'}) + encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension + return encoded_tensor + + +def token_ids_to_text(token_ids, tokenizer): + flat = token_ids.squeeze(0) # remove batch dimension + return tokenizer.decode(flat.tolist()) + + +def calc_loss_loader(data_loader, model, device, num_batches=None): + total_loss = 0. + if len(data_loader) == 0: + return float("nan") + elif num_batches is None: + num_batches = len(data_loader) + else: + # Reduce the number of batches to match the total number of batches in the data loader + # if num_batches exceeds the number of batches in the data loader + num_batches = min(num_batches, len(data_loader)) + for i, (input_batch, target_batch) in enumerate(data_loader): + if i < num_batches: + loss = calc_loss_batch(input_batch, target_batch, model, device) + total_loss += loss.item() + else: + break + return total_loss / num_batches + + +def evaluate_model(model, train_loader, val_loader, device, eval_iter): + model.eval() + with torch.no_grad(): + train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter) + val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter) + model.train() + return train_loss, val_loss + + +##################################### +# Chapter 6 +##################################### + + +def download_and_unzip(url, zip_path, extracted_path, data_file_path): + if data_file_path.exists(): + print(f"{data_file_path} already exists. Skipping download and extraction.") + return + + # Downloading the file + with urllib.request.urlopen(url) as response: + with open(zip_path, "wb") as out_file: + out_file.write(response.read()) + + # Unzipping the file + with zipfile.ZipFile(zip_path, "r") as zip_ref: + zip_ref.extractall(extracted_path) + + # Add .tsv file extension + original_file_path = Path(extracted_path) / "SMSSpamCollection" + os.rename(original_file_path, data_file_path) + print(f"File downloaded and saved as {data_file_path}") + + +def create_balanced_dataset(df): + + # Count the instances of "spam" + num_spam = df[df["Label"] == "spam"].shape[0] + + # Randomly sample "ham' instances to match the number of 'spam' instances + ham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123) + + # Combine ham "subset" with "spam" + balanced_df = pd.concat([ham_subset, df[df["Label"] == "spam"]]) + + return balanced_df + + +def random_split(df, train_frac, validation_frac): + # Shuffle the entire DataFrame + df = df.sample(frac=1, random_state=123).reset_index(drop=True) + + # Calculate split indices + train_end = int(len(df) * train_frac) + validation_end = train_end + int(len(df) * validation_frac) + + # Split the DataFrame + train_df = df[:train_end] + validation_df = df[train_end:validation_end] + test_df = df[validation_end:] + + return train_df, validation_df, test_df + + +class SpamDataset(Dataset): + def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256): + self.data = pd.read_csv(csv_file) + + # Pre-tokenize texts + self.encoded_texts = [ + tokenizer.encode(text) for text in self.data["Text"] + ] + + if max_length is None: + self.max_length = self._longest_encoded_length() + else: + self.max_length = max_length + # Truncate sequences if they are longer than max_length + self.encoded_texts = [ + encoded_text[:self.max_length] + for encoded_text in self.encoded_texts + ] + + # Pad sequences to the longest sequence + self.encoded_texts = [ + encoded_text + [pad_token_id] * (self.max_length - len(encoded_text)) + for encoded_text in self.encoded_texts + ] + + def __getitem__(self, index): + encoded = self.encoded_texts[index] + label = self.data.iloc[index]["Label"] + return torch.tensor(encoded, dtype=torch.long), torch.tensor(label, dtype=torch.long) + + def __len__(self): + return len(self.data) + + def _longest_encoded_length(self): + max_length = 0 + for encoded_text in self.encoded_texts: + encoded_length = len(encoded_text) + if encoded_length > max_length: + max_length = encoded_length + return max_length + + +@torch.no_grad() # Disable gradient tracking for efficiency +def calc_accuracy_loader(data_loader, model, device, num_batches=None): + model.eval() + correct_predictions, num_examples = 0, 0 + + if num_batches is None: + num_batches = len(data_loader) + else: + num_batches = min(num_batches, len(data_loader)) + for i, (input_batch, target_batch) in enumerate(data_loader): + if i < num_batches: + input_batch, target_batch = input_batch.to(device), target_batch.to(device) + logits = model(input_batch)[:, -1, :] # Logits of last ouput token + predicted_labels = torch.argmax(logits, dim=-1) + + num_examples += predicted_labels.shape[0] + correct_predictions += (predicted_labels == target_batch).sum().item() + else: + break + return correct_predictions / num_examples + + +def calc_loss_batch(input_batch, target_batch, model, device): + input_batch, target_batch = input_batch.to(device), target_batch.to(device) + logits = model(input_batch)[:, -1, :] # Logits of last ouput token + loss = torch.nn.functional.cross_entropy(logits, target_batch) + return loss + + +# Overall the same as `train_model_simple` in chapter 5 +def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs, + eval_freq, eval_iter, tokenizer): + # Initialize lists to track losses and tokens seen + train_losses, val_losses, train_accs, val_accs = [], [], [], [] + examples_seen, global_step = 0, -1 + + # Main training loop + for epoch in range(num_epochs): + model.train() # Set model to training mode + + for input_batch, target_batch in train_loader: + optimizer.zero_grad() # Reset loss gradients from previous epoch + loss = calc_loss_batch(input_batch, target_batch, model, device) + loss.backward() # Calculate loss gradients + optimizer.step() # Update model weights using loss gradients + examples_seen += input_batch.shape[0] # New: track examples instead of tokens + global_step += 1 + + # Optional evaluation step + if global_step % eval_freq == 0: + train_loss, val_loss = evaluate_model( + model, train_loader, val_loader, device, eval_iter) + train_losses.append(train_loss) + val_losses.append(val_loss) + print(f"Ep {epoch+1} (Step {global_step:06d}): " + f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}") + + # Calculate accuracy after each epoch + train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=eval_iter) + val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=eval_iter) + print(f"Training accuracy: {train_accuracy*100:.2f}% | ", end="") + print(f"Validation accuracy: {val_accuracy*100:.2f}%") + train_accs.append(train_accuracy) + val_accs.append(val_accuracy) + + return train_losses, val_losses, train_accs, val_accs, examples_seen + + +def plot_values(epochs_seen, examples_seen, train_values, val_values, label="loss"): + fig, ax1 = plt.subplots(figsize=(5, 3)) + + # Plot training and validation loss against epochs + ax1.plot(epochs_seen, train_values, label=f"Training {label}") + ax1.plot(epochs_seen, val_values, linestyle="-.", label=f"Validation {label}") + ax1.set_xlabel("Epochs") + ax1.set_ylabel(label.capitalize()) + ax1.legend() + + # Create a second x-axis for tokens seen + ax2 = ax1.twiny() # Create a second x-axis that shares the same y-axis + ax2.plot(examples_seen, train_values, alpha=0) # Invisible plot for aligning ticks + ax2.set_xlabel("Examples seen") + + fig.tight_layout() # Adjust layout to make room + plt.savefig(f"{label}-plot.pdf") + plt.show() diff --git a/appendix-E/README.md b/appendix-E/README.md new file mode 100644 index 0000000..a07d712 --- /dev/null +++ b/appendix-E/README.md @@ -0,0 +1,3 @@ +# Appendix E: Parameter-efficient Finetuning with LoRA + +- [01_main-chapter-code](01_main-chapter-code) contains the main chapter code. \ No newline at end of file