LLMs-from-scratch/appendix-E/01_main-chapter-code/appendix-E.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c024bfa4-1a7a-4751-b5a1-827225a3478b",
   "metadata": {
    "id": "c024bfa4-1a7a-4751-b5a1-827225a3478b"
   },
   "source": [
    "<font size=\"1\">\n",
    "Supplementary code for \"Build a Large Language Model From Scratch\": <a href=\"https://www.manning.com/books/build-a-large-language-model-from-scratch\">https://www.manning.com/books/build-a-large-language-model-from-scratch</a> by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
    "</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58b8c870-fb72-490e-8916-d8129bd5d1ff",
   "metadata": {},
   "source": [
    "# Appendix E: Parameter-efficient Finetuning with LoRA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "5b7e01c2-1c84-4f2a-bb51-2e0b74abda90",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "5b7e01c2-1c84-4f2a-bb51-2e0b74abda90",
    "outputId": "9495f150-9d79-4910-d6e7-6c0d9aae4a41"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "matplotlib version: 3.7.2\n",
      "numpy version: 1.25.2\n",
      "tiktoken version: 0.5.1\n",
      "torch version: 2.2.2\n",
      "tensorflow version: 2.15.0\n",
      "pandas version: 2.0.3\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "pkgs = [\"matplotlib\",\n",
    "        \"numpy\",\n",
    "        \"tiktoken\",\n",
    "        \"torch\",\n",
    "        \"tensorflow\", # For OpenAI's pretrained weights\n",
    "        \"pandas\"      # Dataset loading\n",
    "       ]\n",
    "for p in pkgs:\n",
    "    print(f\"{p} version: {version(p)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21532056-0ef4-4c98-82c7-e91f61c6485e",
   "metadata": {},
   "source": [
    "## E.1 Introduction to LoRA"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66edc999-3d91-4a1c-a157-9d056392e8d8",
   "metadata": {},
   "source": [
    "- No code in this section\n",
    "- Low-rank adaptation (LoRA) is a machine learning technique that modifies a pretrained model to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters\n",
    "- This approach is important because it allows for efficient finetuning of large models on task-specific data, significantly reducing the computational cost and time required for finetuning"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5bb75b5d-d59c-4948-821a-1594a5883dc1",
   "metadata": {},
   "source": [
    "- Suppose we have a large weight matrix $W$ for a given layer\n",
    "- During backpropagation, we learn a $\\Delta W$ matrix, which contains information on how much we want to update the original weights to minimize the loss function during training\n",
    "- In regular training and finetuning, the weight update is defined as follows:\n",
    "\n",
    "$$W_{\\text{updated}} = W + \\Delta W$$\n",
    "\n",
    "- The LoRA method proposed by [Hu et al.](https://arxiv.org/abs/2106.09685) offers a more efficient alternative to computing the weight updates $\\Delta W$ by learning an approximation of it, $\\Delta W \\approx AB$.\n",
    "- In other words, in LoRA, we have the following, where $A$ and $B$ are two small weight matrices:\n",
    "\n",
    "$$W_{\\text{updated}} = W + AB$$\n",
    "\n",
    "- The figure below illustrates these formulas for full finetuning and LoRA side by side"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8a7419d-cae9-4525-bb44-1641f6ef4f3b",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-e_compressed/lora-1.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4edd43c9-8ec5-48e6-b3fc-5fb3c16037cc",
   "metadata": {},
   "source": [
    "- If you paid close attention, the full finetuning and LoRA depictions in the figure above look slightly different from the formulas I have shown earlier\n",
    "- That's due to the distributive law of matrix multiplication: we don't have to add the weights with the updated weights but can keep them separate\n",
    "- For instance, if $x$ is the input data, then we can write the following for regular finetuning:\n",
    "\n",
    "$$x (W+\\Delta W) = x W + x \\Delta W$$\n",
    "\n",
    "- Similarly, we can write the following for LoRA:\n",
    "\n",
    "$$x (W+A B) = x W + x A B$$\n",
    "\n",
    "- The fact that we can keep the LoRA weight matrices separate makes LoRA especially attractive\n",
    "- In practice, this means that we don't have to modify the weights of the pretrained model at all, as we can apply the LoRA matrices on the fly\n",
    "- After setting up the dataset and loading the model, we will implement LoRA in the code to make these concepts less abstract"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c7017a2-32aa-4002-a2f3-12aac293ccdf",
   "metadata": {
    "id": "8c7017a2-32aa-4002-a2f3-12aac293ccdf"
   },
   "source": [
    "## E.2 Preparing the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "669c64df-4431-4d27-834d-2bb38a01fc02",
   "metadata": {},
   "source": [
    "- This section repeats the code from chapter 6 to load and prepare the dataset\n",
    "- Instead of repeating this code, one could open and run the chapter 6 notebook and then insert the LoRA code from section E.4 there\n",
    "- (The LoRA code was originally the last section of chapter 6 but was moved to the appendix due to the length of chapter 6)\n",
    "- In a similar fashion, we could also apply LoRA to the models in chapter 7 for instruction finetuning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "def7c09b-af9c-4216-90ce-5e67aed1065c",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "def7c09b-af9c-4216-90ce-5e67aed1065c",
    "outputId": "424e4423-f623-443c-ab9e-656f9e867559"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sms_spam_collection/SMSSpamCollection.tsv already exists. Skipping download and extraction.\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "import pandas as pd\n",
    "from previous_chapters import (\n",
    "    download_and_unzip,\n",
    "    create_balanced_dataset,\n",
    "    random_split\n",
    ")\n",
    "\n",
    "\n",
    "url = \"https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip\"\n",
    "zip_path = \"sms_spam_collection.zip\"\n",
    "extracted_path = \"sms_spam_collection\"\n",
    "data_file_path = Path(extracted_path) / \"SMSSpamCollection.tsv\"\n",
    "\n",
    "download_and_unzip(url, zip_path, extracted_path, data_file_path)\n",
    "\n",
    "df = pd.read_csv(data_file_path, sep=\"\\t\", header=None, names=[\"Label\", \"Text\"])\n",
    "balanced_df = create_balanced_dataset(df)\n",
    "balanced_df[\"Label\"] = balanced_df[\"Label\"].map({\"ham\": 0, \"spam\": 1})\n",
    "\n",
    "train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)\n",
    "train_df.to_csv(\"train.csv\", index=None)\n",
    "validation_df.to_csv(\"validation.csv\", index=None)\n",
    "test_df.to_csv(\"test.csv\", index=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "74c3c463-8763-4cc0-9320-41c7eaad8ab7",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "74c3c463-8763-4cc0-9320-41c7eaad8ab7",
    "outputId": "b5b48439-32c8-4b37-cca2-c9dc8fa86563"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch.utils.data import Dataset\n",
    "import tiktoken\n",
    "from previous_chapters import SpamDataset\n",
    "\n",
    "\n",
    "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
    "train_dataset = SpamDataset(\"train.csv\", max_length=None, tokenizer=tokenizer)\n",
    "val_dataset = SpamDataset(\"validation.csv\", max_length=train_dataset.max_length, tokenizer=tokenizer)\n",
    "test_dataset = SpamDataset(\"test.csv\", max_length=train_dataset.max_length, tokenizer=tokenizer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8681adc0-6f02-4e75-b01a-a6ab75d05542",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "8681adc0-6f02-4e75-b01a-a6ab75d05542",
    "outputId": "3266c410-4fdb-4a8c-a142-7f707e2525ab"
   },
   "outputs": [],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "\n",
    "num_workers = 0\n",
    "batch_size = 8\n",
    "\n",
    "torch.manual_seed(123)\n",
    "\n",
    "train_loader = DataLoader(\n",
    "    dataset=train_dataset,\n",
    "    batch_size=batch_size,\n",
    "    shuffle=True,\n",
    "    num_workers=num_workers,\n",
    "    drop_last=True,\n",
    ")\n",
    "\n",
    "val_loader = DataLoader(\n",
    "    dataset=val_dataset,\n",
    "    batch_size=batch_size,\n",
    "    num_workers=num_workers,\n",
    "    drop_last=False,\n",
    ")\n",
    "\n",
    "test_loader = DataLoader(\n",
    "    dataset=test_dataset,\n",
    "    batch_size=batch_size,\n",
    "    num_workers=num_workers,\n",
    "    drop_last=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab7335db-e0bb-4e27-80c5-eea11e593a57",
   "metadata": {},
   "source": [
    "- As a verification step, we iterate through the data loaders and check that the batches contain 8 training examples each, where each training example consists of 120 tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4dee6882-4c3a-4964-af15-fa31f86ad047",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train loader:\n",
      "Input batch dimensions: torch.Size([8, 120])\n",
      "Label batch dimensions torch.Size([8])\n"
     ]
    }
   ],
   "source": [
    "print(\"Train loader:\")\n",
    "for input_batch, target_batch in train_loader:\n",
    "    pass\n",
    "\n",
    "print(\"Input batch dimensions:\", input_batch.shape)\n",
    "print(\"Label batch dimensions\", target_batch.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cdd7947-7039-49bf-8a5e-c0a2f4281ca1",
   "metadata": {},
   "source": [
    "- Lastly, let's print the total number of batches in each dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "IZfw-TYD2zTj",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "IZfw-TYD2zTj",
    "outputId": "6934bbf2-9797-4fbe-d26b-1a246e18c2fb"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "130 training batches\n",
      "19 validation batches\n",
      "38 test batches\n"
     ]
    }
   ],
   "source": [
    "print(f\"{len(train_loader)} training batches\")\n",
    "print(f\"{len(val_loader)} validation batches\")\n",
    "print(f\"{len(test_loader)} test batches\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dec9aa4a-ffd2-4d9f-a835-cce1059fe604",
   "metadata": {},
   "source": [
    "## E.3 Initializing the model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f36ebdaf-810e-46a2-9ad9-e017a04051b1",
   "metadata": {},
   "source": [
    "- This section repeats the code from chapter 6 to load and prepare the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "02b3a506-3879-4258-82b5-93a5b6bafa74",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File already exists and is up-to-date: gpt2/124M/checkpoint\n",
      "File already exists and is up-to-date: gpt2/124M/encoder.json\n",
      "File already exists and is up-to-date: gpt2/124M/hparams.json\n",
      "File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001\n",
      "File already exists and is up-to-date: gpt2/124M/model.ckpt.index\n",
      "File already exists and is up-to-date: gpt2/124M/model.ckpt.meta\n",
      "File already exists and is up-to-date: gpt2/124M/vocab.bpe\n"
     ]
    }
   ],
   "source": [
    "from gpt_download import download_and_load_gpt2\n",
    "from previous_chapters import GPTModel, load_weights_into_gpt\n",
    "\n",
    "\n",
    "CHOOSE_MODEL = \"gpt2-small (124M)\"\n",
    "INPUT_PROMPT = \"Every effort moves\"\n",
    "\n",
    "BASE_CONFIG = {\n",
    "    \"vocab_size\": 50257,     # Vocabulary size\n",
    "    \"context_length\": 1024,  # Context length\n",
    "    \"drop_rate\": 0.0,        # Dropout rate\n",
    "    \"qkv_bias\": True         # Query-key-value bias\n",
    "}\n",
    "\n",
    "model_configs = {\n",
    "    \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n",
    "    \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n",
    "    \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n",
    "    \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n",
    "}\n",
    "\n",
    "BASE_CONFIG.update(model_configs[CHOOSE_MODEL])\n",
    "\n",
    "model_size = CHOOSE_MODEL.split(\" \")[-1].lstrip(\"(\").rstrip(\")\")\n",
    "settings, params = download_and_load_gpt2(model_size=model_size, models_dir=\"gpt2\")\n",
    "\n",
    "model = GPTModel(BASE_CONFIG)\n",
    "load_weights_into_gpt(model, params)\n",
    "model.eval();"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "252614cd-7ce6-4908-83e6-3761f519904e",
   "metadata": {},
   "source": [
    "- To ensure that the model was loaded corrected, let's double-check that it generates coherent text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8b6ce20c-0700-4783-8be0-4cf17c200a7f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Every effort moves you forward.\n",
      "\n",
      "The first step is to understand the importance of your work\n"
     ]
    }
   ],
   "source": [
    "from previous_chapters import (\n",
    "    generate_text_simple,\n",
    "    text_to_token_ids,\n",
    "    token_ids_to_text\n",
    ")\n",
    "\n",
    "\n",
    "text_1 = \"Every effort moves you\"\n",
    "\n",
    "token_ids = generate_text_simple(\n",
    "    model=model,\n",
    "    idx=text_to_token_ids(text_1, tokenizer),\n",
    "    max_new_tokens=15,\n",
    "    context_size=BASE_CONFIG[\"context_length\"]\n",
    ")\n",
    "\n",
    "print(token_ids_to_text(token_ids, tokenizer))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8174b31b-1ab5-4115-b01c-245369da5af3",
   "metadata": {},
   "source": [
    "- Then, we prepare the model for classification finetuning similar to chapter 6, where we replace the output layer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "e255ce91-d73a-4854-90a4-95804928eb16",
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.manual_seed(123)\n",
    "\n",
    "num_classes = 2\n",
    "model.out_head = torch.nn.Linear(in_features=768, out_features=num_classes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "02e6f057-1383-4ece-8444-0a88e71ac75d",
   "metadata": {},
   "outputs": [],
   "source": [
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "model.to(device);  # no assignment model = model.to(device) necessary for nn.Module classes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e951cd6-5e42-44d2-b21f-895cb61004fe",
   "metadata": {},
   "source": [
    "- Lastly, let's calculate the initial classification accuracy of the non-finetuned model (we expect this to be around 50%, which means that the model is not able to distinguish between spam and non-spam messages yet reliably)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "fc7dd72c-73a2-4881-ade0-0a9605f1ab8c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training accuracy: 46.25%\n",
      "Validation accuracy: 45.00%\n",
      "Test accuracy: 48.75%\n"
     ]
    }
   ],
   "source": [
    "from previous_chapters import calc_accuracy_loader\n",
    "\n",
    "\n",
    "torch.manual_seed(123)\n",
    "train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=10)\n",
    "val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=10)\n",
    "test_accuracy = calc_accuracy_loader(test_loader, model, device, num_batches=10)\n",
    "\n",
    "print(f\"Training accuracy: {train_accuracy*100:.2f}%\")\n",
    "print(f\"Validation accuracy: {val_accuracy*100:.2f}%\")\n",
    "print(f\"Test accuracy: {test_accuracy*100:.2f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "398a1ec9-e2a1-43d6-bf9f-12ee54b46a7b",
   "metadata": {
    "id": "398a1ec9-e2a1-43d6-bf9f-12ee54b46a7b"
   },
   "source": [
    "## E.4 Parameter-efficient finetuning with LoRA"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "652a4a82-61ef-4d0a-9858-8988e844f12c",
   "metadata": {},
   "source": [
    "- We begin by initializing a LoRALayer that creates the matrices $A$ and $B$, along with the `alpha` scaling hyperparameter and the `rank` ($r$) hyperparameters\n",
    "- This layer can accept an input and compute the corresponding output, as illustrated in the figure below\n",
    "\n",
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-e_compressed/lora-2.webp\" width=\"200px\">\n",
    "\n",
    "In code, this LoRA layer depicted in the figure above looks like as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "2ds9ywjMwvIW",
   "metadata": {
    "id": "2ds9ywjMwvIW"
   },
   "outputs": [],
   "source": [
    "class LoRALayer(torch.nn.Module):\n",
    "    def __init__(self, in_dim, out_dim, rank, alpha):\n",
    "        super().__init__()\n",
    "        std_dev = 1 / torch.sqrt(torch.tensor(rank).float())\n",
    "        self.A = torch.nn.Parameter(torch.randn(in_dim, rank) * std_dev)\n",
    "        self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))\n",
    "        self.alpha = alpha\n",
    "\n",
    "    def forward(self, x):\n",
    "        x = self.alpha * (x @ self.A @ self.B)\n",
    "        return x"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad21faa8-0614-4257-93cd-68952193e14a",
   "metadata": {},
   "source": [
    "- In the code above, `rank` is a hyperparameter that controls the inner dimension of the matrices $A$ and $B$\n",
    "- In other words, this parameter controls the number of additional parameters introduced by LoRA and is a key factor in determining the balance between model adaptability and parameter efficiency\n",
    "- The second hyperparameter, alpha, is a scaling hyperparameter applied to the output of the low-rank adaptation\n",
    "- It essentially controls the extent to which the adapted layer's output is allowed to influence the original output of the layer being adapted\n",
    "- This can be seen as a way to regulate the impact of the low-rank adaptation on the layer's output\n",
    "- So far, the `LoRALayer` class we implemented above allows us to transform the layer inputs $x$\n",
    "- However, in LoRA, we are usually interested in replacing existing `Linear` layers so that the weight update is applied to the existing pretrained weights, as shown in the figure below\n",
    "\n",
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-e_compressed/lora-3.webp\" width=\"200px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e6d5da0-dfce-4808-b89b-29ff333f563f",
   "metadata": {},
   "source": [
    "- To incorporate the original `Linear` layer weights as shown in the figure above, we implement a `LinearWithLoRA` layer below that uses the previously implemented LoRALayer and can be used to replace existing `Linear` layers in a neural network, for example, the self-attention module or feed forward modules in an LLM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "127d3a64-8359-4b21-b056-78d58cc75fe8",
   "metadata": {},
   "outputs": [],
   "source": [
    "class LinearWithLoRA(torch.nn.Module):\n",
    "    def __init__(self, linear, rank, alpha):\n",
    "        super().__init__()\n",
    "        self.linear = linear\n",
    "        self.lora = LoRALayer(\n",
    "            linear.in_features, linear.out_features, rank, alpha\n",
    "        )\n",
    "\n",
    "    def forward(self, x):\n",
    "        return self.linear(x) + self.lora(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1145a90-35ff-462c-820b-15483fa5b051",
   "metadata": {},
   "source": [
    "- Note that since we initialize the weight matrix $B$ (`self.B` in `LoraLayer`) with zero values in the LoRA layer, the matrix multiplication between $A$ and $B$ results in a matrix consisting of 0's and doesn't affect the original weights (since adding 0 to the original weights does not modify them)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e98a6d36-7bc9-434c-a7f1-533f26aff06d",
   "metadata": {
    "id": "4D21Jk7Vw3nG"
   },
   "source": [
    "- To try LoRA on the GPT model we defined earlier, we define a `replace_linear_with_lora` function to replace all `Linear` layers in the model with the new `LinearWithLoRA` layers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "WlQZ8ygqzN_g",
   "metadata": {
    "id": "WlQZ8ygqzN_g"
   },
   "outputs": [],
   "source": [
    "def replace_linear_with_lora(model, rank, alpha):\n",
    "    for name, module in model.named_children():\n",
    "        if isinstance(module, torch.nn.Linear):\n",
    "            # Replace the Linear layer with LinearWithLoRA\n",
    "            setattr(model, name, LinearWithLoRA(module, rank, alpha))\n",
    "        else:\n",
    "            # Recursively apply the same function to child modules\n",
    "            replace_linear_with_lora(module, rank, alpha)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c172164-cdde-4489-b7d7-aaed9cc2f5f2",
   "metadata": {},
   "source": [
    "- We then freeze the original model parameter and use the `replace_linear_with_lora` to replace the said `Linear` layers below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "dbe15350-4da9-4829-9d23-98bbd3d0b1a1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total trainable parameters before: 124,441,346\n",
      "Total trainable parameters after: 0\n"
     ]
    }
   ],
   "source": [
    "total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
    "print(f\"Total trainable parameters before: {total_params:,}\")\n",
    "\n",
    "for param in model.parameters():\n",
    "    param.requires_grad = False\n",
    "\n",
    "total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
    "print(f\"Total trainable parameters after: {total_params:,}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "mLk_fPq0yz_u",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "mLk_fPq0yz_u",
    "outputId": "7ba89607-ca75-4718-e8dc-9cdc44c3e410"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total trainable LoRA parameters: 1,333,264\n"
     ]
    }
   ],
   "source": [
    "replace_linear_with_lora(model, rank=8, alpha=8)\n",
    "\n",
    "total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
    "print(f\"Total trainable LoRA parameters: {total_params:,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8b6819e-ef7a-4f0d-841a-1b467496bef9",
   "metadata": {},
   "source": [
    "- As we can see, we reduced the number of trainable parameters by almost 100x when using LoRA\n",
    "- Let's now double-check whether the layers have been modified as intended by printing the model architecture"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "1711be61-bb2c-466f-9b5b-24f4aa5ccd9c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPTModel(\n",
      "  (tok_emb): Embedding(50257, 768)\n",
      "  (pos_emb): Embedding(1024, 768)\n",
      "  (drop_emb): Dropout(p=0.0, inplace=False)\n",
      "  (trf_blocks): Sequential(\n",
      "    (0): TransformerBlock(\n",
      "      (att): MultiHeadAttention(\n",
      "        (W_query): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_key): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_value): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (out_proj): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (dropout): Dropout(p=0.0, inplace=False)\n",
      "      )\n",
      "      (ff): FeedForward(\n",
      "        (layers): Sequential(\n",
      "          (0): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=768, out_features=3072, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "          (1): GELU()\n",
      "          (2): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=3072, out_features=768, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "        )\n",
      "      )\n",
      "      (norm1): LayerNorm()\n",
      "      (norm2): LayerNorm()\n",
      "      (drop_resid): Dropout(p=0.0, inplace=False)\n",
      "    )\n",
      "    (1): TransformerBlock(\n",
      "      (att): MultiHeadAttention(\n",
      "        (W_query): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_key): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_value): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (out_proj): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (dropout): Dropout(p=0.0, inplace=False)\n",
      "      )\n",
      "      (ff): FeedForward(\n",
      "        (layers): Sequential(\n",
      "          (0): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=768, out_features=3072, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "          (1): GELU()\n",
      "          (2): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=3072, out_features=768, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "        )\n",
      "      )\n",
      "      (norm1): LayerNorm()\n",
      "      (norm2): LayerNorm()\n",
      "      (drop_resid): Dropout(p=0.0, inplace=False)\n",
      "    )\n",
      "    (2): TransformerBlock(\n",
      "      (att): MultiHeadAttention(\n",
      "        (W_query): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_key): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_value): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (out_proj): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (dropout): Dropout(p=0.0, inplace=False)\n",
      "      )\n",
      "      (ff): FeedForward(\n",
      "        (layers): Sequential(\n",
      "          (0): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=768, out_features=3072, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "          (1): GELU()\n",
      "          (2): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=3072, out_features=768, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "        )\n",
      "      )\n",
      "      (norm1): LayerNorm()\n",
      "      (norm2): LayerNorm()\n",
      "      (drop_resid): Dropout(p=0.0, inplace=False)\n",
      "    )\n",
      "    (3): TransformerBlock(\n",
      "      (att): MultiHeadAttention(\n",
      "        (W_query): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_key): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_value): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (out_proj): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (dropout): Dropout(p=0.0, inplace=False)\n",
      "      )\n",
      "      (ff): FeedForward(\n",
      "        (layers): Sequential(\n",
      "          (0): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=768, out_features=3072, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "          (1): GELU()\n",
      "          (2): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=3072, out_features=768, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "        )\n",
      "      )\n",
      "      (norm1): LayerNorm()\n",
      "      (norm2): LayerNorm()\n",
      "      (drop_resid): Dropout(p=0.0, inplace=False)\n",
      "    )\n",
      "    (4): TransformerBlock(\n",
      "      (att): MultiHeadAttention(\n",
      "        (W_query): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_key): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_value): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (out_proj): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (dropout): Dropout(p=0.0, inplace=False)\n",
      "      )\n",
      "      (ff): FeedForward(\n",
      "        (layers): Sequential(\n",
      "          (0): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=768, out_features=3072, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "          (1): GELU()\n",
      "          (2): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=3072, out_features=768, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "        )\n",
      "      )\n",
      "      (norm1): LayerNorm()\n",
      "      (norm2): LayerNorm()\n",
      "      (drop_resid): Dropout(p=0.0, inplace=False)\n",
      "    )\n",
      "    (5): TransformerBlock(\n",
      "      (att): MultiHeadAttention(\n",
      "        (W_query): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_key): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_value): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (out_proj): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (dropout): Dropout(p=0.0, inplace=False)\n",
      "      )\n",
      "      (ff): FeedForward(\n",
      "        (layers): Sequential(\n",
      "          (0): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=768, out_features=3072, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "          (1): GELU()\n",
      "          (2): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=3072, out_features=768, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "        )\n",
      "      )\n",
      "      (norm1): LayerNorm()\n",
      "      (norm2): LayerNorm()\n",
      "      (drop_resid): Dropout(p=0.0, inplace=False)\n",
      "    )\n",
      "    (6): TransformerBlock(\n",
      "      (att): MultiHeadAttention(\n",
      "        (W_query): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_key): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_value): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (out_proj): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (dropout): Dropout(p=0.0, inplace=False)\n",
      "      )\n",
      "      (ff): FeedForward(\n",
      "        (layers): Sequential(\n",
      "          (0): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=768, out_features=3072, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "          (1): GELU()\n",
      "          (2): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=3072, out_features=768, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "        )\n",
      "      )\n",
      "      (norm1): LayerNorm()\n",
      "      (norm2): LayerNorm()\n",
      "      (drop_resid): Dropout(p=0.0, inplace=False)\n",
      "    )\n",
      "    (7): TransformerBlock(\n",
      "      (att): MultiHeadAttention(\n",
      "        (W_query): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_key): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_value): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (out_proj): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (dropout): Dropout(p=0.0, inplace=False)\n",
      "      )\n",
      "      (ff): FeedForward(\n",
      "        (layers): Sequential(\n",
      "          (0): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=768, out_features=3072, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "          (1): GELU()\n",
      "          (2): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=3072, out_features=768, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "        )\n",
      "      )\n",
      "      (norm1): LayerNorm()\n",
      "      (norm2): LayerNorm()\n",
      "      (drop_resid): Dropout(p=0.0, inplace=False)\n",
      "    )\n",
      "    (8): TransformerBlock(\n",
      "      (att): MultiHeadAttention(\n",
      "        (W_query): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_key): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_value): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (out_proj): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (dropout): Dropout(p=0.0, inplace=False)\n",
      "      )\n",
      "      (ff): FeedForward(\n",
      "        (layers): Sequential(\n",
      "          (0): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=768, out_features=3072, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "          (1): GELU()\n",
      "          (2): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=3072, out_features=768, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "        )\n",
      "      )\n",
      "      (norm1): LayerNorm()\n",
      "      (norm2): LayerNorm()\n",
      "      (drop_resid): Dropout(p=0.0, inplace=False)\n",
      "    )\n",
      "    (9): TransformerBlock(\n",
      "      (att): MultiHeadAttention(\n",
      "        (W_query): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_key): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_value): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (out_proj): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (dropout): Dropout(p=0.0, inplace=False)\n",
      "      )\n",
      "      (ff): FeedForward(\n",
      "        (layers): Sequential(\n",
      "          (0): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=768, out_features=3072, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "          (1): GELU()\n",
      "          (2): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=3072, out_features=768, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "        )\n",
      "      )\n",
      "      (norm1): LayerNorm()\n",
      "      (norm2): LayerNorm()\n",
      "      (drop_resid): Dropout(p=0.0, inplace=False)\n",
      "    )\n",
      "    (10): TransformerBlock(\n",
      "      (att): MultiHeadAttention(\n",
      "        (W_query): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_key): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_value): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (out_proj): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (dropout): Dropout(p=0.0, inplace=False)\n",
      "      )\n",
      "      (ff): FeedForward(\n",
      "        (layers): Sequential(\n",
      "          (0): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=768, out_features=3072, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "          (1): GELU()\n",
      "          (2): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=3072, out_features=768, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "        )\n",
      "      )\n",
      "      (norm1): LayerNorm()\n",
      "      (norm2): LayerNorm()\n",
      "      (drop_resid): Dropout(p=0.0, inplace=False)\n",
      "    )\n",
      "    (11): TransformerBlock(\n",
      "      (att): MultiHeadAttention(\n",
      "        (W_query): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_key): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (W_value): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (out_proj): LinearWithLoRA(\n",
      "          (linear): Linear(in_features=768, out_features=768, bias=True)\n",
      "          (lora): LoRALayer()\n",
      "        )\n",
      "        (dropout): Dropout(p=0.0, inplace=False)\n",
      "      )\n",
      "      (ff): FeedForward(\n",
      "        (layers): Sequential(\n",
      "          (0): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=768, out_features=3072, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "          (1): GELU()\n",
      "          (2): LinearWithLoRA(\n",
      "            (linear): Linear(in_features=3072, out_features=768, bias=True)\n",
      "            (lora): LoRALayer()\n",
      "          )\n",
      "        )\n",
      "      )\n",
      "      (norm1): LayerNorm()\n",
      "      (norm2): LayerNorm()\n",
      "      (drop_resid): Dropout(p=0.0, inplace=False)\n",
      "    )\n",
      "  )\n",
      "  (final_norm): LayerNorm()\n",
      "  (out_head): LinearWithLoRA(\n",
      "    (linear): Linear(in_features=768, out_features=2, bias=True)\n",
      "    (lora): LoRALayer()\n",
      "  )\n",
      ")\n"
     ]
    }
   ],
   "source": [
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "model.to(device)\n",
    "\n",
    "print(model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4bbc9d7-65ec-4675-bab8-2e56eb0cfb55",
   "metadata": {},
   "source": [
    "- Based on the model architecture above, we can see that the model now contains our new `LinearWithLoRA` layers\n",
    "- Also, since we initialized matrix $B$ with 0's, we expect the initial model performance to be unchanged compared to before"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "DAlrb_I00VEU",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "DAlrb_I00VEU",
    "outputId": "3dae5ff0-316d-408e-c8dc-2b8c60f9b994"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training accuracy: 46.25%\n",
      "Validation accuracy: 45.00%\n",
      "Test accuracy: 48.75%\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=10)\n",
    "val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=10)\n",
    "test_accuracy = calc_accuracy_loader(test_loader, model, device, num_batches=10)\n",
    "\n",
    "print(f\"Training accuracy: {train_accuracy*100:.2f}%\")\n",
    "print(f\"Validation accuracy: {val_accuracy*100:.2f}%\")\n",
    "print(f\"Test accuracy: {test_accuracy*100:.2f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13735b3e-f0c3-4dba-ae3d-4141b2878101",
   "metadata": {},
   "source": [
    "- Let's now get to the interesting part and finetune the model by reusing the training function from chapter 6\n",
    "- The training takes about 15 minutes on a M3 MacBook Air laptop computer and less than half a minute on a V100 or A100 GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "wCParRvr0eff",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "wCParRvr0eff",
    "outputId": "b86fd5f4-1527-4549-e0b0-9dff37836f0a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ep 1 (Step 000000): Train loss 2.849, Val loss 2.565\n",
      "Ep 1 (Step 000050): Train loss 0.515, Val loss 0.465\n",
      "Ep 1 (Step 000100): Train loss 0.191, Val loss 0.423\n",
      "Training accuracy: 97.50% | Validation accuracy: 97.50%\n",
      "Ep 2 (Step 000150): Train loss 0.170, Val loss 0.072\n",
      "Ep 2 (Step 000200): Train loss 0.014, Val loss 0.087\n",
      "Ep 2 (Step 000250): Train loss 0.027, Val loss 0.197\n",
      "Training accuracy: 100.00% | Validation accuracy: 92.50%\n",
      "Ep 3 (Step 000300): Train loss 0.014, Val loss 0.321\n",
      "Ep 3 (Step 000350): Train loss 0.015, Val loss 0.146\n",
      "Training accuracy: 100.00% | Validation accuracy: 97.50%\n",
      "Ep 4 (Step 000400): Train loss 0.008, Val loss 0.103\n",
      "Ep 4 (Step 000450): Train loss 0.010, Val loss 0.178\n",
      "Ep 4 (Step 000500): Train loss 0.097, Val loss 0.056\n",
      "Training accuracy: 100.00% | Validation accuracy: 97.50%\n",
      "Ep 5 (Step 000550): Train loss 0.032, Val loss 0.091\n",
      "Ep 5 (Step 000600): Train loss 0.002, Val loss 0.058\n",
      "Training accuracy: 100.00% | Validation accuracy: 100.00%\n",
      "Ep 6 (Step 000650): Train loss 0.001, Val loss 0.009\n",
      "Ep 6 (Step 000700): Train loss 0.001, Val loss 0.039\n",
      "Ep 6 (Step 000750): Train loss 0.000, Val loss 0.038\n",
      "Training accuracy: 100.00% | Validation accuracy: 95.00%\n",
      "Training completed in 13.70 minutes.\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "from previous_chapters import train_classifier_simple\n",
    "\n",
    "\n",
    "start_time = time.time()\n",
    "\n",
    "torch.manual_seed(123)\n",
    "\n",
    "optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)\n",
    "\n",
    "num_epochs = 6\n",
    "train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(\n",
    "    model, train_loader, val_loader, optimizer, device,\n",
    "    num_epochs=num_epochs, eval_freq=50, eval_iter=5,\n",
    "    tokenizer=tokenizer\n",
    ")\n",
    "\n",
    "end_time = time.time()\n",
    "execution_time_minutes = (end_time - start_time) / 60\n",
    "print(f\"Training completed in {execution_time_minutes:.2f} minutes.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0c89e82-3aa8-44c6-b046-0b16200b8e6c",
   "metadata": {},
   "source": [
    "- Finally, let's evaluate the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "bawWGijA0iF3",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 307
    },
    "id": "bawWGijA0iF3",
    "outputId": "4b05b245-ffac-4d36-881b-8306a4da6b75"
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAdwAAAEiCAYAAABTO2OcAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAABPd0lEQVR4nO3dd3hUVfrA8e9Mkpn03ikBJECAEEJdRBAlUlRWsMCyqEFRFw0iIoqsCog/DXYsLAoq6FpiAxcVqVIU6b2E0AKhpICQSurM+f1xk0mGACaQzKS8n+e5T+aWufc9Icw759xzz9EppRRCCCGEqFV6ewcghBBCNAaScIUQQggbkIQrhBBC2IAkXCGEEMIGJOEKIYQQNiAJVwghhLABSbhCCCGEDUjCFUIIIWxAEq4QQghhA5JwhWiE+vXrx4QJE+wdhhCNiiRcIa7C6NGj0el0lZZBgwbZOzQhRB3laO8AhKivBg0axPz58622GY1GO0UjhKjrpIYrxFUyGo0EBwdbLT4+PgCsWbMGg8HAb7/9Zjn+tddeIzAwkPT0dACWLl3KDTfcgLe3N35+ftx+++0cOXLEcvyxY8fQ6XR888039OnTBxcXF7p3787BgwfZsmUL3bp1w93dncGDB3PmzBnL+0aPHs3QoUN58cUXCQgIwNPTk7Fjx1JUVHTZshQWFjJp0iSaNGmCm5sbPXv2ZM2aNZb9x48fZ8iQIfj4+ODm5kaHDh1YsmTJZc/3n//8h/DwcJydnQkKCuLuu++27DObzcTHx9OyZUtcXFyIioriu+++s3r/3r17GTx4MO7u7gQFBXHfffdx9uxZy/5+/foxfvx4nnnmGXx9fQkODmb69OmXjUeIukASrhC1oOwe6X333UdWVhY7duzghRde4KOPPiIoKAiAvLw8Jk6cyNatW1m1ahV6vZ5hw4ZhNputzjVt2jSef/55tm/fjqOjI//85z955plneOedd/jtt984fPgwU6dOtXrPqlWrSExMZM2aNXz11VcsXLiQF1988bLxjhs3jg0bNpCQkMDu3bu55557GDRoEIcOHQIgLi6OwsJC1q1bx549e3j11Vdxd3e/5Lm2bt3K+PHjmTFjBklJSSxdupS+ffta9sfHx/PZZ5/xwQcfsG/fPp588knuvfde1q5dC0BmZiY333wz0dHRbN26laVLl5Kens7w4cOtrvPpp5/i5ubGpk2beO2115gxYwYrVqyo4r+QEHaghBDVFhsbqxwcHJSbm5vV8vLLL1uOKSwsVJ07d1bDhw9X7du3Vw8//PAVz3nmzBkFqD179iillEpOTlaA+uijjyzHfPXVVwpQq1atsmyLj49Xbdu2tYrN19dX5eXlWbbNmTNHubu7K5PJpJRS6sYbb1RPPPGEUkqp48ePKwcHB3Xq1CmrePr376+mTJmilFIqMjJSTZ8+vUq/m++//155enqq7OzsSvsKCgqUq6ur+uOPP6y2jxkzRo0cOVIppdRLL72kBgwYYLX/xIkTClBJSUmW+G+44QarY7p3764mT55cpRiFsAe5hyvEVbrpppuYM2eO1TZfX1/La4PBwBdffEGnTp0ICwvj7bfftjr20KFDTJ06lU2bNnH27FlLzTYlJYWOHTtajuvUqZPldVntODIy0mpbRkaG1bmjoqJwdXW1rPfq1Yvc3FxOnDhBWFiY1bF79uzBZDLRpk0bq+2FhYX4+fkBMH78eB599FGWL19OTEwMd911l1VcFd1yyy2EhYXRqlUrBg0axKBBgxg2bBiurq4cPnyYCxcucMstt1i9p6ioiOjoaAB27drF6tWrL1mDPnLkiCXOi68fEhJS6fcgRF0iCVeIq+Tm5kbr1q2veMwff/wBwLlz5zh37hxubm6WfUOGDCEsLIx58+YRGhqK2WymY8eOle61Ojk5WV7rdLpLbru4Gbo6cnNzcXBwYNu2bTg4OFjtK0t6Dz30EAMHDuTnn39m+fLlxMfH8+abb/L4449XOp+Hhwfbt29nzZo1LF++nKlTpzJ9+nS2bNlCbm4uAD///DNNmjSxel9Zh7Pc3FyGDBnCq6++WuncISEhltcVfwdw7b8HIWqbJFwhasmRI0d48sknmTdvHl9//TWxsbGsXLkSvV7Pn3/+SVJSEvPmzaNPnz4A/P777zV27V27dpGfn4+LiwsAGzduxN3dnWbNmlU6Njo6GpPJREZGhiWWS2nWrBljx45l7NixTJkyhXnz5l0y4QI4OjoSExNDTEwM06ZNw9vbm19//ZVbbrkFo9FISkoKN9544yXf26VLF77//ntatGiBo6N8RImGQ/6ahbhKhYWFpKWlWW1zdHTE398fk8nEvffey8CBA3nggQcYNGgQkZGRvPnmmzz99NP4+Pjg5+fH3LlzCQkJISUlhWeffbbGYisqKmLMmDE8//zzHDt2jGnTpjFu3Dj0+sr9JNu0acOoUaO4//77efPNN4mOjubMmTOsWrWKTp06cdtttzFhwgQGDx5MmzZtOH/+PKtXryYiIuKS1/7pp584evQoffv2xcfHhyVLlmA2m2nbti0eHh5MmjSJJ598ErPZzA033EBWVhbr16/H09OT2NhY4uLimDdvHiNHjrT0Qj58+DAJCQl89NFHlWrhQtQXknCFuEpLly61auIEaNu2LQcOHODll1/m+PHj/PTTT4DWFDp37lxGjhzJgAEDiIqKIiEhgfHjx9OxY0fatm3Lu+++S79+/Woktv79+xMeHk7fvn0pLCxk5MiRV3xsZv78+fzf//0fTz31FKdOncLf35+//e1v3H777QCYTCbi4uI4efIknp6eDBo0qNI96TLe3t4sXLiQ6dOnU1BQQHh4OF999RUdOnQA4KWXXiIgIID4+HiOHj2Kt7c3Xbp04d///jcAoaGhrF+/nsmTJzNgwAAKCwsJCwtj0KBBl/zCIER9oVNKKXsHIYSoOaNHjyYzM5MffvjB3qEIISqQr4tCCCGEDUjCFUIIIWxAmpSFEEIIG5AarhBCCGEDknCFEEIIG5CEK4QQQtiAJNxSs2fPpkWLFjg7O9OzZ082b95s75CqZN26dQwZMoTQ0FB0Ol2lR0GUUkydOpWQkBBcXFyIiYmxzABT5ty5c4waNQpPT0+8vb0ZM2aMZQi+Mrt376ZPnz44OzvTrFkzXnvttdou2mXFx8fTvXt3PDw8CAwMZOjQoSQlJVkdU1BQQFxcHH5+fri7u3PXXXdZpsUrk5KSwm233YarqyuBgYE8/fTTlJSUWB2zZs0aunTpgtFopHXr1ixYsKC2i3dJc+bMoVOnTnh6euLp6UmvXr345ZdfLPsbWnkvNnPmTHQ6HRMmTLBsa4hlnj59Ojqdzmpp166dZX9DLDPAqVOnuPfee/Hz88PFxYXIyEi2bt1q2d9gPsfsOXNCXZGQkKAMBoP65JNP1L59+9TDDz+svL29VXp6ur1D+0tLlixRzz33nFq4cKEC1KJFi6z2z5w5U3l5eakffvhB7dq1S/39739XLVu2VPn5+ZZjBg0apKKiotTGjRvVb7/9plq3bm2ZuUUppbKyslRQUJAaNWqU2rt3r/rqq6+Ui4uL+vDDD21VTCsDBw5U8+fPV3v37lU7d+5Ut956q2revLnKzc21HDN27FjVrFkztWrVKrV161b1t7/9TV1//fWW/SUlJapjx44qJiZG7dixQy1ZskT5+/tbZsdRSqmjR48qV1dXNXHiRLV//3713nvvKQcHB7V06VKbllcppRYvXqx+/vlndfDgQZWUlKT+/e9/KycnJ7V3794GWd6KNm/erFq0aKE6depkmeFIqYZZ5mnTpqkOHTqo1NRUy3LmzBnL/oZY5nPnzqmwsDA1evRotWnTJnX06FG1bNkydfjwYcsxDeVzTBKuUqpHjx4qLi7Osm4ymVRoaKiKj4+3Y1TVd3HCNZvNKjg4WL3++uuWbZmZmcpoNKqvvvpKKaXU/v37FaC2bNliOeaXX35ROp3OMl3bf/7zH+Xj46MKCwstx0yePNlqSjh7ysjIUIBau3atUkoro5OTk/r2228txyQmJipAbdiwQSmlfVHR6/UqLS3NcsycOXO
      "text/plain": [
       "<Figure size 500x300 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from previous_chapters import plot_values\n",
    "\n",
    "epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))\n",
    "examples_seen_tensor = torch.linspace(0, examples_seen, len(train_losses))\n",
    "\n",
    "plot_values(epochs_tensor, examples_seen_tensor, train_losses, val_losses, label=\"loss\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa074723-e3f7-4f7e-a267-855531a037dc",
   "metadata": {},
   "source": [
    "- Note that we previously calculated the accuracy values on 10 batches only; below, we calculate the accuracies on the full dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "1D2awlEq0gZi",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "1D2awlEq0gZi",
    "outputId": "b482af19-5ebd-45b9-a9f0-99f621203ef9"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training accuracy: 100.00%\n",
      "Validation accuracy: 96.64%\n",
      "Test accuracy: 98.00%\n"
     ]
    }
   ],
   "source": [
    "from previous_chapters import calc_accuracy_loader\n",
    "\n",
    "train_accuracy = calc_accuracy_loader(train_loader, model, device)\n",
    "val_accuracy = calc_accuracy_loader(val_loader, model, device)\n",
    "test_accuracy = calc_accuracy_loader(test_loader, model, device)\n",
    "\n",
    "print(f\"Training accuracy: {train_accuracy*100:.2f}%\")\n",
    "print(f\"Validation accuracy: {val_accuracy*100:.2f}%\")\n",
    "print(f\"Test accuracy: {test_accuracy*100:.2f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f87f5e6-339e-4fcf-900b-6d845d3c713d",
   "metadata": {},
   "source": [
    "- As we can based on the relatively high accuracy values above, the LoRA finetuning was successful"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "V100",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}