diff --git a/appendix-E/01_main-chapter-code/appendix-E.ipynb b/appendix-E/01_main-chapter-code/appendix-E.ipynb
index d9da9ca..b9f02c9 100644
--- a/appendix-E/01_main-chapter-code/appendix-E.ipynb
+++ b/appendix-E/01_main-chapter-code/appendix-E.ipynb
@@ -16,7 +16,9 @@
{
"cell_type": "markdown",
"id": "58b8c870-fb72-490e-8916-d8129bd5d1ff",
- "metadata": {},
+ "metadata": {
+ "id": "58b8c870-fb72-490e-8916-d8129bd5d1ff"
+ },
"source": [
"# Appendix E: Parameter-efficient Finetuning with LoRA"
]
@@ -30,7 +32,7 @@
"base_uri": "https://localhost:8080/"
},
"id": "5b7e01c2-1c84-4f2a-bb51-2e0b74abda90",
- "outputId": "9495f150-9d79-4910-d6e7-6c0d9aae4a41"
+ "outputId": "316166b4-027a-4756-e9b4-fe88ae75dd4f"
},
"outputs": [
{
@@ -63,7 +65,9 @@
{
"cell_type": "markdown",
"id": "21532056-0ef4-4c98-82c7-e91f61c6485e",
- "metadata": {},
+ "metadata": {
+ "id": "21532056-0ef4-4c98-82c7-e91f61c6485e"
+ },
"source": [
"## E.1 Introduction to LoRA"
]
@@ -71,7 +75,9 @@
{
"cell_type": "markdown",
"id": "66edc999-3d91-4a1c-a157-9d056392e8d8",
- "metadata": {},
+ "metadata": {
+ "id": "66edc999-3d91-4a1c-a157-9d056392e8d8"
+ },
"source": [
"- No code in this section\n",
"- Low-rank adaptation (LoRA) is a machine learning technique that modifies a pretrained model to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters\n",
@@ -81,7 +87,9 @@
{
"cell_type": "markdown",
"id": "5bb75b5d-d59c-4948-821a-1594a5883dc1",
- "metadata": {},
+ "metadata": {
+ "id": "5bb75b5d-d59c-4948-821a-1594a5883dc1"
+ },
"source": [
"- Suppose we have a large weight matrix $W$ for a given layer\n",
"- During backpropagation, we learn a $\\Delta W$ matrix, which contains information on how much we want to update the original weights to minimize the loss function during training\n",
@@ -100,7 +108,9 @@
{
"cell_type": "markdown",
"id": "a8a7419d-cae9-4525-bb44-1641f6ef4f3b",
- "metadata": {},
+ "metadata": {
+ "id": "a8a7419d-cae9-4525-bb44-1641f6ef4f3b"
+ },
"source": [
""
]
@@ -108,7 +118,9 @@
{
"cell_type": "markdown",
"id": "4edd43c9-8ec5-48e6-b3fc-5fb3c16037cc",
- "metadata": {},
+ "metadata": {
+ "id": "4edd43c9-8ec5-48e6-b3fc-5fb3c16037cc"
+ },
"source": [
"- If you paid close attention, the full finetuning and LoRA depictions in the figure above look slightly different from the formulas I have shown earlier\n",
"- That's due to the distributive law of matrix multiplication: we don't have to add the weights with the updated weights but can keep them separate\n",
@@ -138,7 +150,9 @@
{
"cell_type": "markdown",
"id": "669c64df-4431-4d27-834d-2bb38a01fc02",
- "metadata": {},
+ "metadata": {
+ "id": "669c64df-4431-4d27-834d-2bb38a01fc02"
+ },
"source": [
"- This section repeats the code from chapter 6 to load and prepare the dataset\n",
"- Instead of repeating this code, one could open and run the chapter 6 notebook and then insert the LoRA code from section E.4 there\n",
@@ -155,14 +169,14 @@
"base_uri": "https://localhost:8080/"
},
"id": "def7c09b-af9c-4216-90ce-5e67aed1065c",
- "outputId": "424e4423-f623-443c-ab9e-656f9e867559"
+ "outputId": "a67a7afe-b401-4463-c731-87025d20f72d"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "sms_spam_collection/SMSSpamCollection.tsv already exists. Skipping download and extraction.\n"
+ "File downloaded and saved as sms_spam_collection/SMSSpamCollection.tsv\n"
]
}
],
@@ -198,11 +212,7 @@
"execution_count": 3,
"id": "74c3c463-8763-4cc0-9320-41c7eaad8ab7",
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "74c3c463-8763-4cc0-9320-41c7eaad8ab7",
- "outputId": "b5b48439-32c8-4b37-cca2-c9dc8fa86563"
+ "id": "74c3c463-8763-4cc0-9320-41c7eaad8ab7"
},
"outputs": [],
"source": [
@@ -223,11 +233,7 @@
"execution_count": 4,
"id": "8681adc0-6f02-4e75-b01a-a6ab75d05542",
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "8681adc0-6f02-4e75-b01a-a6ab75d05542",
- "outputId": "3266c410-4fdb-4a8c-a142-7f707e2525ab"
+ "id": "8681adc0-6f02-4e75-b01a-a6ab75d05542"
},
"outputs": [],
"source": [
@@ -264,7 +270,9 @@
{
"cell_type": "markdown",
"id": "ab7335db-e0bb-4e27-80c5-eea11e593a57",
- "metadata": {},
+ "metadata": {
+ "id": "ab7335db-e0bb-4e27-80c5-eea11e593a57"
+ },
"source": [
"- As a verification step, we iterate through the data loaders and check that the batches contain 8 training examples each, where each training example consists of 120 tokens"
]
@@ -273,7 +281,13 @@
"cell_type": "code",
"execution_count": 5,
"id": "4dee6882-4c3a-4964-af15-fa31f86ad047",
- "metadata": {},
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "4dee6882-4c3a-4964-af15-fa31f86ad047",
+ "outputId": "2ae34de1-dd01-4f99-d2c8-ba4dca400754"
+ },
"outputs": [
{
"name": "stdout",
@@ -297,7 +311,9 @@
{
"cell_type": "markdown",
"id": "5cdd7947-7039-49bf-8a5e-c0a2f4281ca1",
- "metadata": {},
+ "metadata": {
+ "id": "5cdd7947-7039-49bf-8a5e-c0a2f4281ca1"
+ },
"source": [
"- Lastly, let's print the total number of batches in each dataset"
]
@@ -311,7 +327,7 @@
"base_uri": "https://localhost:8080/"
},
"id": "IZfw-TYD2zTj",
- "outputId": "6934bbf2-9797-4fbe-d26b-1a246e18c2fb"
+ "outputId": "4d19ed61-cf7a-4ec4-b822-c847dd1c5d77"
},
"outputs": [
{
@@ -333,7 +349,9 @@
{
"cell_type": "markdown",
"id": "dec9aa4a-ffd2-4d9f-a835-cce1059fe604",
- "metadata": {},
+ "metadata": {
+ "id": "dec9aa4a-ffd2-4d9f-a835-cce1059fe604"
+ },
"source": [
"## E.3 Initializing the model"
]
@@ -341,7 +359,9 @@
{
"cell_type": "markdown",
"id": "f36ebdaf-810e-46a2-9ad9-e017a04051b1",
- "metadata": {},
+ "metadata": {
+ "id": "f36ebdaf-810e-46a2-9ad9-e017a04051b1"
+ },
"source": [
"- This section repeats the code from chapter 6 to load and prepare the model"
]
@@ -350,19 +370,25 @@
"cell_type": "code",
"execution_count": 7,
"id": "02b3a506-3879-4258-82b5-93a5b6bafa74",
- "metadata": {},
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "02b3a506-3879-4258-82b5-93a5b6bafa74",
+ "outputId": "b8c9b125-bb52-45d3-8071-fa5054dbf5a9"
+ },
"outputs": [
{
- "name": "stdout",
+ "name": "stderr",
"output_type": "stream",
"text": [
- "File already exists and is up-to-date: gpt2/124M/checkpoint\n",
- "File already exists and is up-to-date: gpt2/124M/encoder.json\n",
- "File already exists and is up-to-date: gpt2/124M/hparams.json\n",
- "File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001\n",
- "File already exists and is up-to-date: gpt2/124M/model.ckpt.index\n",
- "File already exists and is up-to-date: gpt2/124M/model.ckpt.meta\n",
- "File already exists and is up-to-date: gpt2/124M/vocab.bpe\n"
+ "checkpoint: 100%|███████████████████████████| 77.0/77.0 [00:00<00:00, 45.0kiB/s]\n",
+ "encoder.json: 100%|███████████████████████| 1.04M/1.04M [00:00<00:00, 2.15MiB/s]\n",
+ "hparams.json: 100%|█████████████████████████| 90.0/90.0 [00:00<00:00, 54.5kiB/s]\n",
+ "model.ckpt.data-00000-of-00001: 100%|███████| 498M/498M [01:12<00:00, 6.86MiB/s]\n",
+ "model.ckpt.index: 100%|███████████████████| 5.21k/5.21k [00:00<00:00, 2.99MiB/s]\n",
+ "model.ckpt.meta: 100%|██████████████████████| 471k/471k [00:00<00:00, 1.32MiB/s]\n",
+ "vocab.bpe: 100%|████████████████████████████| 456k/456k [00:00<00:00, 1.48MiB/s]\n"
]
}
],
@@ -401,7 +427,9 @@
{
"cell_type": "markdown",
"id": "252614cd-7ce6-4908-83e6-3761f519904e",
- "metadata": {},
+ "metadata": {
+ "id": "252614cd-7ce6-4908-83e6-3761f519904e"
+ },
"source": [
"- To ensure that the model was loaded corrected, let's double-check that it generates coherent text"
]
@@ -410,7 +438,13 @@
"cell_type": "code",
"execution_count": 8,
"id": "8b6ce20c-0700-4783-8be0-4cf17c200a7f",
- "metadata": {},
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "8b6ce20c-0700-4783-8be0-4cf17c200a7f",
+ "outputId": "28ccbca5-8de9-41a0-c093-da00fcbaa91c"
+ },
"outputs": [
{
"name": "stdout",
@@ -445,7 +479,9 @@
{
"cell_type": "markdown",
"id": "8174b31b-1ab5-4115-b01c-245369da5af3",
- "metadata": {},
+ "metadata": {
+ "id": "8174b31b-1ab5-4115-b01c-245369da5af3"
+ },
"source": [
"- Then, we prepare the model for classification finetuning similar to chapter 6, where we replace the output layer"
]
@@ -454,7 +490,9 @@
"cell_type": "code",
"execution_count": 9,
"id": "e255ce91-d73a-4854-90a4-95804928eb16",
- "metadata": {},
+ "metadata": {
+ "id": "e255ce91-d73a-4854-90a4-95804928eb16"
+ },
"outputs": [],
"source": [
"torch.manual_seed(123)\n",
@@ -467,7 +505,9 @@
"cell_type": "code",
"execution_count": 10,
"id": "02e6f057-1383-4ece-8444-0a88e71ac75d",
- "metadata": {},
+ "metadata": {
+ "id": "02e6f057-1383-4ece-8444-0a88e71ac75d"
+ },
"outputs": [],
"source": [
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
@@ -477,7 +517,9 @@
{
"cell_type": "markdown",
"id": "8e951cd6-5e42-44d2-b21f-895cb61004fe",
- "metadata": {},
+ "metadata": {
+ "id": "8e951cd6-5e42-44d2-b21f-895cb61004fe"
+ },
"source": [
"- Lastly, let's calculate the initial classification accuracy of the non-finetuned model (we expect this to be around 50%, which means that the model is not able to distinguish between spam and non-spam messages yet reliably)"
]
@@ -486,7 +528,13 @@
"cell_type": "code",
"execution_count": 11,
"id": "fc7dd72c-73a2-4881-ade0-0a9605f1ab8c",
- "metadata": {},
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "fc7dd72c-73a2-4881-ade0-0a9605f1ab8c",
+ "outputId": "74848515-5a49-4125-fecb-9f4bac23f812"
+ },
"outputs": [
{
"name": "stdout",
@@ -525,7 +573,9 @@
{
"cell_type": "markdown",
"id": "652a4a82-61ef-4d0a-9858-8988e844f12c",
- "metadata": {},
+ "metadata": {
+ "id": "652a4a82-61ef-4d0a-9858-8988e844f12c"
+ },
"source": [
"- We begin by initializing a LoRALayer that creates the matrices $A$ and $B$, along with the `alpha` scaling hyperparameter and the `rank` ($r$) hyperparameters\n",
"- This layer can accept an input and compute the corresponding output, as illustrated in the figure below\n",
@@ -544,11 +594,13 @@
},
"outputs": [],
"source": [
+ "import math\n",
+ "\n",
"class LoRALayer(torch.nn.Module):\n",
" def __init__(self, in_dim, out_dim, rank, alpha):\n",
" super().__init__()\n",
- " std_dev = 1 / torch.sqrt(torch.tensor(rank).float())\n",
- " self.A = torch.nn.Parameter(torch.randn(in_dim, rank) * std_dev)\n",
+ " self.A = torch.nn.Parameter(torch.empty(in_dim, rank))\n",
+ " torch.nn.init.kaiming_uniform_(self.A, a=math.sqrt(5)) # similar to standard weight initialization\n",
" self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))\n",
" self.alpha = alpha\n",
"\n",
@@ -560,11 +612,13 @@
{
"cell_type": "markdown",
"id": "ad21faa8-0614-4257-93cd-68952193e14a",
- "metadata": {},
+ "metadata": {
+ "id": "ad21faa8-0614-4257-93cd-68952193e14a"
+ },
"source": [
"- In the code above, `rank` is a hyperparameter that controls the inner dimension of the matrices $A$ and $B$\n",
"- In other words, this parameter controls the number of additional parameters introduced by LoRA and is a key factor in determining the balance between model adaptability and parameter efficiency\n",
- "- The second hyperparameter, alpha, is a scaling hyperparameter applied to the output of the low-rank adaptation\n",
+ "- The second hyperparameter, `alpha`, is a scaling hyperparameter applied to the output of the low-rank adaptation\n",
"- It essentially controls the extent to which the adapted layer's output is allowed to influence the original output of the layer being adapted\n",
"- This can be seen as a way to regulate the impact of the low-rank adaptation on the layer's output\n",
"- So far, the `LoRALayer` class we implemented above allows us to transform the layer inputs $x$\n",
@@ -576,7 +630,9 @@
{
"cell_type": "markdown",
"id": "3e6d5da0-dfce-4808-b89b-29ff333f563f",
- "metadata": {},
+ "metadata": {
+ "id": "3e6d5da0-dfce-4808-b89b-29ff333f563f"
+ },
"source": [
"- To incorporate the original `Linear` layer weights as shown in the figure above, we implement a `LinearWithLoRA` layer below that uses the previously implemented LoRALayer and can be used to replace existing `Linear` layers in a neural network, for example, the self-attention module or feed forward modules in an LLM"
]
@@ -585,7 +641,9 @@
"cell_type": "code",
"execution_count": 13,
"id": "127d3a64-8359-4b21-b056-78d58cc75fe8",
- "metadata": {},
+ "metadata": {
+ "id": "127d3a64-8359-4b21-b056-78d58cc75fe8"
+ },
"outputs": [],
"source": [
"class LinearWithLoRA(torch.nn.Module):\n",
@@ -603,7 +661,9 @@
{
"cell_type": "markdown",
"id": "e1145a90-35ff-462c-820b-15483fa5b051",
- "metadata": {},
+ "metadata": {
+ "id": "e1145a90-35ff-462c-820b-15483fa5b051"
+ },
"source": [
"- Note that since we initialize the weight matrix $B$ (`self.B` in `LoRALayer`) with zero values in the LoRA layer, the matrix multiplication between $A$ and $B$ results in a matrix consisting of 0's and doesn't affect the original weights (since adding 0 to the original weights does not modify them)"
]
@@ -612,7 +672,7 @@
"cell_type": "markdown",
"id": "e98a6d36-7bc9-434c-a7f1-533f26aff06d",
"metadata": {
- "id": "4D21Jk7Vw3nG"
+ "id": "e98a6d36-7bc9-434c-a7f1-533f26aff06d"
},
"source": [
"- To try LoRA on the GPT model we defined earlier, we define a `replace_linear_with_lora` function to replace all `Linear` layers in the model with the new `LinearWithLoRA` layers\n",
@@ -642,7 +702,9 @@
{
"cell_type": "markdown",
"id": "8c172164-cdde-4489-b7d7-aaed9cc2f5f2",
- "metadata": {},
+ "metadata": {
+ "id": "8c172164-cdde-4489-b7d7-aaed9cc2f5f2"
+ },
"source": [
"- We then freeze the original model parameter and use the `replace_linear_with_lora` to replace the said `Linear` layers using the code below\n",
"- This will replace the `Linear` layers in the LLM with `LinearWithLoRA` layers"
@@ -652,7 +714,13 @@
"cell_type": "code",
"execution_count": 15,
"id": "dbe15350-4da9-4829-9d23-98bbd3d0b1a1",
- "metadata": {},
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "dbe15350-4da9-4829-9d23-98bbd3d0b1a1",
+ "outputId": "fd4c208f-854a-4701-d9d3-9d73af733364"
+ },
"outputs": [
{
"name": "stdout",
@@ -683,19 +751,19 @@
"base_uri": "https://localhost:8080/"
},
"id": "mLk_fPq0yz_u",
- "outputId": "7ba89607-ca75-4718-e8dc-9cdc44c3e410"
+ "outputId": "0a93b8fc-05d7-4ace-ee47-e2fc6bdd7d75"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "Total trainable LoRA parameters: 1,333,264\n"
+ "Total trainable LoRA parameters: 2,666,528\n"
]
}
],
"source": [
- "replace_linear_with_lora(model, rank=8, alpha=8)\n",
+ "replace_linear_with_lora(model, rank=16, alpha=16)\n",
"\n",
"total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
"print(f\"Total trainable LoRA parameters: {total_params:,}\")"
@@ -704,17 +772,25 @@
{
"cell_type": "markdown",
"id": "b8b6819e-ef7a-4f0d-841a-1b467496bef9",
- "metadata": {},
+ "metadata": {
+ "id": "b8b6819e-ef7a-4f0d-841a-1b467496bef9"
+ },
"source": [
- "- As we can see, we reduced the number of trainable parameters by almost 100x when using LoRA\n",
+ "- As we can see, we reduced the number of trainable parameters by almost 50x when using LoRA\n",
"- Let's now double-check whether the layers have been modified as intended by printing the model architecture"
]
},
{
"cell_type": "code",
- "execution_count": 18,
+ "execution_count": 17,
"id": "1711be61-bb2c-466f-9b5b-24f4aa5ccd9c",
- "metadata": {},
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "1711be61-bb2c-466f-9b5b-24f4aa5ccd9c",
+ "outputId": "acff8eca-3775-45a2-b62d-032a986ef037"
+ },
"outputs": [
{
"name": "stdout",
@@ -1189,7 +1265,9 @@
{
"cell_type": "markdown",
"id": "c4bbc9d7-65ec-4675-bab8-2e56eb0cfb55",
- "metadata": {},
+ "metadata": {
+ "id": "c4bbc9d7-65ec-4675-bab8-2e56eb0cfb55"
+ },
"source": [
"- Based on the model architecture above, we can see that the model now contains our new `LinearWithLoRA` layers\n",
"- Also, since we initialized matrix $B$ with 0's, we expect the initial model performance to be unchanged compared to before"
@@ -1197,14 +1275,14 @@
},
{
"cell_type": "code",
- "execution_count": 19,
+ "execution_count": 18,
"id": "DAlrb_I00VEU",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "DAlrb_I00VEU",
- "outputId": "3dae5ff0-316d-408e-c8dc-2b8c60f9b994"
+ "outputId": "3da44ac4-230b-4358-d996-30b63f0d962a"
},
"outputs": [
{
@@ -1231,7 +1309,9 @@
{
"cell_type": "markdown",
"id": "13735b3e-f0c3-4dba-ae3d-4141b2878101",
- "metadata": {},
+ "metadata": {
+ "id": "13735b3e-f0c3-4dba-ae3d-4141b2878101"
+ },
"source": [
"- Let's now get to the interesting part and finetune the model by reusing the training function from chapter 6\n",
"- The training takes about 15 minutes on a M3 MacBook Air laptop computer and less than half a minute on a V100 or A100 GPU"
@@ -1239,43 +1319,39 @@
},
{
"cell_type": "code",
- "execution_count": 20,
+ "execution_count": 19,
"id": "wCParRvr0eff",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "wCParRvr0eff",
- "outputId": "b86fd5f4-1527-4549-e0b0-9dff37836f0a"
+ "outputId": "ce910a9c-ee89-48bb-bfa6-49c6aee1e450"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "Ep 1 (Step 000000): Train loss 2.849, Val loss 2.565\n",
- "Ep 1 (Step 000050): Train loss 0.515, Val loss 0.465\n",
- "Ep 1 (Step 000100): Train loss 0.191, Val loss 0.423\n",
+ "Ep 1 (Step 000000): Train loss 3.820, Val loss 3.462\n",
+ "Ep 1 (Step 000050): Train loss 0.396, Val loss 0.364\n",
+ "Ep 1 (Step 000100): Train loss 0.111, Val loss 0.229\n",
+ "Training accuracy: 97.50% | Validation accuracy: 95.00%\n",
+ "Ep 2 (Step 000150): Train loss 0.135, Val loss 0.073\n",
+ "Ep 2 (Step 000200): Train loss 0.008, Val loss 0.052\n",
+ "Ep 2 (Step 000250): Train loss 0.021, Val loss 0.179\n",
"Training accuracy: 97.50% | Validation accuracy: 97.50%\n",
- "Ep 2 (Step 000150): Train loss 0.170, Val loss 0.072\n",
- "Ep 2 (Step 000200): Train loss 0.014, Val loss 0.087\n",
- "Ep 2 (Step 000250): Train loss 0.027, Val loss 0.197\n",
- "Training accuracy: 100.00% | Validation accuracy: 92.50%\n",
- "Ep 3 (Step 000300): Train loss 0.014, Val loss 0.321\n",
- "Ep 3 (Step 000350): Train loss 0.015, Val loss 0.146\n",
+ "Ep 3 (Step 000300): Train loss 0.096, Val loss 0.080\n",
+ "Ep 3 (Step 000350): Train loss 0.010, Val loss 0.116\n",
+ "Training accuracy: 97.50% | Validation accuracy: 95.00%\n",
+ "Ep 4 (Step 000400): Train loss 0.003, Val loss 0.151\n",
+ "Ep 4 (Step 000450): Train loss 0.008, Val loss 0.077\n",
+ "Ep 4 (Step 000500): Train loss 0.001, Val loss 0.147\n",
"Training accuracy: 100.00% | Validation accuracy: 97.50%\n",
- "Ep 4 (Step 000400): Train loss 0.008, Val loss 0.103\n",
- "Ep 4 (Step 000450): Train loss 0.010, Val loss 0.178\n",
- "Ep 4 (Step 000500): Train loss 0.097, Val loss 0.056\n",
+ "Ep 5 (Step 000550): Train loss 0.007, Val loss 0.094\n",
+ "Ep 5 (Step 000600): Train loss 0.000, Val loss 0.056\n",
"Training accuracy: 100.00% | Validation accuracy: 97.50%\n",
- "Ep 5 (Step 000550): Train loss 0.032, Val loss 0.091\n",
- "Ep 5 (Step 000600): Train loss 0.002, Val loss 0.058\n",
- "Training accuracy: 100.00% | Validation accuracy: 100.00%\n",
- "Ep 6 (Step 000650): Train loss 0.001, Val loss 0.009\n",
- "Ep 6 (Step 000700): Train loss 0.001, Val loss 0.039\n",
- "Ep 6 (Step 000750): Train loss 0.000, Val loss 0.038\n",
- "Training accuracy: 100.00% | Validation accuracy: 95.00%\n",
- "Training completed in 13.70 minutes.\n"
+ "Training completed in 12.10 minutes.\n"
]
}
],
@@ -1290,7 +1366,7 @@
"\n",
"optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)\n",
"\n",
- "num_epochs = 6\n",
+ "num_epochs = 5\n",
"train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(\n",
" model, train_loader, val_loader, optimizer, device,\n",
" num_epochs=num_epochs, eval_freq=50, eval_iter=5,\n",
@@ -1305,27 +1381,29 @@
{
"cell_type": "markdown",
"id": "d0c89e82-3aa8-44c6-b046-0b16200b8e6c",
- "metadata": {},
+ "metadata": {
+ "id": "d0c89e82-3aa8-44c6-b046-0b16200b8e6c"
+ },
"source": [
"- Finally, let's evaluate the model"
]
},
{
"cell_type": "code",
- "execution_count": 21,
+ "execution_count": 20,
"id": "bawWGijA0iF3",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
- "height": 307
+ "height": 308
},
"id": "bawWGijA0iF3",
- "outputId": "4b05b245-ffac-4d36-881b-8306a4da6b75"
+ "outputId": "af70782a-d605-4376-fa6c-d33b38979cfa"
},
"outputs": [
{
"data": {
- "image/png": "",
+ "image/png": "",
"text/plain": [
""
]
@@ -1346,21 +1424,23 @@
{
"cell_type": "markdown",
"id": "aa074723-e3f7-4f7e-a267-855531a037dc",
- "metadata": {},
+ "metadata": {
+ "id": "aa074723-e3f7-4f7e-a267-855531a037dc"
+ },
"source": [
"- Note that we previously calculated the accuracy values on 5 batches only via the `eval_iter=5` setting; below, we calculate the accuracies on the full dataset"
]
},
{
"cell_type": "code",
- "execution_count": 22,
+ "execution_count": 21,
"id": "1D2awlEq0gZi",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1D2awlEq0gZi",
- "outputId": "b482af19-5ebd-45b9-a9f0-99f621203ef9"
+ "outputId": "d603eda1-d912-43eb-ec9c-af6a622510a0"
},
"outputs": [
{
@@ -1369,7 +1449,7 @@
"text": [
"Training accuracy: 100.00%\n",
"Validation accuracy: 96.64%\n",
- "Test accuracy: 98.00%\n"
+ "Test accuracy: 97.33%\n"
]
}
],
@@ -1388,7 +1468,9 @@
{
"cell_type": "markdown",
"id": "1f87f5e6-339e-4fcf-900b-6d845d3c713d",
- "metadata": {},
+ "metadata": {
+ "id": "1f87f5e6-339e-4fcf-900b-6d845d3c713d"
+ },
"source": [
"- As we can see based on the relatively high accuracy values above, the LoRA finetuning was successful"
]
@@ -1415,7 +1497,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.10.11"
+ "version": "3.11.4"
}
},
"nbformat": 4,
diff --git a/ch06/02_bonus_additional-experiments/README.md b/ch06/02_bonus_additional-experiments/README.md
index 7e011f6..61f6e07 100644
--- a/ch06/02_bonus_additional-experiments/README.md
+++ b/ch06/02_bonus_additional-experiments/README.md
@@ -19,7 +19,7 @@ For example,
| 6 | gpt2-large (774M) | pretrained | last | last_block | longest train ex. (120) | 99.52% | 98.66% | 96.67% | 1.50 min | A100 |
| 7 | gpt2-xl (1558M) | pretrained | last | last_block | longest train ex. (120) | 99.81% | 99.33% | 98.33% | 2.83 min | A100 |
| 8 | gpt2-small (124M) | random | last | all | longest train ex. (120) | 100% | 96.64% | 93.67% | 0.69 min | A100 |
-| 9 | gpt2-small (124M) | pretrained | last | LoRA | longest train ex. (120) | 99.52% | 97.99% | 97.67% | 0.75 min | A100 |
+| 9 | gpt2-small (124M) | pretrained | last | LoRA | longest train ex. (120) | 100.00% | 97.32% | 96.67% | 0.75 min | A100 |
| 10 | gpt2-small (124M) | pretrained | last | last_block | context length (1024) | 83.08% | 87.92% | 78.33% | 2.46 min | A100 |
| 11 | gpt2-small (124M) | pretrained | last | last_block | variable: no padding (batch size 1) | 100.00% | 98.66% | 98.00% | 1.75 min | A100 |
| 12 | gpt2-small (124M) | pretrained | last | last_block | variable: no padding (batch size 8) | 99.33% | 98.66% | 98.33% | 1.70 min | A100 |
@@ -41,7 +41,7 @@ You can use the following code to reproduce the experiments:
- Row 6: `python additional-experiments.py --model_size "gpt2-large (774M)"`
- Row 7: `python additional-experiments.py --model_size "gpt2-xl (1558M)"`
- Row 8: `python additional-experiments.py --weights random --trainable_layers all`
-- Row 9: `python additional-experiments.py --trainable_layers lora --lora_rank 16 --lora_alpha 8`
+- Row 9: `python additional-experiments.py --trainable_layers lora --lora_rank 16 --lora_alpha 16`
- Row 10: `python additional-experiments.py --context_length "model_context_length"`
- Row 11: `python additional-experiments.py --no_padding --batch_size 1`
- Row 12: `python additional-experiments.py --no_padding --batch_size 1 --accumulation_steps 8`
@@ -59,7 +59,7 @@ I've kept the LLM and dataset small on purpose, so you can run the training on a
3. **Training All Layers vs. Last Transformer Block (Row 1 vs. 4)**: Training all layers shows a modest improvement of ~2% over just training the last transformer block, but it requires almost three times longer in terms of training duration.
4. **Using Larger Pretrained Models (Row 1 vs 5, and Row 1 vs. 6 and 7)**: Employing a 3x larger pretrained model leads to worse results. However, using a 5x larger model improves performance compared to the initial model, as was anticipated. Similarly, the 12x larger model improves the predictive performance even further. (The medium model was perhaps not well pretrained or the particular finetuning configuration works not as well for this model.)
5. **Using a Model with Random Weights vs. Pretrained Weights (Row 1 vs. 8)**: Utilizing a model with random weights yields results that are only slightly worse by 1.3% compared to using pretrained weights.
-6. **Using LoRA (Low-Rank Adaptation) vs Training All Layers (Row 9 vs. 4)**: Keeping the model frozen and adding trainable LoRA layers (see [Appendix E](../../appendix-E/01_main-chapter-code/appendix-E.ipynb) for details) is a viable alternative to training all model parameters and even improves the performance by 1% point. As it can be seen by the 1% lower gap between the training and validation accuracy when using LoRA, this is likely due to less overfitting. Moreover, using LoRA is also slightly faster because fewer parameters have to be updated.
+6. **Using LoRA (Low-Rank Adaptation) vs Training All Layers (Row 9 vs. 4)**: Keeping the model frozen and adding trainable LoRA layers (see [Appendix E](../../appendix-E/01_main-chapter-code/appendix-E.ipynb) for details) is a viable alternative to training all model parameters and even improves the performance by 1% point. As it can be seen by the ~1% lower gap between the training and validation accuracy when using LoRA, this is likely due to less overfitting. Moreover, using LoRA is also slightly faster because fewer parameters have to be updated.
7. **Padding Input to Full Context Length vs. Longest Training Example (Row 1 vs. 10)**: Padding the input to the full supported context length results is significantly worse.
8. **Padding vs no padding (Row 1 vs. 11 and 12)**: The `--no_padding` option disables the padding in the dataset, which requires training the model with a batch size of 1 since the inputs have variable lengths. This results in a better test accuracy but takes longer to train. In row 12, we additionally enable gradient accumulation with 8 steps to achieve the same batch size as in the other experiments, which helps reduce overfitting and slightly boost the test set accuracy.
9. **Disabling the causal attention mask (Row 1 vs. 13)**: Disables the causal attention mask used in the multi-head attention module. This means all tokens can attend all other tokens. The model accuracy is slightly improved compared to the GPT model with causal mask.
diff --git a/ch06/02_bonus_additional-experiments/additional-experiments.py b/ch06/02_bonus_additional-experiments/additional-experiments.py
index a3dd719..8228778 100644
--- a/ch06/02_bonus_additional-experiments/additional-experiments.py
+++ b/ch06/02_bonus_additional-experiments/additional-experiments.py
@@ -4,6 +4,7 @@
# Code: https://github.com/rasbt/LLMs-from-scratch
import argparse
+import math
import os
from pathlib import Path
import time
@@ -23,8 +24,8 @@ from previous_chapters import GPTModel, load_weights_into_gpt
class LoRALayer(torch.nn.Module):
def __init__(self, in_dim, out_dim, rank, alpha):
super().__init__()
- std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
- self.A = torch.nn.Parameter(torch.randn(in_dim, rank) * std_dev)
+ self.A = torch.nn.Parameter(torch.empty(in_dim, rank))
+ torch.nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
self.alpha = alpha