From d3201f5aad7d967eb62d68068a3b053d86f9bff5 Mon Sep 17 00:00:00 2001 From: Sebastian Raschka Date: Sun, 5 May 2024 07:10:04 -0500 Subject: [PATCH] Add figures for ch06 (#141) --- ch06/01_main-chapter-code/ch06.ipynb | 296 ++++++++++++++++++++++----- 1 file changed, 246 insertions(+), 50 deletions(-) diff --git a/ch06/01_main-chapter-code/ch06.ipynb b/ch06/01_main-chapter-code/ch06.ipynb index 985e382..feba076 100644 --- a/ch06/01_main-chapter-code/ch06.ipynb +++ b/ch06/01_main-chapter-code/ch06.ipynb @@ -25,7 +25,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 1, "id": "5b7e01c2-1c84-4f2a-bb51-2e0b74abda90", "metadata": { "colab": { @@ -62,6 +62,14 @@ " print(f\"{p} version: {version(p)}\")" ] }, + { + "cell_type": "markdown", + "id": "a445828a-ff10-4efa-9f60-a2e2aed4c87d", + "metadata": {}, + "source": [ + "" + ] + }, { "cell_type": "markdown", "id": "3a84cf35-b37f-4c15-8972-dfafc9fadc1c", @@ -82,6 +90,42 @@ "- No code in this section" ] }, + { + "cell_type": "markdown", + "id": "ac45579d-d485-47dc-829e-43be7f4db57b", + "metadata": {}, + "source": [ + "- The most common ways to finetune language models are instruction-finetuning and classifcation finetuning\n", + "- Instruction-finetuning, depicted below, is the topic of the next chapter" + ] + }, + { + "cell_type": "markdown", + "id": "6c29ef42-46d9-43d4-8bb4-94974e1665e4", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "id": "a7f60321-95b8-46a9-97bf-1d07fda2c3dd", + "metadata": {}, + "source": [ + "- Classification finetuning, the topic of this chapter, is a procedure you may already be familiar with if you have a background in machine learning -- it's similar to training a convolutional network to classify handwritten digits, for example\n", + "- In classification finetuning, we have a specific number of class labels (for example, \"spam\" and \"not spam\") that the model can output\n", + "- A classification finetuned model can only predict classes it has seen during training (for example, \"spam\" or \"not spam\", whereas an instruction-finetuned model can usually perform many tasks\n", + "- We can think of a classification-finetuned model as a very specialized model; in practice, it is much easier to create a specialized model than a generalist model that performs well on many different tasks" + ] + }, + { + "cell_type": "markdown", + "id": "0b37a0c4-0bb1-4061-b1fe-eaa4416d52c3", + "metadata": {}, + "source": [ + "" + ] + }, { "cell_type": "markdown", "id": "8c7017a2-32aa-4002-a2f3-12aac293ccdf", @@ -92,6 +136,14 @@ "## 6.2 Preparing the dataset" ] }, + { + "cell_type": "markdown", + "id": "5f628975-d2e8-4f7f-ab38-92bb868b7067", + "metadata": {}, + "source": [ + "" + ] + }, { "cell_type": "markdown", "id": "9fbd459f-63fa-4d8c-8499-e23103156c7d", @@ -106,7 +158,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "id": "def7c09b-af9c-4216-90ce-5e67aed1065c", "metadata": { "colab": { @@ -169,7 +221,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "id": "da0ed4da-ac31-4e4d-8bdd-2153be4656a4", "metadata": { "colab": { @@ -283,7 +335,7 @@ "[5572 rows x 2 columns]" ] }, - "execution_count": 4, + "execution_count": 3, "metadata": {}, "output_type": "execute_result" } @@ -307,7 +359,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "id": "495a5280-9d7c-41d4-9719-64ab99056d4c", "metadata": { "colab": { @@ -345,7 +397,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "id": "7be4a0a2-9704-4a96-b38f-240339818688", "metadata": { "colab": { @@ -396,7 +448,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "id": "c1b10c3d-5d57-42d0-8de8-cf80a06f5ffd", "metadata": { "id": "c1b10c3d-5d57-42d0-8de8-cf80a06f5ffd" @@ -418,7 +470,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "id": "uQl0Psdmx15D", "metadata": { "id": "uQl0Psdmx15D" @@ -448,6 +500,14 @@ "test_df.to_csv(\"test.csv\", index=None)" ] }, + { + "cell_type": "markdown", + "id": "a8d7a0c5-1d5f-458a-b685-3f49520b0094", + "metadata": {}, + "source": [ + "## 6.3 Creating data loaders" + ] + }, { "cell_type": "markdown", "id": "7126108a-75e7-4862-b0fb-cbf59a18bb6c", @@ -465,7 +525,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 8, "id": "74c3c463-8763-4cc0-9320-41c7eaad8ab7", "metadata": { "colab": { @@ -490,6 +550,27 @@ "print(tokenizer.encode(\"<|endoftext|>\", allowed_special={\"<|endoftext|>\"}))" ] }, + { + "cell_type": "code", + "execution_count": 9, + "id": "0ff0f6b2-376b-4740-8858-55b60784be73", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[42, 13, 314, 481, 1908, 340, 757]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tokenizer.encode(\"K. I will sent it again\")" + ] + }, { "cell_type": "markdown", "id": "04f582ff-68bf-450e-bd87-5fb61afe431c", @@ -500,6 +581,14 @@ "- The `SpamDataset` class below identifies the longest sequence in the training dataset and adds the padding token to the others to match that sequence length" ] }, + { + "cell_type": "markdown", + "id": "0829f33f-1428-4f22-9886-7fee633b3666", + "metadata": {}, + "source": [ + "" + ] + }, { "cell_type": "code", "execution_count": 10, @@ -611,6 +700,14 @@ "- Next, we use the dataset to instantiate the data loaders, which is similar to creating the data loaders in previous chapters:" ] }, + { + "cell_type": "markdown", + "id": "64bcc349-205f-48f8-9655-95ff21f5e72f", + "metadata": {}, + "source": [ + "" + ] + }, { "cell_type": "code", "execution_count": 13, @@ -730,7 +827,7 @@ "id": "d1c4f61a-5f5d-4b3b-97cf-151b617d1d6c" }, "source": [ - "## 6.3 Initializing a model with pretrained weights" + "## 6.4 Initializing a model with pretrained weights" ] }, { @@ -738,7 +835,9 @@ "id": "97e1af8b-8bd1-4b44-8b8b-dc031496e208", "metadata": {}, "source": [ - "- In this section, we initialize the pretrained model we worked with in the previous chapter" + "- In this section, we initialize the pretrained model we worked with in the previous chapter\n", + "\n", + "" ] }, { @@ -819,43 +918,86 @@ { "cell_type": "code", "execution_count": 18, - "id": "fe4af171-5dce-4f6e-9b63-1e4e16e8b94c", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "fe4af171-5dce-4f6e-9b63-1e4e16e8b94c", - "outputId": "8ff3ec54-1dc3-4930-9be6-8eeaf560f8d4" - }, + "id": "d8ac25ff-74b1-4149-8dc5-4c429d464330", + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Output text: Every effort moves you forward.\n", + "Every effort moves you forward.\n", "\n", "The first step is to understand the importance of your work\n" ] } ], "source": [ - "from previous_chapters import generate_text_simple\n", + "from previous_chapters import (\n", + " generate_text_simple,\n", + " text_to_token_ids,\n", + " token_ids_to_text\n", + ")\n", "\n", - "start_context = \"Every effort moves you\"\n", "\n", - "tokenizer = tiktoken.get_encoding(\"gpt2\")\n", - "encoded = tokenizer.encode(start_context)\n", - "encoded_tensor = torch.tensor(encoded).unsqueeze(0)\n", + "text_1 = \"Every effort moves you\"\n", "\n", - "out = generate_text_simple(\n", + "token_ids = generate_text_simple(\n", " model=model,\n", - " idx=encoded_tensor,\n", + " idx=text_to_token_ids(text_1, tokenizer),\n", " max_new_tokens=15,\n", " context_size=BASE_CONFIG[\"context_length\"]\n", ")\n", - "decoded_text = tokenizer.decode(out.squeeze(0).tolist())\n", "\n", - "print(\"Output text:\", decoded_text)" + "print(token_ids_to_text(token_ids, tokenizer))" + ] + }, + { + "cell_type": "markdown", + "id": "69162550-6a02-4ece-8db1-06c71d61946f", + "metadata": {}, + "source": [ + "- Before we finetune the model as a classifier, let's see if the model can perhaps already classify spam messages via prompting" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "94224aa9-c95a-4f8a-a420-76d01e3a800c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Is the following text 'spam'? Answer with 'yes' or 'no': 'You are a winner you have been specially selected to receive $1000 cash or a $2000 award.' Answer with 'yes' or 'no'. Answer with 'yes' or 'no'. Answer with 'yes' or 'no'. Answer with 'yes'\n" + ] + } + ], + "source": [ + "text_2 = (\n", + " \"Is the following text 'spam'? Answer with 'yes' or 'no':\"\n", + " \" 'You are a winner you have been specially\"\n", + " \" selected to receive $1000 cash or a $2000 award.'\"\n", + " \" Answer with 'yes' or 'no'.\"\n", + ")\n", + "\n", + "token_ids = generate_text_simple(\n", + " model=model,\n", + " idx=text_to_token_ids(text_2, tokenizer),\n", + " max_new_tokens=23,\n", + " context_size=BASE_CONFIG[\"context_length\"]\n", + ")\n", + "\n", + "print(token_ids_to_text(token_ids, tokenizer))" + ] + }, + { + "cell_type": "markdown", + "id": "1ce39ed0-2c77-410d-8392-dd15d4b22016", + "metadata": {}, + "source": [ + "- As we can see, the model is not very good at following instruction\n", + "- This is expected, since it has only been pretrained and not instruction-finetuned (instruction finetuning will be covered in the next chapter)" ] }, { @@ -865,7 +1007,15 @@ "id": "4c9ae440-32f9-412f-96cf-fd52cc3e2522" }, "source": [ - "## 6.4 Adding a classification head" + "## 6.5 Adding a classification head" + ] + }, + { + "cell_type": "markdown", + "id": "d6e9d66f-76b2-40fc-9ec5-3f972a8db9c0", + "metadata": {}, + "source": [ + "" ] }, { @@ -879,7 +1029,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 20, "id": "b23aff91-6bd0-48da-88f6-353657e6c981", "metadata": { "colab": { @@ -1149,7 +1299,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 21, "id": "fkMWFl-0etea", "metadata": { "id": "fkMWFl-0etea" @@ -1171,7 +1321,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 22, "id": "7e759fa0-0f69-41be-b576-17e5f20e04cb", "metadata": {}, "outputs": [], @@ -1192,9 +1342,17 @@ "- So, we are also making the last transformer block and the final `LayerNorm` module connecting the last transformer block to the output layer trainable" ] }, + { + "cell_type": "markdown", + "id": "0be7c1eb-c46c-4065-8525-eea1b8c66d10", + "metadata": {}, + "source": [ + "" + ] + }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 23, "id": "2aedc120-5ee3-48f6-92f2-ad9304ebcdc7", "metadata": { "id": "2aedc120-5ee3-48f6-92f2-ad9304ebcdc7" @@ -1219,7 +1377,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 24, "id": "f645c06a-7df6-451c-ad3f-eafb18224ebc", "metadata": { "colab": { @@ -1233,13 +1391,13 @@ "name": "stdout", "output_type": "stream", "text": [ - "Inputs: tensor([[ 40, 1107, 8288, 428, 3807, 13]])\n", - "Inputs dimensions: torch.Size([1, 6])\n" + "Inputs: tensor([[5211, 345, 423, 640]])\n", + "Inputs dimensions: torch.Size([1, 4])\n" ] } ], "source": [ - "inputs = tokenizer.encode(\"I really liked this movie.\")\n", + "inputs = tokenizer.encode(\"Do you have time\")\n", "inputs = torch.tensor(inputs).unsqueeze(0)\n", "print(\"Inputs:\", inputs)\n", "print(\"Inputs dimensions:\", inputs.shape) # shape: (batch_size, num_tokens)" @@ -1255,7 +1413,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 25, "id": "48dc84f1-85cc-4609-9cee-94ff539f00f4", "metadata": { "colab": { @@ -1270,13 +1428,11 @@ "output_type": "stream", "text": [ "Outputs:\n", - " tensor([[[-1.9044, 1.5321],\n", - " [-4.9851, 8.5136],\n", - " [-1.6985, 4.6314],\n", - " [-2.3820, 5.7547],\n", - " [-3.8736, 4.4867],\n", - " [-5.7543, 5.3615]]])\n", - "Outputs dimensions: torch.Size([1, 6, 2])\n" + " tensor([[[-1.5854, 0.9904],\n", + " [-3.7235, 7.4548],\n", + " [-2.2661, 6.6049],\n", + " [-3.5983, 3.9902]]])\n", + "Outputs dimensions: torch.Size([1, 4, 2])\n" ] } ], @@ -1288,6 +1444,14 @@ "print(\"Outputs dimensions:\", outputs.shape) # shape: (batch_size, num_tokens, num_classes)" ] }, + { + "cell_type": "markdown", + "id": "7df9144f-6817-4be4-8d4b-5d4dadfe4a9b", + "metadata": {}, + "source": [ + "" + ] + }, { "cell_type": "markdown", "id": "e3bb8616-c791-4f5c-bac0-5302f663e46a", @@ -1325,12 +1489,28 @@ "print(\"Last output token:\", outputs[:, -1, :])" ] }, + { + "cell_type": "markdown", + "id": "8df08ae0-e664-4670-b7c5-8a2280d9b41b", + "metadata": {}, + "source": [ + "" + ] + }, { "cell_type": "markdown", "id": "32aa4aef-e1e9-491b-9adf-5aa973e59b8c", "metadata": {}, "source": [ - "## 6.5 Calculating the classification loss and accuracy" + "## 6.6 Calculating the classification loss and accuracy" + ] + }, + { + "cell_type": "markdown", + "id": "669e1fd1-ace8-44b4-b438-185ed0ba8b33", + "metadata": {}, + "source": [ + "" ] }, { @@ -1545,7 +1725,7 @@ "id": "456ae0fd-6261-42b4-ab6a-d24289953083" }, "source": [ - "## 6.6 Finetuning the model on supervised data" + "## 6.7 Finetuning the model on supervised data" ] }, { @@ -1560,6 +1740,14 @@ " 2. calculate the accuracy after each epoch instead of printing a sample text after each epoch" ] }, + { + "cell_type": "markdown", + "id": "979b6222-1dc2-4530-9d01-b6b04fe3de12", + "metadata": {}, + "source": [ + "" + ] + }, { "cell_type": "code", "execution_count": 31, @@ -1868,7 +2056,15 @@ "id": "a74d9ad7-3ec1-450e-8c9f-4fc46d3d5bb0", "metadata": {}, "source": [ - "## 6.7 Using the LLM as a SPAM classifier" + "## 6.8 Using the LLM as a SPAM classifier" + ] + }, + { + "cell_type": "markdown", + "id": "72ebcfa2-479e-408b-9cf0-7421f6144855", + "metadata": {}, + "source": [ + "" ] }, { @@ -2069,7 +2265,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.6" + "version": "3.11.4" } }, "nbformat": 4,