LLMs-from-scratch/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "62129596-d10f-45b1-a1af-ee10f358f773",
   "metadata": {
    "id": "62129596-d10f-45b1-a1af-ee10f358f773"
   },
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0bd2379-ed2f-4c77-8b71-f1f0242b9ff9",
   "metadata": {
    "id": "b0bd2379-ed2f-4c77-8b71-f1f0242b9ff9"
   },
   "source": [
    "# Direct Preference Optimization (DPO) for LLM Alignment (From Scratch)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d04cb2b8-d87b-4c6b-a225-c630d758f68e",
   "metadata": {
    "id": "d04cb2b8-d87b-4c6b-a225-c630d758f68e"
   },
   "source": [
    "- This code notebook implements Direct Preference Optimization (DPO) from scratch and applies it to a large language model (LLM) to enhance its ability to generate responses that align more closely with user preferences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "pxMGAf3bnVwn",
   "metadata": {
    "id": "pxMGAf3bnVwn"
   },
   "outputs": [],
   "source": [
    "# !pip install -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "edb3e145-fbaa-4bb3-9e95-186b4145087f",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "edb3e145-fbaa-4bb3-9e95-186b4145087f",
    "outputId": "3d449525-76cc-4124-ab30-a93c6a9623ee"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tiktoken version: 0.7.0\n",
      "torch version: 2.3.1+cu121\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "pkgs = [\n",
    "    \"tiktoken\",    # Tokenizer\n",
    "    \"torch\",       # Deep learning library\n",
    "]\n",
    "for p in pkgs:\n",
    "    print(f\"{p} version: {version(p)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49ec20a3-a26c-4f9b-8a33-bfd3d67860e2",
   "metadata": {
    "id": "49ec20a3-a26c-4f9b-8a33-bfd3d67860e2"
   },
   "source": [
    "&nbsp;\n",
    "# 1) A brief introduction to DPO"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17804afd-786b-4600-bad0-f5805454e3d6",
   "metadata": {
    "id": "17804afd-786b-4600-bad0-f5805454e3d6"
   },
   "source": [
    "- DPO, proposed in the paper [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290), is an alternative to reinforcement learning from human feedback (RLHF) used in finetuning large language models (LLMs)\n",
    "- DPO can be used to finetune (or align) the model to generate responses that better align with user expectations and instructions\n",
    "\n",
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/dpo/1.webp\" width=500px>\n",
    "\n",
    "- In instruction finetuning, we train the LLM to generate correct answers given a prompt\n",
    "- However, in practice, there are multiple ways to give a correct answer, and correct answers can differ in style; for example, consider a technical and a more user-friendly response when asking an LLM to give recommendations when buying a laptop, as shown in the figure below\n",
    "\n",
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/dpo/2.webp\" width=700px>\n",
    "\n",
    "- RLHF and DPO are methods that can be used to teach the LLM to prefer one answer style over the other, that is, aligning better with user preferences\n",
    "- The RLHF process, which requires training a separate reward model, is outlined below\n",
    "\n",
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/dpo/4.webp\" width=600px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9073622f-d537-42bf-8778-43c2adaa2191",
   "metadata": {
    "id": "9073622f-d537-42bf-8778-43c2adaa2191"
   },
   "source": [
    "- Compared to RLHF, DPO aims to simplify the process by optimizing models directly for user preferences without the need for complex reward modeling and policy optimization\n",
    "- In other words, DPO focuses on directly optimizing the model's output to align with human preferences or specific objectives\n",
    "- Shown below is the main idea as an overview of how DPO works\n",
    "\n",
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/dpo/5.webp\" width=600px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c894134a-315c-453e-bbc1-387794b3f4d6",
   "metadata": {
    "id": "c894134a-315c-453e-bbc1-387794b3f4d6"
   },
   "source": [
    "- The concrete equation to implement the DPO loss is shown below; we will revisit the equation when we implement it in Python further down in this code notebook\n",
    "\n",
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/dpo/3.webp\" width=600px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd7491b5-f619-4501-ad39-2942de57c115",
   "metadata": {
    "id": "dd7491b5-f619-4501-ad39-2942de57c115"
   },
   "source": [
    "- In the equation above,\n",
    " - \"expected value\" $\\mathbb{E}$ is statistics jargon and stands for the average or mean value of the random variable (the expression inside the brackets)\n",
    " - The $\\pi_{\\theta}$ variable is the so-called policy (a term borrowed from reinforcement learning) and represents the LLM we want to optimize; $\\pi_{ref}$ is a reference LLM, which is typically the original LLM before optimization (at the beginning of the training, $\\pi_{\\theta}$ and $\\pi_{ref}$ are typically the same)\n",
    " - $\\beta$ is a hyperparameter to control the divergence between the $\\pi_{\\theta}$ and the reference model; increasing $\\beta$ increases the impact of the difference between\n",
    "$\\pi_{\\theta}$ and $\\pi_{ref}$ in terms of their log probabilities on the overall loss function, thereby increasing the divergence between the two models\n",
    "- To avoid bloating the code notebook with a more detailed discussion, I may write a separate standalone article with more details on these concepts in the future\n",
    "- In the meantime, if you are interested in comparing RLHF and DPO, please see the section [2.2. RLHF vs Direct Preference Optimization (DPO)](https://magazine.sebastianraschka.com/i/142924793/rlhf-vs-direct-preference-optimization-dpo) in my article [Tips for LLM Pretraining and Evaluating Reward Models](https://magazine.sebastianraschka.com/p/tips-for-llm-pretraining-and-evaluating-rms)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "xqVAgsyQ6LuG",
   "metadata": {
    "id": "xqVAgsyQ6LuG",
    "tags": []
   },
   "source": [
    "&nbsp;\n",
    "# 2) Preparing a preference dataset for DPO"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60b2195d-8734-469b-a52e-5031ca7ea6b1",
   "metadata": {
    "id": "60b2195d-8734-469b-a52e-5031ca7ea6b1"
   },
   "source": [
    "- Let's begin by loading and preparing the dataset, which may already answer a lot of the questions you might have before we revisit the DPO loss equation\n",
    "- Here, we work with a dataset that contains more polite and less polite responses to instruction prompts (concrete examples are shown in the next section)\n",
    "- The dataset was generated via the [create-preference-data-ollama.ipynb](create-preference-data-ollama.ipynb) notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "wHLB62Nj7haD",
   "metadata": {
    "id": "wHLB62Nj7haD"
   },
   "source": [
    "&nbsp;\n",
    "## 2.1) Loading a preference dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13e09f99-1b18-4923-ba36-af46d8e3075f",
   "metadata": {
    "id": "13e09f99-1b18-4923-ba36-af46d8e3075f"
   },
   "source": [
    "- The dataset is a json file with 1100 entries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5266e66c-5ec0-45e6-a654-148971f6aee7",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "5266e66c-5ec0-45e6-a654-148971f6aee7",
    "outputId": "04e8ee70-3076-441d-d2bf-7641da3d0c1d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of entries: 1100\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "\n",
    "file_path = \"instruction-data-with-preference.json\"\n",
    "\n",
    "with open(file_path, \"r\", encoding=\"utf-8\") as file:\n",
    "    data = json.load(file)\n",
    "\n",
    "print(\"Number of entries:\", len(data))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "725d2b9a-d6d2-46e2-89f8-5ab87e040e3b",
   "metadata": {
    "id": "725d2b9a-d6d2-46e2-89f8-5ab87e040e3b"
   },
   "source": [
    "- Let's take a look at two example entries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5c11916f-9a26-4367-a16e-7b0c121a20a6",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "5c11916f-9a26-4367-a16e-7b0c121a20a6",
    "outputId": "00a432cc-19b1-484f-80e2-e897ee5e4024"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'instruction': 'Identify the correct spelling of the following word.',\n",
      " 'input': 'Ocassion',\n",
      " 'output': \"The correct spelling is 'Occasion.'\",\n",
      " 'rejected': \"The correct spelling is obviously 'Occasion.'\",\n",
      " 'chosen': \"The correct spelling is 'Occasion.'\"}\n"
     ]
    }
   ],
   "source": [
    "import pprint\n",
    "\n",
    "pprint.pp(data[50])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "01ef804a-8c13-4a0b-9b2e-b65a4d0a870d",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "01ef804a-8c13-4a0b-9b2e-b65a4d0a870d",
    "outputId": "078cd643-83fb-4b42-ecf9-3256e8c9d239"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'instruction': \"What is an antonym of 'complicated'?\",\n",
      " 'input': '',\n",
      " 'output': \"An antonym of 'complicated' is 'simple'.\",\n",
      " 'chosen': \"A suitable antonym for 'complicated' would be 'simple'.\",\n",
      " 'rejected': \"An antonym of 'complicated' is 'simple'.\"}\n"
     ]
    }
   ],
   "source": [
    "pprint.pp(data[999])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56db5697-a089-4b40-a1f3-e928e8018220",
   "metadata": {
    "id": "56db5697-a089-4b40-a1f3-e928e8018220"
   },
   "source": [
    "\n",
    "\n",
    "```\n",
    "# This is formatted as code\n",
    "```\n",
    "\n",
    "- As we can see above, the dataset consists of 5 keys:\n",
    "    - The `'instruction'` and `'input'` that are used as LLM inputs\n",
    "    - The `'output'` contains the response the model was trained on via the instruction finetuning step in chapter 7\n",
    "    - the `'chosen'` and `'rejected'` entries are the entries we use for DPO; here `'chosen'` is the preferred response, and `'rejected'` is the dispreferred response\n",
    "- The goal is to get the model to follow the style of the chosen over the rejected responses"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86257468-a6ab-4ba3-9c9f-2fdc2c0cc284",
   "metadata": {
    "id": "86257468-a6ab-4ba3-9c9f-2fdc2c0cc284"
   },
   "source": [
    "- Below is a utility function that formats the model input by applying the Alpaca prompt style similar to chapter 7 ([../01_main-chapter-code/ch07.ipynb](../01_main-chapter-code/ch07.ipynb)):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "4564d55c-1c5d-46a6-b5e8-46ab568ad627",
   "metadata": {
    "id": "4564d55c-1c5d-46a6-b5e8-46ab568ad627"
   },
   "outputs": [],
   "source": [
    "def format_input(entry):\n",
    "    instruction_text = (\n",
    "        f\"Below is an instruction that describes a task. \"\n",
    "        f\"Write a response that appropriately completes the request.\"\n",
    "        f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",
    "    )\n",
    "\n",
    "    input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
    "\n",
    "    return instruction_text + input_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3f38b49f-63fd-48c5-bde8-a4717b7923ea",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "3f38b49f-63fd-48c5-bde8-a4717b7923ea",
    "outputId": "9ad07c59-05b3-42ae-c5bc-68780aaf6780"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Identify the correct spelling of the following word.\n",
      "\n",
      "### Input:\n",
      "Ocassion\n"
     ]
    }
   ],
   "source": [
    "model_input = format_input(data[50])\n",
    "print(model_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7dd9e4c9-88a3-463a-8c16-c60ed7e6b51e",
   "metadata": {
    "id": "7dd9e4c9-88a3-463a-8c16-c60ed7e6b51e"
   },
   "source": [
    "- Similarly, we can format the chosen and rejected responses using the Alpaca prompt style:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8ad5831a-e936-44e5-a5cf-02953fe7d848",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "8ad5831a-e936-44e5-a5cf-02953fe7d848",
    "outputId": "2c0a0cbf-c13d-43cf-fcc1-a4585c21e66f"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Response:\n",
      "The correct spelling is 'Occasion.'\n"
     ]
    }
   ],
   "source": [
    "desired_response = f\"### Response:\\n{data[50]['chosen']}\"\n",
    "print(desired_response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "fc0991f6-fef7-48ab-8dee-fbd2863f784c",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "fc0991f6-fef7-48ab-8dee-fbd2863f784c",
    "outputId": "cd85406c-3470-48f8-9792-63f91affd50a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Response:\n",
      "The correct spelling is obviously 'Occasion.'\n"
     ]
    }
   ],
   "source": [
    "possible_response = f\"### Response:\\n{data[50]['rejected']}\"\n",
    "print(possible_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6G3j2Q987t_g",
   "metadata": {
    "id": "6G3j2Q987t_g"
   },
   "source": [
    "&nbsp;\n",
    "## 2.2) Creating training, validation, and test splits"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53ce2b1e-32d7-414c-8e6b-01f21a2488c2",
   "metadata": {
    "id": "53ce2b1e-32d7-414c-8e6b-01f21a2488c2"
   },
   "source": [
    "- Next, we divide the dataset into 3 subsets, 85% training data, 5% validation data, and 10% test data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "36c7b919-8531-4e33-aebf-aaf8e6dbcfbd",
   "metadata": {
    "id": "36c7b919-8531-4e33-aebf-aaf8e6dbcfbd"
   },
   "outputs": [],
   "source": [
    "train_portion = int(len(data) * 0.85)  # 85% for training\n",
    "test_portion = int(len(data) * 0.1)    # 10% for testing\n",
    "val_portion = len(data) - train_portion - test_portion  # Remaining 5% for validation\n",
    "\n",
    "train_data = data[:train_portion]\n",
    "test_data = data[train_portion:train_portion + test_portion]\n",
    "val_data = data[train_portion + test_portion:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "831a6c1b-119b-4622-9862-87f1db36e066",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "831a6c1b-119b-4622-9862-87f1db36e066",
    "outputId": "8e017483-1a75-4336-9540-ac6a69104e27"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training set length: 935\n",
      "Validation set length: 55\n",
      "Test set length: 110\n"
     ]
    }
   ],
   "source": [
    "print(\"Training set length:\", len(train_data))\n",
    "print(\"Validation set length:\", len(val_data))\n",
    "print(\"Test set length:\", len(test_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c07d09f7-66af-49ed-8b9e-484f46e6a68d",
   "metadata": {
    "id": "c07d09f7-66af-49ed-8b9e-484f46e6a68d"
   },
   "source": [
    "&nbsp;\n",
    "## 2.3) Developing a `PreferenceDataset` class and batch processing function"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86101174-00c8-485d-8273-d086d5311926",
   "metadata": {
    "id": "86101174-00c8-485d-8273-d086d5311926"
   },
   "source": [
    "- In this section, we rewrite the `InstructionDataset` class from chapter 7 ([../01_main-chapter-code/ch07.ipynb](../01_main-chapter-code/ch07.ipynb)) for DPO\n",
    "- This means that instead of focusing on single output sequences (responses), we modify the dataset class to return pairs of responses where one is preferred (\"chosen\") over the other (\"rejected\")\n",
    "- Overall, the `PreferenceDataset` is almost identical to the `InstructionDataset` used in chapter 7:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "db08ad74-6dd4-4e40-b1e5-bc5f037d3d27",
   "metadata": {
    "id": "db08ad74-6dd4-4e40-b1e5-bc5f037d3d27"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch.utils.data import Dataset\n",
    "\n",
    "\n",
    "class PreferenceDataset(Dataset):\n",
    "    def __init__(self, data, tokenizer):\n",
    "        self.data = data\n",
    "\n",
    "        # Pre-tokenize texts\n",
    "        self.encoded_texts = []\n",
    "        for entry in data:\n",
    "            prompt = format_input(entry)\n",
    "            rejected_response = entry[\"rejected\"]\n",
    "            chosen_response = entry[\"chosen\"]\n",
    "\n",
    "            prompt_tokens = tokenizer.encode(prompt)\n",
    "            chosen_full_text = f\"{prompt}\\n\\n### Response:\\n{chosen_response}\"\n",
    "            rejected_full_text = f\"{prompt}\\n\\n### Response:\\n{rejected_response}\"\n",
    "            chosen_full_tokens = tokenizer.encode(chosen_full_text)\n",
    "            rejected_full_tokens = tokenizer.encode(rejected_full_text)\n",
    "\n",
    "            self.encoded_texts.append({\n",
    "                \"prompt\": prompt_tokens,\n",
    "                \"chosen\": chosen_full_tokens,\n",
    "                \"rejected\": rejected_full_tokens,\n",
    "            })\n",
    "\n",
    "    def __getitem__(self, index):\n",
    "        return self.encoded_texts[index]\n",
    "\n",
    "    def __len__(self):\n",
    "        return len(self.data)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2325d183-75b9-400a-80ac-0b8d2f526561",
   "metadata": {
    "id": "2325d183-75b9-400a-80ac-0b8d2f526561"
   },
   "source": [
    "- Along with an updated `PreferenceDataset` class, we also need an updated batch collation function that we use to pad the sequences in each batch to an equal length so that we can assemble them in batches\n",
    "- I added comments to the code below to illustrate the process; however, it might be easiest to understand how it works by looking at the example inputs and outputs further below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "8d3a43a6-7704-4bff-9bbc-a38632374f30",
   "metadata": {
    "id": "8d3a43a6-7704-4bff-9bbc-a38632374f30"
   },
   "outputs": [],
   "source": [
    "def custom_collate_fn(\n",
    "    batch,\n",
    "    pad_token_id=50256,\n",
    "    allowed_max_length=None,\n",
    "    mask_prompt_tokens=True,\n",
    "    device=\"cpu\"\n",
    "):\n",
    "    # Initialize lists to hold batch data\n",
    "    batch_data = {\n",
    "        \"prompt\": [],\n",
    "        \"chosen\": [],\n",
    "        \"rejected\": [],\n",
    "        \"rejected_mask\": [],\n",
    "        \"chosen_mask\": []\n",
    "\n",
    "    }\n",
    "\n",
    "    # Determine the longest sequence to set a common padding length\n",
    "    max_length_common = 0\n",
    "    if batch:\n",
    "        for key in [\"chosen\", \"rejected\"]:\n",
    "            current_max = max(len(item[key])+1 for item in batch)\n",
    "            max_length_common = max(max_length_common, current_max)\n",
    "\n",
    "    # Process each item in the batch\n",
    "    for item in batch:\n",
    "        prompt = torch.tensor(item[\"prompt\"])\n",
    "        batch_data[\"prompt\"].append(prompt)\n",
    "\n",
    "        for key in [\"chosen\", \"rejected\"]:\n",
    "            # Adjust padding according to the common maximum length\n",
    "            sequence = item[key]\n",
    "            padded = sequence + [pad_token_id] * (max_length_common - len(sequence))\n",
    "            mask = torch.ones(len(padded)).bool()\n",
    "\n",
    "            # Set mask for all padding tokens to False\n",
    "            mask[len(sequence):] = False\n",
    "\n",
    "            # Set mask for all input tokens to False\n",
    "            # +2 sets the 2 newline (\"\\n\") tokens before \"### Response\" to False\n",
    "            if mask_prompt_tokens:\n",
    "                mask[:prompt.shape[0]+2] = False\n",
    "\n",
    "            batch_data[key].append(torch.tensor(padded))\n",
    "            batch_data[f\"{key}_mask\"].append(mask)\n",
    "\n",
    "    # Final processing\n",
    "    for key in [\"chosen\", \"rejected\", \"chosen_mask\", \"rejected_mask\"]:\n",
    "        # Stack all sequences into a tensor for the given key\n",
    "        tensor_stack = torch.stack(batch_data[key])\n",
    "\n",
    "        # Optionally truncate to maximum sequence length\n",
    "        if allowed_max_length is not None:\n",
    "            tensor_stack = tensor_stack[:, :allowed_max_length]\n",
    "\n",
    "        # Move to the specified device\n",
    "        batch_data[key] = tensor_stack.to(device)\n",
    "\n",
    "    return batch_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76f3744b-9bb0-4f1e-b66b-cff35ad8fd9f",
   "metadata": {
    "id": "76f3744b-9bb0-4f1e-b66b-cff35ad8fd9f"
   },
   "source": [
    "- Before we start using the custom collate function, let's make version of it with some of its function arguments prefilled:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "d3cc137c-7ed7-4758-a518-cc4071b2817a",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "d3cc137c-7ed7-4758-a518-cc4071b2817a",
    "outputId": "598e9def-9768-441a-f886-01f6ba6e250b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Device: cuda\n"
     ]
    }
   ],
   "source": [
    "from functools import partial\n",
    "\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "print(\"Device:\", device)\n",
    "\n",
    "customized_collate_fn = partial(\n",
    "    custom_collate_fn,\n",
    "    device=device,            # Put the data directly on a GPU if available\n",
    "    mask_prompt_tokens=True,  # This is optional\n",
    "    allowed_max_length=1024   # The supported context length of the model\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d29e996-e267-4348-bc1d-4ac6b725cf6a",
   "metadata": {
    "id": "5d29e996-e267-4348-bc1d-4ac6b725cf6a"
   },
   "source": [
    "- Now, let's see the `customized_collate_fn` in action and apply it to some sample data from our preference dataset; for this, we take the first two entries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "1171057d-2a0f-48ff-bad6-4917a072f0f5",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "1171057d-2a0f-48ff-bad6-4917a072f0f5",
    "outputId": "3db3eee8-db29-4ff6-8078-6577a05d953a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "{'instruction': 'Evaluate the following phrase by transforming it into the '\n",
      "                'spelling given.',\n",
      " 'input': 'freind --> friend',\n",
      " 'output': 'The spelling of the given phrase \"freind\" is incorrect, the '\n",
      "           'correct spelling is \"friend\".',\n",
      " 'rejected': 'The spelling of the given phrase \"freind\" is flat out wrong, get '\n",
      "             'it together, the correct spelling is \"friend\".',\n",
      " 'chosen': 'The spelling of the given phrase \"freind\" is incorrect, the '\n",
      "           'correct spelling is \"friend\".'}\n",
      "\n",
      "{'instruction': 'Edit the following sentence for grammar.',\n",
      " 'input': 'He go to the park every day.',\n",
      " 'output': 'He goes to the park every day.',\n",
      " 'rejected': 'He goes to the stupid park every single day.',\n",
      " 'chosen': 'He goes to the park every day.'}\n"
     ]
    }
   ],
   "source": [
    "example_data = data[:2]\n",
    "\n",
    "for i in example_data:\n",
    "    print()\n",
    "    pprint.pp(i)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f1436cc-fbe5-4581-89d8-1992b5f04042",
   "metadata": {
    "id": "8f1436cc-fbe5-4581-89d8-1992b5f04042"
   },
   "source": [
    "- Next, let's instantiate an `example_dataset` and use a PyTorch `DataLoader` to create an `example_dataloader` that mimics the data loader we will use for the model training later:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "db327575-c34b-4fea-b3c7-e30569c9be78",
   "metadata": {
    "id": "db327575-c34b-4fea-b3c7-e30569c9be78"
   },
   "outputs": [],
   "source": [
    "import tiktoken\n",
    "from torch.utils.data import DataLoader\n",
    "\n",
    "\n",
    "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
    "\n",
    "example_dataset = PreferenceDataset(example_data, tokenizer)\n",
    "\n",
    "example_dataloader = DataLoader(\n",
    "    example_dataset,\n",
    "    batch_size=2,\n",
    "    collate_fn=customized_collate_fn,\n",
    "    shuffle=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43a446b7-7037-4d9a-9f14-b4ee0f6f37af",
   "metadata": {
    "id": "43a446b7-7037-4d9a-9f14-b4ee0f6f37af"
   },
   "source": [
    "- The dataset has the following keys:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "87ed4cf9-d70a-4bc7-b676-67e76ed3ee10",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "87ed4cf9-d70a-4bc7-b676-67e76ed3ee10",
    "outputId": "fa724d65-b0e1-4239-8090-9263135ad199"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "batch.keys: dict_keys(['prompt', 'chosen', 'rejected', 'rejected_mask', 'chosen_mask'])\n"
     ]
    }
   ],
   "source": [
    "for batch in example_dataloader:\n",
    "    break\n",
    "\n",
    "print(\"batch.keys:\", batch.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5bda3193-8c68-478c-98d8-0d9d880e7077",
   "metadata": {
    "id": "5bda3193-8c68-478c-98d8-0d9d880e7077"
   },
   "source": [
    "- The prompts are a list of tensors, where each tensor contains the token IDs for a given example; since we selected a batch size of 2, we have two lists of token ID tensors here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "468995ce-2906-498f-ac99-0a3f80d13d12",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "468995ce-2906-498f-ac99-0a3f80d13d12",
    "outputId": "7f3df961-fcb5-4e49-9b0c-c99447c67cc1"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[tensor([21106,   318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,\n",
       "           257,  2882,   326, 20431, 32543,   262,  2581,    13,   198,   198,\n",
       "         21017, 46486,    25,   198,    36,  2100,  4985,   262,  1708,  9546,\n",
       "           416, 25449,   340,   656,   262, 24993,  1813,    13,   198,   198,\n",
       "         21017, 23412,    25,   198, 19503,   521, 14610,  1545]),\n",
       " tensor([21106,   318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,\n",
       "           257,  2882,   326, 20431, 32543,   262,  2581,    13,   198,   198,\n",
       "         21017, 46486,    25,   198, 18378,   262,  1708,  6827,   329, 23491,\n",
       "            13,   198,   198, 21017, 23412,    25,   198,  1544,   467,   284,\n",
       "           262,  3952,   790,  1110,    13])]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "batch[\"prompt\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89cadebe-2516-4ae0-a71f-a8a623f2e1da",
   "metadata": {
    "id": "89cadebe-2516-4ae0-a71f-a8a623f2e1da"
   },
   "source": [
    "- We don't really need the responses for training; what we need to feed to the model during training are the `\"chosen\"` and `\"rejected\"` entries\n",
    "- The  `\"chosen\"` and `\"rejected\"` response entries are padded so that we can stack them as tensors; similar to the prompts, these response texts are encoded into token IDs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "e8f49c56-3989-4fe9-81ac-6bb3cce1a5b8",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "e8f49c56-3989-4fe9-81ac-6bb3cce1a5b8",
    "outputId": "ccc0bd06-6e85-4ee9-893b-d985f26a835d"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[21106,   318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,\n",
       "           257,  2882,   326, 20431, 32543,   262,  2581,    13,   198,   198,\n",
       "         21017, 46486,    25,   198,    36,  2100,  4985,   262,  1708,  9546,\n",
       "           416, 25449,   340,   656,   262, 24993,  1813,    13,   198,   198,\n",
       "         21017, 23412,    25,   198, 19503,   521, 14610,  1545,   198,   198,\n",
       "         21017, 18261,    25,   198,   464, 24993,   286,   262,  1813,  9546,\n",
       "           366, 19503,   521,     1,   318, 11491,    11,   262,  3376, 24993,\n",
       "           318,   366,  6726,  1911, 50256, 50256, 50256, 50256, 50256, 50256,\n",
       "         50256],\n",
       "        [21106,   318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,\n",
       "           257,  2882,   326, 20431, 32543,   262,  2581,    13,   198,   198,\n",
       "         21017, 46486,    25,   198, 18378,   262,  1708,  6827,   329, 23491,\n",
       "            13,   198,   198, 21017, 23412,    25,   198,  1544,   467,   284,\n",
       "           262,  3952,   790,  1110,    13,   198,   198, 21017, 18261,    25,\n",
       "           198,  1544,  2925,   284,   262,  3952,   790,  1110,    13, 50256,\n",
       "         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,\n",
       "         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,\n",
       "         50256]], device='cuda:0')"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "batch[\"chosen\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35a4cd6d-b2ad-45a6-b00a-ba5b720be4ea",
   "metadata": {
    "id": "35a4cd6d-b2ad-45a6-b00a-ba5b720be4ea"
   },
   "source": [
    "- The token IDs above represent the model inputs, but in this format, they are hard to interpret for us humans\n",
    "- So, let's implement a small utility function to convert them back into text so that we can inspect and interpret them more easily:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "52ea54ba-32cb-4ecb-b38b-923f42fd4615",
   "metadata": {
    "id": "52ea54ba-32cb-4ecb-b38b-923f42fd4615"
   },
   "outputs": [],
   "source": [
    "def decode_tokens_from_batch(token_ids, tokenizer):\n",
    "    ids_in_python_list = token_ids.flatten().tolist()\n",
    "    return tokenizer.decode(ids_in_python_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc9dd0ce-1fd4-419c-833f-ea5a1f8d800d",
   "metadata": {
    "id": "bc9dd0ce-1fd4-419c-833f-ea5a1f8d800d"
   },
   "source": [
    "- Let's apply the `decode_tokens_from_batch` utility function to the first prompt entry in the batch:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "55ee481e-3e2c-4ff6-b614-8cb18eb16a41",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "55ee481e-3e2c-4ff6-b614-8cb18eb16a41",
    "outputId": "17ddec15-a09d-45b5-b1e8-600cd59a9600"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Evaluate the following phrase by transforming it into the spelling given.\n",
      "\n",
      "### Input:\n",
      "freind --> friend\n"
     ]
    }
   ],
   "source": [
    "text = decode_tokens_from_batch(\n",
    "    token_ids=batch[\"prompt\"][0],  # [0] for the first entry in the batch\n",
    "    tokenizer=tokenizer,\n",
    ")\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "637b95c4-d5c2-4492-9d19-a45b090eee7e",
   "metadata": {
    "id": "637b95c4-d5c2-4492-9d19-a45b090eee7e"
   },
   "source": [
    "- As we can see above, the prompt was correctly formatted; let's now do the same for the `\"chosen\"` response:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "33a24f20-5ec3-4a89-b57a-52e997163d07",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "33a24f20-5ec3-4a89-b57a-52e997163d07",
    "outputId": "e04366ee-3719-4b07-fcef-6e9dddc06310"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Evaluate the following phrase by transforming it into the spelling given.\n",
      "\n",
      "### Input:\n",
      "freind --> friend\n",
      "\n",
      "### Response:\n",
      "The spelling of the given phrase \"freind\" is incorrect, the correct spelling is \"friend\".<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>\n"
     ]
    }
   ],
   "source": [
    "text = decode_tokens_from_batch(\n",
    "    token_ids=batch[\"chosen\"][0],\n",
    "    tokenizer=tokenizer,\n",
    ")\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac9fbdbd-1cff-401f-8e6c-cd98c134c0f2",
   "metadata": {
    "id": "ac9fbdbd-1cff-401f-8e6c-cd98c134c0f2"
   },
   "source": [
    "- As we can see above, similar to instruction finetuning, the response that is passed to the model during training also contains the input prompt\n",
    "- Also note that we included `<|endoftext|>` tokens as padding tokens, which are necessary so that we can extend the responses to a similar length to stack them as a batch\n",
    "- Don't worry; the `<|endoftext|>` tokens will be ignored in the loss later so that they won't affect the training outcome\n",
    "- Let's now also inspect the corresponding rejected response:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "db382be5-c727-4299-8597-c05424ba9308",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "db382be5-c727-4299-8597-c05424ba9308",
    "outputId": "edbd8c4a-0528-4361-aeba-9b3c3bbde33b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Evaluate the following phrase by transforming it into the spelling given.\n",
      "\n",
      "### Input:\n",
      "freind --> friend\n",
      "\n",
      "### Response:\n",
      "The spelling of the given phrase \"freind\" is flat out wrong, get it together, the correct spelling is \"friend\".<|endoftext|>\n"
     ]
    }
   ],
   "source": [
    "text = decode_tokens_from_batch(\n",
    "    token_ids=batch[\"rejected\"][0],\n",
    "    tokenizer=tokenizer,\n",
    ")\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "715dc968-aa64-4388-b577-7c295831bdcf",
   "metadata": {
    "id": "715dc968-aa64-4388-b577-7c295831bdcf"
   },
   "source": [
    "- In this case, as we can see above, the rejected response is a more impolite version of the chosen response (we don't want the model to generate impolite responses)\n",
    "- Lastly, let's talk about the data masks: if you took a closer look at our custom collate function we implemented above, we created a `\"chosen_mask\"` and a `\"rejected_mask\"` for each dataset entry\n",
    "- The masks have the same shape as the response entries, as shown below for the `\"chosen\"` entry:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "5c324eab-cf1d-4071-b3ba-797d8ec4d1da",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "5c324eab-cf1d-4071-b3ba-797d8ec4d1da",
    "outputId": "742a5742-1bc0-4f74-9eb9-cbf81f936ecb"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chosen inputs: torch.Size([81])\n",
      "chosen mask:   torch.Size([81])\n"
     ]
    }
   ],
   "source": [
    "print(\"chosen inputs:\", batch[\"chosen\"][0].shape)\n",
    "print(\"chosen mask:  \", batch[\"chosen_mask\"][0].shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "880e95f7-cfc3-4f5f-be5e-c279fba5f674",
   "metadata": {
    "id": "880e95f7-cfc3-4f5f-be5e-c279fba5f674"
   },
   "source": [
    "- The contents of these masks are boolean (`True` and `False`) values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "da75b550-5da4-4292-9a7e-a05b842bdcb7",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "da75b550-5da4-4292-9a7e-a05b842bdcb7",
    "outputId": "e5f012c3-33ba-4e6b-aa55-3e331865218f"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([False, False, False, False, False, False, False, False, False, False,\n",
       "        False, False, False, False, False, False, False, False, False, False,\n",
       "        False, False, False, False, False, False, False, False, False, False,\n",
       "        False, False, False, False, False, False, False, False, False, False,\n",
       "        False, False, False, False, False, False, False, False, False, False,\n",
       "         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,\n",
       "         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,\n",
       "         True,  True,  True,  True, False, False, False, False, False, False,\n",
       "        False], device='cuda:0')"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "batch[\"chosen_mask\"][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e67b862-4430-4c99-9157-90955dde29b6",
   "metadata": {
    "id": "0e67b862-4430-4c99-9157-90955dde29b6"
   },
   "source": [
    "- The `True` values denote token IDs that correspond to the actual response\n",
    "- the `False` tokens correspond to token IDs that correspond to either prompt tokens (if we set `mask_prompt_tokens=True` in the `customized_collate_fn` function, which we previously did) or padding tokens\n",
    "- Hence, we can use the mask as a selection mask to select only the token IDs that correspond to the response, that is, stripping all prompt and padding tokens, as we can see below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "1114c6fe-524b-401c-b9fe-02260e6f0541",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "1114c6fe-524b-401c-b9fe-02260e6f0541",
    "outputId": "6d99af1d-940a-4012-c5d9-21d463a66e40"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Response:\n",
      "The spelling of the given phrase \"freind\" is incorrect, the correct spelling is \"friend\".\n"
     ]
    }
   ],
   "source": [
    "text = decode_tokens_from_batch(\n",
    "    token_ids=batch[\"chosen\"][0][batch[\"chosen_mask\"][0]],\n",
    "    tokenizer=tokenizer,\n",
    ")\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "a89f83a4-d16e-40d2-ba43-bd410affd967",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "a89f83a4-d16e-40d2-ba43-bd410affd967",
    "outputId": "1d439c7e-c079-4594-d02a-fa83a3cb275d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Response:\n",
      "The spelling of the given phrase \"freind\" is flat out wrong, get it together, the correct spelling is \"friend\".\n"
     ]
    }
   ],
   "source": [
    "text = decode_tokens_from_batch(\n",
    "    token_ids=batch[\"rejected\"][0][batch[\"rejected_mask\"][0]],\n",
    "    tokenizer=tokenizer,\n",
    ")\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e525287f-137c-4d71-94ae-cfd6db7b057c",
   "metadata": {
    "id": "e525287f-137c-4d71-94ae-cfd6db7b057c"
   },
   "source": [
    "- We will make use of this mask to ignore prompt and padding tokens when computing the DPO loss later"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "jbafhM_R8z5q",
   "metadata": {
    "id": "jbafhM_R8z5q"
   },
   "source": [
    "&nbsp;\n",
    "## 2.4) Creating training, validation, and test set data loaders"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3c29eb8-d1b9-4abe-a155-52b3270d759a",
   "metadata": {
    "id": "b3c29eb8-d1b9-4abe-a155-52b3270d759a"
   },
   "source": [
    "- Above, we worked with a small example subsets from the preference dataset for illustration purposes\n",
    "- Let's now create the actual training, validation, and test set data loaders\n",
    "- This process is identical to creating the data loaders in the pretraining and instruction finetuning chapters and thus should be self-explanatory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "5c0068bf-bda0-4d9e-9f79-2fc4b94cbd1c",
   "metadata": {
    "id": "5c0068bf-bda0-4d9e-9f79-2fc4b94cbd1c"
   },
   "outputs": [],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "\n",
    "\n",
    "num_workers = 0\n",
    "batch_size = 8\n",
    "\n",
    "torch.manual_seed(123)\n",
    "\n",
    "train_dataset = PreferenceDataset(train_data, tokenizer)\n",
    "train_loader = DataLoader(\n",
    "    train_dataset,\n",
    "    batch_size=batch_size,\n",
    "    collate_fn=customized_collate_fn,\n",
    "    shuffle=True,\n",
    "    drop_last=True,\n",
    "    num_workers=num_workers\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "2f4a257b-6835-4194-abe2-5831d6a44885",
   "metadata": {
    "id": "2f4a257b-6835-4194-abe2-5831d6a44885"
   },
   "outputs": [],
   "source": [
    "val_dataset = PreferenceDataset(val_data, tokenizer)\n",
    "val_loader = DataLoader(\n",
    "    val_dataset,\n",
    "    batch_size=batch_size,\n",
    "    collate_fn=customized_collate_fn,\n",
    "    shuffle=False,\n",
    "    drop_last=False,\n",
    "    num_workers=num_workers\n",
    ")\n",
    "\n",
    "test_dataset = PreferenceDataset(test_data, tokenizer)\n",
    "test_loader = DataLoader(\n",
    "    test_dataset,\n",
    "    batch_size=batch_size,\n",
    "    collate_fn=customized_collate_fn,\n",
    "    shuffle=False,\n",
    "    drop_last=False,\n",
    "    num_workers=num_workers\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1fe1ba19-a6d5-4a77-8283-7a17d7ec06e2",
   "metadata": {
    "id": "1fe1ba19-a6d5-4a77-8283-7a17d7ec06e2"
   },
   "source": [
    "- Let's iterate through the data loader and take a look at the dataset shapes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "80d61f15-facb-4eb8-a9be-6427887d24b2",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "80d61f15-facb-4eb8-a9be-6427887d24b2",
    "outputId": "dacd3bdf-f069-4b36-da2c-d6c1c6cc5405"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train loader:\n",
      "torch.Size([8, 77]) torch.Size([8, 77])\n",
      "torch.Size([8, 81]) torch.Size([8, 81])\n",
      "torch.Size([8, 94]) torch.Size([8, 94])\n",
      "torch.Size([8, 75]) torch.Size([8, 75])\n",
      "torch.Size([8, 75]) torch.Size([8, 75])\n",
      "torch.Size([8, 76]) torch.Size([8, 76])\n",
      "torch.Size([8, 99]) torch.Size([8, 99])\n",
      "torch.Size([8, 71]) torch.Size([8, 71])\n",
      "torch.Size([8, 67]) torch.Size([8, 67])\n",
      "torch.Size([8, 88]) torch.Size([8, 88])\n",
      "torch.Size([8, 65]) torch.Size([8, 65])\n",
      "torch.Size([8, 79]) torch.Size([8, 79])\n",
      "torch.Size([8, 80]) torch.Size([8, 80])\n",
      "torch.Size([8, 97]) torch.Size([8, 97])\n",
      "torch.Size([8, 71]) torch.Size([8, 71])\n",
      "torch.Size([8, 89]) torch.Size([8, 89])\n",
      "torch.Size([8, 75]) torch.Size([8, 75])\n",
      "torch.Size([8, 69]) torch.Size([8, 69])\n",
      "torch.Size([8, 84]) torch.Size([8, 84])\n",
      "torch.Size([8, 79]) torch.Size([8, 79])\n",
      "torch.Size([8, 101]) torch.Size([8, 101])\n",
      "torch.Size([8, 87]) torch.Size([8, 87])\n",
      "torch.Size([8, 73]) torch.Size([8, 73])\n",
      "torch.Size([8, 69]) torch.Size([8, 69])\n",
      "torch.Size([8, 80]) torch.Size([8, 80])\n",
      "torch.Size([8, 68]) torch.Size([8, 68])\n",
      "torch.Size([8, 73]) torch.Size([8, 73])\n",
      "torch.Size([8, 71]) torch.Size([8, 71])\n",
      "torch.Size([8, 91]) torch.Size([8, 91])\n",
      "torch.Size([8, 78]) torch.Size([8, 78])\n",
      "torch.Size([8, 78]) torch.Size([8, 78])\n",
      "torch.Size([8, 71]) torch.Size([8, 71])\n",
      "torch.Size([8, 84]) torch.Size([8, 84])\n",
      "torch.Size([8, 92]) torch.Size([8, 92])\n",
      "torch.Size([8, 71]) torch.Size([8, 71])\n",
      "torch.Size([8, 66]) torch.Size([8, 66])\n",
      "torch.Size([8, 73]) torch.Size([8, 73])\n",
      "torch.Size([8, 73]) torch.Size([8, 73])\n",
      "torch.Size([8, 78]) torch.Size([8, 78])\n",
      "torch.Size([8, 66]) torch.Size([8, 66])\n",
      "torch.Size([8, 76]) torch.Size([8, 76])\n",
      "torch.Size([8, 100]) torch.Size([8, 100])\n",
      "torch.Size([8, 77]) torch.Size([8, 77])\n",
      "torch.Size([8, 92]) torch.Size([8, 92])\n",
      "torch.Size([8, 93]) torch.Size([8, 93])\n",
      "torch.Size([8, 115]) torch.Size([8, 115])\n",
      "torch.Size([8, 81]) torch.Size([8, 81])\n",
      "torch.Size([8, 95]) torch.Size([8, 95])\n",
      "torch.Size([8, 81]) torch.Size([8, 81])\n",
      "torch.Size([8, 94]) torch.Size([8, 94])\n",
      "torch.Size([8, 70]) torch.Size([8, 70])\n",
      "torch.Size([8, 89]) torch.Size([8, 89])\n",
      "torch.Size([8, 90]) torch.Size([8, 90])\n",
      "torch.Size([8, 70]) torch.Size([8, 70])\n",
      "torch.Size([8, 85]) torch.Size([8, 85])\n",
      "torch.Size([8, 65]) torch.Size([8, 65])\n",
      "torch.Size([8, 76]) torch.Size([8, 76])\n",
      "torch.Size([8, 72]) torch.Size([8, 72])\n",
      "torch.Size([8, 84]) torch.Size([8, 84])\n",
      "torch.Size([8, 84]) torch.Size([8, 84])\n",
      "torch.Size([8, 65]) torch.Size([8, 65])\n",
      "torch.Size([8, 63]) torch.Size([8, 63])\n",
      "torch.Size([8, 74]) torch.Size([8, 74])\n",
      "torch.Size([8, 79]) torch.Size([8, 79])\n",
      "torch.Size([8, 93]) torch.Size([8, 93])\n",
      "torch.Size([8, 71]) torch.Size([8, 71])\n",
      "torch.Size([8, 99]) torch.Size([8, 99])\n",
      "torch.Size([8, 81]) torch.Size([8, 81])\n",
      "torch.Size([8, 77]) torch.Size([8, 77])\n",
      "torch.Size([8, 74]) torch.Size([8, 74])\n",
      "torch.Size([8, 75]) torch.Size([8, 75])\n",
      "torch.Size([8, 73]) torch.Size([8, 73])\n",
      "torch.Size([8, 87]) torch.Size([8, 87])\n",
      "torch.Size([8, 80]) torch.Size([8, 80])\n",
      "torch.Size([8, 75]) torch.Size([8, 75])\n",
      "torch.Size([8, 81]) torch.Size([8, 81])\n",
      "torch.Size([8, 86]) torch.Size([8, 86])\n",
      "torch.Size([8, 71]) torch.Size([8, 71])\n",
      "torch.Size([8, 63]) torch.Size([8, 63])\n",
      "torch.Size([8, 82]) torch.Size([8, 82])\n",
      "torch.Size([8, 68]) torch.Size([8, 68])\n",
      "torch.Size([8, 76]) torch.Size([8, 76])\n",
      "torch.Size([8, 68]) torch.Size([8, 68])\n",
      "torch.Size([8, 97]) torch.Size([8, 97])\n",
      "torch.Size([8, 72]) torch.Size([8, 72])\n",
      "torch.Size([8, 85]) torch.Size([8, 85])\n",
      "torch.Size([8, 67]) torch.Size([8, 67])\n",
      "torch.Size([8, 85]) torch.Size([8, 85])\n",
      "torch.Size([8, 87]) torch.Size([8, 87])\n",
      "torch.Size([8, 76]) torch.Size([8, 76])\n",
      "torch.Size([8, 74]) torch.Size([8, 74])\n",
      "torch.Size([8, 92]) torch.Size([8, 92])\n",
      "torch.Size([8, 85]) torch.Size([8, 85])\n",
      "torch.Size([8, 72]) torch.Size([8, 72])\n",
      "torch.Size([8, 93]) torch.Size([8, 93])\n",
      "torch.Size([8, 82]) torch.Size([8, 82])\n",
      "torch.Size([8, 76]) torch.Size([8, 76])\n",
      "torch.Size([8, 93]) torch.Size([8, 93])\n",
      "torch.Size([8, 80]) torch.Size([8, 80])\n",
      "torch.Size([8, 87]) torch.Size([8, 87])\n",
      "torch.Size([8, 69]) torch.Size([8, 69])\n",
      "torch.Size([8, 90]) torch.Size([8, 90])\n",
      "torch.Size([8, 99]) torch.Size([8, 99])\n",
      "torch.Size([8, 104]) torch.Size([8, 104])\n",
      "torch.Size([8, 101]) torch.Size([8, 101])\n",
      "torch.Size([8, 98]) torch.Size([8, 98])\n",
      "torch.Size([8, 79]) torch.Size([8, 79])\n",
      "torch.Size([8, 71]) torch.Size([8, 71])\n",
      "torch.Size([8, 76]) torch.Size([8, 76])\n",
      "torch.Size([8, 79]) torch.Size([8, 79])\n",
      "torch.Size([8, 79]) torch.Size([8, 79])\n",
      "torch.Size([8, 67]) torch.Size([8, 67])\n",
      "torch.Size([8, 84]) torch.Size([8, 84])\n",
      "torch.Size([8, 78]) torch.Size([8, 78])\n",
      "torch.Size([8, 85]) torch.Size([8, 85])\n",
      "torch.Size([8, 70]) torch.Size([8, 70])\n"
     ]
    }
   ],
   "source": [
    "print(\"Train loader:\")\n",
    "for batch in train_loader:\n",
    "    print(\n",
    "        batch[\"chosen\"].shape,\n",
    "        batch[\"rejected\"].shape,\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ff958a6-5e61-49f5-9a97-360aa34e3758",
   "metadata": {
    "id": "7ff958a6-5e61-49f5-9a97-360aa34e3758"
   },
   "source": [
    "- Each row shows the shape of the `\"chosen\"` and `\"rejected\"` entries in each batch\n",
    "- Since we applied padding on a batch-by-batch basis, each row has a different shape\n",
    "- This is for efficiency reasons because it would be inefficient to pad all samples to the longest sample in the whole dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29cb0543-1142-4374-8825-3384e20c6ac0",
   "metadata": {
    "id": "29cb0543-1142-4374-8825-3384e20c6ac0"
   },
   "source": [
    "&nbsp;\n",
    "# 3) Loading a finetuned LLM for DPO alignment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22b08881-b769-4b26-8153-5ec0e8573ed2",
   "metadata": {
    "id": "22b08881-b769-4b26-8153-5ec0e8573ed2"
   },
   "source": [
    "- LLM alignment steps, such as RLHF or DPO, assume that we already have an instruction-finetuned model\n",
    "- This section contains minimal code to load the model that was instruction finetuned and saved in chapter 7 (via [../01_main-chapter-code/ch07.ipynb](../01_main-chapter-code/ch07.ipynb))\n",
    "- Make sure you run the chapter 7 code first to create the instruction-finetuned model before you proceed\n",
    "- The code below will copy the instruction-finetuned model into the current directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "b3c6d82b-63f7-459a-b901-7125ab225e56",
   "metadata": {
    "id": "b3c6d82b-63f7-459a-b901-7125ab225e56"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "import shutil\n",
    "\n",
    "\n",
    "finetuned_model_path = Path(\"gpt2-medium355M-sft.pth\")\n",
    "if not finetuned_model_path.exists():\n",
    "\n",
    "    # Try finding the model checkpoint locally:\n",
    "    relative_path = Path(\"..\") / \"ch07\" / finetuned_model_path\n",
    "    if relative_path.exists():\n",
    "        shutil.copy(relative_path, \".\")\n",
    "\n",
    "    # If this notebook is run on Google Colab, get it from a Googe Drive folder\n",
    "    elif \"COLAB_GPU\" in os.environ or \"COLAB_TPU_ADDR\" in os.environ:\n",
    "        from google.colab import drive\n",
    "        drive.mount(\"/content/drive\")\n",
    "        google_drive_path = \"/content/drive/My Drive/Books/LLMs-From-Scratch/ch07/gpt2-medium355M-sft__colab.pth\"  # Readers need to adjust this path\n",
    "        shutil.copy(google_drive_path, \".\")\n",
    "\n",
    "        # Rename the file because my finetuned model in colab has the \"__colab.pth\" suffix\n",
    "        filename = os.path.basename(google_drive_path)\n",
    "        if filename.endswith(\"__colab.pth\"):\n",
    "            new_filename = filename.replace(\"__colab.pth\", \".pth\")\n",
    "            os.rename(filename, new_filename)\n",
    "\n",
    "    else:\n",
    "        print(\n",
    "            f\"Could not find '{finetuned_model_path}'.\\n\"\n",
    "            \"Run the `ch07.ipynb` notebook to finetune and save the finetuned model.\"\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71c8585e-4569-4033-84a7-3903d0e8aaf8",
   "metadata": {
    "id": "71c8585e-4569-4033-84a7-3903d0e8aaf8"
   },
   "source": [
    "- Next, we reuse the basic configuration from previous chapters to load the model weights:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "a8333fee-e7fe-4f8c-9411-8c1db6252d98",
   "metadata": {
    "id": "a8333fee-e7fe-4f8c-9411-8c1db6252d98"
   },
   "outputs": [],
   "source": [
    "from previous_chapters import GPTModel\n",
    "\n",
    "\n",
    "BASE_CONFIG = {\n",
    "    \"vocab_size\": 50257,     # Vocabulary size\n",
    "    \"context_length\": 1024,  # Context length\n",
    "    \"drop_rate\": 0.0,        # Dropout rate\n",
    "    \"qkv_bias\": True         # Query-key-value bias\n",
    "}\n",
    "\n",
    "model_configs = {\n",
    "    \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n",
    "    \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n",
    "    \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n",
    "    \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n",
    "}\n",
    "\n",
    "CHOOSE_MODEL = \"gpt2-medium (355M)\"\n",
    "\n",
    "BASE_CONFIG.update(model_configs[CHOOSE_MODEL])\n",
    "\n",
    "model = GPTModel(BASE_CONFIG)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "c2821403-605c-4071-a4ff-e23f4c9a11fd",
   "metadata": {
    "id": "c2821403-605c-4071-a4ff-e23f4c9a11fd"
   },
   "outputs": [],
   "source": [
    "model.load_state_dict(\n",
    "    torch.load(\n",
    "        \"gpt2-medium355M-sft.pth\",\n",
    "        map_location=torch.device(\"cpu\"),\n",
    "        weights_only=True\n",
    "    )\n",
    ")\n",
    "model.eval();"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61863bec-bd42-4194-b994-645bfe2df8be",
   "metadata": {
    "id": "61863bec-bd42-4194-b994-645bfe2df8be"
   },
   "source": [
    "- Before training the loaded model with DPO, let's make sure that the finetuned model was saved and loaded correctly by trying it out on some sample data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "4357aec5-0db2-4d73-b37b-539cd8fa80a3",
   "metadata": {
    "id": "4357aec5-0db2-4d73-b37b-539cd8fa80a3"
   },
   "outputs": [],
   "source": [
    "prompt = \"\"\"Below is an instruction that describes a task. Write a response\n",
    "that appropriately completes the request.\n",
    "\n",
    "### Instruction:\n",
    "Convert the active sentence to passive: 'The chef cooks the meal every day.'\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "541e7988-38d3-47f6-bd52-9da6564479fa",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "541e7988-38d3-47f6-bd52-9da6564479fa",
    "outputId": "278f7ddf-37c2-4c3a-d069-c510ef6f8d7a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Below is an instruction that describes a task. Write a response\n",
      "that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Convert the active sentence to passive: 'The chef cooks the meal every day.'\n",
      "\n",
      "### Response:\n",
      "The meal is cooked every day by the chef.\n"
     ]
    }
   ],
   "source": [
    "from previous_chapters import (\n",
    "    generate,\n",
    "    text_to_token_ids,\n",
    "    token_ids_to_text\n",
    ")\n",
    "\n",
    "torch.manual_seed(123)\n",
    "\n",
    "token_ids = generate(\n",
    "    model=model,\n",
    "    idx=text_to_token_ids(prompt, tokenizer),\n",
    "    max_new_tokens=35,\n",
    "    context_size=BASE_CONFIG[\"context_length\"],\n",
    "    eos_id=50256\n",
    ")\n",
    "\n",
    "response = token_ids_to_text(token_ids, tokenizer)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be87ed19-fded-4e56-8585-6c7c0367b354",
   "metadata": {
    "id": "be87ed19-fded-4e56-8585-6c7c0367b354"
   },
   "source": [
    "- As we can see above, the model gives a reasonable and correct response\n",
    "- As explained in chapter 7, in practice, we would clean up the response to only return the response text with the prompt and prompt style removed (similar to what you are familiar with from ChatGPT, for example):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "0c30c4e2-af84-4ab4-95d0-9641e32c1e7f",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "0c30c4e2-af84-4ab4-95d0-9641e32c1e7f",
    "outputId": "70192bbe-fdf6-43eb-c673-f573f8c70156"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The meal is cooked every day by the chef.\n"
     ]
    }
   ],
   "source": [
    "def extract_response(response_text, input_text):\n",
    "    return response_text[len(input_text):].replace(\"### Response:\", \"\").strip()\n",
    "\n",
    "response = extract_response(response, prompt)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80442cb9-83b1-46b8-bad0-7d44297ca52d",
   "metadata": {
    "id": "80442cb9-83b1-46b8-bad0-7d44297ca52d"
   },
   "source": [
    "- Now, we are almost ready to get to the DPO part\n",
    "- As mentioned at the beginning of this notebook, DPO works with two LLMs: a policy model (the LLM that we want to optimize) and a reference model (the original model that we keep unchanged)\n",
    "- Below, we rename the `model` as `policy_model` and instantiate a second instance of the model we refer to as the `reference_model`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "5d88cc3a-312e-4b29-bc6d-de8354c1eb9f",
   "metadata": {
    "id": "5d88cc3a-312e-4b29-bc6d-de8354c1eb9f"
   },
   "outputs": [],
   "source": [
    "policy_model = model\n",
    "\n",
    "reference_model = GPTModel(BASE_CONFIG)\n",
    "reference_model.load_state_dict(\n",
    "    torch.load(\n",
    "        \"gpt2-medium355M-sft.pth\",\n",
    "        map_location=torch.device(\"cpu\"),\n",
    "        weights_only=True\n",
    "    )\n",
    ")\n",
    "reference_model.eval()\n",
    "\n",
    "policy_model.to(device)\n",
    "reference_model.to(device);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c6c1469-0038-4914-8aa5-15b1f81877cc",
   "metadata": {
    "id": "9c6c1469-0038-4914-8aa5-15b1f81877cc"
   },
   "source": [
    "&nbsp;\n",
    "# 4) Coding the DPO Loss Function"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75dbe60c-e4ce-413e-beec-22eff0237d11",
   "metadata": {
    "id": "75dbe60c-e4ce-413e-beec-22eff0237d11"
   },
   "source": [
    "- After we took care of the model loading and dataset preparation in the previous sections, we can now get to the fun part and code the DPO loss\n",
    "- Note that the DPO loss code below is based on the method proposed in the [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290) paper\n",
    "- For reference, the core DPO equation is shown again below:\n",
    "\n",
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/dpo/3.webp\" width=800px>\n",
    "\n",
    "- In the equation above,\n",
    " - \"expected value\" $\\mathbb{E}$ is statistics jargon and stands for the average or mean value of the random variable (the expression inside the brackets)\n",
    " - The $\\pi_{\\theta}$ variable is the so-called policy (a term borrowed from reinforcement learning) and represents the LLM we want to optimize; $\\pi_{ref}$ is a reference LLM, which is typically the original LLM before optimization (at the beginning of the training, $\\pi_{\\theta}$ and $\\pi_{ref}$ are typically the same)\n",
    " - $\\beta$ is a hyperparameter to control the divergence between the $\\pi_{\\theta}$ and the reference model; increasing $\\beta$ increases the impact of the difference between\n",
    "$\\pi_{\\theta}$ and $\\pi_{ref}$ in terms of their log probabilities on the overall loss function, thereby increasing the divergence between the two models\n",
    "- In code, we can implement the DPO loss as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "38CsrrwJIZiV",
   "metadata": {
    "id": "38CsrrwJIZiV"
   },
   "outputs": [],
   "source": [
    "import torch.nn.functional as F\n",
    "\n",
    "def compute_dpo_loss(\n",
    "      model_chosen_logprobs,\n",
    "      model_rejected_logprobs,\n",
    "      reference_chosen_logprobs,\n",
    "      reference_rejected_logprobs,\n",
    "      beta=0.1,\n",
    "    ):\n",
    "    \"\"\"Compute the DPO loss for a batch of policy and reference model log probabilities.\n",
    "\n",
    "    Args:\n",
    "        policy_chosen_logprobs: Log probabilities of the policy model for the chosen responses. Shape: (batch_size,)\n",
    "        policy_rejected_logprobs: Log probabilities of the policy model for the rejected responses. Shape: (batch_size,)\n",
    "        reference_chosen_logprobs: Log probabilities of the reference model for the chosen responses. Shape: (batch_size,)\n",
    "        reference_rejected_logprobs: Log probabilities of the reference model for the rejected responses. Shape: (batch_size,)\n",
    "        beta: Temperature parameter for the DPO loss; typically something in the range of 0.1 to 0.5. We ignore the reference model as beta -> 0.\n",
    "        label_smoothing: conservativeness for DPO loss.\n",
    "\n",
    "    Returns:\n",
    "        A tuple of three tensors: (loss, chosen_rewards, rejected_rewards).\n",
    "    \"\"\"\n",
    "\n",
    "    model_logratios = model_chosen_logprobs - model_rejected_logprobs\n",
    "    reference_logratios = reference_chosen_logprobs - reference_rejected_logprobs\n",
    "    logits = model_logratios - reference_logratios\n",
    "\n",
    "    # DPO (Eq. 7 of https://arxiv.org/pdf/2305.18290.pdf)\n",
    "    losses = -F.logsigmoid(beta * logits)\n",
    "\n",
    "    # Optional values to track progress during training\n",
    "    chosen_rewards = (model_chosen_logprobs - reference_chosen_logprobs).detach()\n",
    "    rejected_rewards = (model_rejected_logprobs - reference_rejected_logprobs).detach()\n",
    "\n",
    "    # .mean() to average over the samples in the batch\n",
    "    return losses.mean(), chosen_rewards.mean(), rejected_rewards.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "693be65b-38fc-4d18-bf53-a260a15436e1",
   "metadata": {
    "id": "693be65b-38fc-4d18-bf53-a260a15436e1"
   },
   "source": [
    "- If you are familiar with logarithms, note that we have the general relationship $\\log\\left(\\frac{a}{b}\\right) = \\log a - \\log b$, which we applied in the code above\n",
    "- Keeping this in mind, let's go through some of the steps (we will calculate the `logprobs` using a separate function later)\n",
    "- Let's start with the lines\n",
    "\n",
    "```python\n",
    "model_logratios = model_chosen_logprobs - model_rejected_logprobs\n",
    "reference_logratios = reference_chosen_logprobs - reference_rejected_logprobs\n",
    "```\n",
    "\n",
    "- These lines above calculate the difference in log probabilities (logits) for the chosen and rejected samples for both the policy model and the reference model (this is due to $\\log\\left(\\frac{a}{b}\\right) = \\log a - \\log b$):\n",
    "\n",
    "$$\\log \\left( \\frac{\\pi_\\theta (y_w \\mid x)}{\\pi_\\theta (y_l \\mid x)} \\right) \\quad \\text{and} \\quad \\log \\left( \\frac{\\pi_{\\text{ref}}(y_w \\mid x)}{\\pi_{\\text{ref}}(y_l \\mid x)} \\right)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5458d217-e0ad-40a5-925c-507a8fcf5795",
   "metadata": {
    "id": "5458d217-e0ad-40a5-925c-507a8fcf5795"
   },
   "source": [
    "- Next, the code `logits = model_logratios - reference_logratios` computes the difference between the model's log ratios and the reference model's log ratios, i.e., \n",
    "\n",
    "$$\\beta \\log \\left( \\frac{\\pi_\\theta (y_w \\mid x)}{\\pi_{\\text{ref}} (y_w \\mid x)} \\right)\n",
    "- \\beta \\log \\left( \\frac{\\pi_\\theta (y_l \\mid x)}{\\pi_{\\text{ref}} (y_l \\mid x)} \\right)$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f18e3e36-f5f1-407f-b662-4c20a0ac0354",
   "metadata": {
    "id": "f18e3e36-f5f1-407f-b662-4c20a0ac0354"
   },
   "source": [
    "- Finally, `losses = -F.logsigmoid(beta * logits)`  calculates the loss using the log-sigmoid function; in the original equation, the term inside the expectation is \n",
    "\n",
    "$$\\log \\sigma \\left( \\beta \\log \\left( \\frac{\\pi_\\theta (y_w \\mid x)}{\\pi_{\\text{ref}} (y_w \\mid x)} \\right)\n",
    "- \\beta \\log \\left( \\frac{\\pi_\\theta (y_l \\mid x)}{\\pi_{\\text{ref}} (y_l \\mid x)} \\right) \\right)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00a6f92d-7d64-41fe-bcaa-2bddd46027e1",
   "metadata": {
    "id": "00a6f92d-7d64-41fe-bcaa-2bddd46027e1"
   },
   "source": [
    "- Above, we assumed that the log probabilities were already computed; let's now define a `compute_logprobs` function that we can use to compute these log probabilities that were passed into the `compute_dpo_loss` function above, that is, the values $\\pi_\\theta (y_w \\mid x)$, ${\\pi_\\theta (y_l \\mid x)}$, and so forth:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "71e6507b-d2e2-4469-86b9-f057b08b5df9",
   "metadata": {
    "id": "71e6507b-d2e2-4469-86b9-f057b08b5df9"
   },
   "outputs": [],
   "source": [
    "def compute_logprobs(logits, labels, selection_mask=None):\n",
    "    \"\"\"\n",
    "    Compute log probabilities.\n",
    "\n",
    "    Args:\n",
    "      logits: Tensor of shape (batch_size, num_tokens, vocab_size)\n",
    "      labels: Tensor of shape (batch_size, snum_tokens)\n",
    "      selection_mask: Tensor for shape (batch_size, num_tokens)\n",
    "\n",
    "    Returns:\n",
    "      mean_log_prob: Mean log probability excluding padding tokens.\n",
    "    \"\"\"\n",
    "\n",
    "    # Labels are the inputs shifted by one\n",
    "    labels = labels[:, 1:].clone()\n",
    "\n",
    "    # Truncate logits to match the labels num_tokens\n",
    "    logits = logits[:, :-1, :]\n",
    "\n",
    "    log_probs = F.log_softmax(logits, dim=-1)\n",
    "\n",
    "    # Gather the log probabilities for the actual labels\n",
    "    selected_log_probs = torch.gather(\n",
    "        input=log_probs,\n",
    "        dim=-1,\n",
    "        index=labels.unsqueeze(-1)\n",
    "    ).squeeze(-1)\n",
    "\n",
    "    if selection_mask is not None:\n",
    "        mask = selection_mask[:, 1:].clone()\n",
    "\n",
    "        # Apply the mask to filter out padding tokens\n",
    "        selected_log_probs = selected_log_probs * mask\n",
    "\n",
    "        # Calculate the average log probability excluding padding tokens\n",
    "        # This averages over the tokens, so the shape is (batch_size, num_tokens)\n",
    "        avg_log_prob = selected_log_probs.sum(-1) / mask.sum(-1)\n",
    "\n",
    "        return avg_log_prob\n",
    "\n",
    "    else:\n",
    "        return selected_log_probs.mean(-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf6a71ac-3fcc-44a4-befc-1c56bbd378d7",
   "metadata": {
    "id": "cf6a71ac-3fcc-44a4-befc-1c56bbd378d7"
   },
   "source": [
    "- Note that this function above might look a bit intimidating at first due to the `torch.gather` function, but it's pretty similar to what happens under the hood in PyTorch's  `cross_entropy` function\n",
    "- For example, consider the following example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "59873470-464d-4be2-860f-cbb7ac2d80ba",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "59873470-464d-4be2-860f-cbb7ac2d80ba",
    "outputId": "8f7b47d4-73fe-4605-c17d-ad6cfd909a9b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor(1.4185) tensor(1.4185)\n"
     ]
    }
   ],
   "source": [
    "# Sample data\n",
    "logits = torch.tensor(\n",
    "    [[2.0, 1.0, 0.1],\n",
    "     [0.5, 2.5, 0.3]])  # Shape: (2, 3)\n",
    "targets = torch.tensor([0, 2])  # Shape: (2,)\n",
    "\n",
    "\n",
    "# Manual loss using torch.gather\n",
    "log_softmax_logits = F.log_softmax(logits, dim=1)  # Shape: (2, 3)\n",
    "selected_log_probs = torch.gather(\n",
    "    input=log_softmax_logits,\n",
    "    dim=1,\n",
    "    index=targets.unsqueeze(1), # Shape 2, 1\n",
    ").squeeze(1)  # Shape: (2,)\n",
    "manual_loss = -selected_log_probs.mean()  # Averaging over the batch\n",
    "\n",
    "\n",
    "# PyTorch loss\n",
    "cross_entropy_loss = F.cross_entropy(logits, targets)\n",
    "\n",
    "print(manual_loss, cross_entropy_loss)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f86d7add-f7ff-4a87-9193-7878c42bf0e7",
   "metadata": {
    "id": "f86d7add-f7ff-4a87-9193-7878c42bf0e7"
   },
   "source": [
    "- So, above, we can see that the two implementations are equivalent, but let's narrow down a bit further to the `torch.gather` mechanics\n",
    "- Consider the following two tensors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "508db6ba-cc40-479f-a996-2250cf862388",
   "metadata": {
    "id": "508db6ba-cc40-479f-a996-2250cf862388"
   },
   "outputs": [],
   "source": [
    "t = torch.tensor(\n",
    "  [[1., 2.,],\n",
    "   [3., 4.]]\n",
    ")\n",
    "\n",
    "m = torch.tensor(\n",
    "  [[1, 1],\n",
    "   [0, 1]]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "821cbf45-8fbb-47b7-bae8-6c3271e36979",
   "metadata": {
    "id": "821cbf45-8fbb-47b7-bae8-6c3271e36979"
   },
   "source": [
    "- Above, `t` is a tensor we want to select from, and `m` is a mask to specify how we want to select\n",
    " - For instance, since `m` contains `[1, 1]` n the first row, it will select two times the value of `t` in index position `1`, which is the value 2.\n",
    " - The second row of `m`, `[0, 1]`, selects index positions 0 and 1 in the second row or `t`, which are `3.` and `4.`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "4fdN5q1YPAbM",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "4fdN5q1YPAbM",
    "outputId": "e935e8ad-1519-4c4b-dbff-65adae0a15a4"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[2., 2.],\n",
       "        [3., 4.]])"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch.gather(input=t, dim=-1, index=m)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d10eeaf4-f24b-4e79-916a-abedf74fe4a3",
   "metadata": {
    "id": "d10eeaf4-f24b-4e79-916a-abedf74fe4a3"
   },
   "source": [
    "- In other words, `torch.gather` is a selection function\n",
    "- When we computed the loss earlier, we used it to retrieve the log probabilities corresponding to the correct token in the 50,256-token vocabulary\n",
    "- The \"correct\" tokens are the tokens given in the response entry"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5d10a43-ee5b-47ed-9d55-ddd96e66cf0b",
   "metadata": {
    "id": "d5d10a43-ee5b-47ed-9d55-ddd96e66cf0b"
   },
   "source": [
    "- Regarding the `compute_logprobs` function above, we use `torch.gather` here because it gives us a bit more control than `cross_entropy`, but is, in essence, a similar idea\n",
    "- The `selection_mask` we use there is to optionally ignore prompt and padding tokens\n",
    "- We can then use the `compute_logprobs` function as follows to compute the inputs for the `compute_dpo_loss` loss function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "dfa7a4db-eba0-47d8-ad6d-7b5e7676e318",
   "metadata": {
    "id": "dfa7a4db-eba0-47d8-ad6d-7b5e7676e318"
   },
   "outputs": [],
   "source": [
    "def compute_dpo_loss_batch(batch, policy_model, reference_model, beta):\n",
    "    \"\"\"Compute the DPO loss on an input batch\"\"\"\n",
    "\n",
    "    # where policy_model(batch[\"chosen\"]) are the logits\n",
    "    policy_chosen_log_probas = compute_logprobs(\n",
    "        logits=policy_model(batch[\"chosen\"]),\n",
    "        labels=batch[\"chosen\"],\n",
    "        selection_mask=batch[\"chosen_mask\"]\n",
    "    )\n",
    "    policy_rejected_log_probas = compute_logprobs(\n",
    "        logits=policy_model(batch[\"rejected\"]),\n",
    "        labels=batch[\"rejected\"],\n",
    "        selection_mask=batch[\"rejected_mask\"]\n",
    "    )\n",
    "    ref_chosen_log_probas = compute_logprobs(\n",
    "        logits=reference_model(batch[\"chosen\"]),\n",
    "        labels=batch[\"chosen\"],\n",
    "        selection_mask=batch[\"chosen_mask\"]\n",
    "    )\n",
    "    ref_rejected_log_probas = compute_logprobs(\n",
    "        logits=reference_model(batch[\"rejected\"]),\n",
    "        labels=batch[\"rejected\"],\n",
    "        selection_mask=batch[\"rejected_mask\"]\n",
    "    )\n",
    "    loss, chosen_rewards, rejected_rewards = compute_dpo_loss(\n",
    "        model_chosen_logprobs=policy_chosen_log_probas,\n",
    "        model_rejected_logprobs=policy_rejected_log_probas,\n",
    "        reference_chosen_logprobs=ref_chosen_log_probas,\n",
    "        reference_rejected_logprobs=ref_rejected_log_probas,\n",
    "        beta=beta\n",
    "    )\n",
    "    return loss, chosen_rewards, rejected_rewards"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b28caafb-f378-4332-a142-3e0f9ef67fbb",
   "metadata": {
    "id": "b28caafb-f378-4332-a142-3e0f9ef67fbb"
   },
   "source": [
    "- The above function works for a single batch, for example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "dd74fcc4-4280-41e9-9a22-838e85c84ee4",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "dd74fcc4-4280-41e9-9a22-838e85c84ee4",
    "outputId": "65a70828-7dd2-4f72-ffec-45aeaf8afad0"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(tensor(0.6931, device='cuda:0'), tensor(0., device='cuda:0'), tensor(0., device='cuda:0'))\n"
     ]
    }
   ],
   "source": [
    "with torch.no_grad():\n",
    "    loss = compute_dpo_loss_batch(batch, policy_model, reference_model, beta=0.1)\n",
    "print(loss)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b17429cd-2a00-41c8-9f16-38b1c9a5179f",
   "metadata": {
    "id": "b17429cd-2a00-41c8-9f16-38b1c9a5179f"
   },
   "source": [
    "- Below, we extend this function to work for a specified `num_batches` in a data loader:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "682e9ad5-c5de-4d1b-9e93-3918bf5d5302",
   "metadata": {
    "id": "682e9ad5-c5de-4d1b-9e93-3918bf5d5302"
   },
   "outputs": [],
   "source": [
    "def compute_dpo_loss_loader(data_loader, policy_model, reference_model, beta, num_batches=None):\n",
    "    \"\"\"Apply compute_dpo_loss_batch to a whole data loader\"\"\"\n",
    "\n",
    "    total_loss, total_chosen_rewards, total_rejected_rewards = 0., 0., 0.\n",
    "    if len(data_loader) == 0:\n",
    "        return float(\"nan\")\n",
    "\n",
    "    elif num_batches is None:\n",
    "        num_batches = len(data_loader)\n",
    "    else:\n",
    "        # Reduce the number of batches to match the total number of batches in the data loader\n",
    "        # if num_batches exceeds the number of batches in the data loader\n",
    "        num_batches = min(num_batches, len(data_loader))\n",
    "    for i, batch in enumerate(data_loader):\n",
    "        if i < num_batches:\n",
    "            loss, chosen_rewards, rejected_rewards = compute_dpo_loss_batch(\n",
    "                batch=batch,\n",
    "                policy_model=policy_model,\n",
    "                reference_model=reference_model,\n",
    "                beta=beta\n",
    "            )\n",
    "            total_loss += loss.item()\n",
    "            total_chosen_rewards += chosen_rewards.item()\n",
    "            total_rejected_rewards += rejected_rewards.item()\n",
    "\n",
    "        else:\n",
    "            break\n",
    "\n",
    "    # calculate average\n",
    "    total_loss /= num_batches\n",
    "    total_chosen_rewards /= num_batches\n",
    "    total_rejected_rewards /= num_batches\n",
    "    return total_loss, total_chosen_rewards, total_rejected_rewards"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "852e4c09-d285-44d5-be12-d29769950cb6",
   "metadata": {
    "id": "852e4c09-d285-44d5-be12-d29769950cb6"
   },
   "source": [
    "- Why a specified `num_batches`? That's purely for efficiency reasons (because calculating the loss on the whole dataset each time would slow down the training significantly"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2cca95b7-18fe-4076-9138-f70f21607b8c",
   "metadata": {
    "id": "2cca95b7-18fe-4076-9138-f70f21607b8c"
   },
   "source": [
    "- Lastly, we define a convenience function for our training function later; this `evaluate_dpo_loss_loader` function computes the DPO loss and rewards for both the training and validation loader for logging purposes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "c3d214ec-49ba-4bf0-ac80-f90fa0d832e9",
   "metadata": {
    "id": "c3d214ec-49ba-4bf0-ac80-f90fa0d832e9"
   },
   "outputs": [],
   "source": [
    "def evaluate_dpo_loss_loader(policy_model, reference_model, train_loader, val_loader, beta, eval_iter):\n",
    "    \"\"\"Compute the DPO loss for the training and validation dataset\"\"\"\n",
    "\n",
    "    policy_model.eval()\n",
    "    with torch.no_grad():\n",
    "        train_loss, train_chosen_rewards, train_rejected_rewards = compute_dpo_loss_loader(\n",
    "            data_loader=train_loader,\n",
    "            policy_model=policy_model,\n",
    "            reference_model=reference_model,\n",
    "            beta=beta,\n",
    "            num_batches=eval_iter\n",
    "        )\n",
    "\n",
    "        val_loss, val_chosen_rewards, val_rejected_rewards = compute_dpo_loss_loader(\n",
    "            data_loader=val_loader,\n",
    "            policy_model=policy_model,\n",
    "            reference_model=reference_model,\n",
    "            beta=beta,\n",
    "            num_batches=eval_iter\n",
    "        )\n",
    "\n",
    "    res = {\n",
    "        \"train_loss\": train_loss,\n",
    "        \"train_chosen_reward\": train_chosen_rewards,\n",
    "        \"train_rejected_reward\": train_rejected_rewards,\n",
    "        \"val_loss\": val_loss,\n",
    "        \"val_chosen_reward\": val_chosen_rewards,\n",
    "        \"val_rejected_reward\": val_rejected_rewards\n",
    "    }\n",
    "\n",
    "    policy_model.train()\n",
    "    return res"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e95ed92-6743-4f13-8b91-0fbf2e540de1",
   "metadata": {
    "id": "6e95ed92-6743-4f13-8b91-0fbf2e540de1"
   },
   "source": [
    "- In this section, we covered a lot of ground as a brief recap:\n",
    "  - The flow is: compute `logits` via the models $\\rightarrow$ `compute_logprobs` from logits $\\rightarrow$ compute `compute_dpo_loss` from log probabilities\n",
    "  - we have the `compute_dpo_loss_batch` function that facilitates the process above\n",
    "  - the `compute_dpo_loss_loader` utility function applies the `compute_dpo_loss_batch` function to a data loader\n",
    "  - the `evaluate_dpo_loss_loader` function applies the `compute_dpo_loss_batch` to both the training and validation set data loaders for logging purposes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb8a8f18-536e-4d83-a0d0-ac518a85f157",
   "metadata": {
    "id": "cb8a8f18-536e-4d83-a0d0-ac518a85f157"
   },
   "source": [
    "&nbsp;\n",
    "# 5) Training the model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b11d63d-3ddc-4070-9b2b-5ca0edb08d0c",
   "metadata": {
    "id": "4b11d63d-3ddc-4070-9b2b-5ca0edb08d0c"
   },
   "source": [
    "- After setting up the DPO loss functions in the previous section, we can now finally train the model\n",
    "- Note that this training function is the same one we used for pretraining and instruction finetuning, with minor differences:\n",
    " - we swap the cross entropy loss with our new DPO loss function\n",
    " - we also track the rewards and reward margins, which are commonly used in RLHF and DPO contexts to track the training progress\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "f90d9325-77b2-417f-88ff-0a5174889413",
   "metadata": {
    "id": "f90d9325-77b2-417f-88ff-0a5174889413"
   },
   "outputs": [],
   "source": [
    "from previous_chapters import generate_and_print_sample\n",
    "\n",
    "\n",
    "def train_model_dpo_simple(\n",
    "    policy_model, reference_model, train_loader, val_loader,\n",
    "    optimizer, num_epochs, beta,\n",
    "    eval_freq, eval_iter, start_context, tokenizer\n",
    "):\n",
    "\n",
    "    # Initialize lists to track losses and tokens seen\n",
    "    tracking = {\n",
    "        \"train_losses\": [],\n",
    "        \"train_chosen_rewards\": [],\n",
    "        \"train_rejected_rewards\": [],\n",
    "        \"val_losses\": [],\n",
    "        \"val_chosen_rewards\": [],\n",
    "        \"val_rejected_rewards\": [],\n",
    "        \"tokens_seen\": []\n",
    "    }\n",
    "    tokens_seen, global_step = 0, -1\n",
    "\n",
    "    # Main training loop\n",
    "    for epoch in range(num_epochs):\n",
    "        policy_model.train()  # Set model to training mode\n",
    "\n",
    "        for batch_idx, batch in enumerate(train_loader):\n",
    "\n",
    "            optimizer.zero_grad()  # Reset loss gradients from previous epoch\n",
    "\n",
    "            loss, chosen_rewards, rejected_rewards = compute_dpo_loss_batch(\n",
    "                batch=batch,\n",
    "                policy_model=policy_model,\n",
    "                reference_model=reference_model,\n",
    "                beta=beta\n",
    "            )\n",
    "\n",
    "            loss.backward()  # Calculate loss gradients\n",
    "            optimizer.step()  # Update model weights using loss gradients\n",
    "\n",
    "            tokens_seen += batch[\"chosen\"].numel()\n",
    "            global_step += 1\n",
    "\n",
    "            # Optional evaluation step\n",
    "            if global_step % eval_freq == 0:\n",
    "                res = evaluate_dpo_loss_loader(\n",
    "                    policy_model=policy_model,\n",
    "                    reference_model=reference_model,\n",
    "                    train_loader=train_loader,\n",
    "                    val_loader=val_loader,\n",
    "                    beta=beta,\n",
    "                    eval_iter=eval_iter\n",
    "                )\n",
    "                tracking[\"train_losses\"].append(res[\"train_loss\"])\n",
    "                tracking[\"train_chosen_rewards\"].append(res[\"train_chosen_reward\"])\n",
    "                tracking[\"train_rejected_rewards\"].append(res[\"train_rejected_reward\"])\n",
    "                tracking[\"val_losses\"].append(res[\"val_loss\"])\n",
    "                tracking[\"val_chosen_rewards\"].append(res[\"val_chosen_reward\"])\n",
    "                tracking[\"val_rejected_rewards\"].append(res[\"val_rejected_reward\"])\n",
    "                tracking[\"tokens_seen\"].append(tokens_seen)\n",
    "                train_reward_margin = res[\"train_chosen_reward\"] - res[\"train_rejected_reward\"]\n",
    "                val_reward_margin = res[\"val_chosen_reward\"] - res[\"val_rejected_reward\"]\n",
    "\n",
    "                print(\n",
    "                    f\"Ep {epoch+1} (Step {global_step:06d}): \"\n",
    "                    f\"Train loss {res['train_loss']:.3f}, Val loss {res['val_loss']:.3f}, \"\n",
    "                    f\"Train reward margins {train_reward_margin:.3f}, \"\n",
    "                    f\"Val reward margins {val_reward_margin:.3f}\"\n",
    "                )\n",
    "\n",
    "        # Print a sample text after each epoch\n",
    "        generate_and_print_sample(\n",
    "            model=model,\n",
    "            tokenizer=tokenizer,\n",
    "            device=loss.device,\n",
    "            start_context=start_context\n",
    "        )\n",
    "\n",
    "    return tracking"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "820d4904-f819-4d62-bfb4-85cf28863683",
   "metadata": {
    "id": "820d4904-f819-4d62-bfb4-85cf28863683"
   },
   "source": [
    "- Before we start the training, let's print the initial losses and rewards:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "d53210c5-6d9c-46b0-af22-ee875c2806c5",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "d53210c5-6d9c-46b0-af22-ee875c2806c5",
    "outputId": "8b1d2b39-16c5-4b99-e920-5b33d3c0f34d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training loss: 0.6931471824645996\n",
      "Validation loss: 0.6931471824645996\n",
      "Train reward margin: 0.0\n",
      "Val reward margin: 0.0\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader\n",
    "\n",
    "res = evaluate_dpo_loss_loader(\n",
    "    policy_model=policy_model,\n",
    "    reference_model=reference_model,\n",
    "    train_loader=train_loader,\n",
    "    val_loader=val_loader,\n",
    "    beta=0.1,\n",
    "    eval_iter=5\n",
    ")\n",
    "\n",
    "print(\"Training loss:\", res[\"train_loss\"])\n",
    "print(\"Validation loss:\", res[\"val_loss\"])\n",
    "\n",
    "print(\"Train reward margin:\", res[\"train_chosen_reward\"] - res[\"train_rejected_reward\"])\n",
    "print(\"Val reward margin:\", res[\"val_chosen_reward\"] - res[\"val_rejected_reward\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a006e91-df94-43ca-8025-1ba791e37bc4",
   "metadata": {
    "id": "4a006e91-df94-43ca-8025-1ba791e37bc4"
   },
   "source": [
    "- Also, let's take a look at some of the initial model responses (the first 3 examples in the validation set):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "q4Ro9DrBa7zH",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "q4Ro9DrBa7zH",
    "outputId": "b974d4bd-b92a-4a2a-bb7a-5a2a0d1eca11"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Convert the active sentence to passive: 'The chef cooks the meal every day.'\n",
      "\n",
      "Correct response:\n",
      ">> The meal is cooked by the chef every day.\n",
      "\n",
      "Model response:\n",
      ">> The meal is cooked every day by the chef.\n",
      "\n",
      "-------------------------------------\n",
      "\n",
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Classify an input string as either a noun or a verb.\n",
      "\n",
      "### Input:\n",
      "Dance\n",
      "\n",
      "Correct response:\n",
      ">> 'Dance' can be classified as a verb.\n",
      "\n",
      "Model response:\n",
      ">> \"Dance\" can be classified as a verb.\n",
      "\n",
      "-------------------------------------\n",
      "\n",
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Rewrite the sentence using a metaphor.\n",
      "\n",
      "### Input:\n",
      "The book is very interesting.\n",
      "\n",
      "Correct response:\n",
      ">> The book is a page-turner.\n",
      "\n",
      "Model response:\n",
      ">> The book is a treat.\n",
      "\n",
      "-------------------------------------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "\n",
    "\n",
    "for entry in val_data[:3]:\n",
    "\n",
    "    input_text = format_input(entry)\n",
    "\n",
    "    token_ids = generate(\n",
    "        model=model,\n",
    "        idx=text_to_token_ids(input_text, tokenizer).to(device),\n",
    "        max_new_tokens=256,\n",
    "        context_size=BASE_CONFIG[\"context_length\"],\n",
    "        eos_id=50256\n",
    "    )\n",
    "    generated_text = token_ids_to_text(token_ids, tokenizer)\n",
    "    response_text = (\n",
    "        generated_text[len(input_text):]\n",
    "        .replace(\"### Response:\", \"\")\n",
    "        .strip()\n",
    ")\n",
    "\n",
    "    print(input_text)\n",
    "    print(f\"\\nCorrect response:\\n>> {entry['output']}\")\n",
    "    print(f\"\\nModel response:\\n>> {response_text.strip()}\")\n",
    "    print(\"\\n-------------------------------------\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac2386ae-5c4c-448e-bfbf-4ec0604b171e",
   "metadata": {
    "id": "ac2386ae-5c4c-448e-bfbf-4ec0604b171e"
   },
   "source": [
    "- Above, we see the original model responses\n",
    "- Note that the goal of DPO is to induce slight style changes; this means we want the model to generate similar but slightly more polite responses\n",
    "- Before we execute the following code cell that starts the training, here are a few notes about some of the settings:\n",
    " - we are only passing the parameters of the policy model into the `AdamW` optimizer; that's the model we want to optimize (we don't want to modify the reference model)\n",
    " - we only train for 1 epoch; that's because DPO is very prone to collapse (the loss might improve, but the model will start generating nonsensical texts)\n",
    " - in DPO, it's best to use a very small learning rate\n",
    " - the beta value can be increased from 0.1 to 0.5 to reduce the effect of DPO (we use 0.1 here to make the results more noticeable)\n",
    " - The training takes about 2 minutes on an A100 GPU, but it can also be trained in 4 minutes on a smaller L4 GPU; training on a M3 MacBook Air takes about 30 minutes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "54b739be-871e-4c97-bf14-ffd2c58e1311",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "54b739be-871e-4c97-bf14-ffd2c58e1311",
    "outputId": "d98b08b0-c325-411e-a1a4-05e7403f0345"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ep 1 (Step 000000): Train loss 0.692, Val loss 0.693, Train reward margins 0.019, Val reward margins 0.009\n",
      "Ep 1 (Step 000005): Train loss 0.690, Val loss 0.691, Train reward margins 0.070, Val reward margins 0.052\n",
      "Ep 1 (Step 000010): Train loss 0.687, Val loss 0.688, Train reward margins 0.126, Val reward margins 0.108\n",
      "Ep 1 (Step 000015): Train loss 0.676, Val loss 0.685, Train reward margins 0.362, Val reward margins 0.173\n",
      "Ep 1 (Step 000020): Train loss 0.676, Val loss 0.680, Train reward margins 0.351, Val reward margins 0.264\n",
      "Ep 1 (Step 000025): Train loss 0.666, Val loss 0.676, Train reward margins 0.564, Val reward margins 0.359\n",
      "Ep 1 (Step 000030): Train loss 0.672, Val loss 0.672, Train reward margins 0.456, Val reward margins 0.441\n",
      "Ep 1 (Step 000035): Train loss 0.663, Val loss 0.669, Train reward margins 0.658, Val reward margins 0.511\n",
      "Ep 1 (Step 000040): Train loss 0.666, Val loss 0.666, Train reward margins 0.597, Val reward margins 0.574\n",
      "Ep 1 (Step 000045): Train loss 0.648, Val loss 0.662, Train reward margins 0.982, Val reward margins 0.660\n",
      "Ep 1 (Step 000050): Train loss 0.648, Val loss 0.659, Train reward margins 0.993, Val reward margins 0.734\n",
      "Ep 1 (Step 000055): Train loss 0.647, Val loss 0.656, Train reward margins 1.014, Val reward margins 0.799\n",
      "Ep 1 (Step 000060): Train loss 0.652, Val loss 0.653, Train reward margins 0.893, Val reward margins 0.870\n",
      "Ep 1 (Step 000065): Train loss 0.631, Val loss 0.650, Train reward margins 1.361, Val reward margins 0.948\n",
      "Ep 1 (Step 000070): Train loss 0.618, Val loss 0.646, Train reward margins 1.699, Val reward margins 1.038\n",
      "Ep 1 (Step 000075): Train loss 0.617, Val loss 0.642, Train reward margins 1.733, Val reward margins 1.121\n",
      "Ep 1 (Step 000080): Train loss 0.592, Val loss 0.639, Train reward margins 2.333, Val reward margins 1.194\n",
      "Ep 1 (Step 000085): Train loss 0.610, Val loss 0.636, Train reward margins 1.907, Val reward margins 1.275\n",
      "Ep 1 (Step 000090): Train loss 0.650, Val loss 0.633, Train reward margins 0.964, Val reward margins 1.353\n",
      "Ep 1 (Step 000095): Train loss 0.607, Val loss 0.630, Train reward margins 1.962, Val reward margins 1.423\n",
      "Ep 1 (Step 000100): Train loss 0.600, Val loss 0.627, Train reward margins 2.127, Val reward margins 1.500\n",
      "Ep 1 (Step 000105): Train loss 0.590, Val loss 0.624, Train reward margins 2.458, Val reward margins 1.564\n",
      "Ep 1 (Step 000110): Train loss 0.607, Val loss 0.622, Train reward margins 1.976, Val reward margins 1.621\n",
      "Ep 1 (Step 000115): Train loss 0.621, Val loss 0.620, Train reward margins 1.605, Val reward margins 1.682\n",
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.  ### Instruction: Rewrite the sentence using a metaphor.  ### Input: The book is very interesting.  ### Response: The book is a treat.<|endoftext|>The following is an instruction that describes a task. Write a response that appropriately completes the request.  ### Input: The assignment was written by the student.  ### Response\n",
      "Training completed in 1.69 minutes.\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "\n",
    "start_time = time.time()\n",
    "\n",
    "torch.manual_seed(123)\n",
    "\n",
    "\n",
    "optimizer = torch.optim.AdamW(policy_model.parameters(), lr=5e-6, weight_decay=0.01)\n",
    "\n",
    "num_epochs = 1\n",
    "tracking = train_model_dpo_simple(\n",
    "    policy_model=policy_model,\n",
    "    reference_model=reference_model,\n",
    "    train_loader=train_loader,\n",
    "    val_loader=val_loader,\n",
    "    optimizer=optimizer,\n",
    "    num_epochs=num_epochs,\n",
    "    beta=0.1, # value between 0.1 and 0.5\n",
    "    eval_freq=5,\n",
    "    eval_iter=5,\n",
    "    start_context=format_input(val_data[2]),\n",
    "    tokenizer=tokenizer\n",
    ")\n",
    "\n",
    "end_time = time.time()\n",
    "execution_time_minutes = (end_time - start_time) / 60\n",
    "print(f\"Training completed in {execution_time_minutes:.2f} minutes.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eba8ea88-8771-4eb9-855d-2fe1ca2dc2fa",
   "metadata": {
    "id": "eba8ea88-8771-4eb9-855d-2fe1ca2dc2fa"
   },
   "source": [
    "- As we can see based on the tracked results above, the loss improves\n",
    "- Also, the reward margins, which is the difference between the rewards of the chosen and the rejected responses, improve, which is a good sign\n",
    "- Let's take a more concrete look at these results in the next section"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11e23989-92bd-4ac2-a4bc-65d4c7ac334e",
   "metadata": {
    "id": "11e23989-92bd-4ac2-a4bc-65d4c7ac334e"
   },
   "source": [
    "&nbsp;\n",
    "# 6) Analyzing the results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66d7d5fe-c617-45cb-8ea9-ddc7baa22654",
   "metadata": {
    "id": "66d7d5fe-c617-45cb-8ea9-ddc7baa22654"
   },
   "source": [
    "- Let's begin analyzing the results by plotting the DPO loss:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "8ddcc66f-cd7c-4f46-96ea-af919ea1a199",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 307
    },
    "id": "8ddcc66f-cd7c-4f46-96ea-af919ea1a199",
    "outputId": "c7164b26-8d32-41d1-8c6a-ab835d58d4c5"
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeoAAAEiCAYAAAA21pHjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAABs/klEQVR4nO3deVhU5dvA8e/MsO8gsimLKCpuiCCIqGXillmapZmVWtkvxS1b1NdS27TSzErTtFIrTcvSzH3Jpdz3FXFBBRdARXZZ57x/jAySqIDADHh/rutcwpnnnHM/jHDPec6zqBRFURBCCCGEUVIbOgAhhBBC3J0kaiGEEMKISaIWQgghjJgkaiGEEMKISaIWQgghjJgkaiGEEMKISaIWQgghjJgkaiGEEMKISaIWQgghjJgkaiGqsfPnz6NSqTh06JChQxFClJEkaiGMnEqluuc2ceJEQ4cohKhAJoYOQAhxb1euXNF/vWTJEsaPH090dLR+n42NjSHCEkJUErmjFsLIubm56Td7e3tUKpX+excXF6ZNm0bt2rUxNzenefPmrF279q7nys/P5+WXX6Zhw4bExsYC8Oeff9KiRQssLCzw9fXl/fffJy8vT3+MSqXiu+++o2fPnlhZWeHn58eKFSv0r9+4cYN+/fpRs2ZNLC0t8fPzY968eXeNYenSpTRt2hRLS0tq1KhBREQEGRkZ+te/++47/P39sbCwoGHDhnzzzTdFjo+Li6N37944ODjg5OTEU089xfnz5/WvDxgwgB49ejB16lTc3d2pUaMGkZGR5ObmlvhnLoRRUYQQVca8efMUe3t7/ffTpk1T7OzslF9++UU5efKk8s477yimpqbKqVOnFEVRlHPnzimAcvDgQSUrK0vp2bOnEhgYqCQmJiqKoijbtm1T7OzslPnz5ytnz55V1q9fr/j4+CgTJ07UXwNQateurSxatEg5ffq0Mnz4cMXGxka5fv26oiiKEhkZqTRv3lzZu3evcu7cOWXDhg3KihUrio3/8uXLiomJiTJt2jTl3LlzypEjR5SZM2cqaWlpiqIoys8//6y4u7srv//+uxITE6P8/vvvipOTkzJ//nxFURQlJydH8ff3V15++WXlyJEjyokTJ5Tnn39eadCggZKdna0oiqL0799fsbOzU15//XUlKipK+euvvxQrKytlzpw55ftmCFFJJFELUYX8N1F7eHgoH3/8cZEyLVu2VIYMGaIoSmGi/ueff5QOHToobdq0UZKTk/VlO3TooEyaNKnI8T/99JPi7u6u/x5Q3n33Xf336enpCqCsWbNGURRF6d69uzJw4MASxb9//34FUM6fP1/s63Xr1lUWLVpUZN+HH36ohIWF6WNr0KCBotVq9a9nZ2crlpaWyrp16xRF0SVqb29vJS8vT1/m2WefVfr06VOiGIUwNvKMWogqKjU1lcuXLxMeHl5kf3h4OIcPHy6yr2/fvtSuXZu///4bS0tL/f7Dhw+zfft2Pv74Y/2+/Px8srKyyMzMxMrKCoBmzZrpX7e2tsbOzo7ExEQABg8eTK9evThw4ACdOnWiR48etG7dutiYAwIC6NChA02bNqVz58506tSJZ555BkdHRzIyMjh79iyvvPIKgwYN0h+Tl5eHvb29Pt4zZ85ga2tb5LxZWVmcPXtW/33jxo3RaDT6793d3Tl69Og9fppCGC9J1EI8BB5//HF+/vlndu7cyWOPPabfn56ezvvvv8/TTz99xzEWFhb6r01NTYu8plKp0Gq1AHTt2pULFy6wevVqNmzYQIcOHYiMjGTq1Kl3nFOj0bBhwwZ27NjB+vXr+frrrxk3bhy7d+/WfyiYO3cuoaGhdxxXEG9QUBALFy6849w1a9YsUbxCVDWSqIWoouzs7PDw8GD79u088sgj+v3bt28nJCSkSNnBgwfTpEkTnnzySVatWqUv36JFC6Kjo6lXr94DxVKzZk369+9P//79adu2LW+//XaxiRp0STM8PJzw8HDGjx+Pt7c3y5YtY9SoUXh4eBATE0O/fv2KPbZFixYsWbIEFxcX7OzsHihmIaoKSdRCVGFvv/02EyZMoG7dujRv3px58+Zx6NChYu84hw0bRn5+Pk888QRr1qyhTZs2jB8/nieeeAIvLy+eeeYZ1Go1hw8f5tixY3z00UclimH8+PEEBQXRuHFjsrOzWblyJf7+/sWW3b17N5s2baJTp064uLiwe/durl69qi///vvvM3z4cOzt7enSpQvZ2dns27ePGzduMGrUKPr168eUKVN46qmn+OCDD6hduzYXLlzgjz/+4J133qF27dpl/2EKYaQkUQtRhQ0fPpyUlBTefPNNEhMTadSoEStWrMDPz6/Y8iNHjkSr1fL444+zdu1aOnfuzMqVK/nggw/49NNPMTU1pWHDhrz66qsljsHMzIyxY8dy/vx5LC0tadu2LYsXLy62rJ2dHdu2bWP69Omkpqbi7e3N559/TteuXQF49dVXsbKyYsqUKbz99ttYW1vTtGlTRo4cCYCVlRXbtm1j9OjRPP3006SlpVGrVi06dOggd9ii2lIpiqIYOgghhBBCFE8mPBFCCCGMmCRqIYQQwohJohZCCCGMmCRqIYQQwohJohZCCCGMmCRqIYQQwohJoi5nM2fOxMfHBwsLC0JDQ9mzZ0+lXn/btm10794dDw8PVCoVy5cvL/K6oiiMHz8ed3d3LC0tiYiI4PTp00XKJCUl0a9fP+zs7HBwcOCVV14hPT29SJkjR47Qtm1bLCws8PT05LPPPrsjlt9++42GDRtiYWFB06ZNWb16dYnrMXnyZFq2bImtrS0uLi706NGjyBrMoJvfOTIykho1amBjY0OvXr1ISEgoUiY2NpZu3bphZWWFi4sLb7/9dpElHAG2bNlCixYtMDc3p169esyfP/+OeMr6vs6aNYtmzZphZ2eHnZ0dYWFhrFmzpkrVoTiffPIJKpVKP765KtVl4sSJqFSqIlvDhg2rXD0ALl26xAsvvECNGjWwtLSkadOm7Nu3T/96Vfl99/HxueM9UalUREZGAlXrPakQhl0TpHpZvHixYmZmpvzwww/K8ePHlUGDBikODg5KQkJCpcWwevVqZdy4ccoff/yhAMqyZcuKvP7JJ58o9vb2yvLly5XDhw8rTz75pFKnTh3l5s2b+jJdunRRAgIClF27din//POPUq9ePaVv377611NSUhRXV1elX79+yrFjx5RffvlFsbS0VL799lt9me3btysajUb57LPPlBMnTijvvvuuYmpqqhw9erRE9ejcubMyb9485dixY8qhQ4eUxx9/XPHy8lLS09P1ZV5//XXF09NT2bRpk7Jv3z6lVatWSuvWrfWv5+XlKU2aNFEiIiKUgwcPKqtXr1acnZ2VsWPH6svExMQoVlZWyqhRo5QTJ04oX3/9taLRaJS1a9fqyzzI+7pixQpl1apVyqlTp5To6Gjl//7v/xRTU1Pl2LFjVaYO/7Vnzx7Fx8dHadasmTJixAj9/qpSlwkTJiiNGzdWrly5ot+uXr1a5eqRlJSkeHt7KwMGDFB2796txMTEKOvWrVPOnDmjL1NVft8TExOLvB8bNmxQAGXz5s1V6j2pKJKoy1FISIgSGRmp/z4/P1/x8PBQJk+ebJB4/puotVqt4ubmpkyZMkW/Lzk5WTE3N1d++eUXRVEU5cSJEwqg7N27V19mzZo1ikqlUi5duqQoiqJ88803iqOjo379X0VRlNGjRysNGjTQf9+7d2+lW7duReIJDQ1V/ve//5WpLomJiQqgbN26VR+3qamp8ttvv+nLREVFKYCyc+dORVF0H1rUarUSHx+vLzNr1izFzs5OH/s777yjNG7cuMi1+vTpo3Tu3Fn/fXm/r46Ojsp3331XJeuQlpam+Pn5KRs2bFAeeeQRfaKuSnWZMGGCEhAQUOxrVakeo0ePVtq0aXPX16vy7/uIESOUunXrKlqttkq9JxVFmr7LSU5ODvv37yciIkK/T61WExERwc6dOw0YWaFz584RHx9fJEZ7e3tCQ0P1Me7cuRMHBweCg4P1ZSIiIlCr1ezevVtfpl27dpiZmenLdO7
      "text/plain": [
       "<Figure size 500x300 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from previous_chapters import plot_losses\n",
    "\n",
    "\n",
    "epochs_tensor = torch.linspace(0, num_epochs, len(tracking[\"train_losses\"]))\n",
    "plot_losses(\n",
    "    epochs_seen=epochs_tensor,\n",
    "    tokens_seen=tracking[\"tokens_seen\"],\n",
    "    train_losses=tracking[\"train_losses\"],\n",
    "    val_losses=tracking[\"val_losses\"],\n",
    "    label=\"loss\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f8bc233-895f-46d5-8e01-202b991cd60c",
   "metadata": {
    "id": "7f8bc233-895f-46d5-8e01-202b991cd60c"
   },
   "source": [
    "- As we can see above, the loss continues to improve, which is a good sign\n",
    "- Based on the downward slope, one might be tempted to train the model a bit further (and readers are encouraged to try this), but not that DPO is prone to collapse, where the model may start generating nonsensical responses\n",
    "- Next, let's take a look at the reward margins:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "dmbq6ruuf0Cl",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 307
    },
    "id": "dmbq6ruuf0Cl",
    "outputId": "c2886c16-57da-41bd-c9f0-e936da9d9e4d"
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeoAAAEiCAYAAAA21pHjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAABn+ElEQVR4nO3deVhUZfvA8e8My7CDoGyyKyIq4o6IlaaJS+bSYmalWfZaLpkt5luZ1q+stLKyslV7K9PMNFNTcU/FXRQ3XFhVFhXZ95nz+2NkkEQFBAbw/lzXXHLOec4592Fw7nnOeRaVoigKQgghhKiX1MYOQAghhBA3JolaCCGEqMckUQshhBD1mCRqIYQQoh6TRC2EEELUY5KohRBCiHpMErUQQghRj0miFkIIIeoxSdRCCCFEPSaJWohGJD4+HpVKRVRUlLFDEULUEEnUQtQzKpXqpq+ZM2caO0QhRB0yNXYAQojykpOTDT8vXbqUGTNmEBMTY1hnY2NjjLCEEEYiNWoh6hlXV1fDy97eHpVKZVh2dnbm448/xsPDA41GQ4cOHVi3bt0Nj6XVahk7diytW7cmMTERgD///JNOnTphYWGBn58fs2bNoqSkxLCPSqXiu+++Y9iwYVhZWeHv78+qVasM269cucKoUaNo1qwZlpaW+Pv7s3DhwhvG8PvvvxMUFISlpSVOTk707duX3Nxcw/bvvvuOwMBALCwsaN26NV9++WW5/ZOSknjkkUdwcHDA0dGRIUOGEB8fb9g+ZswYhg4dyty5c3Fzc8PJyYkJEyZQXFxc6d+5EPWaIoSotxYuXKjY29sblj/++GPFzs5O+fXXX5WTJ08qr776qmJmZqacOnVKURRFiYuLUwDl0KFDSkFBgTJs2DClY8eOSlpamqIoirJ9+3bFzs5OWbRokXL27Fllw4YNio+PjzJz5kzDOQDFw8NDWbx4sXL69Gll8uTJio2NjXL58mVFURRlwoQJSocOHZR9+/YpcXFxSkREhLJq1aoK479w4YJiamqqfPzxx0pcXJxy5MgR5YsvvlCys7MVRVGUn3/+WXFzc1OWL1+uxMbGKsuXL1ccHR2VRYsWKYqiKEVFRUpgYKAyduxY5ciRI8rx48eVxx57TAkICFAKCwsVRVGU0aNHK3Z2dsr48eOVEydOKH/99ZdiZWWlfPPNNzX7ZghhJJKohajH/p2o3d3dlXfffbdcma5duyrPP/+8oihlifqff/5R+vTpo/Ts2VPJyMgwlO3Tp4/y3nvvldv/p59+Utzc3AzLgPLGG28YlnNychRA+fvvvxVFUZTBgwcrTz31VKXiP3DggAIo8fHxFW5v0aKFsnjx4nLr3nnnHSU0NNQQW0BAgKLT6QzbCwsLFUtLS2X9+vWKougTtbe3t1JSUmIo8/DDDysjRoyoVIxC1HfyjFqIBiIrK4sLFy4QFhZWbn1YWBiHDx8ut27kyJF4eHiwefNmLC0tDesPHz7Mzp07effddw3rtFotBQUF5OXlYWVlBUD79u0N262trbGzsyMtLQ2A5557jgcffJCDBw/Sr18/hg4dSo8ePSqMOTg4mD59+hAUFER4eDj9+vXjoYceokmTJuTm5nL27Fmefvppxo0bZ9inpKQEe3t7Q7xnzpzB1ta23HELCgo4e/asYblt27aYmJgYlt3c3IiOjr7Jb1OIhkMStRCN0MCBA/n555+JjIzk3nvvNazPyclh1qxZDB8+/Lp9LCwsDD+bmZmV26ZSqdDpdAAMGDCAhIQE1q5dS0REBH369GHChAnMnTv3umOamJgQERHBrl272LBhA59//jmvv/46e/bsMXwp+PbbbwkJCbluv9J4O3fuzC+//HLdsZs1a1apeIVo6CRRC9FA2NnZ4e7uzs6dO7nnnnsM63fu3Em3bt3KlX3uuedo164dDzzwAGvWrDGU79SpEzExMbRs2fK2YmnWrBmjR49m9OjR3HXXXbzyyisVJmrQJ82wsDDCwsKYMWMG3t7erFixgqlTp+Lu7k5sbCyjRo2qcN9OnTqxdOlSnJ2dsbOzu62YhWioJFEL0YC88sorvPXWW7Ro0YIOHTqwcOFCoqKiKqxxTpo0Ca1Wy/3338/ff/9Nz549mTFjBvfffz9eXl489NBDqNVqDh8+zNGjR/m///u/SsUwY8YMOnfuTNu2bSksLGT16tUEBgZWWHbPnj1s2rSJfv364ezszJ49e7h48aKh/KxZs5g8eTL29vb079+fwsJC9u/fz5UrV5g6dSqjRo1izpw5DBkyhLfffhsPDw8SEhL4448/ePXVV/Hw8Kj+L1OIBkIStRANyOTJk8nMzOSll14iLS2NNm3asGrVKvz9/SssP2XKFHQ6HQMHDmTdunWEh4ezevVq3n77bT744APMzMxo3bo1zzzzTKVjMDc3Z/r06cTHx2Npacldd93FkiVLKixrZ2fH9u3bmTdvHllZWXh7e/PRRx8xYMAAAJ555hmsrKyYM2cOr7zyCtbW1gQFBTFlyhQArKys2L59O9OmTWP48OFkZ2fTvHlz+vTpIzVsccdQKYqiGDsIIYQQQlRMBjwRQggh6jFJ1EIIIUQ9JolaCCGEqMckUQshhBD1mCRqIYQQoh6TRC2EEELUY5Koq+iLL77Ax8cHCwsLQkJC2Lt3b52ef/v27QwePBh3d3dUKhUrV64st11RFGbMmIGbmxuWlpb07duX06dPlyuTnp7OqFGjsLOzw8HBgaeffpqcnJxyZY4cOcJdd92FhYUFnp6efPjhh9fFsmzZMlq3bo2FhQVBQUGsXbu2Stcye/Zsunbtiq2tLc7OzgwdOrTcvMugH9N5woQJODk5YWNjw4MPPkhqamq5MomJiQwaNAgrKyucnZ155ZVXyk3bCLB161Y6deqERqOhZcuWLFq06Lp4bve9/eqrr2jfvj12dnbY2dkRGhrK33//3SCv5d/ef/99VCqVoX9zQ7uemTNnolKpyr1at27dIK8F4Pz58zz++OM4OTlhaWlJUFAQ+/fvN2xvSJ8DPj4+1703KpWKCRMmAA3vvakVxp0TpGFZsmSJYm5urvzwww/KsWPHlHHjxikODg5KampqncWwdu1a5fXXX1f++OMPBVBWrFhRbvv777+v2NvbKytXrlQOHz6sPPDAA4qvr6+Sn59vKNO/f38lODhY2b17t/LPP/8oLVu2VEaOHGnYnpmZqbi4uCijRo1Sjh49qvz666+KpaWl8vXXXxvK7Ny5UzExMVE+/PBD5fjx48obb7yhmJmZKdHR0ZW+lvDwcGXhwoXK0aNHlaioKGXgwIGKl5eXkpOTYygzfvx4xdPTU9m0aZOyf/9+pXv37kqPHj0M20tKSpR27dopffv2VQ4dOqSsXbtWadq0qTJ9+nRDmdjYWMXKykqZOnWqcvz4ceXzzz9XTExMlHXr1hnK1MR7u2rVKmXNmjXKqVOnlJiYGOW///2vYmZmphw9erTBXcu19u7dq/j4+Cjt27dXXnjhBcP6hnQ9b731ltK2bVslOTnZ8Lp48WKDvJb09HTF29tbGTNmjLJnzx4lNjZWWb9+vXLmzBlDmYb0OZCWllbufYmIiFAAZcuWLYqiNKz3prZIoq6Cbt26KRMmTDAsa7Vaxd3dXZk9e7ZR4vl3otbpdIqrq6syZ84cw7qMjAxFo9Eov/76q6IoinL8+HEFUPbt22co8/fffysqlUo5f/68oiiK8uWXXypNmjQxzPerKIoybdo0JSAgwLD8yCOPKIMGDSoXT0hIiPKf//yn2teTlpamAMq2bdsMsZuZmSnLli0zlDlx4oQCKJGRkYqi6L+4qNVqJSUlxVDmq6++Uuzs7Azxv/rqq0rbtm3LnWvEiBFKeHi4Ybm23tsmTZoo3333XYO9luzsbMXf31+JiIhQ7rnnHkOibmjX89ZbbynBwcEVbmto1zJt2jSlZ8+eN9ze0D8HXnjhBaVFixaKTqdrcO9NbZFb35VUVFTEgQMH6Nu3r2GdWq2mb9++REZGGjGyMnFxcaSkpJSL0d7enpCQEEOMkZGRODg40KVLF0OZvn37olar2bNnj6HM3Xffjbm5uaFMeHg
      "text/plain": [
       "<Figure size 500x300 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "train_reward_margins = [i-j for i,j in zip(tracking[\"train_chosen_rewards\"], tracking[\"train_rejected_rewards\"])]\n",
    "val_reward_margins = [i-j for i,j in zip(tracking[\"val_chosen_rewards\"], tracking[\"val_rejected_rewards\"])]\n",
    "\n",
    "plot_losses(\n",
    "    epochs_seen=epochs_tensor,\n",
    "    tokens_seen=tracking[\"tokens_seen\"],\n",
    "    train_losses=train_reward_margins,\n",
    "    val_losses=val_reward_margins,\n",
    "    label=\"loss\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69756011-acd6-404c-a5fc-7fe252cf20c8",
   "metadata": {
    "id": "69756011-acd6-404c-a5fc-7fe252cf20c8"
   },
   "source": [
    "- As we can see, and as it's desired, the reward margins improve; this mirrors the loss curve and is a good sign\n",
    "- Note that DPO losses and reward margins are valuable metrics to track during training; however, they don't tell the whole store\n",
    "- Lastly, and most importantly, we have to conduct a qualitative check of the responses\n",
    "- Here, we will look at the response (in addition, you could use an LLM to score the responses similar to chapter 7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "5EfUXJGOali8",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "5EfUXJGOali8",
    "outputId": "7ec7db47-d775-4646-f660-0d7f7e7c8503"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Convert the active sentence to passive: 'The chef cooks the meal every day.'\n",
      "\n",
      "Correct response:\n",
      ">> The meal is cooked by the chef every day.\n",
      "\n",
      "Reference model response:\n",
      ">> The meal is cooked every day by the chef.\n",
      "\n",
      "Policy model response:\n",
      ">> The meal is prepared by the chef.\n",
      "\n",
      "-------------------------------------\n",
      "\n",
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Classify an input string as either a noun or a verb.\n",
      "\n",
      "### Input:\n",
      "Dance\n",
      "\n",
      "Correct response:\n",
      ">> 'Dance' can be classified as a verb.\n",
      "\n",
      "Reference model response:\n",
      ">> \"Dance\" can be classified as a verb.\n",
      "\n",
      "Policy model response:\n",
      ">> The input string \"Dance\" could be classified as a verb.\n",
      "\n",
      "-------------------------------------\n",
      "\n",
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Rewrite the sentence using a metaphor.\n",
      "\n",
      "### Input:\n",
      "The book is very interesting.\n",
      "\n",
      "Correct response:\n",
      ">> The book is a page-turner.\n",
      "\n",
      "Reference model response:\n",
      ">> The book is a treat.\n",
      "\n",
      "Policy model response:\n",
      ">> The book is a treat.\n",
      "\n",
      "-------------------------------------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "\n",
    "\n",
    "for entry in val_data[:3]:\n",
    "\n",
    "    input_text = format_input(entry)\n",
    "\n",
    "    token_ids = generate(\n",
    "        model=reference_model,\n",
    "        idx=text_to_token_ids(input_text, tokenizer).to(device),\n",
    "        max_new_tokens=256,\n",
    "        context_size=BASE_CONFIG[\"context_length\"],\n",
    "        eos_id=50256\n",
    "    )\n",
    "    generated_text = token_ids_to_text(token_ids, tokenizer)\n",
    "    reference_response_text = (\n",
    "        generated_text[len(input_text):]\n",
    "        .replace(\"### Response:\", \"\")\n",
    "        .strip()\n",
    "    )\n",
    "\n",
    "    token_ids = generate(\n",
    "        model=policy_model,\n",
    "        idx=text_to_token_ids(input_text, tokenizer).to(device),\n",
    "        max_new_tokens=256,\n",
    "        context_size=BASE_CONFIG[\"context_length\"],\n",
    "        eos_id=50256\n",
    "    )\n",
    "    generated_text = token_ids_to_text(token_ids, tokenizer)\n",
    "    policy_response_text = (\n",
    "        generated_text[len(input_text):]\n",
    "        .replace(\"### Response:\", \"\")\n",
    "        .strip()\n",
    "    )\n",
    "\n",
    "    print(input_text)\n",
    "    print(f\"\\nCorrect response:\\n>> {entry['output']}\")\n",
    "    print(f\"\\nReference model response:\\n>> {reference_response_text.strip()}\")\n",
    "    print(f\"\\nPolicy model response:\\n>> {policy_response_text.strip()}\")\n",
    "    print(\"\\n-------------------------------------\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "RmcKVg0JlHVF",
   "metadata": {
    "id": "RmcKVg0JlHVF"
   },
   "source": [
    "- As we can see based on the reference model and policy model responses above, the optimized model (i.e., the policy model) indeed slightly changed its style compared to the original model (i.e., reference model)\n",
    "- For instance, `\"Dance\" can be classified as a verb.` changed to `The input string \"Dance\" could be classified as a verb.` which is a slightly more polite response (the use of \"could\" instead of \"can\" makes the statement sound less assertive and more tentative)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "jJSwb2hzQwdP",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "jJSwb2hzQwdP",
    "outputId": "6e755db4-9524-42a8-a58b-2218bf03e39a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Rewrite the sentence using a simile.\n",
      "\n",
      "### Input:\n",
      "The car is very fast.\n",
      "\n",
      "Correct response:\n",
      ">> The car is as fast as lightning.\n",
      "\n",
      "Reference model response:\n",
      ">> The car is as fast as a cheetah.\n",
      "\n",
      "Policy model response:\n",
      ">> The car is as fast as a cheetah.\n",
      "\n",
      "-------------------------------------\n",
      "\n",
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "What type of cloud is typically associated with thunderstorms?\n",
      "\n",
      "Correct response:\n",
      ">> The type of cloud typically associated with thunderstorms is cumulonimbus.\n",
      "\n",
      "Reference model response:\n",
      ">> A thunderstorm is a type of storm that typically produces thunder or lightning.\n",
      "\n",
      "Policy model response:\n",
      ">> The type of cloud typically associated with thunderstorms is a cumulus.\n",
      "\n",
      "-------------------------------------\n",
      "\n",
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n",
      "\n",
      "### Instruction:\n",
      "Name the author of 'Pride and Prejudice'.\n",
      "\n",
      "Correct response:\n",
      ">> Jane Austen.\n",
      "\n",
      "Reference model response:\n",
      ">> The author of 'Pride and Prejudice' is Jane Austen.\n",
      "\n",
      "Policy model response:\n",
      ">> The author of 'Pride and Prejudice' is Jane Austen.\n",
      "\n",
      "-------------------------------------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "\n",
    "\n",
    "for entry in test_data[:3]:\n",
    "\n",
    "    input_text = format_input(entry)\n",
    "\n",
    "    token_ids = generate(\n",
    "        model=reference_model,\n",
    "        idx=text_to_token_ids(input_text, tokenizer).to(device),\n",
    "        max_new_tokens=256,\n",
    "        context_size=BASE_CONFIG[\"context_length\"],\n",
    "        eos_id=50256\n",
    "    )\n",
    "    generated_text = token_ids_to_text(token_ids, tokenizer)\n",
    "    reference_response_text = (\n",
    "        generated_text[len(input_text):]\n",
    "        .replace(\"### Response:\", \"\")\n",
    "        .strip()\n",
    "    )\n",
    "\n",
    "    token_ids = generate(\n",
    "        model=policy_model,\n",
    "        idx=text_to_token_ids(input_text, tokenizer).to(device),\n",
    "        max_new_tokens=256,\n",
    "        context_size=BASE_CONFIG[\"context_length\"],\n",
    "        eos_id=50256\n",
    "    )\n",
    "    generated_text = token_ids_to_text(token_ids, tokenizer)\n",
    "    policy_response_text = (\n",
    "        generated_text[len(input_text):]\n",
    "        .replace(\"### Response:\", \"\")\n",
    "        .strip()\n",
    "    )\n",
    "\n",
    "    print(input_text)\n",
    "    print(f\"\\nCorrect response:\\n>> {entry['output']}\")\n",
    "    print(f\"\\nReference model response:\\n>> {reference_response_text.strip()}\")\n",
    "    print(f\"\\nPolicy model response:\\n>> {policy_response_text.strip()}\")\n",
    "    print(\"\\n-------------------------------------\\n\")"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "A100",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}