LLMs-from-scratch/ch04/01_main-chapter-code/ch04.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ce9295b2-182b-490b-8325-83a67c4a001d",
   "metadata": {},
   "source": [
    "# Chapter 4: Implementing a GPT model from Scratch To Generate Text "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7da97ed-e02f-4d7f-b68e-a0eba3716e02",
   "metadata": {},
   "source": [
    "- In this chapter, we implement a GPT-like LLM architecture; the next chapter will focus on training this LLM"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d4f11e0-4434-4979-9dee-e1207df0eb01",
   "metadata": {},
   "source": [
    "<img src=\"figures/mental-model.webp\" width=450px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53fe99ab-0bcf-4778-a6b5-6db81fb826ef",
   "metadata": {},
   "source": [
    "## 4.1 Coding an LLM architecture"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad72d1ff-d82d-4e33-a88e-3c1a8831797b",
   "metadata": {},
   "source": [
    "- Chapter 1 discussed models like GPT and Llama, which generate words sequentially and are based on the decoder part of the original transformer architecture\n",
    "- Therefore, these LLMs are often referred to as \"decoder-like\" LLMs\n",
    "- Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code\n",
    "- We'll see that many elements are repeated in an LLM's architecture"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c5213e9-bd1c-437e-aee8-f5e8fb717251",
   "metadata": {},
   "source": [
    "<img src=\"figures/mental-model-2.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d43f5e2-fb51-434a-b9be-abeef6b98d99",
   "metadata": {},
   "source": [
    "- In previous chapters, we used small embedding dimensions for token inputs and outputs for ease of illustration, ensuring they fit on a single page\n",
    "- In this chapter, we consider embedding and model sizes akin to a small GPT-2 model\n",
    "- We'll specifically code the architecture of the smallest GPT-2 model (124 million parameters), as outlined in Radford et al.'s [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) (note that the initial report lists it as 117M parameters, but this was later corrected in the model weight repository)\n",
    "- Chapter 6 will show how to load pretrained weights into our implementation, which will be compatible with model sizes of 345, 762, and 1542 million parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21baa14d-24b8-4820-8191-a2808f7fbabc",
   "metadata": {},
   "source": [
    "- Configuration details for the 124 million parameter GPT-2 model include:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "5ed66875-1f24-445d-add6-006aae3c5707",
   "metadata": {},
   "outputs": [],
   "source": [
    "GPT_CONFIG_124M = {\n",
    "    \"vocab_size\": 50257,  # Vocabulary size\n",
    "    \"ctx_len\": 1024,      # Context length\n",
    "    \"emb_dim\": 768,       # Embedding dimension\n",
    "    \"n_heads\": 12,        # Number of attention heads\n",
    "    \"n_layers\": 12,       # Number of layers\n",
    "    \"drop_rate\": 0.1,     # Dropout rate\n",
    "    \"qkv_bias\": False     # Query-Key-Value bias\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c12fcd28-d210-4c57-8be6-06cfcd5d73a4",
   "metadata": {},
   "source": [
    "- We use short variable names to avoid long lines of code later\n",
    "- `\"vocab_size\"` indicates a vocabulary size of 50,257 words, supported by the BPE tokenizer discussed in Chapter 2\n",
    "- `\"ctx_len\"` represents the model's maximum input token count, as enabled by positional embeddings covered in Chapter 2\n",
    "- `\"emb_dim\"` is the embedding size for token inputs, converting each input token into a 768-dimensional vector\n",
    "- `\"n_heads\"` is the number of attention heads in the multi-head attention mechanism implemented in Chapter 3\n",
    "- `\"n_layers\"` is the number of transformer blocks within the model, which we'll implement in upcoming sections\n",
    "- `\"drop_rate\"` is the dropout mechanism's intensity, discussed in Chapter 3; 0.1 means dropping 10% of hidden units during training to mitigate overfitting\n",
    "- `\"qkv_bias\"` decides if the `Linear` layers in the multi-head attention mechanism (from Chapter 3) should include a bias vector when computing query (Q), key (K), and value (V) tensors; we'll disable this option, which is standard practice in modern LLMs; however, we'll revisit this later when loading pretrained GPT-2 weights from OpenAI into our reimplementation in Chapter 6"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4adce779-857b-4418-9501-12a7f3818d88",
   "metadata": {},
   "source": [
    "<img src=\"figures/chapter-steps.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "619c2eed-f8ea-4ff5-92c3-feda0f29b227",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "\n",
    "\n",
    "class DummyGPTModel(nn.Module):\n",
    "    def __init__(self, cfg):\n",
    "        super().__init__()\n",
    "        self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"])\n",
    "        self.pos_emb = nn.Embedding(cfg[\"ctx_len\"], cfg[\"emb_dim\"])\n",
    "        self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n",
    "        \n",
    "        # Use a placeholder for TransformerBlock\n",
    "        self.trf_blocks = nn.Sequential(\n",
    "            *[DummyTransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
    "        \n",
    "        # Use a placeholder for LayerNorm\n",
    "        self.final_norm = DummyLayerNorm(cfg[\"emb_dim\"])\n",
    "        self.out_head = nn.Linear(\n",
    "            cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n",
    "        )\n",
    "\n",
    "    def forward(self, in_idx):\n",
    "        batch_size, seq_len = in_idx.shape\n",
    "        tok_embeds = self.tok_emb(in_idx)\n",
    "        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
    "        x = tok_embeds + pos_embeds\n",
    "        x = self.drop_emb(x)\n",
    "        x = self.trf_blocks(x)\n",
    "        x = self.final_norm(x)\n",
    "        logits = self.out_head(x)\n",
    "        return logits\n",
    "\n",
    "\n",
    "class DummyTransformerBlock(nn.Module):\n",
    "    def __init__(self, cfg):\n",
    "        super().__init__()\n",
    "        # A simple placeholder\n",
    "\n",
    "    def forward(self, x):\n",
    "        # This block does nothing and just returns its input.\n",
    "        return x\n",
    "\n",
    "\n",
    "class DummyLayerNorm(nn.Module):\n",
    "    def __init__(self, normalized_shape, eps=1e-5):\n",
    "        super().__init__()\n",
    "        # The parameters here are just to mimic the LayerNorm interface.\n",
    "\n",
    "    def forward(self, x):\n",
    "        # This layer does nothing and just returns its input.\n",
    "        return x"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9665e8ab-20ca-4100-b9b9-50d9bdee33be",
   "metadata": {},
   "source": [
    "<img src=\"figures/gpt-in-out.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "794b6b6c-d36f-411e-a7db-8ac566a87fee",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[6109, 3626, 6100,  345],\n",
      "        [6109, 1110, 6622,  257]])\n"
     ]
    }
   ],
   "source": [
    "import tiktoken\n",
    "\n",
    "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
    "\n",
    "batch = []\n",
    "\n",
    "txt1 = \"Every effort moves you\"\n",
    "txt2 = \"Every day holds a\"\n",
    "\n",
    "batch.append(torch.tensor(tokenizer.encode(txt1)))\n",
    "batch.append(torch.tensor(tokenizer.encode(txt2)))\n",
    "batch = torch.stack(batch, dim=0)\n",
    "print(batch)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "009238cd-0160-4834-979c-309710986bb0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Output shape: torch.Size([2, 4, 50257])\n",
      "tensor([[[-1.2034,  0.3201, -0.7130,  ..., -1.5548, -0.2390, -0.4667],\n",
      "         [-0.1192,  0.4539, -0.4432,  ...,  0.2392,  1.3469,  1.2430],\n",
      "         [ 0.5307,  1.6720, -0.4695,  ...,  1.1966,  0.0111,  0.5835],\n",
      "         [ 0.0139,  1.6755, -0.3388,  ...,  1.1586, -0.0435, -1.0400]],\n",
      "\n",
      "        [[-1.0908,  0.1798, -0.9484,  ..., -1.6047,  0.2439, -0.4530],\n",
      "         [-0.7860,  0.5581, -0.0610,  ...,  0.4835, -0.0077,  1.6621],\n",
      "         [ 0.3567,  1.2698, -0.6398,  ..., -0.0162, -0.1296,  0.3717],\n",
      "         [-0.2407, -0.7349, -0.5102,  ...,  2.0057, -0.3694,  0.1814]]],\n",
      "       grad_fn=<UnsafeViewBackward0>)\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "model = DummyGPTModel(GPT_CONFIG_124M)\n",
    "\n",
    "logits = model(batch)\n",
    "print(\"Output shape:\", logits.shape)\n",
    "print(logits)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8332a00-98da-4eb4-b882-922776a89917",
   "metadata": {},
   "source": [
    "## 4.2 Normalizing activations with layer normalization"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "066cfb81-d59b-4d95-afe3-e43cf095f292",
   "metadata": {},
   "source": [
    "- Layer normalization, also known as LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450)), centers the activations of a neural network layer around a mean of 0 and normalizes their variance to 1\n",
    "- This stabilizes training and enables faster convergence to effective weights\n",
    "- Layer normalization is applied both before and after the multi-head attention module within the transformer block, which we will implement later; it's also applied before the final output layer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "314ac47a-69cc-4597-beeb-65bed3b5910f",
   "metadata": {},
   "source": [
    "<img src=\"figures/layernorm.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ab49940-6b35-4397-a80e-df8d092770a7",
   "metadata": {},
   "source": [
    "- Let's see how layer normalization works by passing a small input sample through a simple neural network layer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "79e1b463-dc3f-44ac-9cdb-9d5b6f64eb9d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],\n",
      "        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],\n",
      "       grad_fn=<ReluBackward0>)\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "\n",
    "# create 2 training examples with 5 dimensions (features) each\n",
    "batch_example = torch.randn(2, 5) \n",
    "\n",
    "layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())\n",
    "out = layer(batch_example)\n",
    "print(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8fccc29e-71fc-4c16-898c-6137c6ea5d2e",
   "metadata": {},
   "source": [
    "- Let's compute the mean and variance for each of the 2 inputs above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "9888f79e-8e69-44aa-8a19-cd34292adbf5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean:\n",
      " tensor([[0.1324],\n",
      "        [0.2170]], grad_fn=<MeanBackward1>)\n",
      "Variance:\n",
      " tensor([[0.0231],\n",
      "        [0.0398]], grad_fn=<VarBackward0>)\n"
     ]
    }
   ],
   "source": [
    "mean = out.mean(dim=-1, keepdim=True)\n",
    "var = out.var(dim=-1, keepdim=True)\n",
    "\n",
    "print(\"Mean:\\n\", mean)\n",
    "print(\"Variance:\\n\", var)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "052eda3e-b395-48c4-acd4-eb8083bab958",
   "metadata": {},
   "source": [
    "- The normalization is applied to each of the two inputs (rows) independently; using dim=-1 applies the calculation across the last dimension (in this case, the feature dimension) instead of the row dimension"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "570db83a-205c-4f6f-b219-1f6195dde1a7",
   "metadata": {},
   "source": [
    "<img src=\"figures/layernorm2.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f8ecbc7-eb14-4fa1-b5d0-7e1ff9694f99",
   "metadata": {},
   "source": [
    "- Subtracting the mean and dividing by the square-root of the variance (standard deviation) centers the inputs to have a mean of 0 and a variance of 1 across the column (feature) dimension:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "9a1d1bb9-3341-4c9a-bc2a-d2489bf89cda",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Normalized layer outputs:\n",
      " tensor([[ 0.6159,  1.4126, -0.8719,  0.5872, -0.8719, -0.8719],\n",
      "        [-0.0189,  0.1121, -1.0876,  1.5173,  0.5647, -1.0876]],\n",
      "       grad_fn=<DivBackward0>)\n",
      "Mean:\n",
      " tensor([[    0.0000],\n",
      "        [    0.0000]], grad_fn=<MeanBackward1>)\n",
      "Variance:\n",
      " tensor([[1.],\n",
      "        [1.]], grad_fn=<VarBackward0>)\n"
     ]
    }
   ],
   "source": [
    "out_norm = (out - mean) / torch.sqrt(var)\n",
    "print(\"Normalized layer outputs:\\n\", out_norm)\n",
    "\n",
    "mean = out_norm.mean(dim=-1, keepdim=True)\n",
    "var = out_norm.var(dim=-1, keepdim=True)\n",
    "print(\"Mean:\\n\", mean)\n",
    "print(\"Variance:\\n\", var)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac62b90c-7156-4979-9a79-ce1fb92969c1",
   "metadata": {},
   "source": [
    "- Each input is centered at 0 and has a unit variance of 1; to improve readability, we can disable PyTorch's scientific notation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "3e06c34b-c68a-4b36-afbe-b30eda4eca39",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean:\n",
      " tensor([[    0.0000],\n",
      "        [    0.0000]], grad_fn=<MeanBackward1>)\n",
      "Variance:\n",
      " tensor([[1.],\n",
      "        [1.]], grad_fn=<VarBackward0>)\n"
     ]
    }
   ],
   "source": [
    "torch.set_printoptions(sci_mode=False)\n",
    "print(\"Mean:\\n\", mean)\n",
    "print(\"Variance:\\n\", var)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "944fb958-d4ed-43cc-858d-00052bb6b31a",
   "metadata": {},
   "source": [
    "- Above, we normalized the features of each input\n",
    "- Now, using the same idea, we can implement a `LayerNorm` class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "3333a305-aa3d-460a-bcce-b80662d464d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "class LayerNorm(nn.Module):\n",
    "    def __init__(self, emb_dim):\n",
    "        super().__init__()\n",
    "        self.eps = 1e-5\n",
    "        self.scale = nn.Parameter(torch.ones(emb_dim))\n",
    "        self.shift = nn.Parameter(torch.zeros(emb_dim))\n",
    "\n",
    "    def forward(self, x):\n",
    "        mean = x.mean(dim=-1, keepdim=True)\n",
    "        var = x.var(dim=-1, keepdim=True, unbiased=False)\n",
    "        norm_x = (x - mean) / torch.sqrt(var + self.eps)\n",
    "        return self.scale * norm_x + self.shift"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e56c3908-7544-4808-b8cb-5d0a55bcca72",
   "metadata": {},
   "source": [
    "**Scale and shift**\n",
    "\n",
    "- Note that in addition to performing the normalization by subtracting the mean and dividing by the variance, we added two trainable parameters, a `scale` and a `shift` parameter\n",
    "- The initial `scale` (multiplying by 1) and `shift` (adding 0) values don't have any effect; however, `scale` and `shift` are trainable parameters that the LLM automatically adjusts during training if it is determined that doing so would improve the model's performance on its training task\n",
    "- This allows the model to learn appropriate scaling and shifting that best suit the data it is processing\n",
    "- Note that we also add a smaller value (`eps`) before computing the square root of the variance; this is to avoid division-by-zero errors if the variance is 0\n",
    "\n",
    "**Biased variance**\n",
    "- In the variance calculation above, setting `unbiased=False` means using the formula $\\frac{\\sum_i (x_i - \\bar{x})^2}{n}$ to compute the variance where n is the sample size (here, the number of features or columns); this formula does not include Bessel's correction (which uses `n-1` in the denominator), thus providing a biased estimate of the  variance \n",
    "- For LLMs, where the embedding dimension `n` is very large, the difference between using n and `n-1`\n",
    " is negligible\n",
    "- However, GPT-2 was trained with a biased variance in the normalization layers, which is why we also adopted this setting for compatibility reasons with the pretrained weights that we will load in later chapters\n",
    "\n",
    "- Let's now try out `LayerNorm` in practice:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "23b1000a-e613-4b43-bd90-e54deed8d292",
   "metadata": {},
   "outputs": [],
   "source": [
    "ln = LayerNorm(emb_dim=5)\n",
    "out_ln = ln(batch_example)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "94c12de2-1cab-46e0-a099-e2e470353bff",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean:\n",
      " tensor([[    -0.0000],\n",
      "        [     0.0000]], grad_fn=<MeanBackward1>)\n",
      "Variance:\n",
      " tensor([[1.0000],\n",
      "        [1.0000]], grad_fn=<VarBackward0>)\n"
     ]
    }
   ],
   "source": [
    "mean = out_ln.mean(dim=-1, keepdim=True)\n",
    "var = out_ln.var(dim=-1, unbiased=False, keepdim=True)\n",
    "\n",
    "print(\"Mean:\\n\", mean)\n",
    "print(\"Variance:\\n\", var)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e136cfc4-7c89-492e-b120-758c272bca8c",
   "metadata": {},
   "source": [
    "<img src=\"figures/overview-after-ln.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11190e7d-8c29-4115-824a-e03702f9dd54",
   "metadata": {},
   "source": [
    "## 4.3 Implementing a feed forward network with GELU activations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0585dfb-f21e-40e5-973f-2f63ad5cb169",
   "metadata": {},
   "source": [
    "- In this section, we implement a small neural network submodule that is used as part of the transformer block in LLMs\n",
    "- We start with the activation function\n",
    "- In deep learning, ReLU (Rectified Linear Unit) activation functions are commonly used due to their simplicity and effectiveness in various neural network architectures\n",
    "- In LLMs, various other types of activation functions are used beyond the traditional ReLU; two notable examples are GELU (Gaussian Error Linear Unit) and SwiGLU (Sigmoid-Weighted Linear Unit)\n",
    "- GELU and SwiGLU are more complex, smooth activation functions incorporating Gaussian and sigmoid-gated linear units, respectively, offering better performance for deep learning models, unlike the simpler, piecewise linear function of ReLU"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d482ce7-e493-4bfc-a820-3ea99f564ebc",
   "metadata": {},
   "source": [
    "- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415)) can be implemented in several ways; the exact version is defined as GELU(x)=x⋅Φ(x), where Φ(x) is the cumulative distribution function of the standard Gaussian distribution.\n",
    "- In practice, it's common to implement a computationally cheaper approximation: $\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)\n",
    "$ (the original GPT-2 model was also trained with this approximation)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "f84694b7-95f3-4323-b6d6-0a73df278e82",
   "metadata": {},
   "outputs": [],
   "source": [
    "class GELU(nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "\n",
    "    def forward(self, x):\n",
    "        return 0.5 * x * (1 + torch.tanh(\n",
    "            torch.sqrt(torch.tensor(2.0 / torch.pi)) * \n",
    "            (x + 0.044715 * torch.pow(x, 3))\n",
    "        ))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "fc5487d2-2576-4118-80a7-56c4caac2e71",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAxYAAAEiCAYAAABkykQ1AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8g+/7EAAAACXBIWXMAAA9hAAAPYQGoP6dpAABn2klEQVR4nO3deVhUZfsH8O8My7AJiiAoICIqigsqpKG5lYpbRSnZoqKmqWHlkiX+SjPfpDK33K2UJM19KTMTTVJzB1HRJBcQFzZllWUYZs7vD2QSAWXYzpnh+7muud53zpzlvmdyHu55zvM8MkEQBBAREREREVWBXOwAiIiIiIhI/7GwICIiIiKiKmNhQUREREREVcbCgoiIiIiIqoyFBRERERERVRkLCyIiIiIiqjIWFkREREREVGUsLIiIiIiIqMpYWBARERERUZWxsCAqw2effQaZTCbKtUNDQyGTyRAfH1/r1y4sLMRHH30EFxcXyOVy+Pv713oMFSHme0REddvo0aPRrFkzUa4tZtv04MEDjBs3Do6OjpDJZJgyZYoocTyNmO8RsbCok+Li4jB58mS0atUKFhYWsLCwgKenJ4KCgnDhwoUS+xb/Ay3vkZSUBACIj4+HTCbDN998U+51mzVrhiFDhpT52tmzZyGTyRAaGlpteT5Nbm4uPvvsM0RERNTaNR81f/587N69W5Rrl2fdunVYsGABhg0bhh9//BFTp04VNR4pvkdEhqy4aC9+GBsbw8nJCaNHj8adO3cqdc6IiAjIZDJs37693H1kMhkmT55c5mvbt2+HTCar1e/qu3fv4rPPPkN0dHStXbOY2G1TeebPn4/Q0FBMmjQJYWFhGDlypGixSPU9IsBY7ACodu3duxfDhw+HsbEx3nrrLXh5eUEul+PKlSvYuXMnVq1ahbi4OLi6upY4btWqVbCysip1vvr169dS5NUvNzcXc+fOBQD07t27xGuffPIJZs6cWaPXnz9/PoYNG1aqV2DkyJF4/fXXoVAoavT6Zfnzzz/h5OSExYsX1/q1yyLF94ioLvj888/h5uaG/Px8nDx5EqGhoTh27BhiYmJgZmYmdng17u7du5g7dy6aNWuGjh07lnjtu+++g0ajqbFri902lefPP//Es88+izlz5ohy/UdJ9T0iFhZ1yvXr1/H666/D1dUVhw4dQuPGjUu8/tVXX2HlypWQy0t3ZA0bNgx2dna1FarojI2NYWwszj8PIyMjGBkZiXLtlJQUvSgWxXyPiOqCgQMHwsfHBwAwbtw42NnZ4auvvsIvv/yC1157TeToxGViYiLatcVsm1JSUuDp6SnKtXUh5ntEvBWqTvn666+Rk5OD9evXlyoqgKJ/jO+//z5cXFxEiK5i0tLS8OGHH6J9+/awsrKCtbU1Bg4ciPPnz5faNz8/H5999hlatWoFMzMzNG7cGK+++iquX7+O+Ph42NvbAwDmzp2r7fb/7LPPAJS+R7Ndu3bo06dPqWtoNBo4OTlh2LBh2m3ffPMNunXrhoYNG8Lc3Bze3t6lbgGQyWTIycnBjz/+qL326NGjAZQ/fmDlypVo27YtFAoFmjRpgqCgIGRkZJTYp3fv3mjXrh0uX76MPn36wMLCAk5OTvj666+f+L4W38p2+PBhXLp0SRtTRESE9jaGx7uci4959Pa10aNHw8rKCnfu3IG/vz+srKxgb2+PDz/8EGq1utR7t3TpUrRv3x5mZmawt7fHgAEDcPbsWUm+R0R1WY8ePQAU/UD1qCtXrmDYsGGwtbWFmZkZfHx88Msvv4gRIm7evIl3330XHh4eMDc3R8OGDREQEFDmWKyMjAxMnToVzZo1g0KhgLOzM0aNGoV79+4hIiICzzzzDABgzJgx2u+f4u+6R8dYqFQq2NraYsyYMaWukZWVBTMzM3z44YcAgIKCAsyePRve3t6wsbGBpaUlevTogcOHD2uP0bVtAorGxs2bNw/u7u5QKBRo1qwZZs2aBaVSWWK/4tuRjx07hi5dusDMzAzNmzfHhg0bnvi+FrcBcXFx+O2337QxxcfHl/tdXFa7oct3b3W237XxHtF/WFjUIXv37kWLFi3QtWtXnY9NS0vDvXv3Sjwe/4OtNty4cQO7d+/GkCFDsGjRIsyYMQMXL15Er169cPfuXe1+arUaQ4YMwdy5c+Ht7Y2FCxfigw8+QGZmJmJiYmBvb49Vq1YBAF555RWEhYUhLCwMr776apnXHT58OI4cOaIdU1Ls2LFjuHv3Ll5//XXttqVLl6JTp074/PPPMX/+fBgbGyMgIAC//fabdp+wsDAoFAr06NFDe+0JEyaUm/dnn32GoKAgNGnSBAsXLsTQoUOxZs0a9O/fHyqVqsS+6enpGDBgALy8vLBw4UK0bt0aH3/8MX7//fdyz29vb4+wsDC0bt0azs7O2pjatGlT7jHlUavV8PPzQ8OGDfHNN9+gV69eWLhwIdauXVtiv7fffhtTpkyBi4sLvvrqK8ycORNmZmY4efKkJN8jorqs+A/HBg0aaLddunQJzz77LP755x/MnDkTCxcuhKWlJfz9/bFr165aj/HMmTM4fvw4Xn/9dXz77beYOHEiDh06hN69eyM3N1e734MHD9CjRw8sW7YM/fv3x9KlSzFx4kRcuXIFt2/fRps2bfD5558DAN555x3t90/Pnj1LXdPExASvvPIKdu/ejYKCghKv7d69G0qlUts+ZGVl4fvvv0fv3r3x1Vdf4bPPPkNqair8/Py0Yzl0bZuAoh6l2bNno3Pnzli8eDF69eqFkJCQEu1SsWvXrmHYsGHo168fFi5ciAYNGmD06NG4dOlSuedv06YNwsLCYGdnh44dO2pjKv7jXhcV+e6t7va7Nt4jeoRAdUJmZqYAQPD39y/1Wnp6upCamqp95Obmal+bM2eOAKDMh4eHh3a/uLg4AYCwYMGCcmNwdXUVBg8eXOZrZ86cEQAI69evf2Ie+fn5glqtLrEtLi5OUCgUwueff67dtm7dOgGAsGjRolLn0Gg0giAIQmpqqgBAmDNnTql9ivMuFhsbKwAQli1bVmK/d999V7Cysirxnj36/wVBEAoKCoR27doJzz//fIntlpaWQmBgYKlrr1+/XgAgxMXFCYIgCCkpKYKpqanQv3//ErkvX75cACCsW7dOu61Xr14CAGHDhg3abUqlUnB0dBSGDh1a6lqP69Wrl9C2bdsS2w4fPiwAEA4fPlxie/Fn/uhnFhgYKAAo8VkIgiB06tRJ8Pb21j7/888/BQDC+++/XyqG4s9HEKT5HhEZsuJ/WwcPHhRSU1OFW7duCdu3bxfs7e0FhUIh3Lp1S7vvCy+8ILRv317Iz8/XbtNoNEK3bt2Eli1barcVf4ds27at3OsCEIKCgsp8bdu2bWV+Bz3u8e9eQRCEEydOlPr3Pnv2bAGAsHPnzlL7F3//PKlNCgwMFFxdXbXP//jjDwGA8Ouvv5bYb9CgQULz5s21zwsLCwWlUllin/T0dMHBwUEYO3asdpsubVN0dLQAQBg3blyJ/T788EMBgPDnn39qt7m6ugoAhCNHjmi3paSkCAqFQpg+fXqpaz2urDb88e/iYmW1GxX97q3u9rs23yMSBPZY1BFZWVkAUOYA7N69e8Pe3l77WLFiRal9duzYgfDw8BKP9evX13jcj1MoFNoxIGq1Gvfv34eVlRU8PDwQFRVVIl47Ozu89957pc5RmWnoWrVqhY4dO2LLli3abWq1Gtu3b8eLL74Ic3Nz7fZH/396ejoyMzPRo0ePEvHp4uDBgygoKMCUKVNKjH8ZP348rK2tS/SEAEWf8YgRI7TPTU1N0aVLF9y4caNS16+MiRMnlnjeo0ePEtffsWMHZDJZmYMAK/P56ON7RCRlffv2hb29PVxcXDBs2DBYWlril19+gbOzM4CiXuw///wTr732GrKzs7U92ffv34efnx+uXr1a6VmkKuvR716VSoX79++jRYsWqF+/fqn2wcvLC6+88kqpc1Tm++f555+HnZ1difYhPT0d4eHhGD58uHabkZERTE1NARTdCpqWlobCwkL4+PhUun3Yt28fAGDatGkltk+fPh0
      "text/plain": [
       "<Figure size 800x300 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "gelu, relu = GELU(), nn.ReLU()\n",
    "\n",
    "# Some sample data\n",
    "x = torch.linspace(-3, 3, 100)\n",
    "y_gelu, y_relu = gelu(x), relu(x)\n",
    "\n",
    "plt.figure(figsize=(8, 3))\n",
    "for i, (y, label) in enumerate(zip([y_gelu, y_relu], [\"GELU\", \"ReLU\"]), 1):\n",
    "    plt.subplot(1, 2, i)\n",
    "    plt.plot(x, y)\n",
    "    plt.title(f\"{label} activation function\")\n",
    "    plt.xlabel(\"x\")\n",
    "    plt.ylabel(f\"{label}(x)\")\n",
    "    plt.grid(True)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cd01662-14cb-43fd-bffd-2d702813de2d",
   "metadata": {},
   "source": [
    "- As we can see, ReLU is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero\n",
    "- GELU is a smooth, non-linear function that approximates ReLU but with a non-zero gradient for negative values\n",
    "\n",
    "- Next, let's implement the small neural network module, `FeedForward`, that we will be using in the LLM's transformer block later:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "9275c879-b148-4579-a107-86827ca14d4d",
   "metadata": {},
   "outputs": [],
   "source": [
    "class FeedForward(nn.Module):\n",
    "    def __init__(self, cfg):\n",
    "        super().__init__()\n",
    "        self.layers = nn.Sequential(\n",
    "            nn.Linear(cfg[\"emb_dim\"], 4 * cfg[\"emb_dim\"]),\n",
    "            GELU(),\n",
    "            nn.Linear(4 * cfg[\"emb_dim\"], cfg[\"emb_dim\"]),\n",
    "            nn.Dropout(cfg[\"drop_rate\"])\n",
    "        )\n",
    "\n",
    "    def forward(self, x):\n",
    "        return self.layers(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "7c4976e2-0261-418e-b042-c5be98c2ccaf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "768\n"
     ]
    }
   ],
   "source": [
    "print(GPT_CONFIG_124M[\"emb_dim\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fdcaacfa-3cfc-4c9e-b668-b71a2753145a",
   "metadata": {},
   "source": [
    "<img src=\"figures/ffn.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "928e7f7c-d0b1-499f-8d07-4cadb428a6f9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([2, 3, 768])\n"
     ]
    }
   ],
   "source": [
    "ffn = FeedForward(GPT_CONFIG_124M)\n",
    "\n",
    "# input shape: [batch_size, num_token, emb_size]\n",
    "x = torch.rand(2, 3, 768) \n",
    "out = ffn(x)\n",
    "print(out.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f8756c5-6b04-443b-93d0-e555a316c377",
   "metadata": {},
   "source": [
    "<img src=\"figures/mental-model-3.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ffcb905-53c7-4886-87d2-4464c5fecf89",
   "metadata": {},
   "source": [
    "## 4.4 Adding shortcut connections"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffae416c-821e-4bfa-a741-8af4ba5db00e",
   "metadata": {},
   "source": [
    "- Next, let's talk about the concept behind shortcut connections, also called skip or residual connections\n",
    "- Originally, shortcut connections were proposed in deep networks for computer vision (residual networks) to mitigate vanishing gradient problems\n",
    "- A shortcut connection creates an alternative shorter path for the gradient to flow through the network\n",
    "- This is achieved by adding the output of one layer to the output of a later layer, usually skipping one or more layers in between\n",
    "- Let's illustrate this idea with a small example network:\n",
    "\n",
    "<img src=\"figures/shortcut-example.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14cfd241-a32e-4601-8790-784b82f2f23e",
   "metadata": {},
   "source": [
    "- In code, it looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "05473938-799c-49fd-86d4-8ed65f94fee6",
   "metadata": {},
   "outputs": [],
   "source": [
    "class ExampleDeepNeuralNetwork(nn.Module):\n",
    "    def __init__(self, layer_sizes, use_shortcut):\n",
    "        super().__init__()\n",
    "        self.use_shortcut = use_shortcut\n",
    "        self.layers = nn.ModuleList([\n",
    "            nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()),\n",
    "            nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()),\n",
    "            nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()),\n",
    "            nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()),\n",
    "            nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU())\n",
    "        ])\n",
    "\n",
    "    def forward(self, x):\n",
    "        for layer in self.layers:\n",
    "            # Compute the output of the current layer\n",
    "            layer_output = layer(x)\n",
    "            # Check if shortcut can be applied\n",
    "            if self.use_shortcut and x.shape == layer_output.shape:\n",
    "                x = x + layer_output\n",
    "            else:\n",
    "                x = layer_output\n",
    "        return x\n",
    "\n",
    "\n",
    "def print_gradients(model, x):\n",
    "    # Forward pass\n",
    "    output = model(x)\n",
    "    target = torch.tensor([[0.]])\n",
    "\n",
    "    # Calculate loss based on how close the target\n",
    "    # and output are\n",
    "    loss = nn.MSELoss()\n",
    "    loss = loss(output, target)\n",
    "    \n",
    "    # Backward pass to calculate the gradients\n",
    "    loss.backward()\n",
    "\n",
    "    for name, param in model.named_parameters():\n",
    "        if 'weight' in name:\n",
    "            # Print the mean absolute gradient of the weights\n",
    "            print(f\"{name} has gradient mean of {param.grad.abs().mean().item()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b39bf277-b3db-4bb1-84ce-7a20caff1011",
   "metadata": {},
   "source": [
    "- Let's print the gradient values first **without** shortcut connections:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "c75f43cc-6923-4018-b980-26023086572c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "layers.0.0.weight has gradient mean of 0.00020173587836325169\n",
      "layers.1.0.weight has gradient mean of 0.0001201116101583466\n",
      "layers.2.0.weight has gradient mean of 0.0007152041653171182\n",
      "layers.3.0.weight has gradient mean of 0.001398873864673078\n",
      "layers.4.0.weight has gradient mean of 0.005049646366387606\n"
     ]
    }
   ],
   "source": [
    "layer_sizes = [3, 3, 3, 3, 3, 1]  \n",
    "\n",
    "sample_input = torch.tensor([[1., 0., -1.]])\n",
    "\n",
    "torch.manual_seed(123)\n",
    "model_without_shortcut = ExampleDeepNeuralNetwork(\n",
    "    layer_sizes, use_shortcut=False\n",
    ")\n",
    "print_gradients(model_without_shortcut, sample_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "837fd5d4-7345-4663-97f5-38f19dfde621",
   "metadata": {},
   "source": [
    "- Next, let's print the gradient values **with** shortcut connections:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "11b7c0c2-f9dd-4dd5-b096-a05c48c5f6d6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "layers.0.0.weight has gradient mean of 0.22169792652130127\n",
      "layers.1.0.weight has gradient mean of 0.20694105327129364\n",
      "layers.2.0.weight has gradient mean of 0.32896995544433594\n",
      "layers.3.0.weight has gradient mean of 0.2665732502937317\n",
      "layers.4.0.weight has gradient mean of 1.3258541822433472\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "model_with_shortcut = ExampleDeepNeuralNetwork(\n",
    "    layer_sizes, use_shortcut=True\n",
    ")\n",
    "print_gradients(model_with_shortcut, sample_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79ff783a-46f0-49c5-a7a9-26a525764b6e",
   "metadata": {},
   "source": [
    "- As we can see based on the output above, shortcut connections prevent the gradients from vanishing in the early layers (towards `layer.0`)\n",
    "- We will use this concept of a shortcut connection next when we implement a transformer block"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cae578ca-e564-42cf-8635-a2267047cdff",
   "metadata": {},
   "source": [
    "## 4.5 Connecting attention and linear layers in a transformer block"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3daac6f-6545-4258-8f2d-f45a7394f429",
   "metadata": {},
   "source": [
    "- In this section, we now combine the previous concepts into a so-called transformer block\n",
    "- A transformer block combines the causal multi-head attention module from the previous chapter with the linear layers, the feed forward neural network we implemented in an earlier section\n",
    "- In addition, the transformer block also uses dropout and shortcut connections"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "0e1e8176-e5e3-4152-b1aa-0bbd7891dfd9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from previous_chapters import MultiHeadAttention\n",
    "\n",
    "\n",
    "class TransformerBlock(nn.Module):\n",
    "    def __init__(self, cfg):\n",
    "        super().__init__()\n",
    "        self.att = MultiHeadAttention(\n",
    "            d_in=cfg[\"emb_dim\"],\n",
    "            d_out=cfg[\"emb_dim\"],\n",
    "            block_size=cfg[\"ctx_len\"],\n",
    "            num_heads=cfg[\"n_heads\"], \n",
    "            dropout=cfg[\"drop_rate\"],\n",
    "            qkv_bias=cfg[\"qkv_bias\"])\n",
    "        self.ff = FeedForward(cfg)\n",
    "        self.norm1 = LayerNorm(cfg[\"emb_dim\"])\n",
    "        self.norm2 = LayerNorm(cfg[\"emb_dim\"])\n",
    "        self.drop_resid = nn.Dropout(cfg[\"drop_rate\"])\n",
    "\n",
    "    def forward(self, x):\n",
    "        # Shortcut connection for attention block\n",
    "        shortcut = x\n",
    "        x = self.norm1(x)\n",
    "        x = self.att(x)  # Shape [batch_size, num_tokens, emb_size]\n",
    "        x = self.drop_resid(x)\n",
    "        x = x + shortcut  # Add the original input back\n",
    "\n",
    "        # Shortcut connection for feed forward block\n",
    "        shortcut = x\n",
    "        x = self.norm2(x)\n",
    "        x = self.ff(x)\n",
    "        x = self.drop_resid(x)\n",
    "        x = x + shortcut  # Add the original input back\n",
    "\n",
    "        return x"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36b64d16-94a6-4d13-8c85-9494c50478a9",
   "metadata": {},
   "source": [
    "<img src=\"figures/transformer-block.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54d2d375-87bd-4153-9040-63a1e6a2b7cb",
   "metadata": {},
   "source": [
    "- Suppose we have 2 input samples with 6 tokens each, where each token is a 768-dimensional embedding vector; then this transformer block applies self-attention, followed by linear layers, to produce an output of similar size\n",
    "- You can think of the output as an augmented version of the context vectors we discussed in the previous chapter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "3fb45a63-b1f3-4b08-b525-dafbc8228405",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input shape: torch.Size([2, 4, 768])\n",
      "Output shape: torch.Size([2, 4, 768])\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "\n",
    "x = torch.rand(2, 4, 768)  # Shape: [batch_size, num_tokens, emb_dim]\n",
    "block = TransformerBlock(GPT_CONFIG_124M)\n",
    "output = block(x)\n",
    "\n",
    "print(\"Input shape:\", x.shape)\n",
    "print(\"Output shape:\", output.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "01e737a6-fc99-42bb-9f7e-4da899168811",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input shape: torch.Size([2, 4, 768])\n",
      "Output shape: torch.Size([2, 4, 768])\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "\n",
    "x = torch.rand(2, 4, 768)  # Shape: [batch_size, num_tokens, emb_dim]\n",
    "block = TransformerBlock(GPT_CONFIG_124M)\n",
    "output = block(x)\n",
    "\n",
    "print(\"Input shape:\", x.shape)\n",
    "print(\"Output shape:\", output.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91f502e4-f3e4-40cb-8268-179eec002394",
   "metadata": {},
   "source": [
    "<img src=\"figures/mental-model-final.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46618527-15ac-4c32-ad85-6cfea83e006e",
   "metadata": {},
   "source": [
    "## 4.6 Coding the GPT model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dec7d03d-9ff3-4ca3-ad67-01b67c2f5457",
   "metadata": {},
   "source": [
    "- We are almost there: now let's plug in the transformer block into the architecture we coded at the very beginning of this chapter so that we obtain a useable GPT architecture\n",
    "- Note that the transformer block is repeated multiple times; in the case of the smallest 124M GPT-2 model, we repeat it 12 times:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b7b362d-f8c5-48d2-8ebd-722480ac5073",
   "metadata": {},
   "source": [
    "<img src=\"figures/gpt.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "324e4b5d-ed89-4fdf-9a52-67deee0593bc",
   "metadata": {},
   "source": [
    "- The corresponding code implementation, where `cfg[\"n_layers\"] = 12`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "c61de39c-d03c-4a32-8b57-f49ac3834857",
   "metadata": {},
   "outputs": [],
   "source": [
    "class GPTModel(nn.Module):\n",
    "    def __init__(self, cfg):\n",
    "        super().__init__()\n",
    "        self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"])\n",
    "        self.pos_emb = nn.Embedding(cfg[\"ctx_len\"], cfg[\"emb_dim\"])\n",
    "        self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n",
    "        \n",
    "        self.trf_blocks = nn.Sequential(\n",
    "            *[TransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
    "        \n",
    "        self.final_norm = LayerNorm(cfg[\"emb_dim\"])\n",
    "        self.out_head = nn.Linear(\n",
    "            cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n",
    "        )\n",
    "\n",
    "    def forward(self, in_idx):\n",
    "        batch_size, seq_len = in_idx.shape\n",
    "        tok_embeds = self.tok_emb(in_idx)\n",
    "        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
    "        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]\n",
    "        x = self.trf_blocks(x)\n",
    "        x = self.final_norm(x)\n",
    "        logits = self.out_head(x)\n",
    "        return logits"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2750270f-c45d-4410-8767-a6adbd05d5c3",
   "metadata": {},
   "source": [
    "- Using the configuration of the 124M parameter model, we can now instantiate this GPT model with random initial weights as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "252b78c2-4404-483b-84fe-a412e55c16fc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input batch:\n",
      " tensor([[6109, 3626, 6100,  345],\n",
      "        [6109, 1110, 6622,  257]])\n",
      "\n",
      "Output shape: torch.Size([2, 4, 50257])\n",
      "tensor([[[-0.0055,  0.3224,  0.2185,  ...,  0.2539,  0.4578, -0.4747],\n",
      "         [ 0.2663, -0.2975, -0.5040,  ..., -0.3903,  0.5328, -0.4224],\n",
      "         [ 1.1146, -0.0923,  0.1303,  ...,  0.1521, -0.4494,  0.0276],\n",
      "         [-0.8239,  0.1174, -0.2566,  ...,  1.1197,  0.1036, -0.3993]],\n",
      "\n",
      "        [[-0.1027,  0.1752, -0.1048,  ...,  0.2258,  0.1559, -0.8747],\n",
      "         [ 0.2230,  0.1246,  0.0492,  ...,  0.8573, -0.2933,  0.3036],\n",
      "         [ 0.9409,  1.3068, -0.1610,  ...,  0.8244,  0.1763,  0.0811],\n",
      "         [ 0.4395,  0.2753,  0.1540,  ...,  1.3410, -0.3709,  0.1643]]],\n",
      "       grad_fn=<UnsafeViewBackward0>)\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "model = GPTModel(GPT_CONFIG_124M)\n",
    "\n",
    "out = model(batch)\n",
    "print(\"Input batch:\\n\", batch)\n",
    "print(\"\\nOutput shape:\", out.shape)\n",
    "print(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d616e7a-568b-4921-af29-bd3f4683cd2e",
   "metadata": {},
   "source": [
    "- We will train this model in the next chapter\n",
    "- However, a quick note about its size: we previously referred to it as a 124M parameter model; we can double check this number as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "84fb8be4-9d3b-402b-b3da-86b663aac33a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of parameters: 163,009,536\n"
     ]
    }
   ],
   "source": [
    "total_params = sum(p.numel() for p in model.parameters())\n",
    "print(f\"Total number of parameters: {total_params:,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b67d13dd-dd01-4ba6-a2ad-31ca8a9fd660",
   "metadata": {},
   "source": [
    "- As we see above, this model has 163M, not 124M parameters; why?\n",
    "- In the original GPT-2 paper, the researchers applied weight tying, which means that they reused the token embedding layer (`tok_emb`) as the output layer, which means setting `self.out_head.weight = self.tok_emb.weight`\n",
    "- The token embedding layer projects the 50,257-dimensional one-hot encoded input tokens to a 768-dimensional embedding representation\n",
    "- The output layer projects 768-dimensional embeddings back into a 50,257-dimensional representation so that we can convert these back into words (more about that in the next section)\n",
    "- So, the embedding and output layer have the same number of weight parameters, as we can see based on the shape of their weight matrices: the next chapter\n",
    "- However, a quick note about its size: we previously referred to it as a 124M parameter model; we can double check this number as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "e3b43233-e9b8-4f5a-b72b-a263ec686982",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Token embedding layer shape: torch.Size([50257, 768])\n",
      "Output layer shape: torch.Size([50257, 768])\n"
     ]
    }
   ],
   "source": [
    "print(\"Token embedding layer shape:\", model.tok_emb.weight.shape)\n",
    "print(\"Output layer shape:\", model.out_head.weight.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f02259f6-6f79-4c89-a866-4ebeae1c3289",
   "metadata": {},
   "source": [
    "- In the original GPT-2 paper, the researchers reused the token embedding matrix as an output matrix\n",
    "- Correspondingly, if we subtracted the number of parameters of the output layer, we'd get a 124M parameter model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "95a22e02-50d3-48b3-a4e0-d9863343c164",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of trainable parameters considering weight tying: 124,412,160\n"
     ]
    }
   ],
   "source": [
    "total_params_gpt2 =  total_params - sum(p.numel() for p in model.out_head.parameters())\n",
    "print(f\"Number of trainable parameters considering weight tying: {total_params_gpt2:,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40b03f80-b94c-46e7-9d42-d0df399ff3db",
   "metadata": {},
   "source": [
    "- In practice, I found it easier to train the model without weight-tying, which is why we didn't implement it here\n",
    "- However, we will revisit and apply this weight-tying idea later when we load the pretrained weights in Chapter 6\n",
    "- Lastly, we can compute the memory requirements of the model as follows, which can be a helpful reference point:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "5131a752-fab8-4d70-a600-e29870b33528",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total size of the model: 621.83 MB\n"
     ]
    }
   ],
   "source": [
    "# Calculate the total size in bytes (assuming float32, 4 bytes per parameter)\n",
    "total_size_bytes = total_params * 4\n",
    "\n",
    "# Convert to megabytes\n",
    "total_size_mb = total_size_bytes / (1024 * 1024)\n",
    "\n",
    "print(f\"Total size of the model: {total_size_mb:.2f} MB\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "309a3be4-c20a-4657-b4e0-77c97510b47c",
   "metadata": {},
   "source": [
    "- Exercise: you can try the following other configurations, which are referenced in the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), as well.\n",
    "\n",
    "    - **GPT2-small** (the 124M configuration we already implemented):\n",
    "        - \"emb_dim\" = 768\n",
    "        - \"n_layers\" = 12\n",
    "        - \"n_heads\" = 12\n",
    "\n",
    "    - **GPT2-medium:**\n",
    "        - \"emb_dim\" = 1024\n",
    "        - \"n_layers\" = 24\n",
    "        - \"n_heads\" = 16\n",
    "    \n",
    "    - **GPT2-large:**\n",
    "        - \"emb_dim\" = 1280\n",
    "        - \"n_layers\" = 36\n",
    "        - \"n_heads\" = 20\n",
    "    \n",
    "    - **GPT2-XL:**\n",
    "        - \"emb_dim\" = 1600\n",
    "        - \"n_layers\" = 48\n",
    "        - \"n_heads\" = 25"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da5d9bc0-95ab-45d4-9378-417628d86e35",
   "metadata": {},
   "source": [
    "## 4.7 Generating text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48da5deb-6ee0-4b9b-8dd2-abed7ed65172",
   "metadata": {},
   "source": [
    "- LLMs like the GPT model we implemented above are used to generate one word at a time"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "caade12a-fe97-480f-939c-87d24044edff",
   "metadata": {},
   "source": [
    "<img src=\"figures/iterative-gen.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7061524-a3bd-4803-ade6-2e3b7b79ac13",
   "metadata": {},
   "source": [
    "- The following `generate_text_simple` function implements greedy decoding, which is a simple and fast method to generate text\n",
    "- In greedy decoding, at each step, the model chooses the word (or token) with the highest probability as its next output (the highest logit corresponds to the highest probability, so we technically wouldn't even have to compute the softmax function explicitly)\n",
    "- In the next chapter, we will implement a more advanced `generate_text` function\n",
    "- The figure below depicts how the GPT model, given an input context, generates the next word token"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ee0f32c-c18c-445e-b294-a879de2aa187",
   "metadata": {},
   "source": [
    "<img src=\"figures/generate-text.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "c9b428a9-8764-4b36-80cd-7d4e00595ba6",
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_text_simple(model, idx, max_new_tokens, context_size):\n",
    "    # idx is (batch, n_tokens) array of indices in the current context\n",
    "    for _ in range(max_new_tokens):\n",
    "        \n",
    "        # Crop current context if it exceeds the supported context size\n",
    "        # E.g., if LLM supports only 5 tokens, and the context size is 10\n",
    "        # then only the last 5 tokens are used as context\n",
    "        idx_cond = idx[:, -context_size:]\n",
    "        \n",
    "        # Get the predictions\n",
    "        with torch.no_grad():\n",
    "            logits = model(idx_cond)\n",
    "        \n",
    "        # Focus only on the last time step\n",
    "        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)\n",
    "        logits = logits[:, -1, :]  \n",
    "\n",
    "        # Apply softmax to get probabilities\n",
    "        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)\n",
    "\n",
    "        # Get the idx of the vocab entry with the highest probability value\n",
    "        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)\n",
    "\n",
    "        # Append sampled index to the running sequence\n",
    "        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)\n",
    "\n",
    "    return idx"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6515f2c1-3cc7-421c-8d58-cc2f563b7030",
   "metadata": {},
   "source": [
    "- The `generate_text_simple` above implements an iterative process, where it creates one token at a time\n",
    "\n",
    "<img src=\"figures/iterative-generate.webp\" width=350px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f682eac4-f9bd-438b-9dec-6b1cc7bc05ce",
   "metadata": {},
   "source": [
    "- Let's prepare an input example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "bb3ffc8e-f95f-4a24-a978-939b8953ea3e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([-1.4929,  4.4812, -1.6093], grad_fn=<SliceBackward0>)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "tensor([    0.0000,     0.0012,     0.0000,  ...,     0.0000,     0.0000,\n",
       "            0.0000], grad_fn=<SoftmaxBackward0>)"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b = logits[0, -1, :]\n",
    "b[0] = -1.4929\n",
    "b[1] = 4.4812\n",
    "b[2] = -1.6093\n",
    "\n",
    "print(b[:3])\n",
    "torch.softmax(b, dim=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "3d7e3e94-df0f-4c0f-a6a1-423f500ac1d3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "encoded: [15496, 11, 314, 716]\n",
      "encoded_tensor.shape: torch.Size([1, 4])\n"
     ]
    }
   ],
   "source": [
    "start_context = \"Hello, I am\"\n",
    "\n",
    "encoded = tokenizer.encode(start_context)\n",
    "print(\"encoded:\", encoded)\n",
    "\n",
    "encoded_tensor = torch.tensor(encoded).unsqueeze(0)\n",
    "print(\"encoded_tensor.shape:\", encoded_tensor.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "a72a9b60-de66-44cf-b2f9-1e638934ada4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Output: tensor([[15496,    11,   314,   716, 27018, 24086, 47843, 30961, 42348,  7267]])\n",
      "Output length: 10\n"
     ]
    }
   ],
   "source": [
    "model.eval() # disable dropout\n",
    "\n",
    "out = generate_text_simple(\n",
    "    model=model,\n",
    "    idx=encoded_tensor, \n",
    "    max_new_tokens=6, \n",
    "    context_size=GPT_CONFIG_124M[\"ctx_len\"]\n",
    ")\n",
    "\n",
    "print(\"Output:\", out)\n",
    "print(\"Output length:\", len(out[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d131c00-1787-44ba-bec3-7c145497b2c3",
   "metadata": {},
   "source": [
    "- Remove batch dimension and convert back into text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "053d99f6-5710-4446-8d52-117fb34ea9f6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello, I am Featureiman Byeswickattribute argue\n"
     ]
    }
   ],
   "source": [
    "decoded_text = tokenizer.decode(out.squeeze(0).tolist())\n",
    "print(decoded_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a894003-51f6-4ccc-996f-3b9c7d5a1d70",
   "metadata": {},
   "source": [
    "- Note that the model is untrained; hence the random output texts above\n",
    "- We will train the model in the next chapter"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}