LLMs-from-scratch/ch03/03_understanding-buffers/understanding-buffers.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "V6BXGeEJ_s-8"
   },
   "source": [
    "# Understanding PyTorch Buffers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aQt9Ob1Y_8EH"
   },
   "source": [
    "In essence, PyTorch buffers are tensor attributes associated with a PyTorch module or model similar to parameters, but unlike parameters, buffers are not updated during training.\n",
    "\n",
    "Buffers in PyTorch are particularly useful when dealing with GPU computations, as they need to be transferred between devices (like from CPU to GPU) alongside the model's parameters. Unlike parameters, buffers do not require gradient computation, but they still need to be on the correct device to ensure that all computations are performed correctly.\n",
    "\n",
    "In chapter 3, we use PyTorch buffers via `self.register_buffer`, which is only briefly explained in the book. Since the concept and purpose are not immediately clear, this code notebook offers a longer explanation with a hands-on example."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dAwGo_gYLY45"
   },
   "source": [
    "## An example without buffers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0qBQC9IPAJVZ"
   },
   "source": [
    "Suppose we have the following code, which is based on code from chapter 3. This version has been modified to exclude buffers. It implements the causal self-attention mechanism used in LLMs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "id": "7wx-_rokAN04"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "\n",
    "class CausalAttentionWithoutBuffers(nn.Module):\n",
    "\n",
    "    def __init__(self, d_in, d_out, context_length,\n",
    "                 dropout, qkv_bias=False):\n",
    "        super().__init__()\n",
    "        self.d_out = d_out\n",
    "        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
    "        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
    "        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
    "        self.dropout = nn.Dropout(dropout)\n",
    "        self.mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)\n",
    "\n",
    "    def forward(self, x):\n",
    "        b, num_tokens, d_in = x.shape\n",
    "        keys = self.W_key(x)\n",
    "        queries = self.W_query(x)\n",
    "        values = self.W_value(x)\n",
    "\n",
    "        attn_scores = queries @ keys.transpose(1, 2)\n",
    "        attn_scores.masked_fill_(\n",
    "            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)\n",
    "        attn_weights = torch.softmax(\n",
    "            attn_scores / keys.shape[-1]**0.5, dim=-1\n",
    "        )\n",
    "        attn_weights = self.dropout(attn_weights)\n",
    "\n",
    "        context_vec = attn_weights @ values\n",
    "        return context_vec"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nNrK-wLaNSi7"
   },
   "source": [
    "We can initialize and run the module as follows on some example data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "e1MZiIsPA0Py",
    "outputId": "a0746523-3cf3-492f-e996-495c21371837"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[[-0.4519,  0.2216],\n",
      "         [-0.5874,  0.0058],\n",
      "         [-0.6300, -0.0632],\n",
      "         [-0.5675, -0.0843],\n",
      "         [-0.5526, -0.0981],\n",
      "         [-0.5299, -0.1081]],\n",
      "\n",
      "        [[-0.4519,  0.2216],\n",
      "         [-0.5874,  0.0058],\n",
      "         [-0.6300, -0.0632],\n",
      "         [-0.5675, -0.0843],\n",
      "         [-0.5526, -0.0981],\n",
      "         [-0.5299, -0.1081]]])\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "\n",
    "inputs = torch.tensor(\n",
    "  [[0.43, 0.15, 0.89], # Your     (x^1)\n",
    "   [0.55, 0.87, 0.66], # journey  (x^2)\n",
    "   [0.57, 0.85, 0.64], # starts   (x^3)\n",
    "   [0.22, 0.58, 0.33], # with     (x^4)\n",
    "   [0.77, 0.25, 0.10], # one      (x^5)\n",
    "   [0.05, 0.80, 0.55]] # step     (x^6)\n",
    ")\n",
    "\n",
    "batch = torch.stack((inputs, inputs), dim=0)\n",
    "context_length = batch.shape[1]\n",
    "d_in = inputs.shape[1]\n",
    "d_out = 2\n",
    "\n",
    "ca_without_buffer = CausalAttentionWithoutBuffers(d_in, d_out, context_length, 0.0)\n",
    "\n",
    "with torch.no_grad():\n",
    "    context_vecs = ca_without_buffer(batch)\n",
    "\n",
    "print(context_vecs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7_hqz6AgCCc1"
   },
   "source": [
    "So far, everything has worked fine so far.\n",
    "\n",
    "However, when training LLMs, we typically use GPUs to accelerate the process. Therefore, let's transfer the `CausalAttentionWithoutBuffers` module onto a GPU device.\n",
    "\n",
    "Please note that this operation requires the code to be run in an environment equipped with GPUs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "PYwn44HWCPJS",
    "outputId": "1aa6bfe9-e9a9-477f-e944-65388820498d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Machine has GPU: True\n"
     ]
    }
   ],
   "source": [
    "print(\"Machine has GPU:\", torch.cuda.is_available())\n",
    "\n",
    "batch = batch.to(\"cuda\")\n",
    "ca_without_buffer.to(\"cuda\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4_lMki2_CoIR"
   },
   "source": [
    "Now, let's run the code again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 338
    },
    "id": "KE9iLcjGC1V1",
    "outputId": "110f444f-f887-4a0a-a156-a263b444941f"
   },
   "outputs": [
    {
     "ename": "RuntimeError",
     "evalue": "expected self and mask to be on the same device, but got mask on cpu and self on cuda:0",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mRuntimeError\u001b[0m                              Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-4-1e0d2e6638f6>\u001b[0m in \u001b[0;36m<cell line: 1>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mno_grad\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m     \u001b[0mcontext_vecs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mca_without_buffer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcontext_vecs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1530\u001b[0m             \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_compiled_call_impl\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m  \u001b[0;31m# type: ignore[misc]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1531\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1532\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_call_impl\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1533\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1534\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m_call_impl\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1539\u001b[0m                 \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1540\u001b[0m                 or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1541\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1542\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1543\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m<ipython-input-1-cf1dad0dd611>\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, x)\u001b[0m\n\u001b[1;32m     21\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     22\u001b[0m         \u001b[0mattn_scores\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mqueries\u001b[0m \u001b[0;34m@\u001b[0m \u001b[0mkeys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtranspose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 23\u001b[0;31m         attn_scores.masked_fill_(\n\u001b[0m\u001b[1;32m     24\u001b[0m             self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)\n\u001b[1;32m     25\u001b[0m         attn_weights = torch.softmax(\n",
      "\u001b[0;31mRuntimeError\u001b[0m: expected self and mask to be on the same device, but got mask on cpu and self on cuda:0"
     ]
    }
   ],
   "source": [
    "with torch.no_grad():\n",
    "    context_vecs = ca_without_buffer(batch)\n",
    "\n",
    "print(context_vecs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "I7V26PLrC2gk"
   },
   "source": [
    "Running the code resulted in an error. What happened? It seems like we attempted a matrix multiplication between a tensor on a GPU and a tensor on a CPU. But we moved the module to the GPU!?\n",
    "\n",
    "\n",
    "Let's double-check the device locations of some of the tensors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "vvYDPBRIDHfU",
    "outputId": "5765233b-432d-4078-9064-26260b5ea672"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "W_query.device: cuda:0\n",
      "mask.device: cpu\n"
     ]
    }
   ],
   "source": [
    "print(\"W_query.device:\", ca_without_buffer.W_query.weight.device)\n",
    "print(\"mask.device:\", ca_without_buffer.mask.device)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "d11nX-FFOJ3C",
    "outputId": "3f8b2dac-378c-49b7-c544-61b91fe36351"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Tensor"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(ca_without_buffer.mask)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Ojay-KY-DL5M"
   },
   "source": [
    "As we can see, the `mask` was not moved onto the GPU. That's because it's not a PyTorch parameter like the weights (e.g., `W_query.weight`).\n",
    "\n",
    "This means we  have to manually move it to the GPU via `.to(\"cuda\")`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "QYirQ63zDYsW",
    "outputId": "46af1038-23fd-400c-f013-f56bc8a0e730"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mask.device: cuda:0\n"
     ]
    }
   ],
   "source": [
    "ca_without_buffer.mask = ca_without_buffer.mask.to(\"cuda\")\n",
    "print(\"mask.device:\", ca_without_buffer.mask.device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4OoTqzkpDfAm"
   },
   "source": [
    "Let's try our code again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "WfF0yBZODdAZ",
    "outputId": "c7425750-c995-43a6-ca2f-f2dfc402a4fb"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[[-0.4519,  0.2216],\n",
      "         [-0.5874,  0.0058],\n",
      "         [-0.6300, -0.0632],\n",
      "         [-0.5675, -0.0843],\n",
      "         [-0.5526, -0.0981],\n",
      "         [-0.5299, -0.1081]],\n",
      "\n",
      "        [[-0.4519,  0.2216],\n",
      "         [-0.5874,  0.0058],\n",
      "         [-0.6300, -0.0632],\n",
      "         [-0.5675, -0.0843],\n",
      "         [-0.5526, -0.0981],\n",
      "         [-0.5299, -0.1081]]], device='cuda:0')\n"
     ]
    }
   ],
   "source": [
    "with torch.no_grad():\n",
    "    context_vecs = ca_without_buffer(batch)\n",
    "\n",
    "print(context_vecs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oUrVgWuuD7UE"
   },
   "source": [
    "This time, it worked!\n",
    "\n",
    "However, remembering to move individual tensors to the GPU can be tedious. As we will see in the next section, it's easier to use `register_buffer` to register the `mask` as a buffer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "StS2wUrBLeuW"
   },
   "source": [
    "## An example with buffers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nEqD2NFzPO6l"
   },
   "source": [
    "Let's now modify the causal attention class to register the causal `mask` as a buffer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "id": "ndsYj3Zf6N8U"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "\n",
    "class CausalAttentionWithBuffer(nn.Module):\n",
    "\n",
    "    def __init__(self, d_in, d_out, context_length,\n",
    "                 dropout, qkv_bias=False):\n",
    "        super().__init__()\n",
    "        self.d_out = d_out\n",
    "        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
    "        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
    "        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
    "        self.dropout = nn.Dropout(dropout)\n",
    "        # Old:\n",
    "        # self.mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)\n",
    "\n",
    "        # New:\n",
    "        self.register_buffer(\"mask\", torch.triu(torch.ones(context_length, context_length), diagonal=1))\n",
    "\n",
    "    def forward(self, x):\n",
    "        b, num_tokens, d_in = x.shape\n",
    "        keys = self.W_key(x)\n",
    "        queries = self.W_query(x)\n",
    "        values = self.W_value(x)\n",
    "\n",
    "        attn_scores = queries @ keys.transpose(1, 2)\n",
    "        attn_scores.masked_fill_(\n",
    "            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)\n",
    "        attn_weights = torch.softmax(\n",
    "            attn_scores / keys.shape[-1]**0.5, dim=-1\n",
    "        )\n",
    "        attn_weights = self.dropout(attn_weights)\n",
    "\n",
    "        context_vec = attn_weights @ values\n",
    "        return context_vec"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_AL1X6y3Eb7S"
   },
   "source": [
    "Now, conveniently, if we move the module to the GPU, the mask will be located on the GPU as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "8_VCxEa76j00",
    "outputId": "7152e74b-ce7a-44fb-c8d9-46da0908190e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "W_query.device: cuda:0\n",
      "mask.device: cuda:0\n"
     ]
    }
   ],
   "source": [
    "ca_with_buffer = CausalAttentionWithBuffer(d_in, d_out, context_length, 0.0)\n",
    "ca_with_buffer.to(\"cuda\")\n",
    "\n",
    "print(\"W_query.device:\", ca_with_buffer.W_query.weight.device)\n",
    "print(\"mask.device:\", ca_with_buffer.mask.device)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "TBWvKlMe7bbB",
    "outputId": "63aa3589-4fb9-4b75-b161-458afb7d72e2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[[0.4772, 0.1063],\n",
      "         [0.5891, 0.3257],\n",
      "         [0.6202, 0.3860],\n",
      "         [0.5478, 0.3589],\n",
      "         [0.5321, 0.3428],\n",
      "         [0.5077, 0.3493]],\n",
      "\n",
      "        [[0.4772, 0.1063],\n",
      "         [0.5891, 0.3257],\n",
      "         [0.6202, 0.3860],\n",
      "         [0.5478, 0.3589],\n",
      "         [0.5321, 0.3428],\n",
      "         [0.5077, 0.3493]]], device='cuda:0')\n"
     ]
    }
   ],
   "source": [
    "with torch.no_grad():\n",
    "    context_vecs = ca_with_buffer(batch)\n",
    "\n",
    "print(context_vecs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xvOTh4NNPjef"
   },
   "source": [
    "As we can see above, registering a tensor as a buffer can make our lives a lot easier: We don't have to remember to move tensors to a target device like a GPU manually."
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "L4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
Understanding PyTorch Buffers (#288) 2024-07-26 08:45:36 -05:00			`{`
			`"cells": [`
update 2024-07-27 07:12:42 -05:00			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"<table style=\"width:100%\">\n",`
			`"<tr>\n",`
			`"<td style=\"vertical-align:middle; text-align:left;\">\n",`
			`"<font size=\"2\">\n",`
			`"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",`
			`"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",`
			`"</font>\n",`
			`"</td>\n",`
			`"<td style=\"vertical-align:middle; text-align:left;\">\n",`
			`"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",`
			`"</td>\n",`
			`"</tr>\n",`
			`"</table>"`
			`]`
			`},`
Understanding PyTorch Buffers (#288) 2024-07-26 08:45:36 -05:00			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "V6BXGeEJ_s-8"`
			`},`
			`"source": [`
			`"# Understanding PyTorch Buffers"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "aQt9Ob1Y_8EH"`
			`},`
			`"source": [`
			`"In essence, PyTorch buffers are tensor attributes associated with a PyTorch module or model similar to parameters, but unlike parameters, buffers are not updated during training.\n",`
			`"\n",`
			`"Buffers in PyTorch are particularly useful when dealing with GPU computations, as they need to be transferred between devices (like from CPU to GPU) alongside the model's parameters. Unlike parameters, buffers do not require gradient computation, but they still need to be on the correct device to ensure that all computations are performed correctly.\n",`
			`"\n",`
			"In chapter 3, we use PyTorch buffers via `self.register_buffer`, which is only briefly explained in the book. Since the concept and purpose are not immediately clear, this code notebook offers a longer explanation with a hands-on example."
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "dAwGo_gYLY45"`
			`},`
			`"source": [`
			`"## An example without buffers"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "0qBQC9IPAJVZ"`
			`},`
			`"source": [`
			`"Suppose we have the following code, which is based on code from chapter 3. This version has been modified to exclude buffers. It implements the causal self-attention mechanism used in LLMs:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"metadata": {`
			`"id": "7wx-_rokAN04"`
			`},`
			`"outputs": [],`
			`"source": [`
			`"import torch\n",`
			`"import torch.nn as nn\n",`
			`"\n",`
			`"class CausalAttentionWithoutBuffers(nn.Module):\n",`
			`"\n",`
			`" def __init__(self, d_in, d_out, context_length,\n",`
			`" dropout, qkv_bias=False):\n",`
			`" super().__init__()\n",`
			`" self.d_out = d_out\n",`
			`" self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n",`
			`" self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)\n",`
			`" self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n",`
			`" self.dropout = nn.Dropout(dropout)\n",`
			`" self.mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)\n",`
			`"\n",`
			`" def forward(self, x):\n",`
			`" b, num_tokens, d_in = x.shape\n",`
			`" keys = self.W_key(x)\n",`
			`" queries = self.W_query(x)\n",`
			`" values = self.W_value(x)\n",`
			`"\n",`
			`" attn_scores = queries @ keys.transpose(1, 2)\n",`
			`" attn_scores.masked_fill_(\n",`
			`" self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)\n",`
			`" attn_weights = torch.softmax(\n",`
			`" attn_scores / keys.shape[-1]**0.5, dim=-1\n",`
			`" )\n",`
			`" attn_weights = self.dropout(attn_weights)\n",`
			`"\n",`
			`" context_vec = attn_weights @ values\n",`
			`" return context_vec"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "nNrK-wLaNSi7"`
			`},`
			`"source": [`
			`"We can initialize and run the module as follows on some example data:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"metadata": {`
			`"colab": {`
			`"base_uri": "https://localhost:8080/"`
			`},`
			`"id": "e1MZiIsPA0Py",`
			`"outputId": "a0746523-3cf3-492f-e996-495c21371837"`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"tensor([[[-0.4519, 0.2216],\n",`
			`" [-0.5874, 0.0058],\n",`
			`" [-0.6300, -0.0632],\n",`
			`" [-0.5675, -0.0843],\n",`
			`" [-0.5526, -0.0981],\n",`
			`" [-0.5299, -0.1081]],\n",`
			`"\n",`
			`" [[-0.4519, 0.2216],\n",`
			`" [-0.5874, 0.0058],\n",`
			`" [-0.6300, -0.0632],\n",`
			`" [-0.5675, -0.0843],\n",`
			`" [-0.5526, -0.0981],\n",`
			`" [-0.5299, -0.1081]]])\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"torch.manual_seed(123)\n",`
			`"\n",`
			`"inputs = torch.tensor(\n",`
			`" [[0.43, 0.15, 0.89], # Your (x^1)\n",`
			`" [0.55, 0.87, 0.66], # journey (x^2)\n",`
			`" [0.57, 0.85, 0.64], # starts (x^3)\n",`
			`" [0.22, 0.58, 0.33], # with (x^4)\n",`
			`" [0.77, 0.25, 0.10], # one (x^5)\n",`
			`" [0.05, 0.80, 0.55]] # step (x^6)\n",`
			`")\n",`
			`"\n",`
			`"batch = torch.stack((inputs, inputs), dim=0)\n",`
			`"context_length = batch.shape[1]\n",`
			`"d_in = inputs.shape[1]\n",`
			`"d_out = 2\n",`
			`"\n",`
			`"ca_without_buffer = CausalAttentionWithoutBuffers(d_in, d_out, context_length, 0.0)\n",`
			`"\n",`
			`"with torch.no_grad():\n",`
			`" context_vecs = ca_without_buffer(batch)\n",`
			`"\n",`
			`"print(context_vecs)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "7_hqz6AgCCc1"`
			`},`
			`"source": [`
			`"So far, everything has worked fine so far.\n",`
			`"\n",`
			"However, when training LLMs, we typically use GPUs to accelerate the process. Therefore, let's transfer the `CausalAttentionWithoutBuffers` module onto a GPU device.\n",
			`"\n",`
			`"Please note that this operation requires the code to be run in an environment equipped with GPUs."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"metadata": {`
			`"colab": {`
			`"base_uri": "https://localhost:8080/"`
			`},`
			`"id": "PYwn44HWCPJS",`
			`"outputId": "1aa6bfe9-e9a9-477f-e944-65388820498d"`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Machine has GPU: True\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"print(\"Machine has GPU:\", torch.cuda.is_available())\n",`
			`"\n",`
			`"batch = batch.to(\"cuda\")\n",`
			`"ca_without_buffer.to(\"cuda\");"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "4_lMki2_CoIR"`
			`},`
			`"source": [`
			`"Now, let's run the code again:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 4,`
			`"metadata": {`
			`"colab": {`
			`"base_uri": "https://localhost:8080/",`
			`"height": 338`
			`},`
			`"id": "KE9iLcjGC1V1",`
			`"outputId": "110f444f-f887-4a0a-a156-a263b444941f"`
			`},`
			`"outputs": [`
			`{`
			`"ename": "RuntimeError",`
			`"evalue": "expected self and mask to be on the same device, but got mask on cpu and self on cuda:0",`
			`"output_type": "error",`
			`"traceback": [`
			`"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",`
			`"\u001b[0;31mRuntimeError\u001b[0m Traceback (most recent call last)",`
			"\u001b[0;32m<ipython-input-4-1e0d2e6638f6>\u001b[0m in \u001b[0;36m<cell line: 1>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mno_grad\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mcontext_vecs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mca_without_buffer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcontext_vecs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
			"\u001b[0;32m/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_wrapped_call_impl\u001b[0;34m(self, args, kwargs)\u001b[0m\n\u001b[1;32m 1530\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_compiled_call_impl\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# type: ignore[misc]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1531\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1532\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_call_impl\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1533\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1534\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_call_impl\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
			"\u001b[0;32m/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, args, kwargs)\u001b[0m\n\u001b[1;32m 1539\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1540\u001b[0m or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1541\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1542\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1543\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
			"\u001b[0;32m<ipython-input-1-cf1dad0dd611>\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, x)\u001b[0m\n\u001b[1;32m 21\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 22\u001b[0m \u001b[0mattn_scores\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mqueries\u001b[0m \u001b[0;34m@\u001b[0m \u001b[0mkeys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtranspose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 23\u001b[0;31m attn_scores.masked_fill_(\n\u001b[0m\u001b[1;32m 24\u001b[0m self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)\n\u001b[1;32m 25\u001b[0m attn_weights = torch.softmax(\n",
			`"\u001b[0;31mRuntimeError\u001b[0m: expected self and mask to be on the same device, but got mask on cpu and self on cuda:0"`
			`]`
			`}`
			`],`
			`"source": [`
			`"with torch.no_grad():\n",`
			`" context_vecs = ca_without_buffer(batch)\n",`
			`"\n",`
			`"print(context_vecs)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "I7V26PLrC2gk"`
			`},`
			`"source": [`
			`"Running the code resulted in an error. What happened? It seems like we attempted a matrix multiplication between a tensor on a GPU and a tensor on a CPU. But we moved the module to the GPU!?\n",`
			`"\n",`
			`"\n",`
			`"Let's double-check the device locations of some of the tensors:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 5,`
			`"metadata": {`
			`"colab": {`
			`"base_uri": "https://localhost:8080/"`
			`},`
			`"id": "vvYDPBRIDHfU",`
			`"outputId": "5765233b-432d-4078-9064-26260b5ea672"`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"W_query.device: cuda:0\n",`
			`"mask.device: cpu\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"print(\"W_query.device:\", ca_without_buffer.W_query.weight.device)\n",`
			`"print(\"mask.device:\", ca_without_buffer.mask.device)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"metadata": {`
			`"colab": {`
			`"base_uri": "https://localhost:8080/"`
			`},`
			`"id": "d11nX-FFOJ3C",`
			`"outputId": "3f8b2dac-378c-49b7-c544-61b91fe36351"`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"torch.Tensor"`
			`]`
			`},`
			`"execution_count": 6,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"type(ca_without_buffer.mask)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "Ojay-KY-DL5M"`
			`},`
			`"source": [`
			"As we can see, the `mask` was not moved onto the GPU. That's because it's not a PyTorch parameter like the weights (e.g., `W_query.weight`).\n",
			`"\n",`
			"This means we have to manually move it to the GPU via `.to(\"cuda\")`:"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 7,`
			`"metadata": {`
			`"colab": {`
			`"base_uri": "https://localhost:8080/"`
			`},`
			`"id": "QYirQ63zDYsW",`
			`"outputId": "46af1038-23fd-400c-f013-f56bc8a0e730"`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"mask.device: cuda:0\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"ca_without_buffer.mask = ca_without_buffer.mask.to(\"cuda\")\n",`
			`"print(\"mask.device:\", ca_without_buffer.mask.device)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "4OoTqzkpDfAm"`
			`},`
			`"source": [`
			`"Let's try our code again:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 8,`
			`"metadata": {`
			`"colab": {`
			`"base_uri": "https://localhost:8080/"`
			`},`
			`"id": "WfF0yBZODdAZ",`
			`"outputId": "c7425750-c995-43a6-ca2f-f2dfc402a4fb"`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"tensor([[[-0.4519, 0.2216],\n",`
			`" [-0.5874, 0.0058],\n",`
			`" [-0.6300, -0.0632],\n",`
			`" [-0.5675, -0.0843],\n",`
			`" [-0.5526, -0.0981],\n",`
			`" [-0.5299, -0.1081]],\n",`
			`"\n",`
			`" [[-0.4519, 0.2216],\n",`
			`" [-0.5874, 0.0058],\n",`
			`" [-0.6300, -0.0632],\n",`
			`" [-0.5675, -0.0843],\n",`
			`" [-0.5526, -0.0981],\n",`
			`" [-0.5299, -0.1081]]], device='cuda:0')\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"with torch.no_grad():\n",`
			`" context_vecs = ca_without_buffer(batch)\n",`
			`"\n",`
			`"print(context_vecs)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "oUrVgWuuD7UE"`
			`},`
			`"source": [`
			`"This time, it worked!\n",`
			`"\n",`
			"However, remembering to move individual tensors to the GPU can be tedious. As we will see in the next section, it's easier to use `register_buffer` to register the `mask` as a buffer."
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "StS2wUrBLeuW"`
			`},`
			`"source": [`
			`"## An example with buffers"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "nEqD2NFzPO6l"`
			`},`
			`"source": [`
			"Let's now modify the causal attention class to register the causal `mask` as a buffer:"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"metadata": {`
			`"id": "ndsYj3Zf6N8U"`
			`},`
			`"outputs": [],`
			`"source": [`
			`"import torch\n",`
			`"import torch.nn as nn\n",`
			`"\n",`
			`"class CausalAttentionWithBuffer(nn.Module):\n",`
			`"\n",`
			`" def __init__(self, d_in, d_out, context_length,\n",`
			`" dropout, qkv_bias=False):\n",`
			`" super().__init__()\n",`
			`" self.d_out = d_out\n",`
			`" self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n",`
			`" self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)\n",`
			`" self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n",`
			`" self.dropout = nn.Dropout(dropout)\n",`
			`" # Old:\n",`
			`" # self.mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)\n",`
			`"\n",`
			`" # New:\n",`
			`" self.register_buffer(\"mask\", torch.triu(torch.ones(context_length, context_length), diagonal=1))\n",`
			`"\n",`
			`" def forward(self, x):\n",`
			`" b, num_tokens, d_in = x.shape\n",`
			`" keys = self.W_key(x)\n",`
			`" queries = self.W_query(x)\n",`
			`" values = self.W_value(x)\n",`
			`"\n",`
			`" attn_scores = queries @ keys.transpose(1, 2)\n",`
			`" attn_scores.masked_fill_(\n",`
			`" self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)\n",`
			`" attn_weights = torch.softmax(\n",`
			`" attn_scores / keys.shape[-1]**0.5, dim=-1\n",`
			`" )\n",`
			`" attn_weights = self.dropout(attn_weights)\n",`
			`"\n",`
			`" context_vec = attn_weights @ values\n",`
			`" return context_vec"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "_AL1X6y3Eb7S"`
			`},`
			`"source": [`
			`"Now, conveniently, if we move the module to the GPU, the mask will be located on the GPU as well:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 10,`
			`"metadata": {`
			`"colab": {`
			`"base_uri": "https://localhost:8080/"`
			`},`
			`"id": "8_VCxEa76j00",`
			`"outputId": "7152e74b-ce7a-44fb-c8d9-46da0908190e"`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"W_query.device: cuda:0\n",`
			`"mask.device: cuda:0\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"ca_with_buffer = CausalAttentionWithBuffer(d_in, d_out, context_length, 0.0)\n",`
			`"ca_with_buffer.to(\"cuda\")\n",`
			`"\n",`
			`"print(\"W_query.device:\", ca_with_buffer.W_query.weight.device)\n",`
			`"print(\"mask.device:\", ca_with_buffer.mask.device)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 11,`
			`"metadata": {`
			`"colab": {`
			`"base_uri": "https://localhost:8080/"`
			`},`
			`"id": "TBWvKlMe7bbB",`
			`"outputId": "63aa3589-4fb9-4b75-b161-458afb7d72e2"`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"tensor([[[0.4772, 0.1063],\n",`
			`" [0.5891, 0.3257],\n",`
			`" [0.6202, 0.3860],\n",`
			`" [0.5478, 0.3589],\n",`
			`" [0.5321, 0.3428],\n",`
			`" [0.5077, 0.3493]],\n",`
			`"\n",`
			`" [[0.4772, 0.1063],\n",`
			`" [0.5891, 0.3257],\n",`
			`" [0.6202, 0.3860],\n",`
			`" [0.5478, 0.3589],\n",`
			`" [0.5321, 0.3428],\n",`
			`" [0.5077, 0.3493]]], device='cuda:0')\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"with torch.no_grad():\n",`
			`" context_vecs = ca_with_buffer(batch)\n",`
			`"\n",`
			`"print(context_vecs)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"id": "xvOTh4NNPjef"`
			`},`
			`"source": [`
			`"As we can see above, registering a tensor as a buffer can make our lives a lot easier: We don't have to remember to move tensors to a target device like a GPU manually."`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"accelerator": "GPU",`
			`"colab": {`
			`"gpuType": "L4",`
			`"provenance": []`
			`},`
			`"kernelspec": {`
			`"display_name": "Python 3 (ipykernel)",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.10.6"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 4`
			`}`