LLMs-from-scratch/ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "6f678e62-7bcb-4405-86ae-dce94f494303",
      "metadata": {
        "id": "6f678e62-7bcb-4405-86ae-dce94f494303"
      },
      "source": [
        "# Efficient Multi-Head Attention Implementations"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "b742938a-4bfc-4527-a1f1-d5963508967d",
      "metadata": {
        "id": "b742938a-4bfc-4527-a1f1-d5963508967d"
      },
      "source": [
        "This code notebook compares different ways to implement causal multi-head attention used in decoder-style LLMs like GPT, Llama, etc."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "id": "7898551e-f582-48ac-9f66-3632abe2a93f",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "7898551e-f582-48ac-9f66-3632abe2a93f",
        "outputId": "2ddf0145-94d3-4490-8087-d1ffeb6f30ab"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "PyTorch version: 2.2.1+cu121\n",
            "Running on cuda\n"
          ]
        }
      ],
      "source": [
        "import torch\n",
        "\n",
        "torch.manual_seed(123)\n",
        "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
        "print(f\"PyTorch version: {torch.__version__}\")\n",
        "print(f\"Running on {device}\")\n",
        "\n",
        "batch_size = 8\n",
        "context_len = 1024\n",
        "embed_dim = 768\n",
        "embeddings = torch.randn((batch_size, context_len, embed_dim), device=device)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "2f9bb1b6-a1e5-4e0a-884d-0f31b374a8d6",
      "metadata": {
        "id": "2f9bb1b6-a1e5-4e0a-884d-0f31b374a8d6"
      },
      "source": [
        "## 1) CausalAttention MHA wrapper class from chapter 3"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "id": "297c93ed-aec0-4896-bb89-42c4b294d3d1",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "297c93ed-aec0-4896-bb89-42c4b294d3d1",
        "outputId": "ae6d707f-eae8-467a-ed4d-a88051bf776f"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "torch.Size([8, 1024, 9216])\n"
          ]
        }
      ],
      "source": [
        "from ch03 import MultiHeadAttentionWrapper as Ch03_MHA_Wrapper\n",
        "\n",
        "mha_ch03_wrapper = Ch03_MHA_Wrapper(\n",
        "    d_in=embed_dim,\n",
        "    d_out=embed_dim//12,\n",
        "    block_size=context_len,\n",
        "    dropout=0.0,\n",
        "    num_heads=12,\n",
        "    qkv_bias=False\n",
        ").to(device)\n",
        "\n",
        "out = mha_ch03_wrapper(embeddings)\n",
        "print(out.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "21930804-b327-40b1-8e63-94dcad39ce7b",
      "metadata": {
        "id": "21930804-b327-40b1-8e63-94dcad39ce7b"
      },
      "source": [
        "## 2) The multi-head attention class from chapter 3"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "id": "4ee6a61b-d25c-4a0c-8a59-f285544e3710",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4ee6a61b-d25c-4a0c-8a59-f285544e3710",
        "outputId": "5df88462-8b1a-4b1f-ce71-3909ad2ca9c2"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "torch.Size([8, 1024, 768])\n"
          ]
        }
      ],
      "source": [
        "from ch03 import MultiHeadAttention as Ch03_MHA\n",
        "\n",
        "mha_ch03 = Ch03_MHA(\n",
        "    d_in=embed_dim,\n",
        "    d_out=embed_dim,\n",
        "    block_size=context_len,\n",
        "    dropout=0.0,\n",
        "    num_heads=12,\n",
        "    qkv_bias=False\n",
        ").to(device)\n",
        "\n",
        "out = mha_ch03(embeddings)\n",
        "print(out.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "73cd11da-ea3b-4081-b483-c4965dfefbc4",
      "metadata": {
        "id": "73cd11da-ea3b-4081-b483-c4965dfefbc4"
      },
      "source": [
        "## 3) An alternative multi-head attention with combined weights"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1fa1a5ea-eaff-4d2d-aaf0-b34cdb6fd4dd",
      "metadata": {
        "id": "1fa1a5ea-eaff-4d2d-aaf0-b34cdb6fd4dd"
      },
      "source": [
        "- The code for the `MultiHeadAttentionAlt` class below is based on code that was kindly shared by [Rayed Bin Wahed](https://github.com/rasbt/LLMs-from-scratch/discussions/51)\n",
        "- The main difference between the `MultiHeadAttentionAlt` class and the `MultiHeadAttention` class used in chapter 3 is that `MultiHeadAttentionAlt` uses a single weight matrix, `self.qkv = nn.Linear(d_in, 3 * d_out, bias=qkv_bias)` instead of separate weight matrices:\n",
        "\n",
        "  - `self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)`\n",
        "  - `self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)`\n",
        "  - `self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)`\n",
        "\n",
        "- Here, `self.qkv` combines all three weight matrices `self.W_query`, `self.W_key`, and `self.W_value` to carry out the query, key, and value computation in a single step\n",
        "- Using `q, k, v = qkv.unbind(0)`, we obtain the individual query, key, and value tensors, which are then used similarly to the query, key, and value tensors in the `MultiHeadAttention` class in chapter 3"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "id": "9a6bd0a2-f27c-4602-afa0-c96cd295c1a6",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "9a6bd0a2-f27c-4602-afa0-c96cd295c1a6",
        "outputId": "1240afaf-139a-4d01-ddac-4a186ff4a4fd"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "torch.Size([8, 1024, 768])\n"
          ]
        }
      ],
      "source": [
        "import torch.nn as nn\n",
        "\n",
        "\n",
        "class MultiHeadAttentionCombinedQKV(nn.Module):\n",
        "    def __init__(self, d_in, d_out, num_heads, block_size, dropout=0.0, qkv_bias=False):\n",
        "        super().__init__()\n",
        "\n",
        "        assert d_out % num_heads == 0, \"embed_dim is indivisible by num_heads\"\n",
        "\n",
        "        self.num_heads = num_heads\n",
        "        self.block_size = block_size\n",
        "        self.head_dim = d_out // num_heads\n",
        "\n",
        "        self.qkv = nn.Linear(d_in, 3 * d_out, bias=qkv_bias)\n",
        "        self.proj = nn.Linear(d_in, d_out)\n",
        "        self.dropout = nn.Dropout(dropout)\n",
        "\n",
        "        self.register_buffer(\n",
        "            \"mask\", torch.triu(torch.ones(block_size, block_size), diagonal=1)\n",
        "        )\n",
        "\n",
        "    def forward(self, x):\n",
        "        batch_size, num_tokens, embed_dim = x.shape\n",
        "\n",
        "        # (b, num_tokens, embed_dim) --> (b, num_tokens, 3 * embed_dim)\n",
        "        qkv = self.qkv(x)\n",
        "\n",
        "        # (b, num_tokens, 3 * embed_dim) --> (b, num_tokens, 3, num_heads, head_dim)\n",
        "        qkv = qkv.reshape(batch_size, num_tokens, 3, self.num_heads, self.head_dim)\n",
        "\n",
        "        # (b, num_tokens, 3, num_heads, head_dim) --> (3, b, num_heads, num_tokens, head_dim)\n",
        "        qkv = qkv.permute(2, 0, 3, 1, 4)\n",
        "\n",
        "        # (3, b, num_heads, num_tokens, head_dim) -> 3 times (b, num_head, num_tokens, head_dim)\n",
        "        queries, keys, values = qkv.unbind(0)\n",
        "\n",
        "        # (b, num_heads, num_tokens, head_dim) --> (b, num_heads, num_tokens, num_tokens)\n",
        "        attn_scores = queries @ keys.transpose(-2, -1)\n",
        "        attn_scores = attn_scores.masked_fill(\n",
        "            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf\n",
        "        )\n",
        "\n",
        "        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**-0.5, dim=-1)\n",
        "        attn_weights = self.dropout(attn_weights)\n",
        "\n",
        "        # (b, num_heads, num_tokens, num_tokens) --> (b, num_heads, num_tokens, head_dim)\n",
        "        context_vec = attn_weights @ values\n",
        "\n",
        "        # (b, num_heads, num_tokens, head_dim) --> (b, num_tokens, num_heads, head_dim)\n",
        "        context_vec = context_vec.transpose(1, 2)\n",
        "\n",
        "        # (b, num_tokens, num_heads, head_dim) --> (b, num_tokens, embed_dim)\n",
        "        context_vec = context_vec.reshape(batch_size, num_tokens, embed_dim)\n",
        "\n",
        "        context_vec = self.proj(context_vec)\n",
        "\n",
        "        return context_vec\n",
        "\n",
        "\n",
        "mha_combined_qkv = MultiHeadAttentionCombinedQKV(\n",
        "    d_in=embed_dim,\n",
        "    d_out=embed_dim,\n",
        "    block_size=context_len,\n",
        "    dropout=0.0,\n",
        "    num_heads=12,\n",
        "    qkv_bias=False\n",
        ").to(device)\n",
        "\n",
        "out = mha_combined_qkv(embeddings)\n",
        "print(out.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "48a042d3-ee78-4c29-bf63-d92fe6706632",
      "metadata": {
        "id": "48a042d3-ee78-4c29-bf63-d92fe6706632"
      },
      "source": [
        "## 4) Multihead attention with PyTorch's scaled dot product attention"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f78e346f-3b85-44e6-9feb-f01131381148",
      "metadata": {
        "id": "f78e346f-3b85-44e6-9feb-f01131381148"
      },
      "source": [
        "- The implementation below uses PyTorch's [`scaled_dot_product_attention`](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) function, which implements a memory-optimized version of self-attention calld [flash attention](https://arxiv.org/abs/2205.14135)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "id": "1b8e5a0d-1f65-4a03-bf6e-723f0cc428f5",
      "metadata": {
        "id": "1b8e5a0d-1f65-4a03-bf6e-723f0cc428f5"
      },
      "outputs": [],
      "source": [
        "class MHAPyTorchScaledDotProduct(nn.Module):\n",
        "    def __init__(self, d_in, d_out, num_heads, block_size, dropout=0.0, qkv_bias=False):\n",
        "        super().__init__()\n",
        "\n",
        "        assert d_out % num_heads == 0, \"embed_dim is indivisible by num_heads\"\n",
        "\n",
        "        self.num_heads = num_heads\n",
        "        self.block_size = block_size\n",
        "        self.head_dim = d_out // num_heads\n",
        "        self.d_out = d_out\n",
        "\n",
        "        self.qkv = nn.Linear(d_in, 3 * d_out, bias=qkv_bias)\n",
        "        self.proj = nn.Linear(d_in, d_out)\n",
        "        self.dropout = dropout\n",
        "\n",
        "        self.register_buffer(\n",
        "            \"mask\", torch.triu(torch.ones(block_size, block_size), diagonal=1)\n",
        "        )\n",
        "\n",
        "    def forward(self, x):\n",
        "        batch_size, num_tokens, embed_dim = x.shape\n",
        "\n",
        "        # (b, num_tokens, embed_dim) --> (b, num_tokens, 3 * embed_dim)\n",
        "        qkv = self.qkv(x)\n",
        "\n",
        "        # (b, num_tokens, 3 * embed_dim) --> (b, num_tokens, 3, num_heads, head_dim)\n",
        "        qkv = qkv.reshape(batch_size, num_tokens, 3, self.num_heads, self.head_dim)\n",
        "\n",
        "        # (b, num_tokens, 3, num_heads, head_dim) --> (3, b, num_heads, num_tokens, head_dim)\n",
        "        qkv = qkv.permute(2, 0, 3, 1, 4)\n",
        "\n",
        "        # (3, b, num_heads, num_tokens, head_dim) -> 3 times (b, num_heads, num_tokens, head_dim)\n",
        "        queries, keys, values = qkv.unbind(0)\n",
        "\n",
        "        use_dropout = 0. if not self.training else self.dropout\n",
        "        context_vec = nn.functional.scaled_dot_product_attention(\n",
        "            queries, keys, values, attn_mask=None, dropout_p=use_dropout, is_causal=True)\n",
        "\n",
        "        # Combine heads, where self.d_out = self.num_heads * self.head_dim\n",
        "        context_vec = context_vec.transpose(1, 2).contiguous().view(batch_size, num_tokens, self.d_out)\n",
        "\n",
        "        return context_vec"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "id": "fbc8ba92-3471-41cb-b1b2-4c0ef5be392b",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "fbc8ba92-3471-41cb-b1b2-4c0ef5be392b",
        "outputId": "83ef0a2f-3fe6-4123-c8de-f481f2a9e415"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "torch.Size([8, 1024, 768])\n"
          ]
        }
      ],
      "source": [
        "mha_pytorch_scaled = MHAPyTorchScaledDotProduct(\n",
        "    d_in=embed_dim,\n",
        "    d_out=embed_dim,\n",
        "    block_size=context_len,\n",
        "    dropout=0.0,\n",
        "    num_heads=12,\n",
        "    qkv_bias=False\n",
        ").to(device)\n",
        "\n",
        "out = mha_pytorch_scaled(embeddings)\n",
        "print(out.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "351c318f-4835-4d74-8d58-a070222447c4",
      "metadata": {
        "id": "351c318f-4835-4d74-8d58-a070222447c4"
      },
      "source": [
        "## 5) Using PyTorch's torch.nn.MultiheadAttention"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "74a6d060-6324-48fa-a35c-cb09f2a48965",
      "metadata": {
        "id": "74a6d060-6324-48fa-a35c-cb09f2a48965"
      },
      "source": [
        "- Below, we use PyTorch's [torch.nn.MultiheadAttention](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) implementation"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "id": "3799c7ef-3155-42c6-a829-f95656453ae0",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3799c7ef-3155-42c6-a829-f95656453ae0",
        "outputId": "aabf134e-c9bc-474b-ee57-0c24b5fb604c"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "torch.Size([8, 1024, 768])\n"
          ]
        }
      ],
      "source": [
        "import torch.nn as nn\n",
        "\n",
        "\n",
        "class MHAPyTorchClass(nn.Module):\n",
        "    def __init__(self, d_in, d_out, num_heads, block_size, dropout=0.0, qkv_bias=False, need_weights=True):\n",
        "        super().__init__()\n",
        "\n",
        "        self.block_size = block_size\n",
        "        self.multihead_attn = nn.MultiheadAttention(\n",
        "            embed_dim=d_out,\n",
        "            num_heads=num_heads,\n",
        "            dropout=dropout,\n",
        "            bias=qkv_bias,\n",
        "            add_bias_kv=qkv_bias,\n",
        "            batch_first=True,\n",
        "        )\n",
        "\n",
        "        self.need_weights = need_weights\n",
        "        self.proj = nn.Linear(d_out, d_out)\n",
        "        self.register_buffer(\"mask\", torch.triu(torch.ones(block_size, block_size), diagonal=1).bool())\n",
        "\n",
        "    def forward(self, x):\n",
        "        batch_size, num_tokens, _ = x.shape\n",
        "\n",
        "        # Ensure attn_mask is compatible with expected shape and `batch_first=True`\n",
        "        # No need to manually adjust for num_heads; ensure it's right for the sequence\n",
        "        if self.block_size >= num_tokens:\n",
        "            attn_mask = self.mask[:num_tokens, :num_tokens]\n",
        "        else:\n",
        "            attn_mask = self.mask[:self.block_size, :self.block_size]\n",
        "\n",
        "        # attn_mask broadcasting will handle batch_size dimension implicitly\n",
        "        attn_output, _ = self.multihead_attn(\n",
        "            x, x, x, attn_mask=attn_mask, need_weights=self.need_weights\n",
        "        )\n",
        "\n",
        "        output = self.proj(attn_output)\n",
        "\n",
        "        return output\n",
        "\n",
        "\n",
        "mha_pytorch_class_default = MHAPyTorchClass(\n",
        "    d_in=embed_dim,\n",
        "    d_out=embed_dim,\n",
        "    block_size=context_len,\n",
        "    dropout=0.0,\n",
        "    num_heads=12,\n",
        "    qkv_bias=False\n",
        ").to(device)\n",
        "\n",
        "out = mha_pytorch_class_default(embeddings)\n",
        "print(out.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a3953bff-1056-4de2-bfd1-dfccf659eee4",
      "metadata": {
        "id": "a3953bff-1056-4de2-bfd1-dfccf659eee4"
      },
      "source": [
        "## 6) Using PyTorch's torch.nn.MultiheadAttention with `scaled_dot_product_attention`"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d2164859-31a0-4537-b4fb-27d57675ba77",
      "metadata": {
        "id": "d2164859-31a0-4537-b4fb-27d57675ba77"
      },
      "source": [
        "- Set `need_weights` (default `True`) to need_weights=False so that MultiheadAttention uses `scaled_dot_product_attention` [according to the documentation](https://github.com/pytorch/pytorch/blob/71d020262793542974cf13b30f2a9099773f015c/torch/nn/modules/activation.py#L1096)\n",
        "\n",
        ">  need_weights: If specified, returns ``attn_output_weights`` in addition to ``attn_outputs``.\n",
        "            Set ``need_weights=False`` to use the optimized ``scaled_dot_product_attention``\n",
        "            and achieve the best performance for MHA.\n",
        "            Default: ``True``."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "id": "4a4c2afe-5e1f-4bd7-a118-67031176f147",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4a4c2afe-5e1f-4bd7-a118-67031176f147",
        "outputId": "5b577a7c-4199-4e52-8d08-a0974a5a3685"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "torch.Size([8, 1024, 768])\n"
          ]
        }
      ],
      "source": [
        "mha_pytorch_class_noweights = MHAPyTorchClass(\n",
        "    d_in=embed_dim,\n",
        "    d_out=embed_dim,\n",
        "    block_size=context_len,\n",
        "    dropout=0.0,\n",
        "    num_heads=12,\n",
        "    qkv_bias=False,\n",
        "    need_weights=False # NEW!\n",
        ").to(device)\n",
        "\n",
        "out = mha_pytorch_class_noweights(embeddings)\n",
        "print(out.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8877de71-f84f-4f6d-bc87-7552013b6301",
      "metadata": {
        "id": "8877de71-f84f-4f6d-bc87-7552013b6301"
      },
      "source": [
        "## Quick speed comparison (M1 Macbook Air CPU)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a97c0b2e-6593-49d8-98bc-2267b3aa610f",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "a97c0b2e-6593-49d8-98bc-2267b3aa610f",
        "outputId": "ebe635b2-5c03-4e9b-da3a-951d308acf7b"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "1.15 s ± 86.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
          ]
        }
      ],
      "source": [
        "## 1) CausalAttention MHA wrapper class from chapter 3\n",
        "%timeit mha_ch03_wrapper(embeddings)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "19db9c2c-8e75-431a-8eef-0b4d8284e6e6",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "19db9c2c-8e75-431a-8eef-0b4d8284e6e6",
        "outputId": "c6e7bcff-661c-45a6-da82-b1e3f89cf761"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "273 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
          ]
        }
      ],
      "source": [
        "## 2) The multi-head attention class from chapter 3\n",
        "%timeit mha_ch03(embeddings)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "aa526ee0-7a88-4f34-a49a-f8f97da83779",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "aa526ee0-7a88-4f34-a49a-f8f97da83779",
        "outputId": "92b634f8-43f8-468f-87a1-bb774b64c212"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "324 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
          ]
        }
      ],
      "source": [
        "## 3) An alternative multi-head attention with combined weights\n",
        "%timeit mha_combined_qkv(embeddings)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "cc2b4256-16d8-4c34-9fd0-d4b4af0e60fa",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "cc2b4256-16d8-4c34-9fd0-d4b4af0e60fa",
        "outputId": "80c6e314-0771-470e-b090-628984ce2d85"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "106 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
          ]
        }
      ],
      "source": [
        "## 4) Multihead attention with PyTorch's scaled dot product attention\n",
        "%timeit mha_pytorch_scaled(embeddings)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "0f209e70-ebb6-4a1a-b608-1ff42e41c01d",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "0f209e70-ebb6-4a1a-b608-1ff42e41c01d",
        "outputId": "3cd37b53-04d4-4dd0-9450-6fc8ebaac083"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "351 ms ± 7.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
          ]
        }
      ],
      "source": [
        "## 5) Using PyTorch's torch.nn.MultiheadAttention\n",
        "%timeit mha_pytorch_class_default(embeddings)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3f4968c2-8d40-4ab9-8dba-052b4f77d756",
      "metadata": {
        "tags": [],
        "id": "3f4968c2-8d40-4ab9-8dba-052b4f77d756",
        "outputId": "2e86bdb4-7fa0-4051-b000-4a2b591060a2"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "333 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
          ]
        }
      ],
      "source": [
        "## 6) Using PyTorch's torch.nn.MultiheadAttention disabling `need_weights`\n",
        "%timeit mha_pytorch_class_noweights(embeddings)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a78ff594-6cc2-496d-a302-789fa104c3c9",
      "metadata": {
        "id": "a78ff594-6cc2-496d-a302-789fa104c3c9"
      },
      "source": [
        "## Quick speed comparison (Nvidia A100 GPU)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "id": "707a2a14-a089-48a8-88aa-d328e1e0a9d0",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "707a2a14-a089-48a8-88aa-d328e1e0a9d0",
        "outputId": "07a711f6-f7ff-496c-ce16-be67308aeadf"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "41.1 ms ± 5.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
          ]
        }
      ],
      "source": [
        "## 1) CausalAttention MHA wrapper class from chapter 3\n",
        "%timeit mha_ch03_wrapper(embeddings)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "id": "8686dd69-3655-40e4-a57b-a2c55532a010",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "8686dd69-3655-40e4-a57b-a2c55532a010",
        "outputId": "b0c29336-55e8-4194-89e4-9201f77e5375"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "6.58 ms ± 256 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
          ]
        }
      ],
      "source": [
        "## 2) The multi-head attention class from chapter 3\n",
        "%timeit mha_ch03(embeddings)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "id": "2209d7df-e54b-4910-ae2b-c78cf684d9bf",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2209d7df-e54b-4910-ae2b-c78cf684d9bf",
        "outputId": "ba357440-47d4-450d-b859-08031056ccf8"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "7.19 ms ± 590 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
          ]
        }
      ],
      "source": [
        "## 3) An alternative multi-head attention with combined weights\n",
        "%timeit mha_combined_qkv(embeddings)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "id": "1075abe2-4839-4fd6-af3e-c09bb3651e26",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1075abe2-4839-4fd6-af3e-c09bb3651e26",
        "outputId": "b2126630-7fae-4c44-8180-226ff5509d78"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "2.37 ms ± 569 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
          ]
        }
      ],
      "source": [
        "## 4) Multihead attention with PyTorch's scaled dot product attention\n",
        "%timeit mha_pytorch_scaled(embeddings)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "id": "868e3670-8edc-47bc-9e06-eb505e44dc9d",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "868e3670-8edc-47bc-9e06-eb505e44dc9d",
        "outputId": "453d9b7b-3f45-4907-b4fd-77d395534d6b"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "6.66 ms ± 301 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
          ]
        }
      ],
      "source": [
        "## 5) Using PyTorch's torch.nn.MultiheadAttention\n",
        "%timeit mha_pytorch_class_default(embeddings)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "id": "944870e6-de54-4e3b-a455-b8f21f6f92c8",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "944870e6-de54-4e3b-a455-b8f21f6f92c8",
        "outputId": "ccfe127c-c069-4dcd-f37d-ea6a40406955"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "4.52 ms ± 317 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
          ]
        }
      ],
      "source": [
        "## 6) Using PyTorch's torch.nn.MultiheadAttention disabling `need_weights`\n",
        "%timeit mha_pytorch_class_noweights(embeddings)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "dabc6575-0316-4640-a729-e616d5c17b73",
      "metadata": {
        "id": "dabc6575-0316-4640-a729-e616d5c17b73"
      },
      "source": [
        "## Speed comparison (Nvidia A100 GPU) with warmup"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "id": "29b63d3d-6d0b-43bb-9c68-d5514dc81000",
      "metadata": {
        "id": "29b63d3d-6d0b-43bb-9c68-d5514dc81000"
      },
      "outputs": [],
      "source": [
        "# CUDA benchmark code shared by Andrei Aksionov\n",
        "# and based on code from\n",
        "# https://github.com/cuda-mode/lectures/blob/main/lecture1/pytorch_square.py\n",
        "\n",
        "def time_pytorch_function(func, *input, num_repeats = 1_000):\n",
        "    # CUDA IS ASYNC so can't use python time module\n",
        "    start = torch.cuda.Event(enable_timing=True)\n",
        "    end = torch.cuda.Event(enable_timing=True)\n",
        "\n",
        "    # Warmup\n",
        "    for _ in range(5):\n",
        "        func(*input)\n",
        "    torch.cuda.synchronize()\n",
        "\n",
        "    start.record()\n",
        "    for _ in range(num_repeats):\n",
        "        func(*input)\n",
        "        torch.cuda.synchronize()\n",
        "    end.record()\n",
        "    torch.cuda.synchronize()\n",
        "    return start.elapsed_time(end) / num_repeats"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "\n",
        "embeddings_cuda = embeddings.to(torch.device(\"cuda\"))\n",
        "\n",
        "functions = {\n",
        "    \"1) MHA wrapper class\": mha_ch03_wrapper,\n",
        "    \"2) MHA Ch03\": mha_ch03,\n",
        "    \"3) MHA with combined QKV weights\": mha_combined_qkv,\n",
        "    \"4) MHA with PyTorch scaled_dot_product_attention\": mha_pytorch_scaled,\n",
        "    \"5) PyTorch MHA class defaults\": mha_pytorch_class_default,\n",
        "    \"6) PyTorch MHA with need_weights=False\": mha_pytorch_class_noweights\n",
        "}\n",
        "execution_times = [time_pytorch_function(fn, embeddings_cuda) for name,fn in functions.items()]\n",
        "\n",
        "\n",
        "# Plotting\n",
        "\n",
        "# Customize further for dark mode aesthetics\n",
        "plt.rcParams['figure.facecolor'] = '#121212'  # Dark figure background\n",
        "plt.rcParams['axes.facecolor'] = '#121212'    # Dark axes background\n",
        "plt.rcParams['axes.edgecolor'] = 'white'      # White axes border\n",
        "plt.rcParams['axes.labelcolor'] = 'white'     # White labels\n",
        "plt.rcParams['text.color'] = 'white'          # White text\n",
        "plt.rcParams['xtick.color'] = 'white'         # White x ticks\n",
        "plt.rcParams['ytick.color'] = 'white'         # White y ticks\n",
        "plt.rcParams['grid.color'] = '#444444'        # Lighter grid lines for contrast\n",
        "plt.rcParams['lines.linewidth'] = 2           # Thicker plot lines for visibility\n",
        "plt.rcParams['lines.markersize'] = 8          # Larger markers for visibility\n",
        "\n",
        "fig, ax = plt.subplots()\n",
        "bars = plt.bar(functions.keys(), execution_times)\n",
        "\n",
        "plt.ylabel('Execution time (ms)')\n",
        "plt.xticks(rotation=45, ha=\"right\")\n",
        "\n",
        "# Calculate new ylim with a margin\n",
        "max_execution_time = max(execution_times)\n",
        "upper_ylim = max_execution_time + 0.2 * max_execution_time  # Adding a 20% margin\n",
        "\n",
        "plt.ylim(0, upper_ylim)  # Setting new ylim\n",
        "\n",
        "# Annotate bars with execution times\n",
        "for bar in bars:\n",
        "    yval = bar.get_height()\n",
        "    plt.text(bar.get_x() + bar.get_width()/2, yval + (0.05 * upper_ylim), round(yval, 2), ha='center', va='bottom')\n",
        "\n",
        "\n",
        "plt.tight_layout()\n",
        "plt.savefig(\"1.pdf\")\n",
        "plt.show()\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 488
        },
        "id": "CDJAPZaszaqx",
        "outputId": "47c9ef93-438e-4455-faef-7e253eaaaa8d"
      },
      "id": "CDJAPZaszaqx",
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 640x480 with 1 Axes>"
            ],
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAnAAAAHWCAYAAAD3vrTNAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAADOcklEQVR4nOzddVxU2fvA8c/QIiiiImJ3166urt1YoIiYKFhgEKJiY2FhoCgqa3d369q6dneLiihigoEIzO8Pf9wvI5jrLOHzfr18fXfu3Ln3zPlezjz33HOeozI3N1cjhBBCCCFSDZ3kLoAQQgghhPg+EsAJIYQQQqQyEsAJIYQQQqQyEsAJIYQQQqQyEsAJIYQQQqQyEsAJIYQQQqQyEsAJIYQQQqQyEsAJIYQQQqQyesldgJQme/bsvH79OrmLIYQQQohflImJCY8ePfriPhLAJZA9e3YuXbqU3MUQQgghxC+uZMmSXwziJIBLIL7nrWTJktILJ4QQQoj/nImJCZcuXfpqHCIBXBJev35NZGRkchdDCCGEECJJMolBCCGEECKVkQBOCCGEECKVkQBOCCGEECKVkQBOCCGEECKVkQBOCCGEECKVkQAujfH09OTZs2eMHj1a2dahQwc2btxIcHAwz549I0OGDF89Tq9evdi9ezf37t3j2rVrLF68mIIFC2qz6EIIIYT4RhLApSHlypXDyckpUTLidOnSsXfvXiZPnvzNx6pcuTJz586lfv362Nvbo6enx5o1azA2Nv7ZxRZCCCHEd5I8cGlE+vTpCQoKwsvLi969e2u899dffwFQpUqVbz5ey5YtNV67ublx48YNypQpw9GjR/99gYUQQgjxw6QHLo0YP348f//9NwcOHNDK8eMfu7548UIrxxdCCCHEt5MeuDTAzs6O0qVLU7duXa0cX6VSMXr0aI4dO8a1a9e0cg4hhBBCfDsJ4FI5KysrxowZg729Pe/fv9fKOSZMmECxYsVo3LixVo4vhBBCiO8jAVwqV7ZsWSwsLNi3b5+yTU9Pj8qVK9OlSxeyZ89OXFzcDx/fz8+P+vXr06RJE0JDQ39GkYUQQgjxL0kAl8odPHgw0eSEwMBAbt68SUBAwL8O3ho3boytrS3379//t0UVQgghxE8iAVwq9/r160Tj0t68ecPz58+V7RYWFlhYWJAvXz4AihcvzuvXrwkJCeHly5cArF+/nq1btzJnzhzg42NTe3t7HB0def36NRYWFgBEREQQFRX1H307IYQQQiRFArhfgLOzM/3791deb926FfiYGmT58uUA5M2bF3Nzc2WfTp06AbB582aNYyX8jBBCCCGSh8rc3Fyd3IVIKUxNTQkODiZv3rxERkYmd3GEEEII8Yv51lhE8sAJIYQQQqQyEsAJIYQQQqQyEsAJIYQQQqQyEsAJIYQQQqQyWp2Fmjt3bv78809y5syJsbExT58+5eLFi5w8eVJrqwYIIYQQQqR1WgngWrRogaurK2XLluXJkyc8fvyYqKgoMmXKRN68eXn//j1r1qwhICCAkJAQbRRBCCGEECLN+ukB3L59+/jw4QPLly/Hyckp0fJLBgYGVKhQATs7O/bs2YO3tzebNm362cUQQgghhEizfnoeuFq1ammsy/klmTJlInfu3Jw/f/5nFuGHSR44IYQQQiSnb41FtNID961evHjBixcvfnYRhBBCCCHSNK3OQi1dujTFihVTXjds2JDFixczZMgQ9PX1tXlqIYQQQog0S6sBnL+/PwULFgQgT548zJ49m7dv32Jra8vw4cO1eWohhBBCiDRLq2lEChQowMWLFwFo2rQpR48exdXVlT/++IM5c+YwePDgHzqup6cnQ4cOJSgoSDmGoaEhvr6+2NnZYWBgwL59+/D29iY8PPynfZ+fxbTrouQuQooSObtDchdBCCGESFW02gOnUqnQ0fl4iho1avD3338D8PDhQ8zNzX/omOXKlcPJyYlLly5pbB89ejTW1tZ06tQJW1tbLC0tWbhw4b/7AkIIIYQQKZBWA7hz587Rp08fWrZsSeXKlZUALk+ePD/UM5Y+fXqCgoLw8vLi5cuXynZTU1PatWvHkCFDOHToEOfPn8fd3Z2KFStSvnz5n/V1hBBCCCFSBK0GcIMGDaJ06dL4+fnh7+/P3bt3AbC1teXEiRPffbzx48fz999/c+DAAY3tZcuWxcDAQGP7zZs3efDggQRwQgghhEhztDoG7sqVK1SrVi3R9mHDhhEbG/tdx7Kzs6N06dLUrVs30XsWFha8f/+eiIgIje3h4eFky5bts8c0MDDA0NBQeW1iYvJdZRJCCCGESA5aDeASSp8+vTIeLt63Jsu1srJizJgx2Nvb/9Q1VHv16kX//v1/2vGEEEIIIf4LWl/M3s/PjypVqmBkZKRsV6lUqNVqLCwsvuk4ZcuWxcLCQiNJsJ6eHpUrV6ZLly44ODhgaGhIhgwZNHrhsmbNSlhY2GePO2XKFGbOnKm8NjExSTQ5QgghhBAipdFqABcUFIRKpcLDw4Pw8HDU6h9btevgwYNUqVJFY1tgYCA3b94kICCAhw8fEh0dTY0aNdi8eTMABQsWJFeuXJw6deqzx42OjiY6OvqHyiSEEEIIkVy0GsCVKFGCOnXqcOvWrX91nNevX3Pt2jWNbW/evOH58+fK9qVLl+Lr68uLFy+IjIxk3LhxnDhx4osBnBBCCCFEaqTVAO7s2bPkyJHjXwdw32Lw4MHExcWxYMECjUS+QgghhBBpjcrc3PzHnmt+g7x58zJp0iRWr17N1atX+fDhg8b7V65c0dapf4ipqSnBwcHkzZv3mydY/NB5ZCUGDbISgxBCCPHRt8YiWu2By5IlC3nz5mXatGnKNrVa/d2TGIQQQgghxP9oNYCbOnUqFy9exMXFhSdPnvzwJAYhhBBCCPE/Wg3gcubMSbt27ZQVGIQQQgghxL+n1aW0Dh06RMmSJbV5CiGEEEKIX45We+B27tzJqFGjKFasWJKTGHbs2KHN0wshhBBCpElaDeAmTZoEkGQ6D5nEIIQQQgjxY7QawGXNmlWbhxdCCCGE+CVpdQycEEIIIYT4+X56AGdnZ/fN+1pZWfHHH3/87CIIIYQQQqRpPz2A69ixI0ePHsXd3Z3ChQsnet/U1JS6devy119/sW/fPszNzX92EYQQQggh0rSfPgbO1taWBg0a0LVrV3x8fHj79i1Pnjzh/fv3mJmZYWFhwbNnz1ixYgVVq1YlPDz8ZxdBCCGEECJN08okhh07drBjxw7Mzc2pVKkSOXPmJF26dDx79oyLFy9y4cIFWZVBCCGEEOIHaXUW6vPnz9m2bZs2TyGEEEII8cuRWahCCCGEEKmMBHBCCCGEEKmMBHBCCCGEEKmMBHBCCCGEEKnMfxLA6evrU7BgQXR1df+L0wkhhBBCpGlaDeDSpUtHQEAAISEh/PPPP+TMmROAcePG4enpqc1TCyGEEEKkWVoN4Hx8fChZsiS2trZERUUp2w8cOECzZs20eWohhBBCiDRLq3ngGjVqRJcuXTh16pTG9mvXrpEvXz5tnloIIYQQIs3Sag9c5syZk1wqy9jYWFZiEEIIIYT4QVoN4M6dO0f9+vWV1/FBW/v27Tl58qQ2Ty2EEEIIkWZp9RHqqFGjWLVqFUWKFEFXVxdXV1eKFClChQoVsLW11eaphRBCCCHSLK32wB0/fpwaNWqgq6vL1atXqVWrFk+fPqVBgwacP39em6cWQgghhEiztNoDBxAcHIyXl5e2TyOEEEII8cv4TxL5ZsmShaJFi1K8eHGNf9+jY8eOHDx4kODgYIKDg9mxYwd16tRR3jc0NGT8+PHcvHmTe/fusWDBArJmzfqzv4oQQgghRLLTag9cmTJlmD59OoULF0alUmm8p1arsbCw+OZjhYaGMnLkSO7cuYNKpaJ169YsWbKEmjVrcv36dUaPHk29evXo1KkTERER+Pn5sXDhQho1avSzv5YQQgghRLLSagA3depUbt++jaenJ0+ePPlXqUN27typ8Xr06NF07NiR8uXLExoaSrt27XBxceHQoUMAuLu7c+zYMcqXL58oD50QQ
          },
          "metadata": {}
        }
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "A100",
      "machine_shape": "hm",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.6"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}