LLMs-from-scratch/ch04/01_main-chapter-code/exercise-solutions.ipynb
Sebastian Raschka a08d7aaa84
Uv workflow improvements (#531)
* Uv workflow improvements

* Uv workflow improvements

* linter improvements

* pytproject.toml fixes

* pytproject.toml fixes

* pytproject.toml fixes

* pytproject.toml fixes

* pytproject.toml fixes

* pytproject.toml fixes

* windows fixes

* windows fixes

* windows fixes

* windows fixes

* windows fixes

* windows fixes

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix

* win32 fix
2025-02-16 13:16:51 -06:00

460 lines
14 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "ba450fb1-8a26-4894-ab7a-5d7bfefe90ce",
"metadata": {},
"source": [
"<table style=\"width:100%\">\n",
"<tr>\n",
"<td style=\"vertical-align:middle; text-align:left;\">\n",
"<font size=\"2\">\n",
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
"</font>\n",
"</td>\n",
"<td style=\"vertical-align:middle; text-align:left;\">\n",
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
"</td>\n",
"</tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"id": "51c9672d-8d0c-470d-ac2d-1271f8ec3f14",
"metadata": {},
"source": [
"# Chapter 4 Exercise solutions"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5b2fac7a-fdcd-437c-b1c4-0b35a31cd489",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"torch version: 2.4.0\n"
]
}
],
"source": [
"from importlib.metadata import version\n",
"\n",
"print(\"torch version:\", version(\"torch\"))"
]
},
{
"cell_type": "markdown",
"id": "5fea8be3-30a1-4623-a6d7-b095c6c1092e",
"metadata": {},
"source": [
"# Exercise 4.1: Parameters in the feed forward versus attention module"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "2751b0e5-ffd3-4be2-8db3-e20dd4d61d69",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TransformerBlock(\n",
" (att): MultiHeadAttention(\n",
" (W_query): Linear(in_features=768, out_features=768, bias=False)\n",
" (W_key): Linear(in_features=768, out_features=768, bias=False)\n",
" (W_value): Linear(in_features=768, out_features=768, bias=False)\n",
" (out_proj): Linear(in_features=768, out_features=768, bias=True)\n",
" (dropout): Dropout(p=0.1, inplace=False)\n",
" )\n",
" (ff): FeedForward(\n",
" (layers): Sequential(\n",
" (0): Linear(in_features=768, out_features=3072, bias=True)\n",
" (1): GELU()\n",
" (2): Linear(in_features=3072, out_features=768, bias=True)\n",
" )\n",
" )\n",
" (norm1): LayerNorm()\n",
" (norm2): LayerNorm()\n",
" (drop_shortcut): Dropout(p=0.1, inplace=False)\n",
")\n"
]
}
],
"source": [
"from gpt import TransformerBlock\n",
"\n",
"GPT_CONFIG_124M = {\n",
" \"vocab_size\": 50257,\n",
" \"context_length\": 1024,\n",
" \"emb_dim\": 768,\n",
" \"n_heads\": 12,\n",
" \"n_layers\": 12,\n",
" \"drop_rate\": 0.1,\n",
" \"qkv_bias\": False\n",
"}\n",
"\n",
"block = TransformerBlock(GPT_CONFIG_124M)\n",
"print(block)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "1bcaffd1-0cf6-4f8f-bd53-ab88a37f443e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of parameters in feed forward module: 4,722,432\n"
]
}
],
"source": [
"total_params = sum(p.numel() for p in block.ff.parameters())\n",
"print(f\"Total number of parameters in feed forward module: {total_params:,}\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c1dd06c1-ab6c-4df7-ba73-f9cd54b31138",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of parameters in attention module: 2,360,064\n"
]
}
],
"source": [
"total_params = sum(p.numel() for p in block.att.parameters())\n",
"print(f\"Total number of parameters in attention module: {total_params:,}\")"
]
},
{
"cell_type": "markdown",
"id": "15463dec-520a-47b4-b3ad-e180394fd076",
"metadata": {},
"source": [
"- The results above are for a single transformer block\n",
"- Optionally multiply by 12 to capture all transformer blocks in the 124M GPT model"
]
},
{
"cell_type": "markdown",
"id": "597e9251-e0a9-4972-8df6-f280f35939f9",
"metadata": {},
"source": [
"**Bonus: Mathematical breakdown**\n",
"\n",
"- For those interested in how these parameter counts are calculated mathematically, you can find the breakdown below (assuming `emb_dim=768`):\n",
"\n",
"\n",
"Feed forward module:\n",
"\n",
"- 1st `Linear` layer: 768 inputs × 4×768 outputs + 4×768 bias units = 2,362,368\n",
"- 2nd `Linear` layer: 4×768 inputs × 768 outputs + 768 bias units = 2,360,064\n",
"- Total: 1st `Linear` layer + 2nd `Linear` layer = 2,362,368 + 2,360,064 = 4,722,432\n",
"\n",
"Attention module:\n",
"\n",
"- `W_query`: 768 inputs × 768 outputs = 589,824 \n",
"- `W_key`: 768 inputs × 768 outputs = 589,824\n",
"- `W_value`: 768 inputs × 768 outputs = 589,824 \n",
"- `out_proj`: 768 inputs × 768 outputs + 768 bias units = 590,592\n",
"- Total: `W_query` + `W_key` + `W_value` + `out_proj` = 3×589,824 + 590,592 = 2,360,064 "
]
},
{
"cell_type": "markdown",
"id": "0f7b7c7f-0fa1-4d30-ab44-e499edd55b6d",
"metadata": {},
"source": [
"# Exercise 4.2: Initialize larger GPT models"
]
},
{
"cell_type": "markdown",
"id": "310b2e05-3ec8-47fc-afd9-83bf03d4aad8",
"metadata": {},
"source": [
"- **GPT2-small** (the 124M configuration we already implemented):\n",
" - \"emb_dim\" = 768\n",
" - \"n_layers\" = 12\n",
" - \"n_heads\" = 12\n",
"\n",
"- **GPT2-medium:**\n",
" - \"emb_dim\" = 1024\n",
" - \"n_layers\" = 24\n",
" - \"n_heads\" = 16\n",
"\n",
"- **GPT2-large:**\n",
" - \"emb_dim\" = 1280\n",
" - \"n_layers\" = 36\n",
" - \"n_heads\" = 20\n",
"\n",
"- **GPT2-XL:**\n",
" - \"emb_dim\" = 1600\n",
" - \"n_layers\" = 48\n",
" - \"n_heads\" = 25"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "90185dea-81ca-4cdc-aef7-4aaf95cba946",
"metadata": {},
"outputs": [],
"source": [
"GPT_CONFIG_124M = {\n",
" \"vocab_size\": 50257,\n",
" \"context_length\": 1024,\n",
" \"emb_dim\": 768,\n",
" \"n_heads\": 12,\n",
" \"n_layers\": 12,\n",
" \"drop_rate\": 0.1,\n",
" \"qkv_bias\": False\n",
"}\n",
"\n",
"\n",
"def get_config(base_config, model_name=\"gpt2-small\"):\n",
" GPT_CONFIG = base_config.copy()\n",
"\n",
" if model_name == \"gpt2-small\":\n",
" GPT_CONFIG[\"emb_dim\"] = 768\n",
" GPT_CONFIG[\"n_layers\"] = 12\n",
" GPT_CONFIG[\"n_heads\"] = 12\n",
"\n",
" elif model_name == \"gpt2-medium\":\n",
" GPT_CONFIG[\"emb_dim\"] = 1024\n",
" GPT_CONFIG[\"n_layers\"] = 24\n",
" GPT_CONFIG[\"n_heads\"] = 16\n",
"\n",
" elif model_name == \"gpt2-large\":\n",
" GPT_CONFIG[\"emb_dim\"] = 1280\n",
" GPT_CONFIG[\"n_layers\"] = 36\n",
" GPT_CONFIG[\"n_heads\"] = 20\n",
"\n",
" elif model_name == \"gpt2-xl\":\n",
" GPT_CONFIG[\"emb_dim\"] = 1600\n",
" GPT_CONFIG[\"n_layers\"] = 48\n",
" GPT_CONFIG[\"n_heads\"] = 25\n",
"\n",
" else:\n",
" raise ValueError(f\"Incorrect model name {model_name}\")\n",
"\n",
" return GPT_CONFIG\n",
"\n",
"\n",
"def calculate_size(model): # based on chapter code\n",
" \n",
" total_params = sum(p.numel() for p in model.parameters())\n",
" print(f\"Total number of parameters: {total_params:,}\")\n",
"\n",
" total_params_gpt2 = total_params - sum(p.numel() for p in model.out_head.parameters())\n",
" print(f\"Number of trainable parameters considering weight tying: {total_params_gpt2:,}\")\n",
" \n",
" # Calculate the total size in bytes (assuming float32, 4 bytes per parameter)\n",
" total_size_bytes = total_params * 4\n",
" \n",
" # Convert to megabytes\n",
" total_size_mb = total_size_bytes / (1024 * 1024)\n",
" \n",
" print(f\"Total size of the model: {total_size_mb:.2f} MB\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "2587e011-78a4-479c-a8fd-961cc40a5fd4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"gpt2-small:\n",
"Total number of parameters: 163,009,536\n",
"Number of trainable parameters considering weight tying: 124,412,160\n",
"Total size of the model: 621.83 MB\n",
"\n",
"\n",
"gpt2-medium:\n",
"Total number of parameters: 406,212,608\n",
"Number of trainable parameters considering weight tying: 354,749,440\n",
"Total size of the model: 1549.58 MB\n",
"\n",
"\n",
"gpt2-large:\n",
"Total number of parameters: 838,220,800\n",
"Number of trainable parameters considering weight tying: 773,891,840\n",
"Total size of the model: 3197.56 MB\n",
"\n",
"\n",
"gpt2-xl:\n",
"Total number of parameters: 1,637,792,000\n",
"Number of trainable parameters considering weight tying: 1,557,380,800\n",
"Total size of the model: 6247.68 MB\n"
]
}
],
"source": [
"from gpt import GPTModel\n",
"\n",
"\n",
"for model_abbrev in (\"small\", \"medium\", \"large\", \"xl\"):\n",
" model_name = f\"gpt2-{model_abbrev}\"\n",
" CONFIG = get_config(GPT_CONFIG_124M, model_name=model_name)\n",
" model = GPTModel(CONFIG)\n",
" print(f\"\\n\\n{model_name}:\")\n",
" calculate_size(model)"
]
},
{
"cell_type": "markdown",
"id": "f5f2306e-5dc8-498e-92ee-70ae7ec37ac1",
"metadata": {},
"source": [
"# Exercise 4.3: Using separate dropout parameters"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "5fee2cf5-61c3-4167-81b5-44ea155bbaf2",
"metadata": {},
"outputs": [],
"source": [
"GPT_CONFIG_124M = {\n",
" \"vocab_size\": 50257,\n",
" \"context_length\": 1024,\n",
" \"emb_dim\": 768,\n",
" \"n_heads\": 12,\n",
" \"n_layers\": 12,\n",
" \"drop_rate_emb\": 0.1, # NEW: dropout for embedding layers\n",
" \"drop_rate_attn\": 0.1, # NEW: dropout for multi-head attention \n",
" \"drop_rate_shortcut\": 0.1, # NEW: dropout for shortcut connections \n",
" \"qkv_bias\": False\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "5aa1b0c1-d78a-48fc-ad08-4802458b43f7",
"metadata": {},
"outputs": [],
"source": [
"import torch.nn as nn\n",
"from gpt import MultiHeadAttention, LayerNorm, FeedForward\n",
"\n",
"\n",
"class TransformerBlock(nn.Module):\n",
" def __init__(self, cfg):\n",
" super().__init__()\n",
" self.att = MultiHeadAttention(\n",
" d_in=cfg[\"emb_dim\"],\n",
" d_out=cfg[\"emb_dim\"],\n",
" context_length=cfg[\"context_length\"],\n",
" num_heads=cfg[\"n_heads\"], \n",
" dropout=cfg[\"drop_rate_attn\"], # NEW: dropout for multi-head attention\n",
" qkv_bias=cfg[\"qkv_bias\"])\n",
" self.ff = FeedForward(cfg)\n",
" self.norm1 = LayerNorm(cfg[\"emb_dim\"])\n",
" self.norm2 = LayerNorm(cfg[\"emb_dim\"])\n",
" self.drop_shortcut = nn.Dropout(cfg[\"drop_rate_shortcut\"])\n",
"\n",
" def forward(self, x):\n",
" # Shortcut connection for attention block\n",
" shortcut = x\n",
" x = self.norm1(x)\n",
" x = self.att(x) # Shape [batch_size, num_tokens, emb_size]\n",
" x = self.drop_shortcut(x)\n",
" x = x + shortcut # Add the original input back\n",
"\n",
" # Shortcut connection for feed-forward block\n",
" shortcut = x\n",
" x = self.norm2(x)\n",
" x = self.ff(x)\n",
" x = self.drop_shortcut(x)\n",
" x = x + shortcut # Add the original input back\n",
"\n",
" return x\n",
"\n",
"\n",
"class GPTModel(nn.Module):\n",
" def __init__(self, cfg):\n",
" super().__init__()\n",
" self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"])\n",
" self.pos_emb = nn.Embedding(cfg[\"context_length\"], cfg[\"emb_dim\"])\n",
" self.drop_emb = nn.Dropout(cfg[\"drop_rate_emb\"]) # NEW: dropout for embedding layers\n",
"\n",
" self.trf_blocks = nn.Sequential(\n",
" *[TransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
"\n",
" self.final_norm = LayerNorm(cfg[\"emb_dim\"])\n",
" self.out_head = nn.Linear(cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False)\n",
"\n",
" def forward(self, in_idx):\n",
" batch_size, seq_len = in_idx.shape\n",
" tok_embeds = self.tok_emb(in_idx)\n",
" pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
" x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]\n",
" x = self.drop_emb(x)\n",
" x = self.trf_blocks(x)\n",
" x = self.final_norm(x)\n",
" logits = self.out_head(x)\n",
" return logits"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "1d013d32-c275-4f42-be21-9010f1537227",
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"\n",
"torch.manual_seed(123)\n",
"model = GPTModel(GPT_CONFIG_124M)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}