From 6a585e08bc7f32fb1d21b2e45955f22f7f1b0ff5 Mon Sep 17 00:00:00 2001 From: rasbt Date: Mon, 11 Mar 2024 07:07:36 -0500 Subject: [PATCH] Add appendix D --- README.md | 31 +- .../01_main-chapter-code/appendix-D.ipynb | 738 ++++++++++++++++++ .../01_main-chapter-code/previous_chapters.py | 318 ++++++++ .../01_main-chapter-code/the-verdict.txt | 165 ++++ 4 files changed, 1236 insertions(+), 16 deletions(-) create mode 100644 appendix-D/01_main-chapter-code/appendix-D.ipynb create mode 100644 appendix-D/01_main-chapter-code/previous_chapters.py create mode 100644 appendix-D/01_main-chapter-code/the-verdict.txt diff --git a/README.md b/README.md index c223930..fabf0fa 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ This repository contains the code for coding, pretraining, and finetuning a GPT- -In [*Build a Large Language Model (from Scratch)*](http://mng.bz/orYv), you'll discover how LLMs work from the inside out. In this book, I'll guide you step by step through creating your own LLM, explaining each stage with clear text, diagrams, and examples. +In [*Build a Large Language Model (From Scratch)*](http://mng.bz/orYv), you'll discover how LLMs work from the inside out. In this book, I'll guide you step by step through creating your own LLM, explaining each stage with clear text, diagrams, and examples. The method described in this book for training and developing your own small-but-functional model for educational purposes mirrors the approach used in creating large-scale foundational models such as those behind ChatGPT. @@ -31,21 +31,20 @@ Alternatively, you can view this and other files on GitHub at [https://github.co

-| Chapter Title | Main Code (for quick access) | All Code + Supplementary | -|------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|-------------------------------| -| Ch 1: Understanding Large Language Models | No code | No code | -| Ch 2: Working with Text Data | - [ch02.ipynb](ch02/01_main-chapter-code/ch02.ipynb)
- [dataloader.ipynb](ch02/01_main-chapter-code/dataloader.ipynb) (summary)
- [exercise-solutions.ipynb](ch02/01_main-chapter-code/exercise-solutions.ipynb) | [./ch02](./ch02) | -| Ch 3: Coding Attention Mechanisms | - [ch03.ipynb](ch03/01_main-chapter-code/ch03.ipynb)
- [multihead-attention.ipynb](ch03/01_main-chapter-code/multihead-attention.ipynb) (summary)
- [exercise-solutions.ipynb](ch03/01_main-chapter-code/exercise-solutions.ipynb)| [./ch03](./ch03) | -| Ch 4: Implementing a GPT Model from Scratch | - [ch04.ipynb](ch04/01_main-chapter-code/ch04.ipynb)
- [gpt.py](ch04/01_main-chapter-code/gpt.py) (summary)
- [exercise-solutions.ipynb](ch04/01_main-chapter-code/exercise-solutions.ipynb) | [./ch04](./ch04) | -| Ch 5: Pretraining on Unlabeled Data | Q1 2024 | ... | -| Ch 6: Finetuning for Text Classification | Q2 2024 | ... | -| Ch 7: Finetuning with Human Feedback | Q2 2024 | ... | -| Ch 8: Using Large Language Models in Practice | Q2/3 2024 | ... | -| Appendix A: Introduction to PyTorch | - [code-part1.ipynb](appendix-A/03_main-chapter-code/code-part1.ipynb)
- [code-part2.ipynb](appendix-A/03_main-chapter-code/code-part2.ipynb)
- [DDP-script.py](appendix-A/03_main-chapter-code/DDP-script.py)
- [exercise-solutions.ipynb](appendix-A/03_main-chapter-code/exercise-solutions.ipynb) | [./appendix-A](./appendix-A) | -| Appendix B: References and Further Reading | No code | | -| Appendix C: Exercises | No code | | - - +| Chapter Title | Main Code (for quick access) | All Code + Supplementary | +|------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|-------------------------------| +| Ch 1: Understanding Large Language Models | No code | - | +| Ch 2: Working with Text Data | - [ch02.ipynb](ch02/01_main-chapter-code/ch02.ipynb)
- [dataloader.ipynb](ch02/01_main-chapter-code/dataloader.ipynb) (summary)
- [exercise-solutions.ipynb](ch02/01_main-chapter-code/exercise-solutions.ipynb) | [./ch02](./ch02) | +| Ch 3: Coding Attention Mechanisms | - [ch03.ipynb](ch03/01_main-chapter-code/ch03.ipynb)
- [multihead-attention.ipynb](ch03/01_main-chapter-code/multihead-attention.ipynb) (summary)
- [exercise-solutions.ipynb](ch03/01_main-chapter-code/exercise-solutions.ipynb)| [./ch03](./ch03) | +| Ch 4: Implementing a GPT Model from Scratch | - [ch04.ipynb](ch04/01_main-chapter-code/ch04.ipynb)
- [gpt.py](ch04/01_main-chapter-code/gpt.py) (summary)
- [exercise-solutions.ipynb](ch04/01_main-chapter-code/exercise-solutions.ipynb) | [./ch04](./ch04) | +| Ch 5: Pretraining on Unlabeled Data | Q1 2024 | ... | +| Ch 6: Finetuning for Text Classification | Q2 2024 | ... | +| Ch 7: Finetuning with Human Feedback | Q2 2024 | ... | +| Ch 8: Using Large Language Models in Practice | Q2/3 2024 | ... | +| Appendix A: Introduction to PyTorch | - [code-part1.ipynb](appendix-A/03_main-chapter-code/code-part1.ipynb)
- [code-part2.ipynb](appendix-A/03_main-chapter-code/code-part2.ipynb)
- [DDP-script.py](appendix-A/03_main-chapter-code/DDP-script.py)
- [exercise-solutions.ipynb](appendix-A/03_main-chapter-code/exercise-solutions.ipynb) | [./appendix-A](./appendix-A) | +| Appendix B: References and Further Reading | No code | - | +| Appendix C: Exercises | No code | - | +| Appendix D: Adding Bells and Whistles to the Training Loop | - [appendix-D.ipynb](appendix-D/01_main-chapter-code/appendix-D.ipynb) | [./appendix-D](./appendix-D) |
> [!TIP] diff --git a/appendix-D/01_main-chapter-code/appendix-D.ipynb b/appendix-D/01_main-chapter-code/appendix-D.ipynb new file mode 100644 index 0000000..4f34140 --- /dev/null +++ b/appendix-D/01_main-chapter-code/appendix-D.ipynb @@ -0,0 +1,738 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "af53bcb1-ff9d-49c7-a0bc-5b8d32ff975b", + "metadata": {}, + "source": [ + "## Appendix D: Adding Bells and Whistles to the Training Loop" + ] + }, + { + "cell_type": "markdown", + "id": "4f58c142-9434-49af-b33a-356b80a45b86", + "metadata": {}, + "source": [ + "- In this appendix, we add a few more advanced features to the training function, which are used in typical pretraining and finetuning; finetuning is covered in chapters 6 and 7\n", + "- The next three sections below discuss learning rate warmup, cosine decay, and gradient clipping\n", + "- The final section adds these techniques to the training function" + ] + }, + { + "cell_type": "markdown", + "id": "744def4f-c03f-42ee-97bb-5d7d5b89b723", + "metadata": {}, + "source": [ + "- We start by initializing a model reusing the code from chapter 5:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "8755bd5e-bc06-4e6e-9e63-c7c82b816cbe", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "torch version: 2.2.1\n", + "tiktoken version: 0.5.1\n" + ] + } + ], + "source": [ + "from importlib.metadata import version\n", + "import torch\n", + "import tiktoken\n", + "\n", + "print(\"torch version:\", version(\"torch\"))\n", + "print(\"tiktoken version:\", version(\"tiktoken\"))\n", + "\n", + "\n", + "from previous_chapters import GPTModel\n", + "\n", + "GPT_CONFIG_124M = {\n", + " \"vocab_size\": 50257, # Vocabulary size\n", + " \"ctx_len\": 256, # Shortened context length (orig: 1024)\n", + " \"emb_dim\": 768, # Embedding dimension\n", + " \"n_heads\": 12, # Number of attention heads\n", + " \"n_layers\": 12, # Number of layers\n", + " \"drop_rate\": 0.1, # Dropout rate\n", + " \"qkv_bias\": False # Query-key-value bias\n", + "}\n", + "\n", + "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + "\n", + "torch.manual_seed(123)\n", + "model = GPTModel(GPT_CONFIG_124M)\n", + "model.eval(); # Disable dropout during inference" + ] + }, + { + "cell_type": "markdown", + "id": "51574e57-a098-412c-83e8-66dafa5a0b99", + "metadata": {}, + "source": [ + "- Next, using the same code we used in chapter 5, we initialize the data loaders:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "386ca110-2bb4-42f1-bd54-8836df80acaa", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import urllib.request\n", + "\n", + "file_path = \"the-verdict.txt\"\n", + "url = \"https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt\"\n", + "\n", + "if not os.path.exists(file_path):\n", + " with urllib.request.urlopen(url) as response:\n", + " text_data = response.read().decode('utf-8')\n", + " with open(file_path, \"w\", encoding=\"utf-8\") as file:\n", + " file.write(text_data)\n", + "else:\n", + " with open(file_path, \"r\", encoding=\"utf-8\") as file:\n", + " text_data = file.read()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "ae96992b-536a-4684-a924-658b9ffb7e9c", + "metadata": {}, + "outputs": [], + "source": [ + "from previous_chapters import create_dataloader_v1\n", + "\n", + "# Train/validation ratio\n", + "train_ratio = 0.90\n", + "split_idx = int(train_ratio * len(text_data))\n", + "\n", + "\n", + "torch.manual_seed(123)\n", + "\n", + "train_loader = create_dataloader_v1(\n", + " text_data[:split_idx],\n", + " batch_size=2,\n", + " max_length=GPT_CONFIG_124M[\"ctx_len\"],\n", + " stride=GPT_CONFIG_124M[\"ctx_len\"],\n", + " drop_last=True,\n", + " shuffle=True\n", + ")\n", + "\n", + "val_loader = create_dataloader_v1(\n", + " text_data[split_idx:],\n", + " batch_size=2,\n", + " max_length=GPT_CONFIG_124M[\"ctx_len\"],\n", + " stride=GPT_CONFIG_124M[\"ctx_len\"],\n", + " drop_last=False,\n", + " shuffle=False\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "939c08d8-257a-41c6-b842-019f7897ac74", + "metadata": {}, + "source": [ + "## D.1 Learning rate warmup" + ] + }, + { + "cell_type": "markdown", + "id": "7fafcd30-ddf7-4a9f-bcf4-b13c052b3133", + "metadata": {}, + "source": [ + "- When training complex models like LLMs, implementing learning rate warmup can help stabilize the training\n", + "- In learning rate warmup, we gradually increase the learning rate from a very low value (`initial_lr`) to a user-specified maximum (`peak_lr`)\n", + "- This way, the model will start the training with small weight updates, which helps decrease the risk of large destabilizing updates during the training" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "2bb4790b-b8b6-4e9e-adf4-704a04b31ddf", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "135\n" + ] + } + ], + "source": [ + "n_epochs = 15\n", + "peak_lr = 0.01\n", + "initial_lr = 0.0001\n", + "\n", + "optimizer = torch.optim.AdamW(model.parameters(), lr=peak_lr, weight_decay=0.1)\n", + "total_training_steps = len(train_loader) * n_epochs\n", + "\n", + "print(total_training_steps)" + ] + }, + { + "cell_type": "markdown", + "id": "5bf3a8da-abc4-4b80-a5d8-f1cc1c7cc5f3", + "metadata": {}, + "source": [ + "- Typically, the number of warmup steps is between 10% and 20% of the total number of steps\n", + "- We can compute the increment as the difference between the `peak_lr` and `initial_lr` divided by the number of warmup steps" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "e075f80e-a398-4809-be1d-8019e1d31c90", + "metadata": {}, + "outputs": [], + "source": [ + "warmup_steps = 20\n", + "lr_increment = (peak_lr - initial_lr) / warmup_steps\n", + "\n", + "global_step = -1\n", + "track_lrs = []\n", + "\n", + "for epoch in range(n_epochs):\n", + " for input_batch, target_batch in train_loader:\n", + " optimizer.zero_grad()\n", + " global_step += 1\n", + " \n", + " if global_step < warmup_steps:\n", + " lr = initial_lr + global_step * lr_increment\n", + " else:\n", + " lr = peak_lr\n", + " \n", + " # Apply the calculated learning rate to the optimizer\n", + " for param_group in optimizer.param_groups:\n", + " param_group[\"lr\"] = lr\n", + " track_lrs.append(optimizer.param_groups[0][\"lr\"])\n", + " \n", + " # Calculate loss and update weights" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "cb6da121-eeed-4023-bdd8-3666c594b4ed", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "plt.ylabel(\"Learning rate\")\n", + "plt.xlabel(\"Step\")\n", + "plt.plot(range(total_training_steps), track_lrs);" + ] + }, + { + "cell_type": "markdown", + "id": "7b3996b6-3f7a-420a-8584-c5760249f3d8", + "metadata": {}, + "source": [ + "## D.2 Cosine decay" + ] + }, + { + "cell_type": "markdown", + "id": "c5216214-de79-40cf-a733-b1049a73023c", + "metadata": {}, + "source": [ + "- Another popular technique for training complex deep neural networks is cosine decay, which also adjusts the learning rate across training epochs\n", + "- In cosine decay, the learning rate follows a cosine curve, decreasing from its initial value to near zero following a half-cosine cycle\n", + "- This gradual reduction is designed to slow the pace of learning as the model begins to improve its weights; it reduces the risk of overshooting minima as the training progresses, which is crucial for stabilizing the training in its later stages\n", + "- Cosine decay is often preferred over linear decay for its smoother transition in learning rate adjustments, but linear decay is also used in practice (for example, [OLMo: Accelerating the Science of Language Models](https://arxiv.org/abs/2402.00838))" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "4e8d2068-a057-4abf-b478-f02cc37191f6", + "metadata": {}, + "outputs": [], + "source": [ + "import math\n", + "\n", + "min_lr = 0.1 * initial_lr\n", + "track_lrs = []\n", + "\n", + "lr_increment = (peak_lr - initial_lr) / warmup_steps\n", + "global_step = -1\n", + "\n", + "for epoch in range(n_epochs):\n", + " for input_batch, target_batch in train_loader:\n", + " optimizer.zero_grad()\n", + " global_step += 1\n", + " \n", + " # Adjust the learning rate based on the current phase (warmup or cosine annealing)\n", + " if global_step < warmup_steps:\n", + " # Linear warmup\n", + " lr = initial_lr + global_step * lr_increment \n", + " else:\n", + " # Cosine annealing after warmup\n", + " progress = ((global_step - warmup_steps) / \n", + " (total_training_steps - warmup_steps))\n", + " lr = min_lr + (peak_lr - min_lr) * 0.5 * (1 + math.cos(math.pi * progress))\n", + " \n", + " # Apply the calculated learning rate to the optimizer\n", + " for param_group in optimizer.param_groups:\n", + " param_group[\"lr\"] = lr\n", + " track_lrs.append(optimizer.param_groups[0][\"lr\"])\n", + " \n", + " # Calculate loss and update weights" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "0e779e33-8a44-4984-bb23-be0603dc4158", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.ylabel(\"Learning rate\")\n", + "plt.xlabel(\"Step\")\n", + "plt.plot(range(total_training_steps), track_lrs);" + ] + }, + { + "cell_type": "markdown", + "id": "e7512808-b48d-4146-86a1-5931b1e3aec1", + "metadata": {}, + "source": [ + "## D.3 Gradient clipping" + ] + }, + { + "cell_type": "markdown", + "id": "c0a74f76-8d2b-4974-a03c-d645445cdc21", + "metadata": {}, + "source": [ + "- Gradient clipping is yet another technique used to stabilize the training when training LLMs\n", + "- By setting a threshold, gradients exceeding this limit are scaled down to a maximum magnitude to ensure that the updates to the model's parameters during backpropagation remain within a manageable range\n", + "- For instance, using the `max_norm=1.0` setting in PyTorch's `clip_grad_norm_` method means that the norm of the gradients is clipped such that their maximum norm does not exceed 1.0\n", + "- the \"norm\" refers to a measure of the gradient vector's length (or magnitude) in the parameter space of the model\n", + "- Specifically, it's the L2 norm, also known as the Euclidean norm\n", + "- Mathematically, for a vector $\\mathbf{v}$ with components $\\mathbf{v} = [v_1, v_2, \\ldots, v_n]$, the L2 norm is defined as:\n", + "$$\n", + "\\| \\mathbf{v} \\|_2 = \\sqrt{v_1^2 + v_2^2 + \\ldots + v_n^2}\n", + "$$" + ] + }, + { + "cell_type": "markdown", + "id": "d44838a6-4322-47b2-a935-c00d3a88355f", + "metadata": {}, + "source": [ + "- The L2 norm is calculated similarly for matrices.\n", + "- Let's assume our gradient matrix is:\n", + "$$\n", + "G = \\begin{bmatrix}\n", + "1 & 2 \\\\\n", + "2 & 4\n", + "\\end{bmatrix}\n", + "$$\n", + "\n", + "- And we want to clip these gradients with a `max_norm` of 1.\n", + "\n", + "- First, we calculate the L2 norm of these gradients:\n", + "$$\n", + "\\|G\\|_2 = \\sqrt{1^2 + 2^2 + 2^2 + 4^2} = \\sqrt{25} = 5\n", + "$$\n", + "\n", + "- Since $\\|G\\|_2 = 5$ is greater than our `max_norm` of 1, we need to scale down the gradients so that their norm is exactly 1. The scaling factor is calculated as $\\frac{max\\_norm}{\\|G\\|_2} = \\frac{1}{5}$.\n", + "\n", + "- Therefore, the scaled gradient matrix $G'$ will be as follows:\n", + "$$\n", + "G' = \\frac{1}{5} \\times G = \\begin{bmatrix}\n", + "\\frac{1}{5} & \\frac{2}{5} \\\\\n", + "\\frac{2}{5} & \\frac{4}{5}\n", + "\\end{bmatrix}\n", + "$$" + ] + }, + { + "cell_type": "markdown", + "id": "eeb0c3c1-2cff-46f5-8127-24412184428c", + "metadata": {}, + "source": [ + "- Let's see this in action\n", + "- First, we initialize a new model and calculate the loss for a training batch like we would do in the regular training loop" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "e199e1ff-58c4-413a-855e-5edbe9292649", + "metadata": {}, + "outputs": [], + "source": [ + "from previous_chapters import calc_loss_batch\n", + "\n", + "torch.manual_seed(123)\n", + "model = GPTModel(GPT_CONFIG_124M)\n", + "\n", + "loss = calc_loss_batch(input_batch, target_batch, model, device)\n", + "loss.backward()" + ] + }, + { + "cell_type": "markdown", + "id": "76b60f3a-15ec-4846-838d-fdef3df99899", + "metadata": {}, + "source": [ + "- If we call `.backward()`, PyTorch will calculate the gradients and store them in a `.grad` attribute for each weight (parameter) matrix\n", + "- Let's define a utility function to calculate the highest gradient based on all model weights" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "e70729a3-24d1-411d-a002-2529cd3a8a9e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "tensor(0.0373)\n" + ] + } + ], + "source": [ + "def find_highest_gradient(model):\n", + " max_grad = None\n", + " for param in model.parameters():\n", + " if param.grad is not None:\n", + " grad_values = param.grad.data.flatten()\n", + " max_grad_param = grad_values.max()\n", + " if max_grad is None or max_grad_param > max_grad:\n", + " max_grad = max_grad_param\n", + " return max_grad\n", + "\n", + "print(find_highest_gradient(model))" + ] + }, + { + "cell_type": "markdown", + "id": "734f30e6-6b24-4d4b-ae91-e9a4b871113f", + "metadata": {}, + "source": [ + "- Applying gradient clipping, we can see that the largest gradient is now substantially smaller:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "fa81ef8b-4280-400f-a93e-5210f3e62ff0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "tensor(0.0166)\n" + ] + } + ], + "source": [ + "torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n", + "print(find_highest_gradient(model))" + ] + }, + { + "cell_type": "markdown", + "id": "b62c2af0-dac3-4742-be4b-4292c6753099", + "metadata": {}, + "source": [ + "## D.4 The modified training function" + ] + }, + { + "cell_type": "markdown", + "id": "76715332-94ec-4185-922a-75cb420819d5", + "metadata": {}, + "source": [ + "- Now let's add the three concepts covered above (learning rate warmup, cosine decay, and gradient clipping) to the `train_model_simple` function covered in chapter 5 to create the more sophisticated `train_model` function below:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "46eb9c84-a293-4016-a523-7ad726e171e9", + "metadata": {}, + "outputs": [], + "source": [ + "from previous_chapters import evaluate_model, generate_and_print_sample\n", + "\n", + "\n", + "def train_model(model, train_loader, val_loader, optimizer, device, n_epochs,\n", + " eval_freq, eval_iter, start_context, warmup_steps=10,\n", + " initial_lr=3e-05, min_lr=1e-6):\n", + "\n", + " train_losses, val_losses, track_tokens_seen, track_lrs = [], [], [], []\n", + " tokens_seen, global_step = 0, -1\n", + "\n", + " # Retrieve the maximum learning rate from the optimizer\n", + " peak_lr = optimizer.param_groups[0][\"lr\"]\n", + "\n", + " # Calculate the total number of iterations in the training process\n", + " total_training_steps = len(train_loader) * n_epochs\n", + "\n", + " # Calculate the learning rate increment during the warmup phase\n", + " lr_increment = (peak_lr - initial_lr) / warmup_steps\n", + "\n", + " for epoch in range(n_epochs):\n", + " model.train()\n", + " for input_batch, target_batch in train_loader:\n", + " optimizer.zero_grad()\n", + " global_step += 1\n", + "\n", + " # Adjust the learning rate based on the current phase (warmup or cosine annealing)\n", + " if global_step < warmup_steps:\n", + " # Linear warmup\n", + " lr = initial_lr + global_step * lr_increment \n", + " else:\n", + " # Cosine annealing after warmup\n", + " progress = ((global_step - warmup_steps) / \n", + " (total_training_steps - warmup_steps))\n", + " lr = min_lr + (peak_lr - min_lr) * 0.5 * (1 + math.cos(math.pi * progress))\n", + "\n", + " # Apply the calculated learning rate to the optimizer\n", + " for param_group in optimizer.param_groups:\n", + " param_group[\"lr\"] = lr\n", + " track_lrs.append(lr) # Store the current learning rate\n", + "\n", + " # Calculate and backpropagate the loss\n", + " loss = calc_loss_batch(input_batch, target_batch, model, device)\n", + " loss.backward()\n", + "\n", + " # Apply gradient clipping after the warmup phase to avoid exploding gradients\n", + " if global_step > warmup_steps:\n", + " torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n", + " \n", + " optimizer.step()\n", + " tokens_seen += input_batch.numel()\n", + "\n", + " # Periodically evaluate the model on the training and validation sets\n", + " if global_step % eval_freq == 0:\n", + " train_loss, val_loss = evaluate_model(\n", + " model, train_loader, val_loader,\n", + " device, eval_iter\n", + " )\n", + " train_losses.append(train_loss)\n", + " val_losses.append(val_loss)\n", + " track_tokens_seen.append(tokens_seen)\n", + " # Print the current losses\n", + " print(f\"Ep {epoch+1} (Iter {global_step:06d}): \"\n", + " f\"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}\")\n", + "\n", + " # Generate and print a sample from the model to monitor progress\n", + " generate_and_print_sample(\n", + " model, train_loader.dataset.tokenizer,\n", + " device, start_context\n", + " )\n", + "\n", + " return train_losses, val_losses, track_tokens_seen, track_lrs" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "55fcd247-ba9d-4b93-a757-0f7ce04fee41", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Ep 1 (Iter 000000): Train loss 10.914, Val loss 10.940\n", + "Ep 1 (Iter 000005): Train loss 8.903, Val loss 9.313\n", + "Every effort moves you,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,\n", + "Ep 2 (Iter 000010): Train loss 7.362, Val loss 7.789\n", + "Ep 2 (Iter 000015): Train loss 6.273, Val loss 6.814\n", + "Every effort moves you,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,\n", + "Ep 3 (Iter 000020): Train loss 5.958, Val loss 6.609\n", + "Ep 3 (Iter 000025): Train loss 5.675, Val loss 6.592\n", + "Every effort moves you. \n", + "Ep 4 (Iter 000030): Train loss 5.607, Val loss 6.565\n", + "Ep 4 (Iter 000035): Train loss 5.063, Val loss 6.483\n", + "Every effort moves you, and, and the to the to the to the to the to the, and, and the, and the, and, and, and the, and the, and, and the, and, and, and the, and, and the\n", + "Ep 5 (Iter 000040): Train loss 4.384, Val loss 6.379\n", + "Every effort moves you, I was, and I had been. \"I, I had the picture, as a little's his pictures, I had been, I was his\n", + "Ep 6 (Iter 000045): Train loss 4.638, Val loss 6.306\n", + "Ep 6 (Iter 000050): Train loss 3.690, Val loss 6.196\n", + "Every effort moves you know the to me a little of his pictures--I had been. \"I was the's--and, I felt to see a little of his pictures--I had been. \"I of Jack's \"strong. \"I\n", + "Ep 7 (Iter 000055): Train loss 3.157, Val loss 6.148\n", + "Ep 7 (Iter 000060): Train loss 2.498, Val loss 6.157\n", + "Every effort moves you know it was not that, and he was to the fact of the of a and he was--his, the fact of the donkey, in the of the his head to have. \"I had been his pictures--and by his\n", + "Ep 8 (Iter 000065): Train loss 2.182, Val loss 6.178\n", + "Ep 8 (Iter 000070): Train loss 1.998, Val loss 6.193\n", + "Every effort moves you know,\" was not that my dear, his pictures--so handsome, in a so that he was a year after Jack's resolve had been his painting. \"Oh, I had the donkey. \"There were, with his\n", + "Ep 9 (Iter 000075): Train loss 1.824, Val loss 6.211\n", + "Ep 9 (Iter 000080): Train loss 1.742, Val loss 6.201\n", + "Every effort moves you know,\" was not that my hostess was \"interesting\": on that point I could have given Miss Croft the fact, and. \"Oh, as I turned, and down the room, in his\n", + "Ep 10 (Iter 000085): Train loss 1.285, Val loss 6.234\n", + "Every effort moves you?\" \"Yes--quite insensible to the fact with a little: \"Yes--and by me to me to have to see a smile behind his close grayish beard--as if he had the donkey. \"There were days when I\n", + "Ep 11 (Iter 000090): Train loss 1.256, Val loss 6.236\n", + "Ep 11 (Iter 000095): Train loss 0.803, Val loss 6.255\n", + "Every effort moves you?\" \"Yes--quite insensible to the irony. She wanted him vindicated--and by me!\" He laughed again, and threw back his head to look up at the sketch of the donkey. \"There were days when I\n", + "Ep 12 (Iter 000100): Train loss 0.731, Val loss 6.284\n", + "Ep 12 (Iter 000105): Train loss 0.889, Val loss 6.299\n", + "Every effort moves you?\" \"Yes--quite insensible to the irony. She wanted him vindicated--and by me!\" He laughed again, and threw back his head to look up at the sketch of the donkey. \"There were days when I\n", + "Ep 13 (Iter 000110): Train loss 0.703, Val loss 6.316\n", + "Ep 13 (Iter 000115): Train loss 0.517, Val loss 6.315\n", + "Every effort moves you?\" \"Yes--quite insensible to the irony. She wanted him vindicated--and by me!\" He laughed again, and threw back his head to look up at the sketch of the donkey. \"There were days when I\n", + "Ep 14 (Iter 000120): Train loss 0.594, Val loss 6.324\n", + "Ep 14 (Iter 000125): Train loss 0.481, Val loss 6.325\n", + "Every effort moves you?\" \"Yes--quite insensible to the irony. She wanted him vindicated--and by me!\" He laughed again, and threw back his head to look up at the sketch of the donkey. \"There were days when I\n", + "Ep 15 (Iter 000130): Train loss 0.529, Val loss 6.324\n", + "Every effort moves you?\" \"Yes--quite insensible to the irony. She wanted him vindicated--and by me!\" He laughed again, and threw back his head to look up at the sketch of the donkey. \"There were days when I\n" + ] + } + ], + "source": [ + "torch.manual_seed(123)\n", + "model = GPTModel(GPT_CONFIG_124M)\n", + "model.to(device)\n", + "\n", + "peak_lr = 5e-4\n", + "optimizer = torch.optim.AdamW(model.parameters(), lr=peak_lr, weight_decay=0.1)\n", + "\n", + "n_epochs = 15\n", + "train_losses, val_losses, tokens_seen, lrs = train_model(\n", + " model, train_loader, val_loader, optimizer, device, n_epochs=n_epochs,\n", + " eval_freq=5, eval_iter=1, start_context=\"Every effort moves you\",\n", + " warmup_steps=10, initial_lr=1e-5, min_lr=1e-5\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "827e8d5e-0872-4b90-98ac-200c80ee2d53", + "metadata": {}, + "source": [ + "- Looking at the results above, we can see that the model starts out generating incomprehensible strings of words, whereas, towards the end, it's able to produce grammatically more or less correct sentences\n", + "- If we were to check a few passages it writes towards the end, we would find that they are contained in the training set verbatim -- it simply memorizes the training data\n", + "- Note that the overfitting here occurs because we have a very, very small training set, and we iterate over it so many times\n", + " - The LLM training here primarily serves educational purposes; we mainly want to see that the model can learn to produce coherent text\n", + " - Instead of spending weeks or months on training this model on vast amounts of expensive hardware, we load the pretrained weights" + ] + }, + { + "cell_type": "markdown", + "id": "9decec45-4fdf-4ff6-85a7-1806613f8af7", + "metadata": {}, + "source": [ + "- A quick check that the learning rate behaves as intended" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "d8ebb8d2-8308-4a83-a2a6-730c3bf84452", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.plot(range(len(lrs)), lrs)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e80ac790-f9c3-45b8-9ea4-d2e5bf8fbf28", + "metadata": {}, + "outputs": [], + "source": [ + "- And a quick look at the loss curves" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "445d8155-6eae-4b50-a381-d0820ebc27cc", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from previous_chapters import plot_losses\n", + "\n", + "epochs_tensor = torch.linspace(1, n_epochs, len(train_losses))\n", + "plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/appendix-D/01_main-chapter-code/previous_chapters.py b/appendix-D/01_main-chapter-code/previous_chapters.py new file mode 100644 index 0000000..ba18f50 --- /dev/null +++ b/appendix-D/01_main-chapter-code/previous_chapters.py @@ -0,0 +1,318 @@ +# This file collects all the relevant code that we covered thus far +# throughout Chapters 2-4. +# This file can be run as a standalone script. + +import tiktoken +import torch +import torch.nn as nn +from torch.utils.data import Dataset, DataLoader +import matplotlib.pyplot as plt + + +##################################### +# Chapter 2 +##################################### + +class GPTDatasetV1(Dataset): + def __init__(self, txt, tokenizer, max_length, stride): + self.tokenizer = tokenizer + self.input_ids = [] + self.target_ids = [] + + # Tokenize the entire text + token_ids = tokenizer.encode(txt) + + # Use a sliding window to chunk the book into overlapping sequences of max_length + for i in range(0, len(token_ids) - max_length, stride): + input_chunk = token_ids[i:i + max_length] + target_chunk = token_ids[i + 1: i + max_length + 1] + self.input_ids.append(torch.tensor(input_chunk)) + self.target_ids.append(torch.tensor(target_chunk)) + + def __len__(self): + return len(self.input_ids) + + def __getitem__(self, idx): + return self.input_ids[idx], self.target_ids[idx] + + +def create_dataloader_v1(txt, batch_size=4, max_length=256, + stride=128, shuffle=True, drop_last=True): + # Initialize the tokenizer + tokenizer = tiktoken.get_encoding("gpt2") + + # Create dataset + dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) + + # Create dataloader + dataloader = DataLoader( + dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last) + + return dataloader + + +##################################### +# Chapter 3 +##################################### + +class MultiHeadAttention(nn.Module): + def __init__(self, d_in, d_out, block_size, dropout, num_heads, qkv_bias=False): + super().__init__() + assert d_out % num_heads == 0, "d_out must be divisible by n_heads" + + self.d_out = d_out + self.num_heads = num_heads + self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim + + self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) + self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) + self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) + self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs + self.dropout = nn.Dropout(dropout) + self.register_buffer('mask', torch.triu(torch.ones(block_size, block_size), diagonal=1)) + + def forward(self, x): + b, num_tokens, d_in = x.shape + + keys = self.W_key(x) # Shape: (b, num_tokens, d_out) + queries = self.W_query(x) + values = self.W_value(x) + + # We implicitly split the matrix by adding a `num_heads` dimension + # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim) + keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) + values = values.view(b, num_tokens, self.num_heads, self.head_dim) + queries = queries.view(b, num_tokens, self.num_heads, self.head_dim) + + # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim) + keys = keys.transpose(1, 2) + queries = queries.transpose(1, 2) + values = values.transpose(1, 2) + + # Compute scaled dot-product attention (aka self-attention) with a causal mask + attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head + + # Original mask truncated to the number of tokens and converted to boolean + mask_bool = self.mask.bool()[:num_tokens, :num_tokens] + + # Use the mask to fill attention scores + attn_scores.masked_fill_(mask_bool, -torch.inf) + + attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) + attn_weights = self.dropout(attn_weights) + + # Shape: (b, num_tokens, num_heads, head_dim) + context_vec = (attn_weights @ values).transpose(1, 2) + + # Combine heads, where self.d_out = self.num_heads * self.head_dim + context_vec = context_vec.reshape(b, num_tokens, self.d_out) + context_vec = self.out_proj(context_vec) # optional projection + + return context_vec + + +##################################### +# Chapter 4 +##################################### + +class LayerNorm(nn.Module): + def __init__(self, emb_dim): + super().__init__() + self.eps = 1e-5 + self.scale = nn.Parameter(torch.ones(emb_dim)) + self.shift = nn.Parameter(torch.zeros(emb_dim)) + + def forward(self, x): + mean = x.mean(dim=-1, keepdim=True) + var = x.var(dim=-1, keepdim=True, unbiased=False) + norm_x = (x - mean) / torch.sqrt(var + self.eps) + return self.scale * norm_x + self.shift + + +class GELU(nn.Module): + def __init__(self): + super().__init__() + + def forward(self, x): + return 0.5 * x * (1 + torch.tanh( + torch.sqrt(torch.tensor(2.0 / torch.pi)) * + (x + 0.044715 * torch.pow(x, 3)) + )) + + +class FeedForward(nn.Module): + def __init__(self, cfg): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]), + GELU(), + nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]), + nn.Dropout(cfg["drop_rate"]) + ) + + def forward(self, x): + return self.layers(x) + + +class TransformerBlock(nn.Module): + def __init__(self, cfg): + super().__init__() + self.att = MultiHeadAttention( + d_in=cfg["emb_dim"], + d_out=cfg["emb_dim"], + block_size=cfg["ctx_len"], + num_heads=cfg["n_heads"], + dropout=cfg["drop_rate"], + qkv_bias=cfg["qkv_bias"]) + self.ff = FeedForward(cfg) + self.norm1 = LayerNorm(cfg["emb_dim"]) + self.norm2 = LayerNorm(cfg["emb_dim"]) + self.drop_resid = nn.Dropout(cfg["drop_rate"]) + + def forward(self, x): + # Shortcut connection for attention block + shortcut = x + x = self.norm1(x) + x = self.att(x) # Shape [batch_size, num_tokens, emb_size] + x = self.drop_resid(x) + x = x + shortcut # Add the original input back + + # Shortcut connection for feed-forward block + shortcut = x + x = self.norm2(x) + x = self.ff(x) + x = self.drop_resid(x) + x = x + shortcut # Add the original input back + + return x + + +class GPTModel(nn.Module): + def __init__(self, cfg): + super().__init__() + self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"]) + self.pos_emb = nn.Embedding(cfg["ctx_len"], cfg["emb_dim"]) + self.drop_emb = nn.Dropout(cfg["drop_rate"]) + + self.trf_blocks = nn.Sequential( + *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]) + + self.final_norm = LayerNorm(cfg["emb_dim"]) + self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False) + + def forward(self, in_idx): + batch_size, seq_len = in_idx.shape + tok_embeds = self.tok_emb(in_idx) + pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) + x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size] + x = self.drop_emb(x) + x = self.trf_blocks(x) + x = self.final_norm(x) + logits = self.out_head(x) + return logits + + +def generate_text_simple(model, idx, max_new_tokens, context_size): + # idx is (B, T) array of indices in the current context + for _ in range(max_new_tokens): + + # Crop current context if it exceeds the supported context size + # E.g., if LLM supports only 5 tokens, and the context size is 10 + # then only the last 5 tokens are used as context + idx_cond = idx[:, -context_size:] + + # Get the predictions + with torch.no_grad(): + logits = model(idx_cond) + + # Focus only on the last time step + # (batch, n_token, vocab_size) becomes (batch, vocab_size) + logits = logits[:, -1, :] + + # Get the idx of the vocab entry with the highest logits value + idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (batch, 1) + + # Append sampled index to the running sequence + idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1) + + return idx + + +##################################### +# Chapter 5 +#################################### + + +def calc_loss_batch(input_batch, target_batch, model, device): + input_batch, target_batch = input_batch.to(device), target_batch.to(device) + + logits = model(input_batch) + logits = logits.view(-1, logits.size(-1)) + loss = torch.nn.functional.cross_entropy(logits, target_batch.view(-1)) + return loss + + +def calc_loss_loader(data_loader, model, device, num_batches=None): + total_loss, batches_seen = 0., 0. + if num_batches is None: + num_batches = len(data_loader) + for i, (input_batch, target_batch) in enumerate(data_loader): + if i < num_batches: + loss = calc_loss_batch(input_batch, target_batch, model, device) + total_loss += loss.item() + batches_seen += 1 + else: + break + return total_loss / batches_seen + + +def evaluate_model(model, train_loader, val_loader, device, eval_iter): + model.eval() + with torch.no_grad(): + train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter) + val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter) + model.train() + return train_loss, val_loss + + +def generate_and_print_sample(model, tokenizer, device, start_context): + model.eval() + context_size = model.pos_emb.weight.shape[0] + encoded = text_to_token_ids(start_context, tokenizer).to(device) + with torch.no_grad(): + token_ids = generate_text_simple( + model=model, idx=encoded, + max_new_tokens=50, context_size=context_size) + decoded_text = token_ids_to_text(token_ids, tokenizer) + print(decoded_text.replace("\n", " ")) # Compact print format + model.train() + + +def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses): + fig, ax1 = plt.subplots() + + # Plot training and validation loss against epochs + ax1.plot(epochs_seen, train_losses, label="Training loss") + ax1.plot(epochs_seen, val_losses, linestyle="-.", label="Validation loss") + ax1.set_xlabel("Epochs") + ax1.set_ylabel("Loss") + ax1.legend(loc="upper right") + + # Create a second x-axis for tokens seen + ax2 = ax1.twiny() # Create a second x-axis that shares the same y-axis + ax2.plot(tokens_seen, train_losses, alpha=0) # Invisible plot for aligning ticks + ax2.set_xlabel("Tokens seen") + + fig.tight_layout() # Adjust layout to make room + plt.show() + + +def text_to_token_ids(text, tokenizer): + encoded = tokenizer.encode(text) + encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension + return encoded_tensor + + +def token_ids_to_text(token_ids, tokenizer): + flat = token_ids.squeeze(0) # remove batch dimension + return tokenizer.decode(flat.tolist()) \ No newline at end of file diff --git a/appendix-D/01_main-chapter-code/the-verdict.txt b/appendix-D/01_main-chapter-code/the-verdict.txt new file mode 100644 index 0000000..6b651c7 --- /dev/null +++ b/appendix-D/01_main-chapter-code/the-verdict.txt @@ -0,0 +1,165 @@ +I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.) + +"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing's lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn's "Moon-dancers" to say, with tears in her eyes: "We shall not look upon its like again"? + +Well!--even through the prism of Hermia's tears I felt able to face the fact with equanimity. Poor Jack Gisburn! The women had made him--it was fitting that they should mourn him. Among his own sex fewer regrets were heard, and in his own trade hardly a murmur. Professional jealousy? Perhaps. If it were, the honour of the craft was vindicated by little Claude Nutley, who, in all good faith, brought out in the Burlington a very handsome "obituary" on Jack--one of those showy articles stocked with random technicalities that I have heard (I won't say by whom) compared to Gisburn's painting. And so--his resolve being apparently irrevocable--the discussion gradually died out, and, as Mrs. Thwing had predicted, the price of "Gisburns" went up. + +It was not till three years later that, in the course of a few weeks' idling on the Riviera, it suddenly occurred to me to wonder why Gisburn had given up his painting. On reflection, it really was a tempting problem. To accuse his wife would have been too easy--his fair sitters had been denied the solace of saying that Mrs. Gisburn had "dragged him down." For Mrs. Gisburn--as such--had not existed till nearly a year after Jack's resolve had been taken. It might be that he had married her--since he liked his ease--because he didn't want to go on painting; but it would have been hard to prove that he had given up his painting because he had married her. + +Of course, if she had not dragged him down, she had equally, as Miss Croft contended, failed to "lift him up"--she had not led him back to the easel. To put the brush into his hand again--what a vocation for a wife! But Mrs. Gisburn appeared to have disdained it--and I felt it might be interesting to find out why. + +The desultory life of the Riviera lends itself to such purely academic speculations; and having, on my way to Monte Carlo, caught a glimpse of Jack's balustraded terraces between the pines, I had myself borne thither the next day. + +I found the couple at tea beneath their palm-trees; and Mrs. Gisburn's welcome was so genial that, in the ensuing weeks, I claimed it frequently. It was not that my hostess was "interesting": on that point I could have given Miss Croft the fullest reassurance. It was just because she was _not_ interesting--if I may be pardoned the bull--that I found her so. For Jack, all his life, had been surrounded by interesting women: they had fostered his art, it had been reared in the hot-house of their adulation. And it was therefore instructive to note what effect the "deadening atmosphere of mediocrity" (I quote Miss Croft) was having on him. + +I have mentioned that Mrs. Gisburn was rich; and it was immediately perceptible that her husband was extracting from this circumstance a delicate but substantial satisfaction. It is, as a rule, the people who scorn money who get most out of it; and Jack's elegant disdain of his wife's big balance enabled him, with an appearance of perfect good-breeding, to transmute it into objects of art and luxury. To the latter, I must add, he remained relatively indifferent; but he was buying Renaissance bronzes and eighteenth-century pictures with a discrimination that bespoke the amplest resources. + +"Money's only excuse is to put beauty into circulation," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run over from Monte Carlo; and Mrs. Gisburn, beaming on him, added for my enlightenment: "Jack is so morbidly sensitive to every form of beauty." + +Poor Jack! It had always been his fate to have women say such things of him: the fact should be set down in extenuation. What struck me now was that, for the first time, he resented the tone. I had seen him, so often, basking under similar tributes--was it the conjugal note that robbed them of their savour? No--for, oddly enough, it became apparent that he was fond of Mrs. Gisburn--fond enough not to see her absurdity. It was his own absurdity he seemed to be wincing under--his own attitude as an object for garlands and incense. + +"My dear, since I've chucked painting people don't say that stuff about me--they say it about Victor Grindle," was his only protest, as he rose from the table and strolled out onto the sunlit terrace. + +I glanced after him, struck by his last word. Victor Grindle was, in fact, becoming the man of the moment--as Jack himself, one might put it, had been the man of the hour. The younger artist was said to have formed himself at my friend's feet, and I wondered if a tinge of jealousy underlay the latter's mysterious abdication. But no--for it was not till after that event that the _rose Dubarry_ drawing-rooms had begun to display their "Grindles." + +I turned to Mrs. Gisburn, who had lingered to give a lump of sugar to her spaniel in the dining-room. + +"Why _has_ he chucked painting?" I asked abruptly. + +She raised her eyebrows with a hint of good-humoured surprise. + +"Oh, he doesn't _have_ to now, you know; and I want him to enjoy himself," she said quite simply. + +I looked about the spacious white-panelled room, with its _famille-verte_ vases repeating the tones of the pale damask curtains, and its eighteenth-century pastels in delicate faded frames. + +"Has he chucked his pictures too? I haven't seen a single one in the house." + +A slight shade of constraint crossed Mrs. Gisburn's open countenance. "It's his ridiculous modesty, you know. He says they're not fit to have about; he's sent them all away except one--my portrait--and that I have to keep upstairs." + +His ridiculous modesty--Jack's modesty about his pictures? My curiosity was growing like the bean-stalk. I said persuasively to my hostess: "I must really see your portrait, you know." + +She glanced out almost timorously at the terrace where her husband, lounging in a hooded chair, had lit a cigar and drawn the Russian deerhound's head between his knees. + +"Well, come while he's not looking," she said, with a laugh that tried to hide her nervousness; and I followed her between the marble Emperors of the hall, and up the wide stairs with terra-cotta nymphs poised among flowers at each landing. + +In the dimmest corner of her boudoir, amid a profusion of delicate and distinguished objects, hung one of the familiar oval canvases, in the inevitable garlanded frame. The mere outline of the frame called up all Gisburn's past! + +Mrs. Gisburn drew back the window-curtains, moved aside a _jardiniere_ full of pink azaleas, pushed an arm-chair away, and said: "If you stand here you can just manage to see it. I had it over the mantel-piece, but he wouldn't let it stay." + +Yes--I could just manage to see it--the first portrait of Jack's I had ever had to strain my eyes over! Usually they had the place of honour--say the central panel in a pale yellow or _rose Dubarry_ drawing-room, or a monumental easel placed so that it took the light through curtains of old Venetian point. The more modest place became the picture better; yet, as my eyes grew accustomed to the half-light, all the characteristic qualities came out--all the hesitations disguised as audacities, the tricks of prestidigitation by which, with such consummate skill, he managed to divert attention from the real business of the picture to some pretty irrelevance of detail. Mrs. Gisburn, presenting a neutral surface to work on--forming, as it were, so inevitably the background of her own picture--had lent herself in an unusual degree to the display of this false virtuosity. The picture was one of Jack's "strongest," as his admirers would have put it--it represented, on his part, a swelling of muscles, a congesting of veins, a balancing, straddling and straining, that reminded one of the circus-clown's ironic efforts to lift a feather. It met, in short, at every point the demand of lovely woman to be painted "strongly" because she was tired of being painted "sweetly"--and yet not to lose an atom of the sweetness. + +"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride. "The last but one," she corrected herself--"but the other doesn't count, because he destroyed it." + +"Destroyed it?" I was about to follow up this clue when I heard a footstep and saw Jack himself on the threshold. + +As he stood there, his hands in the pockets of his velveteen coat, the thin brown waves of hair pushed back from his white forehead, his lean sunburnt cheeks furrowed by a smile that lifted the tips of a self-confident moustache, I felt to what a degree he had the same quality as his pictures--the quality of looking cleverer than he was. + +His wife glanced at him deprecatingly, but his eyes travelled past her to the portrait. + +"Mr. Rickham wanted to see it," she began, as if excusing herself. He shrugged his shoulders, still smiling. + +"Oh, Rickham found me out long ago," he said lightly; then, passing his arm through mine: "Come and see the rest of the house." + +He showed it to me with a kind of naive suburban pride: the bath-rooms, the speaking-tubes, the dress-closets, the trouser-presses--all the complex simplifications of the millionaire's domestic economy. And whenever my wonder paid the expected tribute he said, throwing out his chest a little: "Yes, I really don't see how people manage to live without that." + +Well--it was just the end one might have foreseen for him. Only he was, through it all and in spite of it all--as he had been through, and in spite of, his pictures--so handsome, so charming, so disarming, that one longed to cry out: "Be dissatisfied with your leisure!" as once one had longed to say: "Be dissatisfied with your work!" + +But, with the cry on my lips, my diagnosis suffered an unexpected check. + +"This is my own lair," he said, leading me into a dark plain room at the end of the florid vista. It was square and brown and leathery: no "effects"; no bric-a-brac, none of the air of posing for reproduction in a picture weekly--above all, no least sign of ever having been used as a studio. + +The fact brought home to me the absolute finality of Jack's break with his old life. + +"Don't you ever dabble with paint any more?" I asked, still looking about for a trace of such activity. + +"Never," he said briefly. + +"Or water-colour--or etching?" + +His confident eyes grew dim, and his cheeks paled a little under their handsome sunburn. + +"Never think of it, my dear fellow--any more than if I'd never touched a brush." + +And his tone told me in a flash that he never thought of anything else. + +I moved away, instinctively embarrassed by my unexpected discovery; and as I turned, my eye fell on a small picture above the mantel-piece--the only object breaking the plain oak panelling of the room. + +"Oh, by Jove!" I said. + +It was a sketch of a donkey--an old tired donkey, standing in the rain under a wall. + +"By Jove--a Stroud!" I cried. + +He was silent; but I felt him close behind me, breathing a little quickly. + +"What a wonder! Made with a dozen lines--but on everlasting foundations. You lucky chap, where did you get it?" + +He answered slowly: "Mrs. Stroud gave it to me." + +"Ah--I didn't know you even knew the Strouds. He was such an inflexible hermit." + +"I didn't--till after. . . . She sent for me to paint him when he was dead." + +"When he was dead? You?" + +I must have let a little too much amazement escape through my surprise, for he answered with a deprecating laugh: "Yes--she's an awful simpleton, you know, Mrs. Stroud. Her only idea was to have him done by a fashionable painter--ah, poor Stroud! She thought it the surest way of proclaiming his greatness--of forcing it on a purblind public. And at the moment I was _the_ fashionable painter." + +"Ah, poor Stroud--as you say. Was _that_ his history?" + +"That was his history. She believed in him, gloried in him--or thought she did. But she couldn't bear not to have all the drawing-rooms with her. She couldn't bear the fact that, on varnishing days, one could always get near enough to see his pictures. Poor woman! She's just a fragment groping for other fragments. Stroud is the only whole I ever knew." + +"You ever knew? But you just said--" + +Gisburn had a curious smile in his eyes. + +"Oh, I knew him, and he knew me--only it happened after he was dead." + +I dropped my voice instinctively. "When she sent for you?" + +"Yes--quite insensible to the irony. She wanted him vindicated--and by me!" + +He laughed again, and threw back his head to look up at the sketch of the donkey. "There were days when I couldn't look at that thing--couldn't face it. But I forced myself to put it here; and now it's cured me--cured me. That's the reason why I don't dabble any more, my dear Rickham; or rather Stroud himself is the reason." + +For the first time my idle curiosity about my companion turned into a serious desire to understand him better. + +"I wish you'd tell me how it happened," I said. + +He stood looking up at the sketch, and twirling between his fingers a cigarette he had forgotten to light. Suddenly he turned toward me. + +"I'd rather like to tell you--because I've always suspected you of loathing my work." + +I made a deprecating gesture, which he negatived with a good-humoured shrug. + +"Oh, I didn't care a straw when I believed in myself--and now it's an added tie between us!" + +He laughed slightly, without bitterness, and pushed one of the deep arm-chairs forward. "There: make yourself comfortable--and here are the cigars you like." + +He placed them at my elbow and continued to wander up and down the room, stopping now and then beneath the picture. + +"How it happened? I can tell you in five minutes--and it didn't take much longer to happen. . . . I can remember now how surprised and pleased I was when I got Mrs. Stroud's note. Of course, deep down, I had always _felt_ there was no one like him--only I had gone with the stream, echoed the usual platitudes about him, till I half got to think he was a failure, one of the kind that are left behind. By Jove, and he _was_ left behind--because he had come to stay! The rest of us had to let ourselves be swept along or go under, but he was high above the current--on everlasting foundations, as you say. + +"Well, I went off to the house in my most egregious mood--rather moved, Lord forgive me, at the pathos of poor Stroud's career of failure being crowned by the glory of my painting him! Of course I meant to do the picture for nothing--I told Mrs. Stroud so when she began to stammer something about her poverty. I remember getting off a prodigious phrase about the honour being _mine_--oh, I was princely, my dear Rickham! I was posing to myself like one of my own sitters. + +"Then I was taken up and left alone with him. I had sent all my traps in advance, and I had only to set up the easel and get to work. He had been dead only twenty-four hours, and he died suddenly, of heart disease, so that there had been no preliminary work of destruction--his face was clear and untouched. I had met him once or twice, years before, and thought him insignificant and dingy. Now I saw that he was superb. + +"I was glad at first, with a merely aesthetic satisfaction: glad to have my hand on such a 'subject.' Then his strange life-likeness began to affect me queerly--as I blocked the head in I felt as if he were watching me do it. The sensation was followed by the thought: if he _were_ watching me, what would he say to my way of working? My strokes began to go a little wild--I felt nervous and uncertain. + +"Once, when I looked up, I seemed to see a smile behind his close grayish beard--as if he had the secret, and were amusing himself by holding it back from me. That exasperated me still more. The secret? Why, I had a secret worth twenty of his! I dashed at the canvas furiously, and tried some of my bravura tricks. But they failed me, they crumbled. I saw that he wasn't watching the showy bits--I couldn't distract his attention; he just kept his eyes on the hard passages between. Those were the ones I had always shirked, or covered up with some lying paint. And how he saw through my lies! + +"I looked up again, and caught sight of that sketch of the donkey hanging on the wall near his bed. His wife told me afterward it was the last thing he had done--just a note taken with a shaking hand, when he was down in Devonshire recovering from a previous heart attack. Just a note! But it tells his whole history. There are years of patient scornful persistence in every line. A man who had swum with the current could never have learned that mighty up-stream stroke. . . . + +"I turned back to my work, and went on groping and muddling; then I looked at the donkey again. I saw that, when Stroud laid in the first stroke, he knew just what the end would be. He had possessed his subject, absorbed it, recreated it. When had I done that with any of my things? They hadn't been born of me--I had just adopted them. . . . + +"Hang it, Rickham, with that face watching me I couldn't do another stroke. The plain truth was, I didn't know where to put it--_I had never known_. Only, with my sitters and my public, a showy splash of colour covered up the fact--I just threw paint into their faces. . . . Well, paint was the one medium those dead eyes could see through--see straight to the tottering foundations underneath. Don't you know how, in talking a foreign language, even fluently, one says half the time not what one wants to but what one can? Well--that was the way I painted; and as he lay there and watched me, the thing they called my 'technique' collapsed like a house of cards. He didn't sneer, you understand, poor Stroud--he just lay there quietly watching, and on his lips, through the gray beard, I seemed to hear the question: 'Are you sure you know where you're coming out?' + +"If I could have painted that face, with that question on it, I should have done a great thing. The next greatest thing was to see that I couldn't--and that grace was given me. But, oh, at that minute, Rickham, was there anything on earth I wouldn't have given to have Stroud alive before me, and to hear him say: 'It's not too late--I'll show you how'? + +"It _was_ too late--it would have been, even if he'd been alive. I packed up my traps, and went down and told Mrs. Stroud. Of course I didn't tell her _that_--it would have been Greek to her. I simply said I couldn't paint him, that I was too moved. She rather liked the idea--she's so romantic! It was that that made her give me the donkey. But she was terribly upset at not getting the portrait--she did so want him 'done' by some one showy! At first I was afraid she wouldn't let me off--and at my wits' end I suggested Grindle. Yes, it was I who started Grindle: I told Mrs. Stroud he was the 'coming' man, and she told somebody else, and so it got to be true. . . . And he painted Stroud without wincing; and she hung the picture among her husband's things. . . ." + +He flung himself down in the arm-chair near mine, laid back his head, and clasping his arms beneath it, looked up at the picture above the chimney-piece. + +"I like to fancy that Stroud himself would have given it to me, if he'd been able to say what he thought that day." + +And, in answer to a question I put half-mechanically--"Begin again?" he flashed out. "When the one thing that brings me anywhere near him is that I knew enough to leave off?" + +He stood up and laid his hand on my shoulder with a laugh. "Only the irony of it is that I _am_ still painting--since Grindle's doing it for me! The Strouds stand alone, and happen once--but there's no exterminating our kind of art." \ No newline at end of file