From e818be42e1a8a4bc7a129ccf818522b34bb5d775 Mon Sep 17 00:00:00 2001
From: Sebastian Raschka <mail@sebastianraschka.com>
Date: Fri, 14 Feb 2025 08:03:01 -0600
Subject: [PATCH] Update link to vocab size increase (#526)

* Update link to vocab size increase

* Update ch05/10_llm-training-speed/README.md

* Update ch05/10_llm-training-speed/README.md
---
 ch05/10_llm-training-speed/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ch05/10_llm-training-speed/README.md b/ch05/10_llm-training-speed/README.md
index 16b3924..2923f15 100644
--- a/ch05/10_llm-training-speed/README.md
+++ b/ch05/10_llm-training-speed/README.md
@@ -169,7 +169,7 @@ After:
 &nbsp;
 ### 9. Using a nicer vocab size value
 
-- This is a tip suggested to me by my former colleague Carlos Moccholi, who mentioned that this tip comes from Andrej Karpathy (I suspect it's from the [nanoGPT](https://github.com/karpathy/nanoGPT/blob/93a43d9a5c22450bbf06e78da2cb6eeef084b717/model.py#L111) repository)
+- Here, we slightly increase the vocabulary size from 50,257 to 50,304, which is the nearest multiple of 64. This tip was suggested to me by my former colleague Carlos Mocholi, who mentioned that it originally came from Andrej Karpathy (likely from [this post](https://x.com/karpathy/status/1621578354024677377)). Karpathy's recommendation is based on [NVIDIA's guidelines on tensor shapes](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensor-core-shape), where batch sizes and linear layer dimensions are commonly chosen as multiples of certain values.
 
 Before:
 - `Step tok/sec: 112046`
@@ -204,4 +204,4 @@ Before (single GPU):
 
 After (4 GPUs):
 - `Step tok/sec: 419259`
-- `Reserved memory: 22.7969 GB`
\ No newline at end of file
+- `Reserved memory: 22.7969 GB`