Clarify dataset length in chapter 2 (#589)

2025-11-01 10:20:00 +00:00 · 2025-03-30 16:01:37 -05:00 · 2025-03-30 16:01:37 -05:00 · 0bdcce4e40
commit 0bdcce4e40
parent 4e3b752e5e
1 changed files with 1 additions and 0 deletions
--- a/ch02/01_main-chapter-code/ch02.ipynb
+++ b/ch02/01_main-chapter-code/ch02.ipynb
@ -1296,6 +1296,7 @@
    "\n",
    "        # Tokenize the entire text\n",
    "        token_ids = tokenizer.encode(txt, allowed_special={\"<|endoftext|>\"})\n",
+    "        assert len(token_ids) > max_length, \"Number of tokenized inputs must at least be equal to max_length+1\"\n",
    "\n",
    "        # Use a sliding window to chunk the book into overlapping sequences of max_length\n",
    "        for i in range(0, len(token_ids) - max_length, stride):\n",