Sebastian Raschka fecfdd16ff
Add simpler BPE, and make previous BPE better (#870)
* Add simpler BPE, and make previous BPE better

* update

* Update README.md
2025-10-08 22:22:34 -05:00

5 lines
534 B
Markdown

# Byte Pair Encoding (BPE) Tokenizer From Scratch
- [bpe-from-scratch-simple.ipynb](bpe-from-scratch-simple.ipynb) contains optional (bonus) code that explains and shows how the BPE tokenizer works under the hood; this is geared for simplicity and readability.
- [bpe-from-scratch.ipynb](bpe-from-scratch.ipynb) implements a more sophisticated (and much more complicated) BPE tokenizer that behaves similarly as tiktoken with respect to all the edge cases; it also has additional funcitionality for loading the official GPT-2 vocab.