9 Commits

Author SHA1 Message Date
Sebastian Raschka
4ff743051e
BPE cosmetics (#629)
* Llama3 from scratch improvements

* Cosmetic BPE improvements

* restore

* Update ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb

* Update ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb

* endoftext whitespace
2025-04-18 18:57:09 -05:00
Sebastian Raschka
72efebd7f8
add special token handling to bpe from scratch code (#616) 2025-04-13 12:38:22 -05:00
Sebastian Raschka
2f41429cf4
Cosmetic improvements to the BPE code (#562) 2025-03-09 10:49:40 -05:00
Sebastian Raschka
f63f04d8d5
Fix BPE bonus materials (#561)
* Fix BPE bonus materials

* fix bpe implementation

* update

* Add 'Hello, world. Is this-- a test?' test case

* update link to test file

* update path handling

* update path handling

* fix pytest paths
2025-03-08 17:21:30 -06:00
Kasen
7bd36dccb4
Improve BPE vocabulary saving and pair frequency handling (#539) 2025-02-19 09:51:04 -06:00
Kasen
b47884ced0
Fix incorrect indentation (#536) 2025-02-18 14:47:31 -06:00
Austin Welch
0f35e370ed
fix: preserve newline tokens in BPE encoder (#495)
* fix: preserve newline tokens in BPE encoder

* further fixes

* more fixes

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-01-21 12:47:15 -06:00
Daniel Kleine
60acb94894
BPE: fixed typo (#492)
* fixed typo

* use rel path if exists

* mod gitignore and use existing vocab files

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-01-20 20:49:53 -06:00
Sebastian Raschka
0d4967eda6
Implementingthe BPE Tokenizer from Scratch (#487) 2025-01-17 12:22:00 -06:00