106 Commits

Author SHA1 Message Date
casinca
c4b19d7eb6 fix issue #664 - inverted token and pos emb layers (#665)
* fix inverted token and pos layers

* remove redundant code

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-06-22 12:15:01 -05:00
Shimpei Kojio
1446cfd824 fixed video link (#646) 2025-06-13 08:16:18 -05:00
Sebastian Raschka
02ca4ac42d BPE cosmetics (#629)
* Llama3 from scratch improvements

* Cosmetic BPE improvements

* restore

* Update ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb

* Update ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb

* endoftext whitespace
2025-04-18 18:57:09 -05:00
Sebastian Raschka
48e98abc8e add special token handling to bpe from scratch code (#616) 2025-04-13 12:38:22 -05:00
Sebastian Raschka
0bdcce4e40 Clarify dataset length in chapter 2 (#589) 2025-03-30 16:01:37 -05:00
Sebastian Raschka
634a531223 Cosmetic improvements to the BPE code (#562) 2025-03-09 10:49:40 -05:00
Sebastian Raschka
6aec412421 Fix BPE bonus materials (#561)
* Fix BPE bonus materials

* fix bpe implementation

* update

* Add 'Hello, world. Is this-- a test?' test case

* update link to test file

* update path handling

* update path handling

* fix pytest paths
2025-03-08 17:21:30 -06:00
Sebastian Raschka
5be0e3cbbd add link to supplementary ch02 video (#553) 2025-03-02 13:17:42 -06:00
Sebastian Raschka
839a7e9bfc Use correct ch02 title (#551) 2025-02-28 10:16:21 -06:00
Sebastian Raschka
db58925d7f Add BPE from scratch link (#550) 2025-02-28 09:57:41 -06:00
Kasen
af4b73ca7b Improve BPE vocabulary saving and pair frequency handling (#539) 2025-02-19 09:51:04 -06:00
Kasen
0a5214b804 Fix incorrect indentation (#536) 2025-02-18 14:47:31 -06:00
Sebastian Raschka
d684ff418a Fix typo in Ch02 comments (#516) 2025-02-04 20:16:07 -06:00
Sebastian Raschka
dcaac28b92 Bonus material: extending tokenizers (#496)
* Bonus material: extending tokenizers

* small wording update
2025-01-22 09:26:54 -06:00
Daniel Kleine
9175590ea4 add GPT2TokenizerFast to BPE comparison (#498)
* added HF BPE Fast

* update benchmarks

* add note about performance

* revert accidental changes

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-01-22 09:26:44 -06:00
Austin Welch
654734053a fix: preserve newline tokens in BPE encoder (#495)
* fix: preserve newline tokens in BPE encoder

* further fixes

* more fixes

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-01-21 12:47:15 -06:00
Daniel Kleine
3f9facbc55 BPE: fixed typo (#492)
* fixed typo

* use rel path if exists

* mod gitignore and use existing vocab files

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-01-20 20:49:53 -06:00
Sebastian Raschka
b17d097742 Implementingthe BPE Tokenizer from Scratch (#487) 2025-01-17 12:22:00 -06:00
Henry Shi
15af754304 Print out embeddings for more illustrative learning (#481)
* print out embeddings for illustrative learning

* suggestion print embeddingcontents

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-01-13 14:44:06 -06:00
Tao Qian
65ee619d3b Minor readability improvement in dataloader.ipynb (#461)
* Minor readability improvement in dataloader.ipynb

- The tokenizer and encoded_text variables at the root level are unused.
- The default params for create_dataloader_v1 are confusing, especially for the default batch_size 4, which happens to be the same as the max_length.

* readability improvements

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-01-04 11:26:10 -06:00
Sebastian Raschka
42b703fc0b Note about SSL certificates (#404) 2024-10-19 16:27:19 -05:00
Sebastian Raschka
6a9bedc2ec Update bonus section formatting (#400) 2024-10-12 10:26:08 -05:00
rasbt
3cebcce639 minor spelling fix 2024-09-08 15:35:36 -05:00
Gustavo Monti
190910e3d6 updating REAMDE from chapter 02 inclund 04_bonus section (#344)
* updating REAMDE from chapter 02 inclund 04_bonus section

* Update ch02/README.md

---------

Co-authored-by: Gustavo Monti Rocha <gustavo.rocha@intelliway.com.br>
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
2024-09-05 08:09:46 +02:00
Sebastian Raschka
f66c089f0b Test with PyTorch 2.0 and 2.4 (#290)
* Test with PyTorch 2.0 and 2.4

* Update basic-tests-old-pytorch.yml

* skip version cell
2024-07-27 15:09:02 -05:00
Sebastian Raschka
6dd8666d9c Test code in pytorch 2.4 (#285)
* test code in pytorch 2.4

* update
2024-07-24 21:53:41 -05:00
Sebastian Raschka
3f6f2af3a3 Simplify embedding vs linear layer code (#278) 2024-07-21 12:21:10 -05:00
Thanh Tran
a2bb045984 fix typos & inconsistent texts (#269)
Co-authored-by: TRAN <you@example.com>
2024-07-17 07:34:51 -05:00
rasbt
ee1d4730ba fixes bold font #267 2024-07-16 17:51:15 -05:00
Daniel Kleine
7e0dd7f765 minor: removed redundant imports (#260)
* removed duplicated imports

* removed empty cell
2024-07-05 15:33:19 -05:00
rasbt
bd216fdade update decode method 2024-07-05 08:34:27 -05:00
Suman Debnath
46f4d9e575 fixing the regular expression used in the SimpleTokenizer (#259)
* fixing the regular expression used in the SimpleTokenizer class and a typo in the 2.7 Creating token embedding introduction section

* rerun

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>
2024-07-04 12:27:27 -05:00
rasbt
64536ca40f update figures 2024-07-02 17:12:42 -05:00
rasbt
5e24a042c1 add links to summary sections 2024-06-29 07:33:26 -05:00
Sebastian Raschka
4fef19e016 remove redundant code lines (#247) 2024-06-25 21:44:19 -05:00
rasbt
f46441d53f update with latest versions 2024-06-25 21:09:27 -05:00
Daniel Kleine
7a54d383e7 minor fixes (#246)
* removed duplicated white spaces

* Update ch07/01_main-chapter-code/ch07.ipynb

* Update ch07/05_dataset-generation/llama3-ollama.ipynb

* removed duplicated white spaces

* fixed title again

---------

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
2024-06-25 17:30:30 -05:00
rasbt
c1f9361428 add main and optional sections 2024-06-19 17:48:25 -05:00
Daniel Kleine
73be1c592f fixed num_workers (#229)
* fixed num_workers

* ch06 & ch07: added num_workers to create_dataloader_v1
2024-06-19 17:36:46 -05:00
rasbt
b2ff989174 distinguish better between main chapter code and bonus materials 2024-06-11 21:07:42 -05:00
rasbt
e1adeb14f3 add allowed_special={"<|endoftext|>"} 2024-06-09 06:04:02 -05:00
Sebastian Raschka
40ba3a4068 Remove leftover instances of self.tokenizer (#201)
* Remove leftover instances of self.tokenizer

* add endoftext token
2024-06-08 14:57:34 -05:00
rasbt
20f1ef553c update figure 2.13 2024-06-01 09:38:33 -05:00
rasbt
fe8bb9291e update formatting 2024-05-24 07:20:37 -05:00
rasbt
1407085f07 reset cell count for better nbdiff 2024-05-22 20:27:09 -05:00
rasbt
85c3210105 update regex 2024-05-22 20:15:31 -05:00
rasbt
678fad50bc formatting for consistency with production chapter 2024-05-18 11:03:42 -05:00
rasbt
6c6321f671 simplify code 2024-05-16 20:16:25 -05:00
Sebastian Raschka
0f03c20483 Data loader intuition with numbers (#132)
* data loader intuition with numbers

* fix link

* fix tests
2024-04-27 07:56:41 -05:00
rasbt
379a8ab39c update figures in bonus notebook 2024-04-23 21:01:27 -05:00