Sebastian Raschka
dcaac28b92
Bonus material: extending tokenizers ( #496 )
...
* Bonus material: extending tokenizers
* small wording update
2025-01-22 09:26:54 -06:00
Daniel Kleine
9175590ea4
add GPT2TokenizerFast to BPE comparison ( #498 )
...
* added HF BPE Fast
* update benchmarks
* add note about performance
* revert accidental changes
---------
Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-01-22 09:26:44 -06:00
Austin Welch
654734053a
fix: preserve newline tokens in BPE encoder ( #495 )
...
* fix: preserve newline tokens in BPE encoder
* further fixes
* more fixes
---------
Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-01-21 12:47:15 -06:00
Daniel Kleine
3f9facbc55
BPE: fixed typo ( #492 )
...
* fixed typo
* use rel path if exists
* mod gitignore and use existing vocab files
---------
Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-01-20 20:49:53 -06:00
Sebastian Raschka
b17d097742
Implementingthe BPE Tokenizer from Scratch ( #487 )
2025-01-17 12:22:00 -06:00
Henry Shi
15af754304
Print out embeddings for more illustrative learning ( #481 )
...
* print out embeddings for illustrative learning
* suggestion print embeddingcontents
---------
Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-01-13 14:44:06 -06:00
Tao Qian
65ee619d3b
Minor readability improvement in dataloader.ipynb ( #461 )
...
* Minor readability improvement in dataloader.ipynb
- The tokenizer and encoded_text variables at the root level are unused.
- The default params for create_dataloader_v1 are confusing, especially for the default batch_size 4, which happens to be the same as the max_length.
* readability improvements
---------
Co-authored-by: rasbt <mail@sebastianraschka.com>
2025-01-04 11:26:10 -06:00
Sebastian Raschka
42b703fc0b
Note about SSL certificates ( #404 )
2024-10-19 16:27:19 -05:00
Sebastian Raschka
6a9bedc2ec
Update bonus section formatting ( #400 )
2024-10-12 10:26:08 -05:00
rasbt
3cebcce639
minor spelling fix
2024-09-08 15:35:36 -05:00
Gustavo Monti
190910e3d6
updating REAMDE from chapter 02 inclund 04_bonus section ( #344 )
...
* updating REAMDE from chapter 02 inclund 04_bonus section
* Update ch02/README.md
---------
Co-authored-by: Gustavo Monti Rocha <gustavo.rocha@intelliway.com.br>
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
2024-09-05 08:09:46 +02:00
Sebastian Raschka
f66c089f0b
Test with PyTorch 2.0 and 2.4 ( #290 )
...
* Test with PyTorch 2.0 and 2.4
* Update basic-tests-old-pytorch.yml
* skip version cell
2024-07-27 15:09:02 -05:00
Sebastian Raschka
6dd8666d9c
Test code in pytorch 2.4 ( #285 )
...
* test code in pytorch 2.4
* update
2024-07-24 21:53:41 -05:00
Sebastian Raschka
3f6f2af3a3
Simplify embedding vs linear layer code ( #278 )
2024-07-21 12:21:10 -05:00
Thanh Tran
a2bb045984
fix typos & inconsistent texts ( #269 )
...
Co-authored-by: TRAN <you@example.com>
2024-07-17 07:34:51 -05:00
rasbt
ee1d4730ba
fixes bold font #267
2024-07-16 17:51:15 -05:00
Daniel Kleine
7e0dd7f765
minor: removed redundant imports ( #260 )
...
* removed duplicated imports
* removed empty cell
2024-07-05 15:33:19 -05:00
rasbt
bd216fdade
update decode method
2024-07-05 08:34:27 -05:00
Suman Debnath
46f4d9e575
fixing the regular expression used in the SimpleTokenizer ( #259 )
...
* fixing the regular expression used in the SimpleTokenizer class and a typo in the 2.7 Creating token embedding introduction section
* rerun
---------
Co-authored-by: rasbt <mail@sebastianraschka.com>
2024-07-04 12:27:27 -05:00
rasbt
64536ca40f
update figures
2024-07-02 17:12:42 -05:00
rasbt
5e24a042c1
add links to summary sections
2024-06-29 07:33:26 -05:00
Sebastian Raschka
4fef19e016
remove redundant code lines ( #247 )
2024-06-25 21:44:19 -05:00
rasbt
f46441d53f
update with latest versions
2024-06-25 21:09:27 -05:00
Daniel Kleine
7a54d383e7
minor fixes ( #246 )
...
* removed duplicated white spaces
* Update ch07/01_main-chapter-code/ch07.ipynb
* Update ch07/05_dataset-generation/llama3-ollama.ipynb
* removed duplicated white spaces
* fixed title again
---------
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
2024-06-25 17:30:30 -05:00
rasbt
c1f9361428
add main and optional sections
2024-06-19 17:48:25 -05:00
Daniel Kleine
73be1c592f
fixed num_workers ( #229 )
...
* fixed num_workers
* ch06 & ch07: added num_workers to create_dataloader_v1
2024-06-19 17:36:46 -05:00
rasbt
b2ff989174
distinguish better between main chapter code and bonus materials
2024-06-11 21:07:42 -05:00
rasbt
e1adeb14f3
add allowed_special={"<|endoftext|>"}
2024-06-09 06:04:02 -05:00
Sebastian Raschka
40ba3a4068
Remove leftover instances of self.tokenizer ( #201 )
...
* Remove leftover instances of self.tokenizer
* add endoftext token
2024-06-08 14:57:34 -05:00
rasbt
20f1ef553c
update figure 2.13
2024-06-01 09:38:33 -05:00
rasbt
fe8bb9291e
update formatting
2024-05-24 07:20:37 -05:00
rasbt
1407085f07
reset cell count for better nbdiff
2024-05-22 20:27:09 -05:00
rasbt
85c3210105
update regex
2024-05-22 20:15:31 -05:00
rasbt
678fad50bc
formatting for consistency with production chapter
2024-05-18 11:03:42 -05:00
rasbt
6c6321f671
simplify code
2024-05-16 20:16:25 -05:00
Sebastian Raschka
0f03c20483
Data loader intuition with numbers ( #132 )
...
* data loader intuition with numbers
* fix link
* fix tests
2024-04-27 07:56:41 -05:00
rasbt
379a8ab39c
update figures in bonus notebook
2024-04-23 21:01:27 -05:00
Sebastian Raschka
44a009f7e6
update stride wording
2024-04-22 20:40:48 -05:00
Sebastian Raschka
bae4b0fb08
Make datesets and loaders compatible with multiprocessing ( #118 )
2024-04-13 13:57:56 -05:00
Sebastian Raschka
bbce1cb143
Automated link checking ( #117 )
...
* Automated link checking
* Fix links in Jupyter Nbs
2024-04-12 19:08:34 -04:00
James Holcombe
0b866c133f
Use instance tokenizer ( #116 )
...
* Use instance tokenizer
* consistency updates
---------
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
2024-04-10 21:16:19 -04:00
Sebastian Raschka
ccd7cebbb3
Rename variable to context_length to make it easier on readers ( #106 )
...
* rename to context length
* fix spacing
2024-04-04 07:27:41 -05:00
rasbt
edcae09884
improve importlib experience for windows users
2024-04-03 06:31:15 -05:00
Intelligence-Manifesto
d081928e90
code -> markdown ( #101 )
2024-04-02 14:37:45 -05:00
rasbt
1c173e4f44
update figures
2024-03-30 09:43:51 -05:00
rasbt
ca96b7aee5
minor updates
2024-03-29 20:42:32 -05:00
Jeff Hammerbacher
5b222e2d6f
Fix small typos in ch02.ipynb ( #89 )
2024-03-29 08:25:52 -05:00
Sebastian Raschka
cf39abac04
Add and link bonus material ( #84 )
2024-03-23 07:27:43 -05:00
rasbt
001507481e
add colon and semicolon to tokenizer
2024-03-23 06:50:34 -05:00
Sebastian Raschka
a2cd8436cb
Ch05 supplementary code ( #81 )
2024-03-19 09:26:26 -05:00