TITC 09a3a73f2d
remove all non-English texts and notice (#304)
* remove all non-English texts and notice

1. almost 18GB txt left after `is_english` filtered.
2. remove notice use gutenberg's strip_headers
3. after re-run get_data.py, seems all data are under `gutenberg/data/.mirror` folder.

* some improvements

* update readme

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>
2024-08-09 17:09:14 -05:00
..

Chapter 5: Pretraining on Unlabeled Data

Main Chapter Code

Bonus Materials