mirror of
https://github.com/rasbt/LLMs-from-scratch.git
synced 2025-08-31 12:00:23 +00:00
Updated devcontainer, .gitignore and README for gutenberg project (#107)
* added ch05/03_bonus_pretraining_on_gutenberg model checkpoints and preprocessing output folders to .gitignore * removed prettier extension, added github alerts markdown extension * specified download instructions and fixed code markdown * Update ch05/03_bonus_pretraining_on_gutenberg/README.md * Update ch05/03_bonus_pretraining_on_gutenberg/README.md --------- Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
This commit is contained in:
parent
25f533efe0
commit
7d0b9b78b0
@ -11,7 +11,7 @@
|
|||||||
"ms-python.python",
|
"ms-python.python",
|
||||||
"ms-azuretools.vscode-docker",
|
"ms-azuretools.vscode-docker",
|
||||||
"ms-toolsai.jupyter",
|
"ms-toolsai.jupyter",
|
||||||
"esbenp.prettier-vscode"
|
"yahyabatulu.vscode-markdown-alert"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
3
.gitignore
vendored
3
.gitignore
vendored
@ -12,7 +12,10 @@ ch05/01_main-chapter-code/gpt2/
|
|||||||
ch05/02_alternative_weight_loading/checkpoints
|
ch05/02_alternative_weight_loading/checkpoints
|
||||||
ch05/01_main-chapter-code/model.pth
|
ch05/01_main-chapter-code/model.pth
|
||||||
ch05/01_main-chapter-code/model_and_optimizer.pth
|
ch05/01_main-chapter-code/model_and_optimizer.pth
|
||||||
|
ch05/03_bonus_pretraining_on_gutenberg/model_checkpoints
|
||||||
|
|
||||||
|
# Preprocessing output folders
|
||||||
|
ch05/03_bonus_pretraining_on_gutenberg/gutenberg_preprocessed
|
||||||
|
|
||||||
# Temporary OS-related files
|
# Temporary OS-related files
|
||||||
.DS_Store
|
.DS_Store
|
||||||
|
@ -23,16 +23,35 @@ As of this writing, this will require approximately 50 GB of disk space, but it
|
|||||||
|
|
||||||
Linux and macOS users can follow these steps to download the dataset (if you are a Windows user, please see the note below):
|
Linux and macOS users can follow these steps to download the dataset (if you are a Windows user, please see the note below):
|
||||||
|
|
||||||
|
Set the `03_bonus_pretraining_on_gutenberg` folder as working directory to clone the `gutenberg` repository locally in this folder (this is necessary to run the provided scripts `prepare_dataset.py` and `pretraining_simple.py`). For instance, when being in the `LLMs-from-scratch` repository's folder, navigate into the *03_bonus_pretraining_on_gutenberg* folder via:
|
||||||
|
```bash
|
||||||
|
cd ch05/03_bonus_pretraining_on_gutenberg
|
||||||
|
```
|
||||||
|
|
||||||
1. `git clone https://github.com/pgcorpus/gutenberg.git`
|
2. Clone the `gutenberg` repository in there:
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/pgcorpus/gutenberg.git
|
||||||
|
```
|
||||||
|
|
||||||
2. `cd gutenberg`
|
3. Navigate into the locally cloned `gutenberg` repository's folder:
|
||||||
|
```bash
|
||||||
|
cd gutenberg
|
||||||
|
```
|
||||||
|
|
||||||
3. `pip install -r requirements.txt`
|
4. Install the required packages defined in *requirements.txt* from the `gutenberg` repository's folder:
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
4. `python get_data.py`
|
5. Download the data:
|
||||||
|
```bash
|
||||||
|
python get_data.py
|
||||||
|
```
|
||||||
|
|
||||||
5. `cd ..`
|
6. Go back into the `03_bonus_pretraining_on_gutenberg` folder
|
||||||
|
```bash
|
||||||
|
cd ..
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
#### Special instructions for Windows users
|
#### Special instructions for Windows users
|
||||||
@ -54,14 +73,14 @@ sudo apt-get install -y rsync && \
|
|||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> Instructions about how to set up Python and installing packages can be found in [Appendix A: Optional Python Setup Preferences](../../appendix-A/01_optional-python-setup-preferences/README.md) and [Appendix A: Installing Python Libraries](../../appendix-A/02_installing-python-libraries/README.md).
|
> Instructions about how to set up Python and installing packages can be found in [Appendix A: Optional Python Setup Preferences](../../appendix-A/01_optional-python-setup-preferences/README.md) and [Appendix A: Installing Python Libraries](../../appendix-A/02_installing-python-libraries/README.md).
|
||||||
>
|
>
|
||||||
> Optionally, a Docker image running Ubuntu is provided with this repository. When having cloned the [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) GitHub repository, copy the *.devcontainer* folder of this `LLMs-from-scratch` repository and paste it to the locally cloned `gutenberg` repository. Instructions about how to run a container with the provided Docker image can be found in [Appendix A: Optional Docker Environment](../../appendix-A/04_optional-docker-environment/README.md).
|
> Optionally, a Docker image running Ubuntu is provided with this repository. Instructions about how to run a container with the provided Docker image can be found in [Appendix A: Optional Docker Environment](../../appendix-A/04_optional-docker-environment/README.md).
|
||||||
|
|
||||||
|
|
||||||
### 2) Prepare the dataset
|
### 2) Prepare the dataset
|
||||||
|
|
||||||
Next, run the `prepare_dataset.py` script, which concatenates the (as of this writing, 60,173) text files into fewer larger files so that they can be more efficiently transferred and accessed:
|
Next, run the `prepare_dataset.py` script, which concatenates the (as of this writing, 60,173) text files into fewer larger files so that they can be more efficiently transferred and accessed:
|
||||||
|
|
||||||
```
|
```bash
|
||||||
python prepare_dataset.py \
|
python prepare_dataset.py \
|
||||||
--data_dir gutenberg/data \
|
--data_dir gutenberg/data \
|
||||||
--max_size_mb 500 \
|
--max_size_mb 500 \
|
||||||
@ -90,34 +109,32 @@ python pretraining_simple.py \
|
|||||||
|
|
||||||
The output will be formatted in the following way:
|
The output will be formatted in the following way:
|
||||||
|
|
||||||
```
|
> Total files: 3
|
||||||
Total files: 3
|
> Tokenizing file 1 of 3: data_small/combined_1.txt
|
||||||
Tokenizing file 1 of 3: data_small/combined_1.txt
|
> Training ...
|
||||||
Training ...
|
> Ep 1 (Step 0): Train loss 9.694, Val loss 9.724
|
||||||
Ep 1 (Step 0): Train loss 9.694, Val loss 9.724
|
> Ep 1 (Step 100): Train loss 6.672, Val loss 6.683
|
||||||
Ep 1 (Step 100): Train loss 6.672, Val loss 6.683
|
> Ep 1 (Step 200): Train loss 6.543, Val loss 6.434
|
||||||
Ep 1 (Step 200): Train loss 6.543, Val loss 6.434
|
> Ep 1 (Step 300): Train loss 5.772, Val loss 6.313
|
||||||
Ep 1 (Step 300): Train loss 5.772, Val loss 6.313
|
> Ep 1 (Step 400): Train loss 5.547, Val loss 6.249
|
||||||
Ep 1 (Step 400): Train loss 5.547, Val loss 6.249
|
> Ep 1 (Step 500): Train loss 6.182, Val loss 6.155
|
||||||
Ep 1 (Step 500): Train loss 6.182, Val loss 6.155
|
> Ep 1 (Step 600): Train loss 5.742, Val loss 6.122
|
||||||
Ep 1 (Step 600): Train loss 5.742, Val loss 6.122
|
> Ep 1 (Step 700): Train loss 6.309, Val loss 5.984
|
||||||
Ep 1 (Step 700): Train loss 6.309, Val loss 5.984
|
> Ep 1 (Step 800): Train loss 5.435, Val loss 5.975
|
||||||
Ep 1 (Step 800): Train loss 5.435, Val loss 5.975
|
> Ep 1 (Step 900): Train loss 5.582, Val loss 5.935
|
||||||
Ep 1 (Step 900): Train loss 5.582, Val loss 5.935
|
> ...
|
||||||
...
|
> Ep 1 (Step 31900): Train loss 3.664, Val loss 3.946
|
||||||
Ep 1 (Step 31900): Train loss 3.664, Val loss 3.946
|
> Ep 1 (Step 32000): Train loss 3.493, Val loss 3.939
|
||||||
Ep 1 (Step 32000): Train loss 3.493, Val loss 3.939
|
> Ep 1 (Step 32100): Train loss 3.940, Val loss 3.961
|
||||||
Ep 1 (Step 32100): Train loss 3.940, Val loss 3.961
|
> Saved model_checkpoints/model_pg_32188.pth
|
||||||
Saved model_checkpoints/model_pg_32188.pth
|
> Book processed 3h 46m 55s
|
||||||
Book processed 3h 46m 55s
|
> Total time elapsed 3h 46m 55s
|
||||||
Total time elapsed 3h 46m 55s
|
> ETA for remaining books: 7h 33m 50s
|
||||||
ETA for remaining books: 7h 33m 50s
|
> Tokenizing file 2 of 3: data_small/combined_2.txt
|
||||||
Tokenizing file 2 of 3: data_small/combined_2.txt
|
> Training ...
|
||||||
Training ...
|
> Ep 1 (Step 32200): Train loss 2.982, Val loss 4.094
|
||||||
Ep 1 (Step 32200): Train loss 2.982, Val loss 4.094
|
> Ep 1 (Step 32300): Train loss 3.920, Val loss 4.097
|
||||||
Ep 1 (Step 32300): Train loss 3.920, Val loss 4.097
|
> ...
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user