From 7d0b9b78b02ad8cbb1e21fd334cbdace8b6742e6 Mon Sep 17 00:00:00 2001 From: Daniel Kleine <53251018+d-kleine@users.noreply.github.com> Date: Fri, 5 Apr 2024 13:53:01 +0200 Subject: [PATCH] Updated devcontainer, .gitignore and README for gutenberg project (#107) * added ch05/03_bonus_pretraining_on_gutenberg model checkpoints and preprocessing output folders to .gitignore * removed prettier extension, added github alerts markdown extension * specified download instructions and fixed code markdown * Update ch05/03_bonus_pretraining_on_gutenberg/README.md * Update ch05/03_bonus_pretraining_on_gutenberg/README.md --------- Co-authored-by: Sebastian Raschka --- .devcontainer/devcontainer.json | 2 +- .gitignore | 3 + .../README.md | 89 +++++++++++-------- 3 files changed, 57 insertions(+), 37 deletions(-) diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json index e693ea7..decf415 100644 --- a/.devcontainer/devcontainer.json +++ b/.devcontainer/devcontainer.json @@ -11,7 +11,7 @@ "ms-python.python", "ms-azuretools.vscode-docker", "ms-toolsai.jupyter", - "esbenp.prettier-vscode" + "yahyabatulu.vscode-markdown-alert" ] } } diff --git a/.gitignore b/.gitignore index 56a5ff1..fc5f24d 100644 --- a/.gitignore +++ b/.gitignore @@ -12,7 +12,10 @@ ch05/01_main-chapter-code/gpt2/ ch05/02_alternative_weight_loading/checkpoints ch05/01_main-chapter-code/model.pth ch05/01_main-chapter-code/model_and_optimizer.pth +ch05/03_bonus_pretraining_on_gutenberg/model_checkpoints +# Preprocessing output folders +ch05/03_bonus_pretraining_on_gutenberg/gutenberg_preprocessed # Temporary OS-related files .DS_Store diff --git a/ch05/03_bonus_pretraining_on_gutenberg/README.md b/ch05/03_bonus_pretraining_on_gutenberg/README.md index 36f9c96..bef41d1 100644 --- a/ch05/03_bonus_pretraining_on_gutenberg/README.md +++ b/ch05/03_bonus_pretraining_on_gutenberg/README.md @@ -23,16 +23,35 @@ As of this writing, this will require approximately 50 GB of disk space, but it Linux and macOS users can follow these steps to download the dataset (if you are a Windows user, please see the note below): +Set the `03_bonus_pretraining_on_gutenberg` folder as working directory to clone the `gutenberg` repository locally in this folder (this is necessary to run the provided scripts `prepare_dataset.py` and `pretraining_simple.py`). For instance, when being in the `LLMs-from-scratch` repository's folder, navigate into the *03_bonus_pretraining_on_gutenberg* folder via: +```bash +cd ch05/03_bonus_pretraining_on_gutenberg +``` -1. `git clone https://github.com/pgcorpus/gutenberg.git` +2. Clone the `gutenberg` repository in there: +```bash +git clone https://github.com/pgcorpus/gutenberg.git +``` -2. `cd gutenberg` +3. Navigate into the locally cloned `gutenberg` repository's folder: +```bash +cd gutenberg +``` -3. `pip install -r requirements.txt` +4. Install the required packages defined in *requirements.txt* from the `gutenberg` repository's folder: +```bash +pip install -r requirements.txt +``` + +5. Download the data: +```bash +python get_data.py +``` -4. `python get_data.py` - -5. `cd ..` +6. Go back into the `03_bonus_pretraining_on_gutenberg` folder +```bash +cd .. +```   #### Special instructions for Windows users @@ -54,14 +73,14 @@ sudo apt-get install -y rsync && \ > [!NOTE] > Instructions about how to set up Python and installing packages can be found in [Appendix A: Optional Python Setup Preferences](../../appendix-A/01_optional-python-setup-preferences/README.md) and [Appendix A: Installing Python Libraries](../../appendix-A/02_installing-python-libraries/README.md). > -> Optionally, a Docker image running Ubuntu is provided with this repository. When having cloned the [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) GitHub repository, copy the *.devcontainer* folder of this `LLMs-from-scratch` repository and paste it to the locally cloned `gutenberg` repository. Instructions about how to run a container with the provided Docker image can be found in [Appendix A: Optional Docker Environment](../../appendix-A/04_optional-docker-environment/README.md). +> Optionally, a Docker image running Ubuntu is provided with this repository. Instructions about how to run a container with the provided Docker image can be found in [Appendix A: Optional Docker Environment](../../appendix-A/04_optional-docker-environment/README.md).   ### 2) Prepare the dataset Next, run the `prepare_dataset.py` script, which concatenates the (as of this writing, 60,173) text files into fewer larger files so that they can be more efficiently transferred and accessed: -``` +```bash python prepare_dataset.py \ --data_dir gutenberg/data \ --max_size_mb 500 \ @@ -90,34 +109,32 @@ python pretraining_simple.py \ The output will be formatted in the following way: -``` -Total files: 3 -Tokenizing file 1 of 3: data_small/combined_1.txt -Training ... -Ep 1 (Step 0): Train loss 9.694, Val loss 9.724 -Ep 1 (Step 100): Train loss 6.672, Val loss 6.683 -Ep 1 (Step 200): Train loss 6.543, Val loss 6.434 -Ep 1 (Step 300): Train loss 5.772, Val loss 6.313 -Ep 1 (Step 400): Train loss 5.547, Val loss 6.249 -Ep 1 (Step 500): Train loss 6.182, Val loss 6.155 -Ep 1 (Step 600): Train loss 5.742, Val loss 6.122 -Ep 1 (Step 700): Train loss 6.309, Val loss 5.984 -Ep 1 (Step 800): Train loss 5.435, Val loss 5.975 -Ep 1 (Step 900): Train loss 5.582, Val loss 5.935 -... -Ep 1 (Step 31900): Train loss 3.664, Val loss 3.946 -Ep 1 (Step 32000): Train loss 3.493, Val loss 3.939 -Ep 1 (Step 32100): Train loss 3.940, Val loss 3.961 -Saved model_checkpoints/model_pg_32188.pth -Book processed 3h 46m 55s -Total time elapsed 3h 46m 55s -ETA for remaining books: 7h 33m 50s -Tokenizing file 2 of 3: data_small/combined_2.txt -Training ... -Ep 1 (Step 32200): Train loss 2.982, Val loss 4.094 -Ep 1 (Step 32300): Train loss 3.920, Val loss 4.097 -... -``` +> Total files: 3 +> Tokenizing file 1 of 3: data_small/combined_1.txt +> Training ... +> Ep 1 (Step 0): Train loss 9.694, Val loss 9.724 +> Ep 1 (Step 100): Train loss 6.672, Val loss 6.683 +> Ep 1 (Step 200): Train loss 6.543, Val loss 6.434 +> Ep 1 (Step 300): Train loss 5.772, Val loss 6.313 +> Ep 1 (Step 400): Train loss 5.547, Val loss 6.249 +> Ep 1 (Step 500): Train loss 6.182, Val loss 6.155 +> Ep 1 (Step 600): Train loss 5.742, Val loss 6.122 +> Ep 1 (Step 700): Train loss 6.309, Val loss 5.984 +> Ep 1 (Step 800): Train loss 5.435, Val loss 5.975 +> Ep 1 (Step 900): Train loss 5.582, Val loss 5.935 +> ... +> Ep 1 (Step 31900): Train loss 3.664, Val loss 3.946 +> Ep 1 (Step 32000): Train loss 3.493, Val loss 3.939 +> Ep 1 (Step 32100): Train loss 3.940, Val loss 3.961 +> Saved model_checkpoints/model_pg_32188.pth +> Book processed 3h 46m 55s +> Total time elapsed 3h 46m 55s +> ETA for remaining books: 7h 33m 50s +> Tokenizing file 2 of 3: data_small/combined_2.txt +> Training ... +> Ep 1 (Step 32200): Train loss 2.982, Val loss 4.094 +> Ep 1 (Step 32300): Train loss 3.920, Val loss 4.097 +> ...