mirror of
				https://github.com/rasbt/LLMs-from-scratch.git
				synced 2025-10-25 15:00:03 +00:00 
			
		
		
		
	Updated devcontainer, .gitignore and README for gutenberg project (#107)
* added ch05/03_bonus_pretraining_on_gutenberg model checkpoints and preprocessing output folders to .gitignore * removed prettier extension, added github alerts markdown extension * specified download instructions and fixed code markdown * Update ch05/03_bonus_pretraining_on_gutenberg/README.md * Update ch05/03_bonus_pretraining_on_gutenberg/README.md --------- Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
This commit is contained in:
		
							parent
							
								
									25f533efe0
								
							
						
					
					
						commit
						7d0b9b78b0
					
				| @ -11,7 +11,7 @@ | |||||||
|         "ms-python.python", |         "ms-python.python", | ||||||
|         "ms-azuretools.vscode-docker", |         "ms-azuretools.vscode-docker", | ||||||
|         "ms-toolsai.jupyter", |         "ms-toolsai.jupyter", | ||||||
|         "esbenp.prettier-vscode" |         "yahyabatulu.vscode-markdown-alert" | ||||||
|       ] |       ] | ||||||
|     } |     } | ||||||
|   } |   } | ||||||
|  | |||||||
							
								
								
									
										3
									
								
								.gitignore
									
									
									
									
										vendored
									
									
								
							
							
						
						
									
										3
									
								
								.gitignore
									
									
									
									
										vendored
									
									
								
							| @ -12,7 +12,10 @@ ch05/01_main-chapter-code/gpt2/ | |||||||
| ch05/02_alternative_weight_loading/checkpoints | ch05/02_alternative_weight_loading/checkpoints | ||||||
| ch05/01_main-chapter-code/model.pth | ch05/01_main-chapter-code/model.pth | ||||||
| ch05/01_main-chapter-code/model_and_optimizer.pth | ch05/01_main-chapter-code/model_and_optimizer.pth | ||||||
|  | ch05/03_bonus_pretraining_on_gutenberg/model_checkpoints | ||||||
| 
 | 
 | ||||||
|  | # Preprocessing output folders | ||||||
|  | ch05/03_bonus_pretraining_on_gutenberg/gutenberg_preprocessed | ||||||
| 
 | 
 | ||||||
| # Temporary OS-related files | # Temporary OS-related files | ||||||
| .DS_Store | .DS_Store | ||||||
|  | |||||||
| @ -23,16 +23,35 @@ As of this writing, this will require approximately 50 GB of disk space, but it | |||||||
| 
 | 
 | ||||||
| Linux and macOS users can follow these steps to download the dataset (if you are a Windows user, please see the note below): | Linux and macOS users can follow these steps to download the dataset (if you are a Windows user, please see the note below): | ||||||
| 
 | 
 | ||||||
|  | Set the `03_bonus_pretraining_on_gutenberg` folder as working directory to clone the `gutenberg` repository locally in this folder (this is necessary to run the provided scripts `prepare_dataset.py` and `pretraining_simple.py`). For instance, when being in the `LLMs-from-scratch` repository's folder, navigate into the *03_bonus_pretraining_on_gutenberg* folder via: | ||||||
|  | ```bash | ||||||
|  | cd ch05/03_bonus_pretraining_on_gutenberg | ||||||
|  | ``` | ||||||
| 
 | 
 | ||||||
| 1. `git clone https://github.com/pgcorpus/gutenberg.git` | 2. Clone the `gutenberg` repository in there: | ||||||
|  | ```bash | ||||||
|  | git clone https://github.com/pgcorpus/gutenberg.git | ||||||
|  | ``` | ||||||
| 
 | 
 | ||||||
| 2. `cd gutenberg` | 3. Navigate into the locally cloned `gutenberg` repository's folder: | ||||||
|  | ```bash | ||||||
|  | cd gutenberg | ||||||
|  | ``` | ||||||
| 
 | 
 | ||||||
| 3. `pip install -r requirements.txt` | 4. Install the required packages defined in *requirements.txt* from the `gutenberg` repository's folder: | ||||||
|  | ```bash | ||||||
|  | pip install -r requirements.txt | ||||||
|  | ``` | ||||||
|   |   | ||||||
| 4. `python get_data.py` | 5. Download the data: | ||||||
|  | ```bash | ||||||
|  | python get_data.py | ||||||
|  | ``` | ||||||
| 
 | 
 | ||||||
| 5. `cd ..` | 6. Go back into the `03_bonus_pretraining_on_gutenberg` folder | ||||||
|  | ```bash | ||||||
|  | cd .. | ||||||
|  | ``` | ||||||
| 
 | 
 | ||||||
|   |   | ||||||
| #### Special instructions for Windows users | #### Special instructions for Windows users | ||||||
| @ -54,14 +73,14 @@ sudo apt-get install -y rsync && \ | |||||||
| > [!NOTE] | > [!NOTE] | ||||||
| > Instructions about how to set up Python and installing packages can be found in  [Appendix A: Optional Python Setup Preferences](../../appendix-A/01_optional-python-setup-preferences/README.md) and [Appendix A: Installing Python Libraries](../../appendix-A/02_installing-python-libraries/README.md). | > Instructions about how to set up Python and installing packages can be found in  [Appendix A: Optional Python Setup Preferences](../../appendix-A/01_optional-python-setup-preferences/README.md) and [Appendix A: Installing Python Libraries](../../appendix-A/02_installing-python-libraries/README.md). | ||||||
| > | > | ||||||
| > Optionally, a Docker image running Ubuntu is provided with this repository. When having cloned the [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) GitHub repository, copy the *.devcontainer* folder of this `LLMs-from-scratch` repository and paste it to the locally cloned `gutenberg` repository. Instructions about how to run a container with the provided Docker image can be found in [Appendix A: Optional Docker Environment](../../appendix-A/04_optional-docker-environment/README.md). | > Optionally, a Docker image running Ubuntu is provided with this repository. Instructions about how to run a container with the provided Docker image can be found in [Appendix A: Optional Docker Environment](../../appendix-A/04_optional-docker-environment/README.md). | ||||||
| 
 | 
 | ||||||
|   |   | ||||||
| ### 2) Prepare the dataset | ### 2) Prepare the dataset | ||||||
| 
 | 
 | ||||||
| Next, run the `prepare_dataset.py` script, which concatenates the (as of this writing, 60,173) text files into fewer larger files so that they can be more efficiently transferred and accessed: | Next, run the `prepare_dataset.py` script, which concatenates the (as of this writing, 60,173) text files into fewer larger files so that they can be more efficiently transferred and accessed: | ||||||
| 
 | 
 | ||||||
| ``` | ```bash | ||||||
| python prepare_dataset.py \ | python prepare_dataset.py \ | ||||||
|   --data_dir gutenberg/data \ |   --data_dir gutenberg/data \ | ||||||
|   --max_size_mb 500 \ |   --max_size_mb 500 \ | ||||||
| @ -90,34 +109,32 @@ python pretraining_simple.py \ | |||||||
| 
 | 
 | ||||||
| The output will be formatted in the following way: | The output will be formatted in the following way: | ||||||
| 
 | 
 | ||||||
| ``` | > Total files: 3   | ||||||
| Total files: 3 | > Tokenizing file 1 of 3: data_small/combined_1.txt   | ||||||
| Tokenizing file 1 of 3: data_small/combined_1.txt | > Training ...   | ||||||
| Training ... | > Ep 1 (Step 0): Train loss 9.694, Val loss 9.724   | ||||||
| Ep 1 (Step 0): Train loss 9.694, Val loss 9.724 | > Ep 1 (Step 100): Train loss 6.672, Val loss 6.683   | ||||||
| Ep 1 (Step 100): Train loss 6.672, Val loss 6.683 | > Ep 1 (Step 200): Train loss 6.543, Val loss 6.434   | ||||||
| Ep 1 (Step 200): Train loss 6.543, Val loss 6.434 | > Ep 1 (Step 300): Train loss 5.772, Val loss 6.313   | ||||||
| Ep 1 (Step 300): Train loss 5.772, Val loss 6.313 | > Ep 1 (Step 400): Train loss 5.547, Val loss 6.249   | ||||||
| Ep 1 (Step 400): Train loss 5.547, Val loss 6.249 | > Ep 1 (Step 500): Train loss 6.182, Val loss 6.155   | ||||||
| Ep 1 (Step 500): Train loss 6.182, Val loss 6.155 | > Ep 1 (Step 600): Train loss 5.742, Val loss 6.122   | ||||||
| Ep 1 (Step 600): Train loss 5.742, Val loss 6.122 | > Ep 1 (Step 700): Train loss 6.309, Val loss 5.984   | ||||||
| Ep 1 (Step 700): Train loss 6.309, Val loss 5.984 | > Ep 1 (Step 800): Train loss 5.435, Val loss 5.975   | ||||||
| Ep 1 (Step 800): Train loss 5.435, Val loss 5.975 | > Ep 1 (Step 900): Train loss 5.582, Val loss 5.935   | ||||||
| Ep 1 (Step 900): Train loss 5.582, Val loss 5.935 | > ...   | ||||||
| ... | > Ep 1 (Step 31900): Train loss 3.664, Val loss 3.946   | ||||||
| Ep 1 (Step 31900): Train loss 3.664, Val loss 3.946 | > Ep 1 (Step 32000): Train loss 3.493, Val loss 3.939   | ||||||
| Ep 1 (Step 32000): Train loss 3.493, Val loss 3.939 | > Ep 1 (Step 32100): Train loss 3.940, Val loss 3.961   | ||||||
| Ep 1 (Step 32100): Train loss 3.940, Val loss 3.961 | > Saved model_checkpoints/model_pg_32188.pth   | ||||||
| Saved model_checkpoints/model_pg_32188.pth | > Book processed 3h 46m 55s    | ||||||
| Book processed 3h 46m 55s  | > Total time elapsed 3h 46m 55s    | ||||||
| Total time elapsed 3h 46m 55s  | > ETA for remaining books: 7h 33m 50s   | ||||||
| ETA for remaining books: 7h 33m 50s | > Tokenizing file 2 of 3: data_small/combined_2.txt   | ||||||
| Tokenizing file 2 of 3: data_small/combined_2.txt | > Training ...   | ||||||
| Training ... | > Ep 1 (Step 32200): Train loss 2.982, Val loss 4.094   | ||||||
| Ep 1 (Step 32200): Train loss 2.982, Val loss 4.094 | > Ep 1 (Step 32300): Train loss 3.920, Val loss 4.097   | ||||||
| Ep 1 (Step 32300): Train loss 3.920, Val loss 4.097 | > ... | ||||||
| ... |  | ||||||
| ``` |  | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|   |   | ||||||
|  | |||||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user
	 Daniel Kleine
						Daniel Kleine