Gutenberg for Windows users (#99)

This commit is contained in:
Sebastian Raschka 2024-04-02 08:54:24 -05:00 committed by GitHub
parent f30dd2dd2b
commit 5af3834760

View File

@ -13,6 +13,8 @@ Please read the [Project Gutenberg Permissions, Licensing and other Common Reque
### 1) Download the dataset
In this section, we download books from Project Gutenberg using code from the [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) GitHub repository.
As of this writing, this will require approximately 50 GB of disk space, but it may be more depending on how much Project Gutenberg grew since then.
Follow these steps to download the dataset:
@ -28,6 +30,10 @@ Follow these steps to download the dataset:
5. `cd ..`
 
> [!NOTE]
> The [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) code is compatible with both Linux and macOS. However, Windows users would have to make small adjustments, such as adding `shell=True` to the `subprocess` calls and replacing `rsync`. Alternatively, an easier way to run this code on Windows is by using the "Windows Subsystem for Linux" feature, which allows users to run a Linux environment in Windows. For more information, please read [Microsoft's official installation instruction](https://learn.microsoft.com/en-us/windows/wsl/install) and [tutorial](https://learn.microsoft.com/en-us/training/modules/wsl-introduction/).
 
### 2) Prepare the dataset