add note about duplicated cell

This commit is contained in:
rasbt 2024-08-19 21:04:18 -05:00
parent 11e2f56af5
commit f4e45a3f40

View File

@ -70,11 +70,12 @@
"id": "5860ba9f-2db3-4480-b96b-4be1c68981eb",
"metadata": {},
"source": [
"We can print the number of times the word \"pizza\" is sampled using the `print_sampled_tokens` function we defined in this section. Let's start with the code we defined in section 5.3.1.\n",
"- We can print the number of times the word \"pizza\" is sampled using the `print_sampled_tokens` function we defined in this section\n",
"- Let's start with the code we defined in section 5.3.1\n",
"\n",
"It is sampled 0x if the temperature is 0 or 0.1, and it is sampled 32x if the temperature is scaled up to 5. The estimated probability is 32/1000 * 100% = 3.2%.\n",
"- It is sampled 0x if the temperature is 0 or 0.1, and it is sampled 32x if the temperature is scaled up to 5. The estimated probability is 32/1000 * 100% = 3.2%\n",
"\n",
"The actual probability is 4.3% and contained in the rescaled softmax probability tensor (`scaled_probas[2][6]`)."
"- The actual probability is 4.3% and contained in the rescaled softmax probability tensor (`scaled_probas[2][6]`)"
]
},
{
@ -82,7 +83,7 @@
"id": "9cba59c2-a8a3-4af3-add4-70230795225e",
"metadata": {},
"source": [
"Below is a self-contained example using code from chapter 5:"
"- Below is a self-contained example using code from chapter 5:"
]
},
{
@ -133,7 +134,7 @@
"id": "1ee0f9f3-4132-42c7-8324-252fd8f59145",
"metadata": {},
"source": [
"Now, we can iterate over the `scaled_probas` and print the sampling frequencies in each case:"
"- Now, we can iterate over the `scaled_probas` and print the sampling frequencies in each case:"
]
},
{
@ -194,9 +195,11 @@
"id": "fbf88c97-19c4-462c-924a-411c8c765d2c",
"metadata": {},
"source": [
"Note that sampling offers an approximation of the actual probabilities when the word \"pizza\" is sampled. E.g., if it is sampled 32/1000 times, the estimated probability is 3.2%. To obtain the actual probability, we can check the probabilities directly by accessing the corresponding entry in `scaled_probas`.\n",
"- Note that sampling offers an approximation of the actual probabilities when the word \"pizza\" is sampled\n",
"- E.g., if it is sampled 32/1000 times, the estimated probability is 3.2%\n",
"- To obtain the actual probability, we can check the probabilities directly by accessing the corresponding entry in `scaled_probas`\n",
"\n",
"Since \"pizza\" is the 7th entry in the vocabulary, for the temperature of 5, we obtain it as follows:"
"- Since \"pizza\" is the 7th entry in the vocabulary, for the temperature of 5, we obtain it as follows:"
]
},
{
@ -228,7 +231,7 @@
"id": "d3dcb438-5f18-4332-9627-66009f30a1a4",
"metadata": {},
"source": [
"There is a 4.3% probability that the word \"pizza\" is sampled if the temperature is set to 5."
"There is a 4.3% probability that the word \"pizza\" is sampled if the temperature is set to 5"
]
},
{
@ -379,6 +382,14 @@
"print(\"Output text:\\n\", token_ids_to_text(token_ids, tokenizer))"
]
},
{
"cell_type": "markdown",
"id": "c85b1f11-37a5-477d-9c2d-170a6865e669",
"metadata": {},
"source": [
"- Note that re-executing the previous code cell will produce the exact same generated text:"
]
},
{
"cell_type": "code",
"execution_count": 9,
@ -422,9 +433,10 @@
"id": "f40044e8-a0f5-476c-99fd-489b999fd80a",
"metadata": {},
"source": [
"If we are still in the Python session where you first trained the model in chapter 5, to continue the pretraining for one more epoch, we just have to load the model and optimizer that we saved in the main chapter and call the `train_model_simple` function again.\n",
"- If we are still in the Python session where you first trained the model in chapter 5, to continue the pretraining for one more epoch, we just have to load the model and optimizer that we saved in the main chapter and call the `train_model_simple` function again\n",
"\n",
"It takes a couple more steps to make this reproducible in this new code environment. First, we load the tokenizer, model, and optimizer:"
"- It takes a couple more steps to make this reproducible in this new code environment\n",
"- First, we load the tokenizer, model, and optimizer:"
]
},
{
@ -468,7 +480,7 @@
"id": "688fce4a-9ab2-4d97-a95c-fef02c32b4f3",
"metadata": {},
"source": [
"Next, we initialize the data loader:"
"- Next, we initialize the data loader:"
]
},
{
@ -531,7 +543,7 @@
"id": "76598ef8-165c-4bcc-af5e-b6fe72398365",
"metadata": {},
"source": [
"Lastly, we use the `train_model_simple` function to train the model:"
"- Lastly, we use the `train_model_simple` function to train the model:"
]
},
{
@ -574,21 +586,22 @@
"id": "7cb1140b-2027-4156-8d19-600ac849edbe",
"metadata": {},
"source": [
"We can use the following code to calculate the training and validation set losses of the GPT model:\n",
"- We can use the following code to calculate the training and validation set losses of the GPT model:\n",
"\n",
"```python\n",
"train_loss = calc_loss_loader(train_loader, gpt, device)\n",
"val_loss = calc_loss_loader(val_loader, gpt, device)\n",
"```\n",
"\n",
"The resulting losses for the 124M parameter are as follows:\n",
"- The resulting losses for the 124M parameter are as follows:\n",
"\n",
"```\n",
"Training loss: 3.754748503367106\n",
"Validation loss: 3.559617757797241\n",
"```\n",
"\n",
"The main observation is that the training and validation set performances are in the same ballpark. This can have multiple explanations.\n",
"- The main observation is that the training and validation set performances are in the same ballpark\n",
"- This can have multiple explanations:\n",
"\n",
"1. The Verdict was not part of the pretraining dataset when OpenAI trained GPT-2. Hence, the model is not explicitly overfitting to the training set and performs similarly well on The Verdict's training and validation set portions. (The validation set loss is slightly lower than the training set loss, which is unusual in deep learning. However, it's likely due to random noise since the dataset is relatively small. In practice, if there is no overfitting, the training and validation set performances are expected to be roughly identical).\n",
"\n",
@ -849,14 +862,17 @@
"id": "b3d313f4-0038-4bc9-a340-84b3b55dc0e3",
"metadata": {},
"source": [
"In the main chapter, we experimented with the smallest GPT-2 model, which has only 124M parameters. The reason was to keep the resource requirements as low as possible. However, you can easily experiment with larger models with minimal code changes. For example, instead of loading the 1558M instead of 124M model in chapter 5, the only 2 lines of code that we have to change are\n",
"- In the main chapter, we experimented with the smallest GPT-2 model, which has only 124M parameters\n",
"- The reason was to keep the resource requirements as low as possible\n",
"- However, you can easily experiment with larger models with minimal code changes\n",
"- For example, instead of loading the 1558M instead of 124M model in chapter 5, the only 2 lines of code that we have to change are\n",
"\n",
"```python\n",
"settings, params = download_and_load_gpt2(model_size=\"124M\", models_dir=\"gpt2\")\n",
"model_name = \"gpt2-small (124M)\"\n",
"```\n",
"\n",
"The updated code becomes\n",
"- The updated code becomes\n",
"\n",
"\n",
"```python\n",
@ -992,7 +1008,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
"version": "3.10.6"
}
},
"nbformat": 4,