make spam spelling consistent

This commit is contained in:
rasbt 2024-05-08 06:48:28 -05:00
parent 9682b0e22d
commit 24e9110fa8
No known key found for this signature in database
GPG Key ID: 3C6E5C7C075611DB
2 changed files with 9 additions and 9 deletions

View File

@ -1415,7 +1415,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.11.4"
}
},
"nbformat": 4,

View File

@ -152,7 +152,7 @@
},
"source": [
"- This section prepares the dataset we use for classification finetuning\n",
"- We use a dataset consisting of SPAM and non-SPAM text messages to finetune the LLM to classify them\n",
"- We use a dataset consisting of spam and non-spam text messages to finetune the LLM to classify them\n",
"- First, we download and unzip the dataset"
]
},
@ -354,7 +354,7 @@
"id": "e7b6e631-4f0b-4aab-82b9-8898e6663109"
},
"source": [
"- When we check the class distribution, we see that the data contains \"ham\" (i.e., not-SPAM) much more frequently than \"spam\""
"- When we check the class distribution, we see that the data contains \"ham\" (i.e., \"not spam\") much more frequently than \"spam\""
]
},
{
@ -424,7 +424,7 @@
" # Count the instances of \"spam\"\n",
" num_spam = df[df[\"Label\"] == \"spam\"].shape[0]\n",
" \n",
" # Randomly sample \"ham' instances to match the number of 'spam' instances\n",
" # Randomly sample \"ham\" instances to match the number of \"spam\" instances\n",
" ham_subset = df[df[\"Label\"] == \"ham\"].sample(num_spam, random_state=123)\n",
" \n",
" # Combine ham \"subset\" with \"spam\"\n",
@ -443,7 +443,7 @@
"id": "d3fd2f5a-06d8-4d30-a2e3-230b86c559d6"
},
"source": [
"- Next, we change the \"string\" class labels \"ham\" and \"spam\" into integer class labels 0 and 1:"
"- Next, we change the string class labels \"ham\" and \"spam\" into integer class labels 0 and 1:"
]
},
{
@ -1330,7 +1330,7 @@
"metadata": {},
"source": [
"- Then, we replace the output layer (`model.out_head`), which originally maps the layer inputs to 50,257 dimensions (the size of the vocabulary)\n",
"- Since we finetune the model for binary classification (predicting 2 classes, \"spam\" and \"ham\"), we can replace the output layer as shown below, which will be trainable by default\n",
"- Since we finetune the model for binary classification (predicting 2 classes, \"spam\" and \"not spam\"), we can replace the output layer as shown below, which will be trainable by default\n",
"- Note that we use `BASE_CONFIG[\"emb_dim\"]` (which is equal to 768 in the `\"gpt2-small (124M)\"` model) to keep the code below more general"
]
},
@ -1538,7 +1538,7 @@
"- Hence, instead, we minimize the cross entropy loss as a proxy for maximizing the classification accuracy (you can learn more about this topic in lecture 8 of my freely available [Introduction to Deep Learning](https://sebastianraschka.com/blog/2021/dl-course.html#l08-multinomial-logistic-regression--softmax-regression) class.\n",
"\n",
"- Note that in chapter 5, we calculated the cross entropy loss for the next predicted token over the 50,257 token IDs in the vocabulary\n",
"- Here, we calculate the cross entropy in a similar fashion; the only difference is that instead of 50,257 token IDs, we now have only two choices: spam (label 1) or ham (label 0).\n",
"- Here, we calculate the cross entropy in a similar fashion; the only difference is that instead of 50,257 token IDs, we now have only two choices: \"spam\" (label 1) or \"not spam\" (label 0).\n",
"- In other words, the loss calculation training code is practically identical to the one in chapter 5, but we now only have two labels instead of 50,257 labels (token IDs).\n",
"\n",
"\n",
@ -2071,7 +2071,7 @@
"id": "a74d9ad7-3ec1-450e-8c9f-4fc46d3d5bb0",
"metadata": {},
"source": [
"## 6.8 Using the LLM as a SPAM classifier"
"## 6.8 Using the LLM as a spam classifier"
]
},
{
@ -2284,7 +2284,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.11.4"
}
},
"nbformat": 4,