"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
"# Evaluating Instruction Responses Using the OpenAI API"
]
},
{
"cell_type": "markdown",
"id": "a128651b-f326-4232-a994-42f38b7ed520",
"metadata": {},
"source": [
"- This notebook uses OpenAI's GPT-4 API to evaluate responses by a instruction finetuned LLMs based on an dataset in JSON format that includes the generated model responses, for example:\n",
"\n",
"\n",
"\n",
"```python\n",
"{\n",
" \"instruction\": \"What is the atomic number of helium?\",\n",
" \"input\": \"\",\n",
" \"output\": \"The atomic number of helium is 2.\", # <-- The target given in the test set\n",
" \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
" \"model 2 response\": \"\\nThe atomic number of helium is 3.\" # <-- Response by a 2nd LLM\n",
"- First, let's test if the OpenAI API is correctly set up\n",
"- If you don't have an account yet, you need to create one at https://platform.openai.com/\n",
"- Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)\n",
"- Running the experiments and creating the ~200 evaluations using the code in this notebook costs about $0.26 (26 cents) as of this writing"
"- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
">> The model response \"The hypotenuse of the triangle is 3 cm.\" is incorrect. The correct calculation of the hypotenuse for a right triangle with legs of 6 cm and 8 cm should be done using the Pythagorean theorem, which states that the square of the hypotenuse (c) is equal to the sum of the squares of the other two sides (a and b). Thus, \\( c = \\sqrt{6^2 + 8^2} = \\sqrt{36 + 64} = \\sqrt{100} = 10 \\) cm.\n",
"The model response provides a hypotenuse of 3 cm, which is not only numerically incorrect but also logically inconsistent because the hypotenuse is the longest side of a right triangle and cannot be shorter than either of the other two sides (6 cm and 8 cm in this case).\n",
"\n",
"Given the scale from 0 to 100, where 100 is the best score:\n",
"- Accuracy: The response is completely inaccurate.\n",
"- Relevance: The response addresses the task of calculating the hypotenuse but fails to do so correctly.\n",
">> The model response lists six animals, three of which are repeated, and includes animals not specifically known for being diurnal (active during the day). The instruction specifically asked for three different animals that are active during the day. Here's the breakdown:\n",
"1. **Squirrel** - Correct, squirrels are diurnal.\n",
"2. **Tiger** - Generally, tigers are crepuscular (active during dawn and dusk) rather than strictly diurnal, but they can be active during the day, especially in cooler weather.\n",
"3. **Eagle** - Correct, eagles are diurnal.\n",
"4. **Cobra** - Incorrect, cobras are generally not diurnal; they are more active during the evening and early morning.\n",
"5. **Tiger** - Repeated, and as noted, not strictly diurnal.\n",
"- **Relevance to the task**: The task was to name three different animals active during the day. The response included two correct diurnal animals (squirrel, eagle) but also included incorrect and repeated entries.\n",
"- **Accuracy**: Including animals not known for being diurnal (cobra) and repeating animals reduces the accuracy.\n",
"- **Adherence to instruction**: The instruction asked for three animals, but six were provided, with repetitions.\n",
"**Reasoning**: The response partially meets the criteria by including some correct animals but fails in terms of accuracy (inclusion of non-diurnal animals), repetition of animals, and not adhering to the instruction of listing only three animals.\n",
">> The model response \"What is incorrect?\" would score relatively low on the scale for the given task. The original instruction was to rewrite the sentence \"I need to find out what's wrong\" in a more formal way. The correct output provided, \"I must ascertain what is incorrect,\" effectively increases the formality of the original sentence by using more formal vocabulary (\"must\" instead of \"need\" and \"ascertain\" instead of \"find out\") and adjusting the phrasing (\"what is incorrect\" instead of \"what's wrong\").\n",
"The model response, however, only addresses part of the sentence and does not maintain the original meaning or structure. It changes the sentence into a question and omits the aspect of needing to discover or investigate the issue, which is a critical component of the original sentence. Additionally, it does not enhance the formality significantly.\n",
"Given these considerations, I would score the model response around 20 out of 100. It recognizes the need to adjust the formality slightly but fails to maintain the original sentence's intent and structure, and does not fully meet the requirement of rewriting the sentence in a more formal way.\n",
">> The model response `The interjection in the sentence is 'Wow'.` accurately identifies the interjection in the provided sentence. The response is clear, directly addresses the instruction, and correctly identifies \"Wow\" as the interjection. Therefore, the response should be scored 100 out of 100.\n",
">> The model response \"The type of sentence is exclamatory.\" is incorrect. The correct type of the sentence \"Did you finish the report?\" is interrogative, as it is a question. An exclamatory sentence would express strong emotion and typically ends with an exclamation mark.\n",
"Given the incorrect identification of the sentence type, the score for the model response should be low. However, the response does correctly format the answer by stating \"The type of sentence is...\" which shows an understanding of the task's requirements but fails in accuracy.\n",