"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
"# Evaluating Instruction Responses Using the OpenAI API"
]
},
{
"cell_type": "markdown",
"id": "a128651b-f326-4232-a994-42f38b7ed520",
"metadata": {},
"source": [
"- This notebook uses OpenAI's GPT-4 API to evaluate responses by a instruction finetuned LLMs based on an dataset in JSON format that includes the generated model responses, for example:\n",
"\n",
"\n",
"\n",
"```python\n",
"{\n",
" \"instruction\": \"What is the atomic number of helium?\",\n",
" \"input\": \"\",\n",
" \"output\": \"The atomic number of helium is 2.\", # <-- The target given in the test set\n",
" \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
" \"model 2 response\": \"\\nThe atomic number of helium is 3.\" # <-- Response by a 2nd LLM\n",
"- First, let's test if the OpenAI API is correctly set up\n",
"- If you don't have an account yet, you need to create one at https://platform.openai.com/\n",
"- Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)\n",
"- Running the experiments and creating the ~200 evaluations using the code in this notebook costs about $0.26 (26 cents) as of this writing"
"- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
">> The model response \"The hypotenuse of the triangle is 3 cm.\" is incorrect. The correct calculation of the hypotenuse for a right triangle with legs of 6 cm and 8 cm can be found using the Pythagorean theorem, which states that the square of the hypotenuse (c) is equal to the sum of the squares of the other two sides (a and b). Mathematically, this is expressed as:\n",
"The correct answer should be 10 cm. The response given as 3 cm is not only incorrect but also significantly off from the correct value. This error could lead to misunderstandings or incorrect applications in practical scenarios where precise measurements are crucial.\n",
"Given the scale from 0 to 100, where 100 is the best score, the response would score very low due to its inaccuracy. However, since the response format is correct (stating the measurement and unit), it does not score the absolute minimum.\n",
">> The model response lists six animals, three of which (squirrel, tiger, eagle) are indeed active during the day, making them correct responses to the instruction. However, the instruction specifically asked for three different animals, and the model response includes repetitions (tiger and cobra are each listed twice) and also exceeds the requested number of animals.\n",
"\n",
"The inclusion of \"cobra\" is incorrect as most cobras are not diurnal (active during the day); they are generally more active during the early morning and late evening, which can be considered crepuscular rather than diurnal.\n",
"- **Relevance to the task**: The response correctly identifies three diurnal animals but also includes additional animals, which was not requested.\n",
"- **Accuracy**: Including animals not active during the day (cobra) and repeating animals reduces the accuracy.\n",
"- **Adherence to instructions**: The task was to name three different animals, but the response included six names with repetitions.\n",
"Given these points, the response partially meets the requirements but also deviates significantly in terms of the number of animals and the inclusion of incorrect and repeated entries.\n",
"This score reflects that while the response did include three correct animals, it failed to strictly follow the instructions by listing only three different animals and included incorrect information.\n",
">> The model response \"What is incorrect?\" scores low in terms of fulfilling the instruction to rewrite the sentence in a more formal way. The original sentence \"I need to find out what's wrong.\" expresses a personal obligation and a process of discovery, which is not captured in the model response. The model response turns the sentence into a direct question and loses the nuance of needing to discover or investigate the issue.\n",
"- **Formality:** The response is slightly more formal than casual speech but does not elevate the formality significantly or appropriately. It does use \"incorrect\" which is slightly more formal than \"wrong.\"\n",
"- **Completeness:** The response fails to include the aspect of needing to find out or ascertain, which is a critical part of the original sentence.\n",
"- **Accuracy:** The response changes the structure and intent by converting it into a direct question, which does not align with the instruction to rewrite the statement while maintaining its original intent.\n",
"Overall, the response does not adequately meet the requirements of the task as it significantly alters the meaning and omits key elements of the original sentence.\n",
">> The model response `The interjection in the sentence is 'Wow'.` accurately identifies the interjection in the provided sentence. The response is clear, directly addresses the instruction, and correctly identifies \"Wow\" as the interjection, which is used to express surprise or admiration, fitting the context of the sentence. Therefore, the response is fully correct and meets all the requirements of the task.\n",
">> The model response \"The type of sentence is exclamatory.\" is incorrect. The input sentence \"Did you finish the report?\" is clearly an interrogative sentence as it is asking a question, indicated by the question mark at the end and the structure of the sentence.\n",
"Given the scoring criteria where 100 is the best score and should be awarded to a correct and precise response, the model's response should receive a low score because it incorrectly identifies the type of sentence. An exclamatory sentence typically expresses strong emotion and ends with an exclamation mark, which is not the case here.\n",
"Therefore, the score for the model response would be 0 out of 100, as it completely misidentifies the type of sentence, providing incorrect information.\n",