Merge pull request #180 from rasbt/model-eval

Add openai model eval utility code
2025-10-30 01:10:33 +00:00 · 2024-05-26 12:02:36 -04:00 · 2024-05-26 12:02:36 -04:00 · 8f362634b8
commit 8f362634b8
parent fd6d0d13e4 91339535e1
4 changed files with 1282 additions and 0 deletions
--- a/ch07/03_model-evaluation/README.md
+++ b/ch07/03_model-evaluation/README.md
@ -0,0 +1,25 @@
+# Chapter 7: Instruction and Preference Finetuning
+
+This folder contains utility code that can be used for model evaluation.
+
+Install the additional package requirements via:
+
+```bash
+pip install -r requirements-extra.txt
+```
+
+
+&nbsp;
+## Evaluating Instruction Responses Using the OpenAI API
+
+- The [llm-instruction-eval-openai.ipynb](llm-instruction-eval-openai.ipynb) notebook uses OpenAI's GPT-4 to evaluate responses generated by instruction finetuned models. It works with a JSON file in the following format:
+
+```python
+{
+    "instruction": "What is the atomic number of helium?",
+    "input": "",
+    "output": "The atomic number of helium is 2.",               # <-- The target given in the test set
+    "model 1 response": "\nThe atomic number of helium is 2.0.", # <-- Response by an LLM
+    "model 2 response": "\nThe atomic number of helium is 3."    # <-- Response by a 2nd LLM
+},
+```
--- a/ch07/03_model-evaluation/eval-example-data.json
+++ b/ch07/03_model-evaluation/eval-example-data.json
@ -0,0 +1,702 @@
+[
+    {
+        "instruction": "Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.",
+        "input": "",
+        "output": "The hypotenuse of the triangle is 10 cm.",
+        "model 1 response": "\nThe hypotenuse of the triangle is 3 cm.",
+        "model 2 response": "\nThe hypotenuse of the triangle is 12 cm."
+    },
+    {
+        "instruction": "Name 3 different animals that are active during the day.",
+        "input": "",
+        "output": "1. Squirrel\n2. Eagle\n3. Tiger",
+        "model 1 response": "\n1. Squirrel\n2. Tiger\n3. Eagle\n4. Cobra\n5. Tiger\n6. Cobra",
+        "model 2 response": "\n1. Squirrel\n2. Eagle\n3. Tiger"
+    },
+    {
+        "instruction": "Rewrite the following sentence in a more formal way:\n\"I need to find out what's wrong.\"",
+        "input": "",
+        "output": "I must ascertain what is incorrect.",
+        "model 1 response": "\nWhat is incorrect?",
+        "model 2 response": "\nI need to find out what's wrong."
+    },
+    {
+        "instruction": "Identify the interjection in the sentence.",
+        "input": "Wow, that was an amazing trick!",
+        "output": "The interjection in the sentence is 'Wow'.",
+        "model 1 response": "\nThe interjection in the sentence is 'Wow'.",
+        "model 2 response": "\nThe interjection in the sentence is 'Wow'."
+    },
+    {
+        "instruction": "Identify the type of sentence.",
+        "input": "Did you finish the report?",
+        "output": "The type of sentence is interrogative.",
+        "model 1 response": "\nThe type of sentence is exclamatory.",
+        "model 2 response": "\nThe type of sentence is exclamatory."
+    },
+    {
+        "instruction": "Convert the following sentence into future tense.",
+        "input": "She walks to school.",
+        "output": "She will walk to school.",
+        "model 1 response": "\nShe will walk to school.",
+        "model 2 response": "\nShe will walk to school."
+    },
+    {
+        "instruction": "Convert 500 grams to ounces.",
+        "input": "",
+        "output": "500 grams is approximately 17.64 ounces.",
+        "model 1 response": "\n500 grams is approximately 99.21 ounces.",
+        "model 2 response": "\n500 grams is approximately 0.05 ounces."
+    },
+    {
+        "instruction": "Edit the following sentence to remove all passive voice.",
+        "input": "The song was composed by the artist.",
+        "output": "The artist composed the song.",
+        "model 1 response": "\nThe artist composed the song.",
+        "model 2 response": "\nThe artist composed the song."
+    },
+    {
+        "instruction": "Provide a word that rhymes with \"care.\"",
+        "input": "",
+        "output": "A word that rhymes with \"care\" is \"fare.\"",
+        "model 1 response": "\nA word that rhymes with \"care\" is \"bare.\"",
+        "model 2 response": "\nA word that rhymes with \"care\" is \"care-free.\""
+    },
+    {
+        "instruction": "Rewrite the given sentence to describe the same thing in a positive way.",
+        "input": "The meal was not delicious.",
+        "output": "The meal could use some improvement.",
+        "model 1 response": "\nThe meal could use some improvement.",
+        "model 2 response": "\nThe meal was delicious."
+    },
+    {
+        "instruction": "Generate a sentence using the word 'elusive'.",
+        "input": "",
+        "output": "The meaning of the poem was elusive, escaping easy interpretation.",
+        "model 1 response": "\nThe poem was elusive, elusive, and elusive.",
+        "model 2 response": "\nThe elusive object that is so elusive that it is impossible to describe is the elusive cat."
+    },
+    {
+        "instruction": "Arrange the given numbers in descending order.",
+        "input": "7, 10, 3, 6, 9",
+        "output": "10, 9, 7, 6, 3.",
+        "model 1 response": "\n10 numbers in descending order: 3, 6, 3.",
+        "model 2 response": "\nThe numbers in descending order are: 3, 6, 9."
+    },
+    {
+        "instruction": "Convert this direct speech to indirect speech.",
+        "input": "'I am tired,' he said.",
+        "output": "He said that he was tired.",
+        "model 1 response": "\nHe said that he was tired.",
+        "model 2 response": "\nHe said that he was tired."
+    },
+    {
+        "instruction": "What is the capital of Turkey?",
+        "input": "",
+        "output": "The capital of Turkey is Ankara.",
+        "model 1 response": "\nThe capital of Turkey is Ankara.",
+        "model 2 response": "\nThe capital of Turkey is Ankara."
+    },
+    {
+        "instruction": "Convert the sentence to future tense.",
+        "input": "They play football every Sunday.",
+        "output": "They will play football every Sunday.",
+        "model 1 response": "\nThey will play football every Sunday.",
+        "model 2 response": "\nThey will play football every Sunday."
+    },
+    {
+        "instruction": "Provide a synonym for 'quick'.",
+        "input": "",
+        "output": "A synonym for 'quick' is 'fast'.",
+        "model 1 response": "\nA synonym for 'quick' is 'fast'.",
+        "model 2 response": "\nA synonym for 'quick' is 'fast'."
+    },
+    {
+        "instruction": "What is a synonym of 'luminous'?",
+        "input": "",
+        "output": "A synonym for 'luminous' is 'bright'.",
+        "model 1 response": "\nA synonym for 'luminous' is 'bright'.",
+        "model 2 response": "\nA synonym for 'luminous' is 'bright'."
+    },
+    {
+        "instruction": "Name the process by which plants lose water vapor through their leaves.",
+        "input": "",
+        "output": "The process by which plants lose water vapor through their leaves is called transpiration.",
+        "model 1 response": "\nThe process is called evaporation.",
+        "model 2 response": "\nThe process by which plants lose water vapor through their leaves is called transpiration."
+    },
+    {
+        "instruction": "Identify the verb in the sentence.",
+        "input": "Birds fly south for the winter.",
+        "output": "The verb in the sentence is 'fly'.",
+        "model 1 response": "\nThe verb in the sentence is \"fly.\"",
+        "model 2 response": "\nThe verb in the sentence is 'fly'."
+    },
+    {
+        "instruction": "What type of bond is formed when one atom donates an electron to another atom?",
+        "input": "",
+        "output": "The type of bond formed when one atom donates an electron to another atom is an ionic bond.",
+        "model 1 response": "\nThe type of bond formed when one atom donates an electron to another atom is an ionic bond.",
+        "model 2 response": "\nThe type of bond formed when one atom donates an electron to another atom is ionic."
+    },
+    {
+        "instruction": "Rewrite the following sentence to include a simile.",
+        "input": "The car moved fast.",
+        "output": "The car moved as fast as a cheetah.",
+        "model 1 response": "\nThe car moved as fast as a cheetah.",
+        "model 2 response": "\nThe car moved at a fast pace."
+    },
+    {
+        "instruction": "Identify the correct spelling of the following word.",
+        "input": "Necesary",
+        "output": "The correct spelling is \"Necessary\".",
+        "model 1 response": "\nThe correct spelling is \"Necesary\".",
+        "model 2 response": "\nThe correct spelling of the word \"Necessary\" is \"Necessary\"."
+    },
+    {
+        "instruction": "Rewrite this sentence in the active voice.",
+        "input": "A novel was written by Mary.",
+        "output": "Mary wrote a novel.",
+        "model 1 response": "\nMary wrote a novel.",
+        "model 2 response": "\nMary wrote a novel."
+    },
+    {
+        "instruction": "rewrite this sentence as a question.",
+        "input": "Your brother went to London last year.",
+        "output": "Did your brother go to London last year?",
+        "model 1 response": "\nDid your brother go to London last year?",
+        "model 2 response": "\nDid your brother go to London last year?"
+    },
+    {
+        "instruction": "Explain what a haiku is.",
+        "input": "",
+        "output": "A haiku is a form of traditional Japanese poetry that consists of three lines with a syllable pattern of 5-7-5.",
+        "model 1 response": "\nA haiku is a form of traditional Japanese poetry that consists of three lines with a syllable pattern of 5-7-5.",
+        "model 2 response": "\nA haiku is a form of traditional Japanese poetry that consists of three lines that are composed of three syllables."
+    },
+    {
+        "instruction": "Split the following sentence into two declarative sentences: \"The book is interesting but the cover is dull\".",
+        "input": "",
+        "output": "The book is interesting.\nThe cover is dull.",
+        "model 1 response": "\nThe book is interesting.\nThe cover is dull.",
+        "model 2 response": "\nThe book is dull."
+    },
+    {
+        "instruction": "What is the antonym of 'sparse'?",
+        "input": "",
+        "output": "The antonym of 'sparse' is 'dense'.",
+        "model 1 response": "\nThe antonym of 'sparse' is 'dense'.",
+        "model 2 response": "\nThe antonym of 'sparse' is 'dense'."
+    },
+    {
+        "instruction": "Identify the type of sentence.",
+        "input": "What a beautiful view!",
+        "output": "The type of sentence is exclamatory.",
+        "model 1 response": "\nThe type of sentence is exclamatory.",
+        "model 2 response": "\nThe type of sentence is subjunctive."
+    },
+    {
+        "instruction": "What is the boiling point of mercury in Celsius?",
+        "input": "",
+        "output": "The boiling point of mercury is 356.7 degrees Celsius.",
+        "model 1 response": "\nThe boiling point of mercury is approximately 100 degrees Celsius.",
+        "model 2 response": "\nThe boiling point of mercury is 100 degrees Celsius."
+    },
+    {
+        "instruction": "Define the term 'kinetic energy'.",
+        "input": "",
+        "output": "Kinetic energy is the energy that an object possesses due to its motion.",
+        "model 1 response": "\nKinetic energy is the energy that an object possesses due to its motion.",
+        "model 2 response": "\nKinetic energy is the energy that an object possesses due to its motion."
+    },
+    {
+        "instruction": "Change the following sentence to past perfect tense.",
+        "input": "They finish the game.",
+        "output": "They had finished the game.",
+        "model 1 response": "\nThey had finished the game.",
+        "model 2 response": "\nThey will finish the game."
+    },
+    {
+        "instruction": "Classify the following items as either solid, liquid, or gas.",
+        "input": "Mercury, oxygen, wood",
+        "output": "Mercury - Liquid\nOxygen - Gas\nWood - Solid",
+        "model 1 response": "\nMercury: Solid\nOxygen: Gas\nWood: Solid\nPlants: Vegetable\nMinerals: Iron\nMinerals: Gold\nMinerals: Coal\nMinerals: Bacteria: E. coli\nFruits: Cherries\nMinerals: Wheat\nMinerals: Vinegar\nMinerals: Fish\nMinerals: Mercury\nMinerals: Oxygen\nMinerals: Carbon\nMinerals: Wood\nMinerals: Gas\nMinerals: Carbon\nMinerals: Wood\nMinerals: Solid\nMinerals: Gas\nMinerals: Solid\nMinerals: Gas\nMinerals: Solid\nMinerals: Solid\nMinerals: Gas\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\n",
+        "model 2 response": "\nMercury is a solid. Oxygen is a liquid. Wood is a gas."
+    },
+    {
+        "instruction": "What is the synonym for 'enormous'?",
+        "input": "",
+        "output": "A synonym for 'enormous' is 'gigantic'.",
+        "model 1 response": "\nThe synonym for 'enormous' is 'gigantic'.",
+        "model 2 response": "\nThe synonym for 'enormous' is 'gigantic'."
+    },
+    {
+        "instruction": "Translate the term 'sustainable development' into German.",
+        "input": "",
+        "output": "The German translation of 'sustainable development' is 'nachhaltige Entwicklung'.",
+        "model 1 response": "\nThe German translation of 'sustainable development' is 'nachhaltige Entwicklung'.",
+        "model 2 response": "\nThe German translation of 'sustainable development' is 'sustainable development f\u00fcr das Entwicklung'."
+    },
+    {
+        "instruction": "Name three essential elements found in nature.",
+        "input": "",
+        "output": "1. Carbon\n2. Oxygen\n3. Iron",
+        "model 1 response": "\n1. Carbon\n2. Oxygen\n3. Oxygen\n### Response:\n1. Carbon\n2. Oxygen\n3. Oxygen",
+        "model 2 response": "\n1. Carbon\n2. Iron\n3. Titanium"
+    },
+    {
+        "instruction": "Rewrite the following phrase as a question.",
+        "input": "We must finish the project by tomorrow.",
+        "output": "Must we finish the project by tomorrow?",
+        "model 1 response": "\nMust we finish the project by tomorrow?",
+        "model 2 response": "\nWe must finish the project by tomorrow."
+    },
+    {
+        "instruction": "Combine the given sentences into one cohesive sentence.",
+        "input": "It is raining. I cannot go to the park.",
+        "output": "Since it is raining, I cannot go to the park.",
+        "model 1 response": "\nIt is raining. I cannot go to the park.",
+        "model 2 response": "\nIt is raining. I cannot go to the park."
+    },
+    {
+        "instruction": "Calculate the volume of a cube with side length of 3 meters.",
+        "input": "",
+        "output": "The volume of the cube is 27 cubic meters.",
+        "model 1 response": "\nThe volume of the cube is approximately 3 cubic meters.",
+        "model 2 response": "\nThe volume of the cube is approximately 3.5 cubic meters."
+    },
+    {
+        "instruction": "Re-word this sentence using an active voice.",
+        "input": "The decision was made by the team.",
+        "output": "The team made the decision.",
+        "model 1 response": "\nThe team made the decision.",
+        "model 2 response": "\nThe team made the decision."
+    },
+    {
+        "instruction": "Convert the sentence to use an indefinite pronoun.",
+        "input": "Someone left a message.",
+        "output": "A message was left by someone.",
+        "model 1 response": "\nA message was left by someone.",
+        "model 2 response": "\nA message was left by someone."
+    },
+    {
+        "instruction": "Classify the following statement into one of these labels: [positive, negative, neutral]",
+        "input": "My computer crashed.",
+        "output": "The statement \"My computer crashed\" can be classified as negative.",
+        "model 1 response": "\nThe statement \"My computer crashed\" can be classified as negative.",
+        "model 2 response": "\nThe statement \"My computer crashed\" is classified as negative."
+    },
+    {
+        "instruction": "What is the derivative of x^2?",
+        "input": "",
+        "output": "The derivative of x^2 is 2x.",
+        "model 1 response": "\nThe derivative of x^2 is 2x.",
+        "model 2 response": "\nThe derivative of x^2 is 2x."
+    },
+    {
+        "instruction": "Fill in the blanks with appropriate words.",
+        "input": "The sky was filled with colorful ________ at sunset.",
+        "output": "The sky was filled with colorful clouds at sunset.",
+        "model 1 response": "\nThe sky was filled with colorful clouds at sunset.",
+        "model 2 response": "\nThe sky was filled with colorful clouds at sunset."
+    },
+    {
+        "instruction": "Change the sentence from passive to active voice.",
+        "input": "The song was sung by the choir.",
+        "output": "The choir sang the song.",
+        "model 1 response": "\nThe choir sang the song.",
+        "model 2 response": "\nThe choir sang the song."
+    },
+    {
+        "instruction": "What is the atomic mass of carbon?",
+        "input": "",
+        "output": "The atomic mass of carbon is approximately 12.01 atomic mass units.",
+        "model 1 response": "\nThe atomic mass of carbon is approximately 3.01 atomic mass units.",
+        "model 2 response": "\nThe atomic mass of carbon is approximately 3.0 atomic mass units."
+    },
+    {
+        "instruction": "Convert the sentence to present perfect tense: 'He wins the race.'",
+        "input": "He wins the race.",
+        "output": "He has won the race.",
+        "model 1 response": "\nHe has won the race.",
+        "model 2 response": "\nHe will win the race."
+    },
+    {
+        "instruction": "Rewrite the following sentence to include an idiom.",
+        "input": "It started raining suddenly.",
+        "output": "It started raining cats and dogs.",
+        "model 1 response": "\nIt started raining cats and dogs.",
+        "model 2 response": "\nIt started raining suddenly."
+    },
+    {
+        "instruction": "What is the atomic number of helium?",
+        "input": "",
+        "output": "The atomic number of helium is 2.",
+        "model 1 response": "\nThe atomic number of helium is 2.0.",
+        "model 2 response": "\nThe atomic number of helium is 3."
+    },
+    {
+        "instruction": "Name three types of clouds.",
+        "input": "",
+        "output": "Three types of clouds are cumulus, stratus, and cirrus.",
+        "model 1 response": "\nThree types of clouds are cirrus, cirrus-like, and cirrus-like-ring.",
+        "model 2 response": "\nThree types of clouds are cumulus, stratus, and cirrus."
+    },
+    {
+        "instruction": "Identify the onomatopoeia in the sentence.",
+        "input": "The bees buzzed as they flew past.",
+        "output": "The onomatopoeia in the sentence is 'buzzed'.",
+        "model 1 response": "\nThe onomatopoeia in the sentence is that the bees buzzed as they flew past.",
+        "model 2 response": "\nThe onomatopoeia in the sentence is 'buzzed'."
+    },
+    {
+        "instruction": "Name the process by which water changes from liquid to gas.",
+        "input": "",
+        "output": "The process by which water changes from liquid to gas is called evaporation.",
+        "model 1 response": "\nThe process is called evaporation.",
+        "model 2 response": "\nThe process by which water changes from liquid to gas is called evaporation."
+    },
+    {
+        "instruction": "Find the synonym for the given verb.",
+        "input": "Begin",
+        "output": "Commence",
+        "model 1 response": "\nCommence",
+        "model 2 response": "\nBegin\n\n### Input:\nbegin\n\n### Response:\nBegin\n\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin"
+    },
+    {
+        "instruction": "Classify the following numbers as prime or composite.",
+        "input": ": 11, 14, 19.",
+        "output": "Prime numbers: 11, 19\nComposite numbers: 14",
+        "model 1 response": "\nPrime numbers: 11, 19\n\nComposite numbers: 14\n\n### Response:\nPrime numbers: 11, 19\nComposite numbers: 14",
+        "model 2 response": "\nPrime numbers: 11, 19, 19. Composite numbers: 14, 19, 19."
+    },
+    {
+        "instruction": "Classify the item as either a grain or a fruit.",
+        "input": "Wheat",
+        "output": "Grain",
+        "model 1 response": "\nGrain\n\n### Response:\nGrain\n### Response:\nWheat\n### Response:\nGrain",
+        "model 2 response": "\nGrain"
+    },
+    {
+        "instruction": "Provide the plural form of 'thesis'.",
+        "input": "",
+        "output": "The plural form of 'thesis' is 'theses'.",
+        "model 1 response": "\nThe plural form of 'thesis' is 'theses'.",
+        "model 2 response": "\nThe plural form of 'thesis' is 'thesis'."
+    },
+    {
+        "instruction": "Generate three verbs that mean the same as \"to run.\"",
+        "input": "",
+        "output": "1. To sprint\n2. To dash\n3. To jog",
+        "model 1 response": "\n1. To sprint\n2. To dash\n3. To dash\n4. To dash\n5. To dash\n6. To dash",
+        "model 2 response": "\n1. To sprint\n2. To dash\n3. To jog"
+    },
+    {
+        "instruction": "Generate a question that could logically follow the statement below.",
+        "input": "I just returned from a vacation in Spain.",
+        "output": "How was your vacation in Spain?",
+        "model 1 response": "\nHow was your vacation in Spain?",
+        "model 2 response": "\nWhat was your vacation in Spain?"
+    },
+    {
+        "instruction": "What is the normal boiling point of water?",
+        "input": "",
+        "output": "The normal boiling point of water is 100 degrees Celsius.",
+        "model 1 response": "\nThe normal boiling point of water is 100 degrees Celsius.",
+        "model 2 response": "\nThe normal boiling point of water is 100 degrees Celsius."
+    },
+    {
+        "instruction": "Convert the sentence to active voice.",
+        "input": "The book was read by many.",
+        "output": "Many people read the book.",
+        "model 1 response": "\nMany people read the book.",
+        "model 2 response": "\nMany people read the book."
+    },
+    {
+        "instruction": "Identify the tense used in the sentence.",
+        "input": "They are watching a movie.",
+        "output": "The tense used in the sentence \"They are watching a movie\" is the present continuous tense.",
+        "model 1 response": "\nThe tense used in the sentence is future continuous.",
+        "model 2 response": "\nThe tense used in the sentence is future continuous."
+    },
+    {
+        "instruction": "Explain what GDP stands for and what it measures.",
+        "input": "",
+        "output": "GDP stands for Gross Domestic Product, which measures the total value of all goods and services produced over a specific time period within a country's borders.",
+        "model 1 response": "\nGDP stands for Gross Domestic Product, which measures the total value of all goods and services produced over a specific time period within a country's borders.",
+        "model 2 response": "\nGDP stands for Gross Domestic Product, which measures the total value of all goods and services produced over a specific time period within a country's borders."
+    },
+    {
+        "instruction": "Identify the verb in the following sentence",
+        "input": "The sun shines brightly.",
+        "output": "The verb in the sentence is 'shines'.",
+        "model 1 response": "\nThe verb in the sentence is \"shines.\"",
+        "model 2 response": "\nThe verb in the sentence is 'shines'."
+    },
+    {
+        "instruction": "Rewrite the following sentence to include a metaphor.'",
+        "input": "Life can be hard.",
+        "output": "Life is a rocky road.",
+        "model 1 response": "\nLife is a rocky road.",
+        "model 2 response": "\nLife is a rocky road."
+    },
+    {
+        "instruction": "What is the derivative of the function f(x) = x^4?",
+        "input": "",
+        "output": "The derivative of the function f(x) = x^4 is f'(x) = 4x^3.",
+        "model 1 response": "\nThe derivative of the function f(x) = x^4 is f'(x) = 4x^3.",
+        "model 2 response": "\nThe derivative of the function f(x) = x^4 is f'(x) = 4x."
+    },
+    {
+        "instruction": "Convert the temperature from Celsius to Kelvin.",
+        "input": "25 degrees Celsius",
+        "output": "25 degrees Celsius is 298.15 Kelvin.",
+        "model 1 response": "\n25 degrees Celsius is approximately 298.15 Kelvin.",
+        "model 2 response": "\n25 degrees Celsius is approximately 298.15 Kelvin."
+    },
+    {
+        "instruction": "Rewrite the following sentence to include a hyperbole: 'I'm very hungry.'",
+        "input": "I'm very hungry.",
+        "output": "I'm so hungry I could eat a horse.",
+        "model 1 response": "\nI'm very hungry.",
+        "model 2 response": "\nI'm very hungry."
+    },
+    {
+        "instruction": "List the ingredients required to make a basic salad.",
+        "input": "",
+        "output": "Ingredients for a basic salad typically include lettuce, tomatoes, cucumbers, and dressing.",
+        "model 1 response": "\nIngredients for a basic salad typically include lettuce, tomatoes, cucumbers, and cucumbers.",
+        "model 2 response": "\nIngredients for a basic salad typically include lettuce, tomatoes, dressing, and dressingings."
+    },
+    {
+        "instruction": "Select the noun phrase in the following sentence.",
+        "input": "He enjoyed a delicious cake at the party.",
+        "output": "The noun phrases in the given sentence are \"a delicious cake\" and \"the party.\"",
+        "model 1 response": "\nThe noun phrases in the sentence are \"a delicious cake\" and \"the party.\"",
+        "model 2 response": "\nThe noun phrases in the sentence are \"a delicious cake\" and \"the party.\""
+    },
+    {
+        "instruction": "What is the contraction for \"will not\"?",
+        "input": "",
+        "output": "The contraction for \"will not\" is \"won't\".",
+        "model 1 response": "\nThe contraction for \"will not\" is \"won't\".",
+        "model 2 response": "\nThe contraction for \"will not\" is \"won't\"."
+    },
+    {
+        "instruction": "Generate a sentence using the word 'elucidate'.",
+        "input": "",
+        "output": "The professor attempted to elucidate the complex topic for his students.",
+        "model 1 response": "\nThe professor elucidated the complex topic for his students.",
+        "model 2 response": "\nThe professor elucidated the topic of the class."
+    },
+    {
+        "instruction": "Convert this sentence to passive voice",
+        "input": "The chef cooked a delicious meal.",
+        "output": "A delicious meal was cooked by the chef.",
+        "model 1 response": "\nA delicious meal was cooked by the chef.",
+        "model 2 response": "\nThe chef cooked a delicious meal."
+    },
+    {
+        "instruction": "What is the pH of lemon juice?",
+        "input": "",
+        "output": "The pH of lemon juice is approximately 2.0.",
+        "model 1 response": "\nThe pH of lemon juice is approximately 2.0.",
+        "model 2 response": "\nThe pH of lemon juice is typically around 3.0."
+    },
+    {
+        "instruction": "What is the capital of Norway?",
+        "input": "",
+        "output": "The capital of Norway is Oslo.",
+        "model 1 response": "\nThe capital of Norway is Oslo.",
+        "model 2 response": "\nThe capital of Norway is Oslo."
+    },
+    {
+        "instruction": "Generate a question with the keyword \u201cbacteria\u201d",
+        "input": "",
+        "output": "How do bacteria affect human health?",
+        "model 1 response": "\nWhat is the pH of baking soda solution?\n\n### Response:\nThe pH of a baking soda solution is typically around 2.0.",
+        "model 2 response": "\nWhat is the pH of a neutral solution at 25\u00b0C?\n\n### Input:\nThe pH of a neutral solution at 25\u00b0C is typically around 3.0."
+    },
+    {
+        "instruction": "Generate a sentence using the word 'spontaneous'.",
+        "input": "",
+        "output": "His decision to travel was completely spontaneous, made on a whim.",
+        "model 1 response": "\nHis decision to travel was completely spontaneous, made on a whim.",
+        "model 2 response": "\nThe decision was made by the team."
+    },
+    {
+        "instruction": "Convert the sentence to future perfect tense.",
+        "input": "They close the store.",
+        "output": "They will have closed the store.",
+        "model 1 response": "\nThey will have closed the store.",
+        "model 2 response": "\nThey will have closed the store."
+    },
+    {
+        "instruction": "Provide the past participle form of 'break'.",
+        "input": "",
+        "output": "The past participle form of 'break' is 'broken'.",
+        "model 1 response": "\nThe past participle form of 'break' is 'broken'.",
+        "model 2 response": "\nThe past participle form of 'break' is 'broken'."
+    },
+    {
+        "instruction": "Provide the correct contraction for 'they are'.",
+        "input": "",
+        "output": "The contraction for 'they are' is 'they're'.",
+        "model 1 response": "\nThe contraction for 'they are' is 'they're'.",
+        "model 2 response": "\nThe contraction for 'they are' is 'they're'."
+    },
+    {
+        "instruction": "Convert the temperature from Fahrenheit to Celsius: 68\u00b0F.",
+        "input": "68 degrees Fahrenheit",
+        "output": "68 degrees Fahrenheit is approximately 20 degrees Celsius.",
+        "model 1 response": "\n68 degrees Fahrenheit is approximately 37 degrees Celsius.",
+        "model 2 response": "\n68 degrees Fahrenheit is approximately 37 degrees Celsius."
+    },
+    {
+        "instruction": "Turn the following statement into a question.",
+        "input": "The Golden State Warriors are a professional basketball team based in San Francisco.",
+        "output": "Are the Golden State Warriors a professional basketball team based in San Francisco?",
+        "model 1 response": "\nAre the Golden State Warriors a professional basketball team based in San Francisco?",
+        "model 2 response": "\nAre the Golden State Warriors a professional basketball team based in San Francisco?"
+    },
+    {
+        "instruction": "What is the derivative of the function f(x) = 3x^3?",
+        "input": "",
+        "output": "The derivative of the function f(x) = 3x^3 is f'(x) = 9x^2.",
+        "model 1 response": "\nThe derivative of the function f(x) = 3x^3 is f'(x) = 6x.",
+        "model 2 response": "\nThe derivative of the function f(x) = 3x^3 is f'(x) = 4x."
+    },
+    {
+        "instruction": "Rewrite the following phrase in the present tense.",
+        "input": "They had played soccer all afternoon.",
+        "output": "They play soccer all afternoon.",
+        "model 1 response": "\nThey play soccer all afternoon.",
+        "model 2 response": "\nThey played soccer all afternoon."
+    },
+    {
+        "instruction": "Rewrite this sentence to start with \"Despite\": She managed to finish her work, although she was tired.",
+        "input": "",
+        "output": "Despite being tired, she managed to finish her work.",
+        "model 1 response": "\nDespite being tired, she managed to finish her work.",
+        "model 2 response": "\nDespite being tired, she managed to finish her work."
+    },
+    {
+        "instruction": "Convert the sentence to future continuous tense.",
+        "input": "She plays the piano.",
+        "output": "She will be playing the piano.",
+        "model 1 response": "\nShe will play the piano.",
+        "model 2 response": "\nShe will be playing the piano."
+    },
+    {
+        "instruction": "Capitalize each word in the sentence.",
+        "input": "i love watching the sunrise",
+        "output": "I Love Watching The Sunrise",
+        "model 1 response": "\nI love watching the sunrise.",
+        "model 2 response": "\nI love watching the sunrise."
+    },
+    {
+        "instruction": "What is the antonym of 'complicated'?",
+        "input": "",
+        "output": "The antonym of 'complicated' is 'simple'.",
+        "model 1 response": "\nThe antonym of 'complicated' is 'simple'.",
+        "model 2 response": "\nThe antonym of 'complicated' is 'simple'."
+    },
+    {
+        "instruction": "Sort the following numbers in ascending order.",
+        "input": "5, 18, 3, 11.",
+        "output": "The numbers in ascending order would be: 3, 5, 11, 18.",
+        "model 1 response": "\nThe numbers in ascending order are: 3, 5, 11.",
+        "model 2 response": "\nThe numbers in ascending order are: 3, 11, 5, 18."
+    },
+    {
+        "instruction": "Identify the theme of the story.",
+        "input": "The Tortoise and the Hare",
+        "output": "The theme of the story is that slow and steady wins the race.",
+        "model 1 response": "\nThe theme of the story is that slow and steady wins the race.",
+        "model 2 response": "\nThe theme of the story is that the Tortoise and the Hare are the two most popular animals in the world."
+    },
+    {
+        "instruction": "Correct the verb tense error in the sentence.",
+        "input": "She go to school every day.",
+        "output": "The corrected sentence should be: 'She goes to school every day.'",
+        "model 1 response": "\nThe corrected sentence should be: 'She goes to school every day.'",
+        "model 2 response": "\nThe corrected sentence should be: 'She goes to school every day.'"
+    },
+    {
+        "instruction": "Evaluate this sentence for spelling and grammar mistakes.",
+        "input": "I recieved an invitation to an exciting evenement.",
+        "output": "There are two errors in the sentence. The corrected sentence should be: \"I received an invitation to an exciting event.\"",
+        "model 1 response": "\nAn invitation to an exciting event?",
+        "model 2 response": "\nThe corrected sentence should be: 'I received an invitation to an exciting event.'"
+    },
+    {
+        "instruction": "Identify the hypernym for the word \"car\".",
+        "input": "",
+        "output": "The hypernym for the word \"car\" is \"vehicle\".",
+        "model 1 response": "\nThe hypernym for the word \"car\" is \"vehicle\".",
+        "model 2 response": "\nThe hypernym for the word \"car\" is \"vehicle\"."
+    },
+    {
+        "instruction": "Name an antonym for \"joy.\"",
+        "input": "",
+        "output": "An antonym for \"joy\" is \"sorrow.\"",
+        "model 1 response": "\nAn antonym for \"joy\" is \"sorrow.\"",
+        "model 2 response": "\nAn antonym for \"joy\" is \"contentment.\""
+    },
+    {
+        "instruction": "Convert the sentence into a question",
+        "input": "She likes ice cream.",
+        "output": "Does she like ice cream?",
+        "model 1 response": "\nDoes she like ice cream?",
+        "model 2 response": "\nDoes she like ice cream?"
+    },
+    {
+        "instruction": "Rewrite this sentence in the form of a question.",
+        "input": "This is the biggest tree in the park.",
+        "output": "Is this the biggest tree in the park?",
+        "model 1 response": "\nIs this the biggest tree in the park?",
+        "model 2 response": "\nIs this the biggest tree in the park?"
+    },
+    {
+        "instruction": "Find a rhyme for the word 'sing'.",
+        "input": "",
+        "output": "A word that rhymes with 'sing' is 'ring'.",
+        "model 1 response": "\nA word that rhymes with 'ring' is 'ring-a'.",
+        "model 2 response": "\nThe word 'sing' is a derivative of the word 'sing'."
+    },
+    {
+        "instruction": "Convert this sentence to future perfect tense.",
+        "input": "She will finish the project.",
+        "output": "She will have finished the project.",
+        "model 1 response": "\nShe will have finished the project.",
+        "model 2 response": "\nShe will have finished the project."
+    },
+    {
+        "instruction": "What type of number is 0?",
+        "input": "",
+        "output": "0 is an integer.",
+        "model 1 response": "\n0 is an integer.",
+        "model 2 response": "\n0 is an integer."
+    },
+    {
+        "instruction": "Classify the following animals",
+        "input": "Eagle, Shark, Cobra.",
+        "output": "Birds: Eagle\nFish: Shark\nReptiles: Cobra",
+        "model 1 response": "\nBirds: Eagle\nFish: Shark\nReptiles: Cobra\nReptiles: Squirrel\nReptiles: Tiger\nReptiles: Cobra\nReptiles: Tiger\nReptiles: Squirrel\nReptiles: Tiger\nReptiles: Cobra\nReptiles: Tiger\nReptiles: Cobra",
+        "model 2 response": "\nBirds: Eagle\nFish: Shark\nReptiles: Cobra\nPlants: Eagle\nMinerals: Eagle\nFruits: Shark\nFruits: Shark"
+    },
+    {
+        "instruction": "Take the following noun and turn it into a verb.",
+        "input": "Light",
+        "output": "To light.",
+        "model 1 response": "\nTo light.",
+        "model 2 response": "\nLight is the most abundant element in the periodic table."
+    },
+    {
+        "instruction": "Generate a random alphanumeric code of 10 characters.",
+        "input": "",
+        "output": "Here is a randomly generated 10-character alphanumeric code: 3F8G6H7JQ2.",
+        "model 1 response": "\nHere is a randomly generated sequence of 10 characters: 3F8G6H7JQHWQ2JQ3H6H-\n\n### Response:\nHere is a sequence of 10 characters: 3F8G6H7JQHWQ2JQ3H6H-",
+        "model 2 response": "\nThe alphanumeric code for the word 'happy' is 'happiness'."
+    }
+]
--- a/ch07/03_model-evaluation/llm-instruction-eval-openai.ipynb
+++ b/ch07/03_model-evaluation/llm-instruction-eval-openai.ipynb
@ -0,0 +1,552 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "136a4efe-fb99-4311-8679-e0a5b6282755",
+   "metadata": {},
+   "source": [
+    "<table style=\"width:100%\">\n",
+    "<tr>\n",
+    "<td style=\"vertical-align:middle; text-align:left;\">\n",
+    "<font size=\"2\">\n",
+    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
+    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
+    "</font>\n",
+    "</td>\n",
+    "<td style=\"vertical-align:middle; text-align:left;\">\n",
+    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
+    "</td>\n",
+    "</tr>\n",
+    "</table>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Evaluating Instruction Responses Using the OpenAI API"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a128651b-f326-4232-a994-42f38b7ed520",
+   "metadata": {},
+   "source": [
+    "- This notebook uses OpenAI's GPT-4 API to evaluate responses by a instruction finetuned LLMs based on an dataset in JSON format that includes the generated model responses, for example:\n",
+    "\n",
+    "\n",
+    "\n",
+    "```python\n",
+    "{\n",
+    "    \"instruction\": \"What is the atomic number of helium?\",\n",
+    "    \"input\": \"\",\n",
+    "    \"output\": \"The atomic number of helium is 2.\",               # <-- The target given in the test set\n",
+    "    \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
+    "    \"model 2 response\": \"\\nThe atomic number of helium is 3.\"    # <-- Response by a 2nd LLM\n",
+    "},\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "267ba0d1-b884-42df-85bd-0be746fd47a5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# pip install -r requirements-exra.txt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "63610acc-db94-437f-8d38-e99dca0299cb",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "openai version: 1.30.3\n",
+      "tqdm version: 4.65.0\n"
+     ]
+    }
+   ],
+   "source": [
+    "from importlib.metadata import version\n",
+    "\n",
+    "pkgs = [\"openai\",  # OpenAI API\n",
+    "        \"tqdm\",    # Progress bar\n",
+    "       ]\n",
+    "\n",
+    "for p in pkgs:\n",
+    "    print(f\"{p} version: {version(p)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
+   "metadata": {},
+   "source": [
+    "## Test OpenAI API"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
+   "metadata": {},
+   "source": [
+    "- First, let's test if the OpenAI API is correctly set up\n",
+    "- If you don't have an account yet, you need to create one at https://platform.openai.com/\n",
+    "- Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)\n",
+    "- Running the experiments and creating the ~200 evaluations using the code in this notebook costs about $0.26 (26 cents) as of this writing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
+   "metadata": {},
+   "source": [
+    "OPENAI_API_KEY = \"Your Open AI API Key\"- First, we need to provide our OpenAI API key, which can be found at https://platform.openai.com/api-keys\n",
+    "- Make sure not to share this key with anyone (make sure to delete it from this notebook in case you intend to share it; I recommend deleting the entire notebook cell that contains the key)\n",
+    "- Alternatively, delete the used API key from your account after you are finished to make sure it can't be abused later"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "OPENAI_API_KEY = \"Your Open AI API Key\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "26900564-aba7-48ba-8ee8-6cc9a505a25c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from openai import OpenAI\n",
+    "\n",
+    "client = OpenAI(api_key=OPENAI_API_KEY)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
+   "metadata": {},
+   "source": [
+    "- First, let's try the API with a simple example to make sure it works as intended:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "08e9ef2e-e816-4283-840e-43625791ad33",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'hello world'"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "def run_chatgpt(prompt, client, model=\"gpt-4-turbo\"):\n",
+    "    response = client.chat.completions.create(\n",
+    "        model=model,\n",
+    "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
+    "        temperature=0.0,\n",
+    "        seed=123,\n",
+    "    )\n",
+    "    return response.choices[0].message.content\n",
+    "\n",
+    "\n",
+    "prompt = f\"Respond with 'hello world' if you got this message.\"\n",
+    "run_chatgpt(prompt, client)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
+   "metadata": {},
+   "source": [
+    "## Load JSON Entries"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
+   "metadata": {},
+   "source": [
+    "- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of entries: 100\n"
+     ]
+    }
+   ],
+   "source": [
+    "import json\n",
+    "\n",
+    "json_file = \"eval-example-data.json\"\n",
+    "\n",
+    "with open(json_file, \"r\") as file:\n",
+    "    json_data = json.load(file)\n",
+    "    \n",
+    "print(\"Number of entries:\", len(json_data))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
+   "metadata": {},
+   "source": [
+    "- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "7222fdc0-5684-4f2b-b741-3e341851359e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",
+       " 'input': '',\n",
+       " 'output': 'The hypotenuse of the triangle is 10 cm.',\n",
+       " 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",
+       " 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "json_data[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
+   "metadata": {},
+   "source": [
+    "- Below is a small utility function that formats the input for visualization purposes later:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def format_input(entry):\n",
+    "    instruction_text = (\n",
+    "        f\"Below is an instruction that describes a task. Write a response that \"\n",
+    "        f\"appropriately completes the request.\"\n",
+    "        f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",
+    "    )\n",
+    "\n",
+    "    input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
+    "    instruction_text + input_text\n",
+    "\n",
+    "    return instruction_text + input_text"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "39a55283-7d51-4136-ba60-f799d49f4098",
+   "metadata": {},
+   "source": [
+    "- Now, let's try the OpenAI API to compare the model responses (we only evalyate the first 5 responses for a visual comparison):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "735cc089-d127-480a-b39d-0782581f0c41",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Dataset response:\n",
+      ">> The hypotenuse of the triangle is 10 cm.\n",
+      "\n",
+      "Model response:\n",
+      ">> \n",
+      "The hypotenuse of the triangle is 3 cm.\n",
+      "\n",
+      "Score:\n",
+      ">> The model response \"The hypotenuse of the triangle is 3 cm.\" is incorrect. The correct calculation of the hypotenuse for a right triangle with legs of 6 cm and 8 cm should be done using the Pythagorean theorem, which states that the square of the hypotenuse (c) is equal to the sum of the squares of the other two sides (a and b). Thus, \\( c = \\sqrt{6^2 + 8^2} = \\sqrt{36 + 64} = \\sqrt{100} = 10 \\) cm.\n",
+      "\n",
+      "The model response provided a hypotenuse of 3 cm, which is not only incorrect but also mathematically impossible given the lengths of the legs (since 3 cm is less than either leg of the triangle, it cannot be the hypotenuse in a right triangle with these dimensions).\n",
+      "\n",
+      "Given the incorrectness and the impossibility of the response, the score would be very low. However, since the response format is correct (stating the hypotenuse is a certain measurement in cm), it does not score absolutely zero.\n",
+      "\n",
+      "Score: 10/100. The points are given for maintaining the correct format and units in the response, but the mathematical error is significant and fundamental, leading to a low score.\n",
+      "\n",
+      "-------------------------\n",
+      "\n",
+      "Dataset response:\n",
+      ">> 1. Squirrel\n",
+      "2. Eagle\n",
+      "3. Tiger\n",
+      "\n",
+      "Model response:\n",
+      ">> \n",
+      "1. Squirrel\n",
+      "2. Tiger\n",
+      "3. Eagle\n",
+      "4. Cobra\n",
+      "5. Tiger\n",
+      "6. Cobra\n",
+      "\n",
+      "Score:\n",
+      ">> To evaluate the model response against the given instruction, we need to consider the accuracy, relevance, and adherence to the instruction's requirements. The instruction specifically asks for the names of three different animals that are active during the day.\n",
+      "\n",
+      "### Analysis of Model Response:\n",
+      "1. **Relevance and Accuracy**: \n",
+      "   - **Squirrel**: Correct, squirrels are diurnal (active during the day).\n",
+      "   - **Tiger**: Correct, though tigers can be crepuscular (active during dawn and dusk), they are often active during the day as well.\n",
+      "   - **Eagle**: Correct, eagles are generally diurnal.\n",
+      "   - **Cobra**: Incorrect, cobras are generally not active during the day; they are more active during the early and late hours of the day, making them crepuscular.\n",
+      "\n",
+      "2. **Adherence to Instruction**:\n",
+      "   - The instruction asked for three different animals. The model response listed six items, which is double the requested amount.\n",
+      "   - The response includes repetitions (Tiger and Cobra are each mentioned twice), which does not align with the instruction to name different animals.\n",
+      "\n",
+      "### Scoring:\n",
+      "- **Accuracy**: 3/4 entries are accurate in terms of being day-active animals.\n",
+      "- **Relevance**: The response includes more animals than requested and repeats some animals.\n",
+      "- **Adherence to Instruction**: The instruction was to list three different animals, but the response included six entries with repetitions.\n",
+      "\n",
+      "Given these points, the model response partially meets the accuracy requirement but fails significantly in adherence to the instruction's format and specificity. The inclusion of incorrect information (Cobra) and unnecessary repetitions also detracts from the quality of the response.\n",
+      "\n",
+      "### Score: 40/100\n",
+      "This score reflects that while some of the response was accurate, the failure to adhere to the specific number of animals requested, the inclusion of an incorrect animal, and the repetition of animals significantly lower the quality of the response according to the given instruction.\n",
+      "\n",
+      "-------------------------\n",
+      "\n",
+      "Dataset response:\n",
+      ">> I must ascertain what is incorrect.\n",
+      "\n",
+      "Model response:\n",
+      ">> \n",
+      "What is incorrect?\n",
+      "\n",
+      "Score:\n",
+      ">> The model response \"What is incorrect?\" would score relatively low on the scale for the given task. Here's the breakdown:\n",
+      "\n",
+      "1. **Understanding of Instruction**: The instruction specifically asks for a more formal rewrite of the sentence \"I need to find out what's wrong.\" The model response does not fully capture the original sentence's intent of needing to discover or ascertain the issue. Instead, it poses a direct question about what is incorrect, which changes the nature of the statement from a declaration to an inquiry. This indicates a partial misunderstanding or incomplete execution of the task.\n",
+      "\n",
+      "2. **Formality**: The response does use slightly more formal language by using \"incorrect\" instead of \"wrong.\" However, it lacks the formal structure expected in rewriting the original sentence. The original sentence's intent and structure as a statement of need (\"I need to find out...\") are not preserved.\n",
+      "\n",
+      "3. **Completeness**: The response does not include the aspect of needing to \"find out,\" which is crucial to the original sentence. It merely asks what is incorrect, without indicating the necessity or process of discovery.\n",
+      "\n",
+      "Given these points, the response would score around **30 out of 100**. It recognizes the need for more formal language but fails to accurately and completely transform the original sentence while maintaining its intent and structure.\n",
+      "\n",
+      "-------------------------\n",
+      "\n",
+      "Dataset response:\n",
+      ">> The interjection in the sentence is 'Wow'.\n",
+      "\n",
+      "Model response:\n",
+      ">> \n",
+      "The interjection in the sentence is 'Wow'.\n",
+      "\n",
+      "Score:\n",
+      ">> The model response `The interjection in the sentence is 'Wow'.` accurately identifies the interjection in the given sentence. The response is clear, directly addresses the instruction, and correctly identifies \"Wow\" as the interjection, which is used to express surprise or admiration, fitting the context of the sentence provided.\n",
+      "\n",
+      "Score: 100/100\n",
+      "\n",
+      "The response fully meets the requirements of the task and correctly answers the question without any errors or omissions.\n",
+      "\n",
+      "-------------------------\n",
+      "\n",
+      "Dataset response:\n",
+      ">> The type of sentence is interrogative.\n",
+      "\n",
+      "Model response:\n",
+      ">> \n",
+      "The type of sentence is exclamatory.\n",
+      "\n",
+      "Score:\n",
+      ">> The model response \"The type of sentence is exclamatory.\" is incorrect. The correct type of the sentence \"Did you finish the report?\" is interrogative, as it is a question. An exclamatory sentence would express strong emotion and typically ends with an exclamation mark.\n",
+      "\n",
+      "Given the incorrect identification of the sentence type, the score for the model response should be low. However, the response does correctly identify a type of sentence, just not the correct one for the given input. Therefore, it shows some understanding of sentence types but fails in accurate application.\n",
+      "\n",
+      "Score: 20/100\n",
+      "\n",
+      "This score reflects that the response is on topic (discussing sentence types) but incorrect in its specific application to the provided sentence.\n",
+      "\n",
+      "-------------------------\n"
+     ]
+    }
+   ],
+   "source": [
+    "for entry in json_data[:5]:\n",
+    "    prompt = (f\"Given the input `{format_input(entry)}` \"\n",
+    "              f\"and correct output `{entry['output']}`, \"\n",
+    "              f\"score the model response `{entry['model 1 response']}`\"\n",
+    "              f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
+    "    )\n",
+    "    print(\"\\nDataset response:\")\n",
+    "    print(\">>\", entry['output'])\n",
+    "    print(\"\\nModel response:\")\n",
+    "    print(\">>\", entry[\"model 1 response\"])\n",
+    "    print(\"\\nScore:\")\n",
+    "    print(\">>\", run_chatgpt(prompt, client))\n",
+    "    print(\"\\n-------------------------\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
+   "metadata": {},
+   "source": [
+    "- Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "3552bdfb-7511-42ac-a9ec-da672e2a5468",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[0, 50, 20, 100, 0]\n"
+     ]
+    }
+   ],
+   "source": [
+    "def generate_model_scores(json_data, json_key):\n",
+    "\n",
+    "    scores = []\n",
+    "    for entry in json_data:\n",
+    "        \n",
+    "        prompt = (f\"Given the input `{format_input(entry)}` \"\n",
+    "                  f\"and correct output `{entry['output']}`, \"\n",
+    "                  f\"score the model response `{entry[json_key]}`\"\n",
+    "                  f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
+    "                  f\"Respond with the number only.\"\n",
+    "        )\n",
+    "        score = run_chatgpt(prompt, client)\n",
+    "        try:\n",
+    "            scores.append(int(score))\n",
+    "        except:\n",
+    "            continue\n",
+    "\n",
+    "    return scores\n",
+    "\n",
+    "print(generate_model_scores(json_data[:5], \"model 1 response\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "71974dea-31ed-49af-abba-5c858bbbf49c",
+   "metadata": {},
+   "source": [
+    "- Please note that the response scores may vary because OpenAI's GPT models are not deterministic despite setting a random number seed, etc."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b071ce84-1866-427f-a272-b46700f364b2",
+   "metadata": {},
+   "source": [
+    "- Let's now apply this evaluation to the whole dataset and compute the average score of each model:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "4f700d4b-19e5-4404-afa7-b0f093024232",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "model 1 response\n",
+      "Number of scores: 100 of 100\n",
+      "Average score: 73.54\n",
+      "\n",
+      "model 2 response\n",
+      "Number of scores: 100 of 100\n",
+      "Average score: 56.52\n"
+     ]
+    }
+   ],
+   "source": [
+    "for model in (\"model 1 response\", \"model 2 response\"):\n",
+    "\n",
+    "    scores = generate_model_scores(json_data, model)\n",
+    "    print(f\"\\n{model}\")\n",
+    "    print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",
+    "    print(f\"Average score: {sum(scores)/len(scores):.2f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8169d534-1fec-43c4-9550-5cb701ff7f05",
+   "metadata": {},
+   "source": [
+    "- Based on the evaluation above, we can say that the 1st model is substantially better than the 2nd model"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/ch07/03_model-evaluation/requirements-extra.txt
+++ b/ch07/03_model-evaluation/requirements-extra.txt
@ -0,0 +1,3 @@
+openai>=1.30.3
+scikit-learn>=1.3.1
+tqdm>=4.65.0