LLMs-from-scratch/ch07/03_model-evaluation/llm-instruction-eval-openai.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "136a4efe-fb99-4311-8679-e0a5b6282755",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Evaluating Instruction Responses Using the OpenAI API"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a128651b-f326-4232-a994-42f38b7ed520",
   "metadata": {},
   "source": [
    "- This notebook uses OpenAI's GPT-4 API to evaluate responses by a instruction finetuned LLMs based on an dataset in JSON format that includes the generated model responses, for example:\n",
    "\n",
    "\n",
    "\n",
    "```python\n",
    "{\n",
    "    \"instruction\": \"What is the atomic number of helium?\",\n",
    "    \"input\": \"\",\n",
    "    \"output\": \"The atomic number of helium is 2.\",               # <-- The target given in the test set\n",
    "    \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
    "    \"model 2 response\": \"\\nThe atomic number of helium is 3.\"    # <-- Response by a 2nd LLM\n",
    "},\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "267ba0d1-b884-42df-85bd-0be746fd47a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# pip install -r requirements-exra.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "63610acc-db94-437f-8d38-e99dca0299cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "openai version: 1.30.3\n",
      "tqdm version: 4.65.0\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "pkgs = [\"openai\",  # OpenAI API\n",
    "        \"tqdm\",    # Progress bar\n",
    "       ]\n",
    "\n",
    "for p in pkgs:\n",
    "    print(f\"{p} version: {version(p)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
   "metadata": {},
   "source": [
    "## Test OpenAI API"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
   "metadata": {},
   "source": [
    "- First, let's test if the OpenAI API is correctly set up\n",
    "- If you don't have an account yet, you need to create one at https://platform.openai.com/\n",
    "- Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)\n",
    "- Running the experiments and creating the ~200 evaluations using the code in this notebook costs about $0.26 (26 cents) as of this writing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
   "metadata": {},
   "source": [
    "OPENAI_API_KEY = \"Your Open AI API Key\"- First, we need to provide our OpenAI API key, which can be found at https://platform.openai.com/api-keys\n",
    "- Make sure not to share this key with anyone (make sure to delete it from this notebook in case you intend to share it; I recommend deleting the entire notebook cell that contains the key)\n",
    "- Alternatively, delete the used API key from your account after you are finished to make sure it can't be abused later"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",
   "metadata": {},
   "outputs": [],
   "source": [
    "OPENAI_API_KEY = \"Your Open AI API Key\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "26900564-aba7-48ba-8ee8-6cc9a505a25c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI(api_key=OPENAI_API_KEY)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
   "metadata": {},
   "source": [
    "- First, let's try the API with a simple example to make sure it works as intended:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "08e9ef2e-e816-4283-840e-43625791ad33",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'hello world'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def run_chatgpt(prompt, client, model=\"gpt-4-turbo\"):\n",
    "    response = client.chat.completions.create(\n",
    "        model=model,\n",
    "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "        temperature=0.0,\n",
    "        seed=123,\n",
    "    )\n",
    "    return response.choices[0].message.content\n",
    "\n",
    "\n",
    "prompt = f\"Respond with 'hello world' if you got this message.\"\n",
    "run_chatgpt(prompt, client)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
   "metadata": {},
   "source": [
    "## Load JSON Entries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
   "metadata": {},
   "source": [
    "- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of entries: 100\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "json_file = \"eval-example-data.json\"\n",
    "\n",
    "with open(json_file, \"r\") as file:\n",
    "    json_data = json.load(file)\n",
    "    \n",
    "print(\"Number of entries:\", len(json_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
   "metadata": {},
   "source": [
    "- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "7222fdc0-5684-4f2b-b741-3e341851359e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",
       " 'input': '',\n",
       " 'output': 'The hypotenuse of the triangle is 10 cm.',\n",
       " 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",
       " 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "json_data[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
   "metadata": {},
   "source": [
    "- Below is a small utility function that formats the input for visualization purposes later:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_input(entry):\n",
    "    instruction_text = (\n",
    "        f\"Below is an instruction that describes a task. Write a response that \"\n",
    "        f\"appropriately completes the request.\"\n",
    "        f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",
    "    )\n",
    "\n",
    "    input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
    "    instruction_text + input_text\n",
    "\n",
    "    return instruction_text + input_text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39a55283-7d51-4136-ba60-f799d49f4098",
   "metadata": {},
   "source": [
    "- Now, let's try the OpenAI API to compare the model responses (we only evalyate the first 5 responses for a visual comparison):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "735cc089-d127-480a-b39d-0782581f0c41",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Dataset response:\n",
      ">> The hypotenuse of the triangle is 10 cm.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The hypotenuse of the triangle is 3 cm.\n",
      "\n",
      "Score:\n",
      ">> The model response \"The hypotenuse of the triangle is 3 cm.\" is incorrect. The correct calculation of the hypotenuse for a right triangle with legs of 6 cm and 8 cm should be done using the Pythagorean theorem, which states that the square of the hypotenuse (c) is equal to the sum of the squares of the other two sides (a and b). Thus, \\( c = \\sqrt{6^2 + 8^2} = \\sqrt{36 + 64} = \\sqrt{100} = 10 \\) cm.\n",
      "\n",
      "The model response provided a hypotenuse of 3 cm, which is not only incorrect but also mathematically impossible given the lengths of the legs (since 3 cm is less than either leg of the triangle, it cannot be the hypotenuse in a right triangle with these dimensions).\n",
      "\n",
      "Given the incorrectness and the impossibility of the response, the score would be very low. However, since the response format is correct (stating the hypotenuse is a certain measurement in cm), it does not score absolutely zero.\n",
      "\n",
      "Score: 10/100. The points are given for maintaining the correct format and units in the response, but the mathematical error is significant and fundamental, leading to a low score.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> 1. Squirrel\n",
      "2. Eagle\n",
      "3. Tiger\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "1. Squirrel\n",
      "2. Tiger\n",
      "3. Eagle\n",
      "4. Cobra\n",
      "5. Tiger\n",
      "6. Cobra\n",
      "\n",
      "Score:\n",
      ">> To evaluate the model response against the given instruction, we need to consider the accuracy, relevance, and adherence to the instruction's requirements. The instruction specifically asks for the names of three different animals that are active during the day.\n",
      "\n",
      "### Analysis of Model Response:\n",
      "1. **Relevance and Accuracy**: \n",
      "   - **Squirrel**: Correct, squirrels are diurnal (active during the day).\n",
      "   - **Tiger**: Correct, though tigers can be crepuscular (active during dawn and dusk), they are often active during the day as well.\n",
      "   - **Eagle**: Correct, eagles are generally diurnal.\n",
      "   - **Cobra**: Incorrect, cobras are generally not active during the day; they are more active during the early and late hours of the day, making them crepuscular.\n",
      "\n",
      "2. **Adherence to Instruction**:\n",
      "   - The instruction asked for three different animals. The model response listed six items, which is double the requested amount.\n",
      "   - The response includes repetitions (Tiger and Cobra are each mentioned twice), which does not align with the instruction to name different animals.\n",
      "\n",
      "### Scoring:\n",
      "- **Accuracy**: 3/4 entries are accurate in terms of being day-active animals.\n",
      "- **Relevance**: The response includes more animals than requested and repeats some animals.\n",
      "- **Adherence to Instruction**: The instruction was to list three different animals, but the response included six entries with repetitions.\n",
      "\n",
      "Given these points, the model response partially meets the accuracy requirement but fails significantly in adherence to the instruction's format and specificity. The inclusion of incorrect information (Cobra) and unnecessary repetitions also detracts from the quality of the response.\n",
      "\n",
      "### Score: 40/100\n",
      "This score reflects that while some of the response was accurate, the failure to adhere to the specific number of animals requested, the inclusion of an incorrect animal, and the repetition of animals significantly lower the quality of the response according to the given instruction.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> I must ascertain what is incorrect.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "What is incorrect?\n",
      "\n",
      "Score:\n",
      ">> The model response \"What is incorrect?\" would score relatively low on the scale for the given task. Here's the breakdown:\n",
      "\n",
      "1. **Understanding of Instruction**: The instruction specifically asks for a more formal rewrite of the sentence \"I need to find out what's wrong.\" The model response does not fully capture the original sentence's intent of needing to discover or ascertain the issue. Instead, it poses a direct question about what is incorrect, which changes the nature of the statement from a declaration to an inquiry. This indicates a partial misunderstanding or incomplete execution of the task.\n",
      "\n",
      "2. **Formality**: The response does use slightly more formal language by using \"incorrect\" instead of \"wrong.\" However, it lacks the formal structure expected in rewriting the original sentence. The original sentence's intent and structure as a statement of need (\"I need to find out...\") are not preserved.\n",
      "\n",
      "3. **Completeness**: The response does not include the aspect of needing to \"find out,\" which is crucial to the original sentence. It merely asks what is incorrect, without indicating the necessity or process of discovery.\n",
      "\n",
      "Given these points, the response would score around **30 out of 100**. It recognizes the need for more formal language but fails to accurately and completely transform the original sentence while maintaining its intent and structure.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> The interjection in the sentence is 'Wow'.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The interjection in the sentence is 'Wow'.\n",
      "\n",
      "Score:\n",
      ">> The model response `The interjection in the sentence is 'Wow'.` accurately identifies the interjection in the given sentence. The response is clear, directly addresses the instruction, and correctly identifies \"Wow\" as the interjection, which is used to express surprise or admiration, fitting the context of the sentence provided.\n",
      "\n",
      "Score: 100/100\n",
      "\n",
      "The response fully meets the requirements of the task and correctly answers the question without any errors or omissions.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> The type of sentence is interrogative.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The type of sentence is exclamatory.\n",
      "\n",
      "Score:\n",
      ">> The model response \"The type of sentence is exclamatory.\" is incorrect. The correct type of the sentence \"Did you finish the report?\" is interrogative, as it is a question. An exclamatory sentence would express strong emotion and typically ends with an exclamation mark.\n",
      "\n",
      "Given the incorrect identification of the sentence type, the score for the model response should be low. However, the response does correctly identify a type of sentence, just not the correct one for the given input. Therefore, it shows some understanding of sentence types but fails in accurate application.\n",
      "\n",
      "Score: 20/100\n",
      "\n",
      "This score reflects that the response is on topic (discussing sentence types) but incorrect in its specific application to the provided sentence.\n",
      "\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "for entry in json_data[:5]:\n",
    "    prompt = (f\"Given the input `{format_input(entry)}` \"\n",
    "              f\"and correct output `{entry['output']}`, \"\n",
    "              f\"score the model response `{entry['model 1 response']}`\"\n",
    "              f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
    "    )\n",
    "    print(\"\\nDataset response:\")\n",
    "    print(\">>\", entry['output'])\n",
    "    print(\"\\nModel response:\")\n",
    "    print(\">>\", entry[\"model 1 response\"])\n",
    "    print(\"\\nScore:\")\n",
    "    print(\">>\", run_chatgpt(prompt, client))\n",
    "    print(\"\\n-------------------------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
   "metadata": {},
   "source": [
    "- Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "3552bdfb-7511-42ac-a9ec-da672e2a5468",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0, 50, 20, 100, 0]\n"
     ]
    }
   ],
   "source": [
    "def generate_model_scores(json_data, json_key):\n",
    "\n",
    "    scores = []\n",
    "    for entry in json_data:\n",
    "        \n",
    "        prompt = (f\"Given the input `{format_input(entry)}` \"\n",
    "                  f\"and correct output `{entry['output']}`, \"\n",
    "                  f\"score the model response `{entry[json_key]}`\"\n",
    "                  f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
    "                  f\"Respond with the number only.\"\n",
    "        )\n",
    "        score = run_chatgpt(prompt, client)\n",
    "        try:\n",
    "            scores.append(int(score))\n",
    "        except:\n",
    "            continue\n",
    "\n",
    "    return scores\n",
    "\n",
    "print(generate_model_scores(json_data[:5], \"model 1 response\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71974dea-31ed-49af-abba-5c858bbbf49c",
   "metadata": {},
   "source": [
    "- Please note that the response scores may vary because OpenAI's GPT models are not deterministic despite setting a random number seed, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b071ce84-1866-427f-a272-b46700f364b2",
   "metadata": {},
   "source": [
    "- Let's now apply this evaluation to the whole dataset and compute the average score of each model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "4f700d4b-19e5-4404-afa7-b0f093024232",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "model 1 response\n",
      "Number of scores: 100 of 100\n",
      "Average score: 73.54\n",
      "\n",
      "model 2 response\n",
      "Number of scores: 100 of 100\n",
      "Average score: 56.52\n"
     ]
    }
   ],
   "source": [
    "for model in (\"model 1 response\", \"model 2 response\"):\n",
    "\n",
    "    scores = generate_model_scores(json_data, model)\n",
    "    print(f\"\\n{model}\")\n",
    "    print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",
    "    print(f\"Average score: {sum(scores)/len(scores):.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8169d534-1fec-43c4-9550-5cb701ff7f05",
   "metadata": {},
   "source": [
    "- Based on the evaluation above, we can say that the 1st model is substantially better than the 2nd model"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
Add openai model eval utility code 2024-05-26 10:44:15 -05:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"id": "136a4efe-fb99-4311-8679-e0a5b6282755",`
			`"metadata": {},`
			`"source": [`
			`"<table style=\"width:100%\">\n",`
			`"<tr>\n",`
			`"<td style=\"vertical-align:middle; text-align:left;\">\n",`
			`"<font size=\"2\">\n",`
			`"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",`
			`"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",`
			`"</font>\n",`
			`"</td>\n",`
			`"<td style=\"vertical-align:middle; text-align:left;\">\n",`
			`"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",`
			`"</td>\n",`
			`"</tr>\n",`
			`"</table>"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",`
			`"metadata": {`
			`"tags": []`
			`},`
			`"source": [`
			`"# Evaluating Instruction Responses Using the OpenAI API"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "a128651b-f326-4232-a994-42f38b7ed520",`
			`"metadata": {},`
			`"source": [`
			`"- This notebook uses OpenAI's GPT-4 API to evaluate responses by a instruction finetuned LLMs based on an dataset in JSON format that includes the generated model responses, for example:\n",`
			`"\n",`
			`"\n",`
			`"\n",`
			"```python\n",
			`"{\n",`
			`" \"instruction\": \"What is the atomic number of helium?\",\n",`
			`" \"input\": \"\",\n",`
			`" \"output\": \"The atomic number of helium is 2.\", # <-- The target given in the test set\n",`
			`" \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",`
			`" \"model 2 response\": \"\\nThe atomic number of helium is 3.\" # <-- Response by a 2nd LLM\n",`
			`"},\n",`
			"```"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"id": "267ba0d1-b884-42df-85bd-0be746fd47a5",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# pip install -r requirements-exra.txt"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"id": "63610acc-db94-437f-8d38-e99dca0299cb",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"openai version: 1.30.3\n",`
			`"tqdm version: 4.65.0\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"from importlib.metadata import version\n",`
			`"\n",`
			`"pkgs = [\"openai\", # OpenAI API\n",`
			`" \"tqdm\", # Progress bar\n",`
			`" ]\n",`
			`"\n",`
			`"for p in pkgs:\n",`
			`" print(f\"{p} version: {version(p)}\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",`
			`"metadata": {},`
			`"source": [`
			`"## Test OpenAI API"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "9558a522-650d-401a-84fc-9fd7b1f39da7",`
			`"metadata": {},`
			`"source": [`
			`"- First, let's test if the OpenAI API is correctly set up\n",`
			`"- If you don't have an account yet, you need to create one at https://platform.openai.com/\n",`
			`"- Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)\n",`
			`"- Running the experiments and creating the ~200 evaluations using the code in this notebook costs about $0.26 (26 cents) as of this writing"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "89343a84-0ddc-42fc-bf50-298a342b93c0",`
			`"metadata": {},`
			`"source": [`
			`"OPENAI_API_KEY = \"Your Open AI API Key\"- First, we need to provide our OpenAI API key, which can be found at https://platform.openai.com/api-keys\n",`
			`"- Make sure not to share this key with anyone (make sure to delete it from this notebook in case you intend to share it; I recommend deleting the entire notebook cell that contains the key)\n",`
			`"- Alternatively, delete the used API key from your account after you are finished to make sure it can't be abused later"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"OPENAI_API_KEY = \"Your Open AI API Key\""`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 4,`
			`"id": "26900564-aba7-48ba-8ee8-6cc9a505a25c",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from openai import OpenAI\n",`
			`"\n",`
			`"client = OpenAI(api_key=OPENAI_API_KEY)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",`
			`"metadata": {},`
			`"source": [`
			`"- First, let's try the API with a simple example to make sure it works as intended:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 5,`
			`"id": "08e9ef2e-e816-4283-840e-43625791ad33",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'hello world'"`
			`]`
			`},`
			`"execution_count": 5,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"def run_chatgpt(prompt, client, model=\"gpt-4-turbo\"):\n",`
			`" response = client.chat.completions.create(\n",`
			`" model=model,\n",`
			`" messages=[{\"role\": \"user\", \"content\": prompt}],\n",`
			`" temperature=0.0,\n",`
			`" seed=123,\n",`
			`" )\n",`
			`" return response.choices[0].message.content\n",`
			`"\n",`
			`"\n",`
			`"prompt = f\"Respond with 'hello world' if you got this message.\"\n",`
			`"run_chatgpt(prompt, client)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",`
			`"metadata": {},`
			`"source": [`
			`"## Load JSON Entries"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",`
			`"metadata": {},`
			`"source": [`
			`"- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"id": "8b2d393a-aa92-4190-9d44-44326a6f699b",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Number of entries: 100\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"import json\n",`
			`"\n",`
			`"json_file = \"eval-example-data.json\"\n",`
			`"\n",`
			`"with open(json_file, \"r\") as file:\n",`
			`" json_data = json.load(file)\n",`
			`" \n",`
			`"print(\"Number of entries:\", len(json_data))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",`
			`"metadata": {},`
			`"source": [`
			"- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 7,`
			`"id": "7222fdc0-5684-4f2b-b741-3e341851359e",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",`
			`" 'input': '',\n",`
			`" 'output': 'The hypotenuse of the triangle is 10 cm.',\n",`
			`" 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",`
			`" 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"`
			`]`
			`},`
			`"execution_count": 7,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"json_data[0]"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "fcf0331b-6024-4bba-89a9-a088b14a1046",`
			`"metadata": {},`
			`"source": [`
			`"- Below is a small utility function that formats the input for visualization purposes later:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 8,`
			`"id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"def format_input(entry):\n",`
			`" instruction_text = (\n",`
			`" f\"Below is an instruction that describes a task. Write a response that \"\n",`
			`" f\"appropriately completes the request.\"\n",`
			`" f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",`
			`" )\n",`
			`"\n",`
			`" input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",`
			`" instruction_text + input_text\n",`
			`"\n",`
			`" return instruction_text + input_text"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "39a55283-7d51-4136-ba60-f799d49f4098",`
			`"metadata": {},`
			`"source": [`
			`"- Now, let's try the OpenAI API to compare the model responses (we only evalyate the first 5 responses for a visual comparison):"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"id": "735cc089-d127-480a-b39d-0782581f0c41",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"\n",`
			`"Dataset response:\n",`
			`">> The hypotenuse of the triangle is 10 cm.\n",`
			`"\n",`
			`"Model response:\n",`
			`">> \n",`
			`"The hypotenuse of the triangle is 3 cm.\n",`
			`"\n",`
			`"Score:\n",`
			`">> The model response \"The hypotenuse of the triangle is 3 cm.\" is incorrect. The correct calculation of the hypotenuse for a right triangle with legs of 6 cm and 8 cm should be done using the Pythagorean theorem, which states that the square of the hypotenuse (c) is equal to the sum of the squares of the other two sides (a and b). Thus, \\( c = \\sqrt{6^2 + 8^2} = \\sqrt{36 + 64} = \\sqrt{100} = 10 \\) cm.\n",`
			`"\n",`
			`"The model response provided a hypotenuse of 3 cm, which is not only incorrect but also mathematically impossible given the lengths of the legs (since 3 cm is less than either leg of the triangle, it cannot be the hypotenuse in a right triangle with these dimensions).\n",`
			`"\n",`
			`"Given the incorrectness and the impossibility of the response, the score would be very low. However, since the response format is correct (stating the hypotenuse is a certain measurement in cm), it does not score absolutely zero.\n",`
			`"\n",`
			`"Score: 10/100. The points are given for maintaining the correct format and units in the response, but the mathematical error is significant and fundamental, leading to a low score.\n",`
			`"\n",`
			`"-------------------------\n",`
			`"\n",`
			`"Dataset response:\n",`
			`">> 1. Squirrel\n",`
			`"2. Eagle\n",`
			`"3. Tiger\n",`
			`"\n",`
			`"Model response:\n",`
			`">> \n",`
			`"1. Squirrel\n",`
			`"2. Tiger\n",`
			`"3. Eagle\n",`
			`"4. Cobra\n",`
			`"5. Tiger\n",`
			`"6. Cobra\n",`
			`"\n",`
			`"Score:\n",`
			`">> To evaluate the model response against the given instruction, we need to consider the accuracy, relevance, and adherence to the instruction's requirements. The instruction specifically asks for the names of three different animals that are active during the day.\n",`
			`"\n",`
			`"### Analysis of Model Response:\n",`
			`"1. Relevance and Accuracy: \n",`
			`" - Squirrel: Correct, squirrels are diurnal (active during the day).\n",`
			`" - Tiger: Correct, though tigers can be crepuscular (active during dawn and dusk), they are often active during the day as well.\n",`
			`" - Eagle: Correct, eagles are generally diurnal.\n",`
			`" - Cobra: Incorrect, cobras are generally not active during the day; they are more active during the early and late hours of the day, making them crepuscular.\n",`
			`"\n",`
			`"2. Adherence to Instruction:\n",`
			`" - The instruction asked for three different animals. The model response listed six items, which is double the requested amount.\n",`
			`" - The response includes repetitions (Tiger and Cobra are each mentioned twice), which does not align with the instruction to name different animals.\n",`
			`"\n",`
			`"### Scoring:\n",`
			`"- Accuracy: 3/4 entries are accurate in terms of being day-active animals.\n",`
			`"- Relevance: The response includes more animals than requested and repeats some animals.\n",`
			`"- Adherence to Instruction: The instruction was to list three different animals, but the response included six entries with repetitions.\n",`
			`"\n",`
			`"Given these points, the model response partially meets the accuracy requirement but fails significantly in adherence to the instruction's format and specificity. The inclusion of incorrect information (Cobra) and unnecessary repetitions also detracts from the quality of the response.\n",`
			`"\n",`
			`"### Score: 40/100\n",`
			`"This score reflects that while some of the response was accurate, the failure to adhere to the specific number of animals requested, the inclusion of an incorrect animal, and the repetition of animals significantly lower the quality of the response according to the given instruction.\n",`
			`"\n",`
			`"-------------------------\n",`
			`"\n",`
			`"Dataset response:\n",`
			`">> I must ascertain what is incorrect.\n",`
			`"\n",`
			`"Model response:\n",`
			`">> \n",`
			`"What is incorrect?\n",`
			`"\n",`
			`"Score:\n",`
			`">> The model response \"What is incorrect?\" would score relatively low on the scale for the given task. Here's the breakdown:\n",`
			`"\n",`
			`"1. Understanding of Instruction: The instruction specifically asks for a more formal rewrite of the sentence \"I need to find out what's wrong.\" The model response does not fully capture the original sentence's intent of needing to discover or ascertain the issue. Instead, it poses a direct question about what is incorrect, which changes the nature of the statement from a declaration to an inquiry. This indicates a partial misunderstanding or incomplete execution of the task.\n",`
			`"\n",`
			`"2. Formality: The response does use slightly more formal language by using \"incorrect\" instead of \"wrong.\" However, it lacks the formal structure expected in rewriting the original sentence. The original sentence's intent and structure as a statement of need (\"I need to find out...\") are not preserved.\n",`
			`"\n",`
			`"3. Completeness: The response does not include the aspect of needing to \"find out,\" which is crucial to the original sentence. It merely asks what is incorrect, without indicating the necessity or process of discovery.\n",`
			`"\n",`
			`"Given these points, the response would score around 30 out of 100. It recognizes the need for more formal language but fails to accurately and completely transform the original sentence while maintaining its intent and structure.\n",`
			`"\n",`
			`"-------------------------\n",`
			`"\n",`
			`"Dataset response:\n",`
			`">> The interjection in the sentence is 'Wow'.\n",`
			`"\n",`
			`"Model response:\n",`
			`">> \n",`
			`"The interjection in the sentence is 'Wow'.\n",`
			`"\n",`
			`"Score:\n",`
			">> The model response `The interjection in the sentence is 'Wow'.` accurately identifies the interjection in the given sentence. The response is clear, directly addresses the instruction, and correctly identifies \"Wow\" as the interjection, which is used to express surprise or admiration, fitting the context of the sentence provided.\n",
			`"\n",`
			`"Score: 100/100\n",`
			`"\n",`
			`"The response fully meets the requirements of the task and correctly answers the question without any errors or omissions.\n",`
			`"\n",`
			`"-------------------------\n",`
			`"\n",`
			`"Dataset response:\n",`
			`">> The type of sentence is interrogative.\n",`
			`"\n",`
			`"Model response:\n",`
			`">> \n",`
			`"The type of sentence is exclamatory.\n",`
			`"\n",`
			`"Score:\n",`
			`">> The model response \"The type of sentence is exclamatory.\" is incorrect. The correct type of the sentence \"Did you finish the report?\" is interrogative, as it is a question. An exclamatory sentence would express strong emotion and typically ends with an exclamation mark.\n",`
			`"\n",`
			`"Given the incorrect identification of the sentence type, the score for the model response should be low. However, the response does correctly identify a type of sentence, just not the correct one for the given input. Therefore, it shows some understanding of sentence types but fails in accurate application.\n",`
			`"\n",`
			`"Score: 20/100\n",`
			`"\n",`
			`"This score reflects that the response is on topic (discussing sentence types) but incorrect in its specific application to the provided sentence.\n",`
			`"\n",`
			`"-------------------------\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"for entry in json_data[:5]:\n",`
			" prompt = (f\"Given the input `{format_input(entry)}` \"\n",
			" f\"and correct output `{entry['output']}`, \"\n",
			" f\"score the model response `{entry['model 1 response']}`\"\n",
			`" f\" on a scale from 0 to 100, where 100 is the best score. \"\n",`
			`" )\n",`
			`" print(\"\\nDataset response:\")\n",`
			`" print(\">>\", entry['output'])\n",`
			`" print(\"\\nModel response:\")\n",`
			`" print(\">>\", entry[\"model 1 response\"])\n",`
			`" print(\"\\nScore:\")\n",`
			`" print(\">>\", run_chatgpt(prompt, client))\n",`
			`" print(\"\\n-------------------------\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",`
			`"metadata": {},`
			`"source": [`
			`"- Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 10,`
			`"id": "3552bdfb-7511-42ac-a9ec-da672e2a5468",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"[0, 50, 20, 100, 0]\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"def generate_model_scores(json_data, json_key):\n",`
			`"\n",`
			`" scores = []\n",`
			`" for entry in json_data:\n",`
			`" \n",`
			" prompt = (f\"Given the input `{format_input(entry)}` \"\n",
			" f\"and correct output `{entry['output']}`, \"\n",
			" f\"score the model response `{entry[json_key]}`\"\n",
			`" f\" on a scale from 0 to 100, where 100 is the best score. \"\n",`
			`" f\"Respond with the number only.\"\n",`
			`" )\n",`
			`" score = run_chatgpt(prompt, client)\n",`
			`" try:\n",`
			`" scores.append(int(score))\n",`
			`" except:\n",`
			`" continue\n",`
			`"\n",`
			`" return scores\n",`
			`"\n",`
			`"print(generate_model_scores(json_data[:5], \"model 1 response\"))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "71974dea-31ed-49af-abba-5c858bbbf49c",`
			`"metadata": {},`
			`"source": [`
			`"- Please note that the response scores may vary because OpenAI's GPT models are not deterministic despite setting a random number seed, etc."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "b071ce84-1866-427f-a272-b46700f364b2",`
			`"metadata": {},`
			`"source": [`
			`"- Let's now apply this evaluation to the whole dataset and compute the average score of each model:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 11,`
			`"id": "4f700d4b-19e5-4404-afa7-b0f093024232",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"\n",`
			`"model 1 response\n",`
			`"Number of scores: 100 of 100\n",`
			`"Average score: 73.54\n",`
			`"\n",`
			`"model 2 response\n",`
			`"Number of scores: 100 of 100\n",`
			`"Average score: 56.52\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"for model in (\"model 1 response\", \"model 2 response\"):\n",`
			`"\n",`
			`" scores = generate_model_scores(json_data, model)\n",`
			`" print(f\"\\n{model}\")\n",`
			`" print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",`
			`" print(f\"Average score: {sum(scores)/len(scores):.2f}\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "8169d534-1fec-43c4-9550-5cb701ff7f05",`
			`"metadata": {},`
			`"source": [`
			`"- Based on the evaluation above, we can say that the 1st model is substantially better than the 2nd model"`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3 (ipykernel)",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.10.6"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 5`
			`}`