LLMs-from-scratch/ch07/03_model-evaluation/llm-instruction-eval-openai.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "136a4efe-fb99-4311-8679-e0a5b6282755",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Evaluating Instruction Responses Using the OpenAI API"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a128651b-f326-4232-a994-42f38b7ed520",
   "metadata": {},
   "source": [
    "- This notebook uses OpenAI's GPT-4 API to evaluate responses by a instruction finetuned LLMs based on an dataset in JSON format that includes the generated model responses, for example:\n",
    "\n",
    "\n",
    "\n",
    "```python\n",
    "{\n",
    "    \"instruction\": \"What is the atomic number of helium?\",\n",
    "    \"input\": \"\",\n",
    "    \"output\": \"The atomic number of helium is 2.\",               # <-- The target given in the test set\n",
    "    \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
    "    \"model 2 response\": \"\\nThe atomic number of helium is 3.\"    # <-- Response by a 2nd LLM\n",
    "},\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "267ba0d1-b884-42df-85bd-0be746fd47a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# pip install -r requirements-extra.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "63610acc-db94-437f-8d38-e99dca0299cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "openai version: 1.30.3\n",
      "tqdm version: 4.65.0\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "pkgs = [\"openai\",  # OpenAI API\n",
    "        \"tqdm\",    # Progress bar\n",
    "       ]\n",
    "\n",
    "for p in pkgs:\n",
    "    print(f\"{p} version: {version(p)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
   "metadata": {},
   "source": [
    "## Test OpenAI API"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
   "metadata": {},
   "source": [
    "- First, let's test if the OpenAI API is correctly set up\n",
    "- If you don't have an account yet, you need to create one at https://platform.openai.com/\n",
    "- Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)\n",
    "- Running the experiments and creating the ~200 evaluations using the code in this notebook costs about $0.26 (26 cents) as of this writing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
   "metadata": {},
   "source": [
    "- First, we need to provide our OpenAI API secret key, which can be found at https://platform.openai.com/api-keys\n",
    "- Make sure not to share this key with anyone\n",
    "- Add this secret key (`\"sk-...\"`) to the `config.json` file in this folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from openai import OpenAI\n",
    "\n",
    "# Load API key from a JSON file. \n",
    "# Make sure to replace \"sk-...\" with your actual API key from https://platform.openai.com/api-keys\n",
    "with open(\"config.json\", \"r\") as config_file:\n",
    "    config = json.load(config_file)\n",
    "    api_key = config[\"OPENAI_API_KEY\"]\n",
    "\n",
    "client = OpenAI(api_key=api_key)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
   "metadata": {},
   "source": [
    "- First, let's try the API with a simple example to make sure it works as intended:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "08e9ef2e-e816-4283-840e-43625791ad33",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'hello world'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def run_chatgpt(prompt, client, model=\"gpt-4-turbo\"):\n",
    "    response = client.chat.completions.create(\n",
    "        model=model,\n",
    "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "        temperature=0.0,\n",
    "        seed=123,\n",
    "    )\n",
    "    return response.choices[0].message.content\n",
    "\n",
    "\n",
    "prompt = f\"Respond with 'hello world' if you got this message.\"\n",
    "run_chatgpt(prompt, client)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
   "metadata": {},
   "source": [
    "## Load JSON Entries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
   "metadata": {},
   "source": [
    "- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of entries: 100\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "json_file = \"eval-example-data.json\"\n",
    "\n",
    "with open(json_file, \"r\") as file:\n",
    "    json_data = json.load(file)\n",
    "    \n",
    "print(\"Number of entries:\", len(json_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
   "metadata": {},
   "source": [
    "- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "7222fdc0-5684-4f2b-b741-3e341851359e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",
       " 'input': '',\n",
       " 'output': 'The hypotenuse of the triangle is 10 cm.',\n",
       " 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",
       " 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "json_data[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
   "metadata": {},
   "source": [
    "- Below is a small utility function that formats the input for visualization purposes later:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_input(entry):\n",
    "    instruction_text = (\n",
    "        f\"Below is an instruction that describes a task. Write a response that \"\n",
    "        f\"appropriately completes the request.\"\n",
    "        f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",
    "    )\n",
    "\n",
    "    input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
    "    instruction_text + input_text\n",
    "\n",
    "    return instruction_text + input_text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39a55283-7d51-4136-ba60-f799d49f4098",
   "metadata": {},
   "source": [
    "- Now, let's try the OpenAI API to compare the model responses (we only evalyate the first 5 responses for a visual comparison):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "735cc089-d127-480a-b39d-0782581f0c41",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Dataset response:\n",
      ">> The hypotenuse of the triangle is 10 cm.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The hypotenuse of the triangle is 3 cm.\n",
      "\n",
      "Score:\n",
      ">> The model response \"The hypotenuse of the triangle is 3 cm.\" is incorrect. The correct calculation of the hypotenuse for a right triangle with legs of 6 cm and 8 cm should be done using the Pythagorean theorem, which states that the square of the hypotenuse (c) is equal to the sum of the squares of the other two sides (a and b). Thus, \\( c = \\sqrt{6^2 + 8^2} = \\sqrt{36 + 64} = \\sqrt{100} = 10 \\) cm.\n",
      "\n",
      "The model response provides a hypotenuse of 3 cm, which is not only numerically incorrect but also logically inconsistent because the hypotenuse is the longest side of a right triangle and cannot be shorter than either of the other two sides (6 cm and 8 cm in this case).\n",
      "\n",
      "Given the scale from 0 to 100, where 100 is the best score:\n",
      "- Accuracy: The response is completely inaccurate.\n",
      "- Relevance: The response addresses the task of calculating the hypotenuse but fails to do so correctly.\n",
      "\n",
      "Score: 0\n",
      "\n",
      "The score is 0 because the response is factually incorrect and provides misleading information that does not fulfill the task as required.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> 1. Squirrel\n",
      "2. Eagle\n",
      "3. Tiger\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "1. Squirrel\n",
      "2. Tiger\n",
      "3. Eagle\n",
      "4. Cobra\n",
      "5. Tiger\n",
      "6. Cobra\n",
      "\n",
      "Score:\n",
      ">> The model response lists six animals, three of which are repeated, and includes animals not specifically known for being diurnal (active during the day). The instruction specifically asked for three different animals that are active during the day. Here's the breakdown:\n",
      "\n",
      "1. **Squirrel** - Correct, squirrels are diurnal.\n",
      "2. **Tiger** - Generally, tigers are crepuscular (active during dawn and dusk) rather than strictly diurnal, but they can be active during the day, especially in cooler weather.\n",
      "3. **Eagle** - Correct, eagles are diurnal.\n",
      "4. **Cobra** - Incorrect, cobras are generally not diurnal; they are more active during the evening and early morning.\n",
      "5. **Tiger** - Repeated, and as noted, not strictly diurnal.\n",
      "6. **Cobra** - Repeated and incorrect.\n",
      "\n",
      "### Scoring:\n",
      "- **Relevance to the task**: The task was to name three different animals active during the day. The response included two correct diurnal animals (squirrel, eagle) but also included incorrect and repeated entries.\n",
      "- **Accuracy**: Including animals not known for being diurnal (cobra) and repeating animals reduces the accuracy.\n",
      "- **Adherence to instruction**: The instruction asked for three animals, but six were provided, with repetitions.\n",
      "\n",
      "### Score: 40/100\n",
      "**Reasoning**: The response partially meets the criteria by including some correct animals but fails in terms of accuracy (inclusion of non-diurnal animals), repetition of animals, and not adhering to the instruction of listing only three animals.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> I must ascertain what is incorrect.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "What is incorrect?\n",
      "\n",
      "Score:\n",
      ">> The model response \"What is incorrect?\" would score relatively low on the scale for the given task. The original instruction was to rewrite the sentence \"I need to find out what's wrong\" in a more formal way. The correct output provided, \"I must ascertain what is incorrect,\" effectively increases the formality of the original sentence by using more formal vocabulary (\"must\" instead of \"need\" and \"ascertain\" instead of \"find out\") and adjusting the phrasing (\"what is incorrect\" instead of \"what's wrong\").\n",
      "\n",
      "The model response, however, only addresses part of the sentence and does not maintain the original meaning or structure. It changes the sentence into a question and omits the aspect of needing to discover or investigate the issue, which is a critical component of the original sentence. Additionally, it does not enhance the formality significantly.\n",
      "\n",
      "Given these considerations, I would score the model response around 20 out of 100. It recognizes the need to adjust the formality slightly but fails to maintain the original sentence's intent and structure, and does not fully meet the requirement of rewriting the sentence in a more formal way.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> The interjection in the sentence is 'Wow'.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The interjection in the sentence is 'Wow'.\n",
      "\n",
      "Score:\n",
      ">> The model response `The interjection in the sentence is 'Wow'.` accurately identifies the interjection in the provided sentence. The response is clear, directly addresses the instruction, and correctly identifies \"Wow\" as the interjection. Therefore, the response should be scored 100 out of 100.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> The type of sentence is interrogative.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The type of sentence is exclamatory.\n",
      "\n",
      "Score:\n",
      ">> The model response \"The type of sentence is exclamatory.\" is incorrect. The correct type of the sentence \"Did you finish the report?\" is interrogative, as it is a question. An exclamatory sentence would express strong emotion and typically ends with an exclamation mark.\n",
      "\n",
      "Given the incorrect identification of the sentence type, the score for the model response should be low. However, the response does correctly format the answer by stating \"The type of sentence is...\" which shows an understanding of the task's requirements but fails in accuracy.\n",
      "\n",
      "Score: 10/100\n",
      "\n",
      "The score reflects that the response is well-structured but fundamentally incorrect in identifying the sentence type.\n",
      "\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "for entry in json_data[:5]:\n",
    "    prompt = (f\"Given the input `{format_input(entry)}` \"\n",
    "              f\"and correct output `{entry['output']}`, \"\n",
    "              f\"score the model response `{entry['model 1 response']}`\"\n",
    "              f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
    "    )\n",
    "    print(\"\\nDataset response:\")\n",
    "    print(\">>\", entry['output'])\n",
    "    print(\"\\nModel response:\")\n",
    "    print(\">>\", entry[\"model 1 response\"])\n",
    "    print(\"\\nScore:\")\n",
    "    print(\">>\", run_chatgpt(prompt, client))\n",
    "    print(\"\\n-------------------------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
   "metadata": {},
   "source": [
    "- Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "3552bdfb-7511-42ac-a9ec-da672e2a5468",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "\n",
    "def generate_model_scores(json_data, json_key, client):\n",
    "    scores = []\n",
    "    for entry in tqdm(json_data, desc=\"Scoring entries\"):\n",
    "        prompt = (\n",
    "            f\"Given the input `{format_input(entry)}` \"\n",
    "            f\"and correct output `{entry['output']}`, \"\n",
    "            f\"score the model response `{entry[json_key]}`\"\n",
    "            f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
    "            f\"Respond with the number only.\"\n",
    "        )\n",
    "        score = run_chatgpt(prompt, client)\n",
    "        try:\n",
    "            scores.append(int(score))\n",
    "        except:\n",
    "            continue\n",
    "\n",
    "    return scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71974dea-31ed-49af-abba-5c858bbbf49c",
   "metadata": {},
   "source": [
    "- Please note that the response scores may vary because OpenAI's GPT models are not deterministic despite setting a random number seed, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b071ce84-1866-427f-a272-b46700f364b2",
   "metadata": {},
   "source": [
    "- Let's now apply this evaluation to the whole dataset and compute the average score of each model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "4f700d4b-19e5-4404-afa7-b0f093024232",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Scoring entries: 100%|█████████████████████████████████████████████████| 100/100 [01:09<00:00,  1.44it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "model 1 response\n",
      "Number of scores: 100 of 100\n",
      "Average score: 74.04\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Scoring entries: 100%|█████████████████████████████████████████████████| 100/100 [01:08<00:00,  1.46it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "model 2 response\n",
      "Number of scores: 100 of 100\n",
      "Average score: 56.72\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "for model in (\"model 1 response\", \"model 2 response\"):\n",
    "\n",
    "    scores = generate_model_scores(json_data, model, client)\n",
    "    print(f\"\\n{model}\")\n",
    "    print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",
    "    print(f\"Average score: {sum(scores)/len(scores):.2f}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8169d534-1fec-43c4-9550-5cb701ff7f05",
   "metadata": {},
   "source": [
    "- Based on the evaluation above, we can say that the 1st model is substantially better than the 2nd model"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}