From f42290e83be8de200cfdf24b4b13cb98853cbfd8 Mon Sep 17 00:00:00 2001
From: rasbt <mail@sebastianraschka.com>
Date: Fri, 7 Jun 2024 08:37:41 -0500
Subject: [PATCH] remove redundant file

---
 .../llm-instruction-eval-prometheus.ipynb     | 673 ------------------
 1 file changed, 673 deletions(-)
 delete mode 100644 ch07/03_model-evaluation/llm-instruction-eval-prometheus.ipynb
diff --git a/ch07/03_model-evaluation/llm-instruction-eval-prometheus.ipynb b/ch07/03_model-evaluation/llm-instruction-eval-prometheus.ipynb
deleted file mode 100644
index 022be2a..0000000
--- a/ch07/03_model-evaluation/llm-instruction-eval-prometheus.ipynb
+++ /dev/null
@@ -1,673 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "136a4efe-fb99-4311-8679-e0a5b6282755",
-   "metadata": {},
-   "source": [
-    "<table style=\"width:100%\">\n",
-    "<tr>\n",
-    "<td style=\"vertical-align:middle; text-align:left;\">\n",
-    "<font size=\"2\">\n",
-    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
-    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
-    "</font>\n",
-    "</td>\n",
-    "<td style=\"vertical-align:middle; text-align:left;\">\n",
-    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
-    "</td>\n",
-    "</tr>\n",
-    "</table>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "# Evaluating Instruction Responses Locally Using the Prometheus Evaluator LLM"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a128651b-f326-4232-a994-42f38b7ed520",
-   "metadata": {},
-   "source": [
-    "- This notebook uses an 7 billion parameter LLM that has been specifically developed for evaluating other LLMs; for more information, see the [Prometheus 2 paper](https://arxiv.org/abs/2405.01535)\n",
-    "- We will use Prometheus 2 via the [prometheus-eval](https://github.com/prometheus-eval/prometheus-eval) Python package, which in turn is based on [vllm](https://github.com/vllm-project/vllm), which is an efficient LLM inference tool that runs locally\n",
-    "- Specifically, in this notebook, we will use Prometheus 2 to evaluate responses of instruction finetuned LLMs based on a dataset in JSON format that includes the generated model responses, for example:\n",
-    "\n",
-    "\n",
-    "\n",
-    "```python\n",
-    "{\n",
-    "    \"instruction\": \"What is the atomic number of helium?\",\n",
-    "    \"input\": \"\",\n",
-    "    \"output\": \"The atomic number of helium is 2.\",               # <-- The target given in the test set\n",
-    "    \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
-    "    \"model 2 response\": \"\\nThe atomic number of helium is 3.\"    # <-- Response by a 2nd LLM\n",
-    "},\n",
-    "```\n",
-    "\n",
-    "<div style=\"background-color: #ffdddd; border-left: 6px solid #f44336; padding: 10px;\">\n",
-    "  <strong>Note:</strong> The code in this notebook requires installing <a href=\"https://github.com/vllm-project/vllm\"><vllm>, which currently only supports Linux.\n",
-    "</div>\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "2c10ef46-4dd5-4a20-a949-afc15a18498d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# pip install -r requirements-extra.txt\n",
-    "# pip install vllm # only supports Linux"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "id": "63610acc-db94-437f-8d38-e99dca0299cb",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "prometheus-eval version: 0.1.15\n",
-      "tqdm version: 4.66.4\n"
-     ]
-    }
-   ],
-   "source": [
-    "from importlib.metadata import version\n",
-    "\n",
-    "pkgs = [\n",
-    "        \"prometheus-eval\",\n",
-    "        \"tqdm\",  # Progress bar,\n",
-    "        \"vllm\"\n",
-    "]\n",
-    "\n",
-    "for p in pkgs:\n",
-    "    print(f\"{p} version: {version(p)}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
-   "metadata": {},
-   "source": [
-    "## Installing Ollama and Downloading Llama 3"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5a092280-5462-4709-a3fe-8669a4a8a0a6",
-   "metadata": {},
-   "source": [
-    "- Ollama is an application to run LLMs efficiently\n",
-    "- It is a wrapper around [llama.cpp](https://github.com/ggerganov/llama.cpp), which implements LLMs in pure C/C++ to maximize efficiency\n",
-    "- Note that it is a tool for using LLMs to generate text (inference), not training or finetuning LLMs\n",
-    "- Prior to running the code below, install ollama by visiting [https://ollama.com](https://ollama.com) and following the instructions (for instance, clicking on the \"Download\" button and downloading the ollama application for your operating system)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
-   "metadata": {},
-   "source": [
-    "- Now let's test if ollama is set up correctly\n",
-    "- For this, click on the ollama application you downloaded; if it prompts you to install the command line usage, say \"yes\"\n",
-    "- Next, on the command line, execute the following command to try out the 8 billion parameters Llama 3 model (the model, which takes up 4.7 GB of storage space, will be automatically downloaded the first time you execute this command)\n",
-    "\n",
-    "```bash\n",
-    "# 8B model\n",
-    "ollama run llama3\n",
-    "```\n",
-    "\n",
-    "The output looks like as follows:\n",
-    "\n",
-    "```\n",
-    "$ ollama run llama3\n",
-    "pulling manifest \n",
-    "pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB                         \n",
-    "pulling 4fa551d4f938... 100% ▕████████████████▏  12 KB                         \n",
-    "pulling 8ab4849b038c... 100% ▕████████████████▏  254 B                         \n",
-    "pulling 577073ffcc6c... 100% ▕████████████████▏  110 B                         \n",
-    "pulling 3f8eb4da87fa... 100% ▕████████████████▏  485 B                         \n",
-    "verifying sha256 digest \n",
-    "writing manifest \n",
-    "removing any unused layers \n",
-    "success \n",
-    "```\n",
-    "\n",
-    "- Note that `llama3` refers to the instruction finetuned 8 billion Llama 3 model\n",
-    "\n",
-    "- Alternatively, you can also use the larger 70 billion parameters Llama 3 model, if your machine supports it, by replacing `llama3` with `llama3:70b`\n",
-    "\n",
-    "- After the download has been completed, you will see a command line prompt that allows you to chat with the model\n",
-    "\n",
-    "-  Try a prompt like \"What do llamas eat?\", which should return an output similar to the following:\n",
-    "\n",
-    "```\n",
-    ">>> What do llamas eat?\n",
-    "Llamas are ruminant animals, which means they have a four-chambered \n",
-    "stomach and eat plants that are high in fiber. In the wild, llamas \n",
-    "typically feed on:\n",
-    "1. Grasses: They love to graze on various types of grasses, including tall \n",
-    "grasses, wheat, oats, and barley.\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0b5addcb-fc7d-455d-bee9-6cc7a0d684c7",
-   "metadata": {},
-   "source": [
-    "- You can end this session using the input `/bye`"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "dda155ee-cf36-44d3-b634-20ba8e1ca38a",
-   "metadata": {},
-   "source": [
-    "## Using Ollama's REST API"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
-   "metadata": {},
-   "source": [
-    "- Now, an alternative way to interact with the model is via its REST API in Python via the following function\n",
-    "- First, in your terminal, start a local ollama server via `ollama serve` (after executing the code in this notebook, you can later stop this session by simply closing the terminal)\n",
-    "- Next, run the following code cell to query the model"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Llamas are ruminant animals, which means they have a four-chambered stomach and eat plants. Their diet typically consists of:\n",
-      "\n",
-      "1. Grasses: Llamas love to graze on grasses, including tall grasses, meadow grasses, and wheat.\n",
-      "2. Hay: High-quality hay is a staple in an llama's diet. They enjoy timothy hay, alfalfa hay, and other types of hay.\n",
-      "3. Grains: Whole grains like oats, barley, and corn are also part of their diet.\n",
-      "4. Fruits and vegetables: Llamas will eat fruits like apples, carrots, and sweet potatoes as a treat or to supplement their diet.\n",
-      "5. Minerals: They need access to loose minerals like salt, calcium, and phosphorus to stay healthy.\n",
-      "\n",
-      "In the wild, llamas might also eat:\n",
-      "\n",
-      "* Leaves from shrubs and trees\n",
-      "* Bark (in some cases)\n",
-      "* Seeds\n",
-      "* Fungi\n",
-      "\n",
-      "Domesticated llamas usually have a more controlled diet, as their owners provide them with specific foods and supplements to ensure they receive the nutrients they need. A balanced diet for an llama typically includes 15-20% hay, 10-15% grains, and 5-10% fruits and vegetables.\n",
-      "\n",
-      "Remember, always consult with a veterinarian or experienced llama breeder to determine the best diet for your individual llama!\n"
-     ]
-    }
-   ],
-   "source": [
-    "import urllib.request\n",
-    "import json\n",
-    "\n",
-    "def query_model(prompt, model=\"llama3\", url=\"http://localhost:11434/api/chat\"):\n",
-    "    # Create the data payload as a dictionary\n",
-    "    data = {\n",
-    "        \"model\": model,\n",
-    "        \"seed\":123,        # for deterministic responses\n",
-    "        \"temperature\":0,   # for deterministic responses\n",
-    "        \"messages\": [\n",
-    "            {\"role\": \"user\", \"content\": prompt}\n",
-    "        ]\n",
-    "    }\n",
-    "\n",
-    "    # Convert the dictionary to a JSON formatted string and encode it to bytes\n",
-    "    payload = json.dumps(data).encode(\"utf-8\")\n",
-    "\n",
-    "    # Create a request object, setting the method to POST and adding necessary headers\n",
-    "    request = urllib.request.Request(url, data=payload, method=\"POST\")\n",
-    "    request.add_header(\"Content-Type\", \"application/json\")\n",
-    "\n",
-    "    # Send the request and capture the response\n",
-    "    response_data = \"\"\n",
-    "    with urllib.request.urlopen(request) as response:\n",
-    "        # Read and decode the response\n",
-    "        while True:\n",
-    "            line = response.readline().decode(\"utf-8\")\n",
-    "            if not line:\n",
-    "                break\n",
-    "            response_json = json.loads(line)\n",
-    "            response_data += response_json[\"message\"][\"content\"]\n",
-    "\n",
-    "    return response_data\n",
-    "\n",
-    "\n",
-    "result = query_model(\"What do Llamas eat?\")\n",
-    "print(result)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
-   "metadata": {},
-   "source": [
-    "- First, let's try the API with a simple example to make sure it works as intended:"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
-   "metadata": {},
-   "source": [
-    "## Load JSON Entries"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
-   "metadata": {},
-   "source": [
-    "- Now, let's get to the data evaluation part\n",
-    "- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Number of entries: 100\n"
-     ]
-    }
-   ],
-   "source": [
-    "import json\n",
-    "\n",
-    "json_file = \"eval-example-data.json\"\n",
-    "\n",
-    "with open(json_file, \"r\") as file:\n",
-    "    json_data = json.load(file)\n",
-    "    \n",
-    "print(\"Number of entries:\", len(json_data))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
-   "metadata": {},
-   "source": [
-    "- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "id": "7222fdc0-5684-4f2b-b741-3e341851359e",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",
-       " 'input': '',\n",
-       " 'output': 'The hypotenuse of the triangle is 10 cm.',\n",
-       " 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",
-       " 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"
-      ]
-     },
-     "execution_count": 4,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "json_data[0]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
-   "metadata": {},
-   "source": [
-    "- Below is a small utility function that formats the input for visualization purposes later:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def format_input(entry):\n",
-    "    instruction_text = (\n",
-    "        f\"Below is an instruction that describes a task. Write a response that \"\n",
-    "        f\"appropriately completes the request.\"\n",
-    "        f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",
-    "    )\n",
-    "\n",
-    "    input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
-    "    instruction_text + input_text\n",
-    "\n",
-    "    return instruction_text + input_text"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "39a55283-7d51-4136-ba60-f799d49f4098",
-   "metadata": {},
-   "source": [
-    "- Now, let's try the ollama API to compare the model responses (we only evalyate the first 5 responses for a visual comparison):"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "id": "735cc089-d127-480a-b39d-0782581f0c41",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "Dataset response:\n",
-      ">> The hypotenuse of the triangle is 10 cm.\n",
-      "\n",
-      "Model response:\n",
-      ">> \n",
-      "The hypotenuse of the triangle is 3 cm.\n",
-      "\n",
-      "Score:\n",
-      ">> To evaluate the model response, I'll compare it to the correct output.\n",
-      "\n",
-      "Correct output: The hypotenuse of the triangle is 10 cm.\n",
-      "Model response: The hypotenuse of the triangle is 3 cm.\n",
-      "\n",
-      "The model response is incorrect, as the calculated value (3 cm) does not match the actual value (10 cm). Therefore, I would score this response a 0 out of 100.\n",
-      "\n",
-      "-------------------------\n",
-      "\n",
-      "Dataset response:\n",
-      ">> 1. Squirrel\n",
-      "2. Eagle\n",
-      "3. Tiger\n",
-      "\n",
-      "Model response:\n",
-      ">> \n",
-      "1. Squirrel\n",
-      "2. Tiger\n",
-      "3. Eagle\n",
-      "4. Cobra\n",
-      "5. Tiger\n",
-      "6. Cobra\n",
-      "\n",
-      "Score:\n",
-      ">> To complete the request, I will provide a response that names three different animals that are active during the day.\n",
-      "\n",
-      "### Response:\n",
-      "1. Squirrel\n",
-      "2. Eagle\n",
-      "3. Tiger\n",
-      "\n",
-      "Now, let's evaluate the model response based on the provided options. Here's how it scores:\n",
-      "\n",
-      "1. Squirrel (Match)\n",
-      "2. Tiger (Match)\n",
-      "3. Eagle (Match)\n",
-      "\n",
-      "The model response correctly identifies three animals that are active during the day: squirrel, tiger, and eagle.\n",
-      "\n",
-      "On a scale from 0 to 100, I would score this response as **80**. The model accurately completes the request and provides relevant information. However, it does not fully utilize all available options (4-6), which is why the score is not higher.\n",
-      "\n",
-      "Corrected output: 1. Squirrel\n",
-      "2. Eagle\n",
-      "3. Tiger\n",
-      "\n",
-      "-------------------------\n",
-      "\n",
-      "Dataset response:\n",
-      ">> I must ascertain what is incorrect.\n",
-      "\n",
-      "Model response:\n",
-      ">> \n",
-      "What is incorrect?\n",
-      "\n",
-      "Score:\n",
-      ">> The task is to rewrite a sentence in a more formal way.\n",
-      "\n",
-      "### Original Sentence:\n",
-      "\"I need to find out what's wrong.\"\n",
-      "\n",
-      "### Formal Rewrite:\n",
-      "\"I must ascertain what is incorrect.\"\n",
-      "\n",
-      "Score: **90**\n",
-      "\n",
-      "The model response accurately captures the original sentence's meaning while adopting a more formal tone. The words \"ascertain\" and \"incorrect\" effectively convey a sense of professionalism and precision, making it suitable for a formal setting.\n",
-      "\n",
-      "Note: I scored the model response 90 out of 100 because it successfully transformed the informal sentence into a more formal one, but there is room for improvement in terms of style and nuance.\n",
-      "\n",
-      "-------------------------\n",
-      "\n",
-      "Dataset response:\n",
-      ">> The interjection in the sentence is 'Wow'.\n",
-      "\n",
-      "Model response:\n",
-      ">> \n",
-      "The interjection in the sentence is 'Wow'.\n",
-      "\n",
-      "Score:\n",
-      ">> A scoring question!\n",
-      "\n",
-      "I'd rate the model response as **98** out of 100.\n",
-      "\n",
-      "Here's why:\n",
-      "\n",
-      "* The model correctly identifies \"Wow\" as the interjection in the sentence.\n",
-      "* The response is concise and directly answers the instruction.\n",
-      "* There are no grammatical errors, typos, or inaccuracies in the response.\n",
-      "\n",
-      "The only reason I wouldn't give it a perfect score (100) is that it's possible for an even more precise or detailed response to be given, such as \"The sentence contains a single interjection: 'Wow', which is used to express surprise and enthusiasm.\" However, the model's response is still very good, and 98 out of 100 is a strong score.\n",
-      "\n",
-      "-------------------------\n",
-      "\n",
-      "Dataset response:\n",
-      ">> The type of sentence is interrogative.\n",
-      "\n",
-      "Model response:\n",
-      ">> \n",
-      "The type of sentence is exclamatory.\n",
-      "\n",
-      "Score:\n",
-      ">> A nice simple task!\n",
-      "\n",
-      "To score my response, I'll compare it with the correct output.\n",
-      "\n",
-      "Correct output: The type of sentence is interrogative.\n",
-      "My response: The type of sentence is exclamatory.\n",
-      "\n",
-      "The correct answer is an interrogative sentence (asking a question), while my response suggests it's an exclamatory sentence (expressing strong emotions). Oops!\n",
-      "\n",
-      "So, I'd score my response as follows:\n",
-      "\n",
-      "* Correctness: 0/10\n",
-      "* Relevance: 0/10 (my response doesn't even match the input)\n",
-      "* Overall quality: 0/100\n",
-      "\n",
-      "The lowest possible score is 0. Unfortunately, that's where my response falls. Better luck next time!\n",
-      "\n",
-      "-------------------------\n"
-     ]
-    }
-   ],
-   "source": [
-    "for entry in json_data[:5]:\n",
-    "    prompt = (f\"Given the input `{format_input(entry)}` \"\n",
-    "              f\"and correct output `{entry['output']}`, \"\n",
-    "              f\"score the model response `{entry['model 1 response']}`\"\n",
-    "              f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
-    "    )\n",
-    "    print(\"\\nDataset response:\")\n",
-    "    print(\">>\", entry['output'])\n",
-    "    print(\"\\nModel response:\")\n",
-    "    print(\">>\", entry[\"model 1 response\"])\n",
-    "    print(\"\\nScore:\")\n",
-    "    print(\">>\", query_model(prompt))\n",
-    "    print(\"\\n-------------------------\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
-   "metadata": {},
-   "source": [
-    "- Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "id": "3552bdfb-7511-42ac-a9ec-da672e2a5468",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from tqdm import tqdm\n",
-    "\n",
-    "def generate_model_scores(json_data, json_key):\n",
-    "    scores = []\n",
-    "    for entry in tqdm(json_data, desc=\"Scoring entries\"):\n",
-    "        prompt = (\n",
-    "            f\"Given the input `{format_input(entry)}` \"\n",
-    "            f\"and correct output `{entry['output']}`, \"\n",
-    "            f\"score the model response `{entry[json_key]}`\"\n",
-    "            f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
-    "            f\"Respond with the integer number only.\"\n",
-    "        )\n",
-    "        score = query_model(prompt)\n",
-    "        try:\n",
-    "            scores.append(int(score))\n",
-    "        except:\n",
-    "            continue\n",
-    "\n",
-    "    return scores"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b071ce84-1866-427f-a272-b46700f364b2",
-   "metadata": {},
-   "source": [
-    "- Let's now apply this evaluation to the whole dataset and compute the average score of each model (this takes about 1 min per model on a M3 MacBook Air laptop)\n",
-    "- Note that ollama is not fully deterministic (as of this writing) so the numbers you are getting might slightly differ from the ones shown below"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "id": "4f700d4b-19e5-4404-afa7-b0f093024232",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Scoring entries: 100%|████████████████████████| 100/100 [01:06<00:00,  1.50it/s]\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "model 1 response\n",
-      "Number of scores: 100 of 100\n",
-      "Average score: 78.02\n",
-      "\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Scoring entries: 100%|████████████████████████| 100/100 [01:10<00:00,  1.41it/s]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "model 2 response\n",
-      "Number of scores: 99 of 100\n",
-      "Average score: 66.56\n",
-      "\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "for model in (\"model 1 response\", \"model 2 response\"):\n",
-    "\n",
-    "    scores = generate_model_scores(json_data, model)\n",
-    "    print(f\"\\n{model}\")\n",
-    "    print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",
-    "    print(f\"Average score: {sum(scores)/len(scores):.2f}\\n\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8169d534-1fec-43c4-9550-5cb701ff7f05",
-   "metadata": {},
-   "source": [
-    "- Based on the evaluation above, we can say that the 1st model is better than the 2nd model"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.11.4"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}