LLMs-from-scratch/ch07/03_model-evaluation/llm-instruction-eval-ollama.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "136a4efe-fb99-4311-8679-e0a5b6282755",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Evaluating Instruction Responses Locally Using a Llama 3 Model Via Ollama"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a128651b-f326-4232-a994-42f38b7ed520",
   "metadata": {},
   "source": [
    "- This notebook uses an 8 billion parameter Llama 3 model through ollama to evaluate responses of instruction finetuned LLMs based on a dataset in JSON format that includes the generated model responses, for example:\n",
    "\n",
    "\n",
    "\n",
    "```python\n",
    "{\n",
    "    \"instruction\": \"What is the atomic number of helium?\",\n",
    "    \"input\": \"\",\n",
    "    \"output\": \"The atomic number of helium is 2.\",               # <-- The target given in the test set\n",
    "    \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
    "    \"model 2 response\": \"\\nThe atomic number of helium is 3.\"    # <-- Response by a 2nd LLM\n",
    "},\n",
    "```\n",
    "\n",
    "- The code doesn't require a GPU and runs on a laptop (it was tested on a M3 MacBook Air)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "63610acc-db94-437f-8d38-e99dca0299cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tqdm version: 4.66.2\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "pkgs = [\"tqdm\",    # Progress bar\n",
    "       ]\n",
    "\n",
    "for p in pkgs:\n",
    "    print(f\"{p} version: {version(p)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
   "metadata": {},
   "source": [
    "## Installing Ollama and Downloading Llama 3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a092280-5462-4709-a3fe-8669a4a8a0a6",
   "metadata": {},
   "source": [
    "- Ollama is an application to run LLMs efficiently\n",
    "- It is a wrapper around [llama.cpp](https://github.com/ggerganov/llama.cpp), which implements LLMs in pure C/C++ to maximize efficiency\n",
    "- Note that it is a tool for using LLMs to generate text (inference), not training or finetuning LLMs\n",
    "- Prior to running the code below, install ollama by visiting [https://ollama.com](https://ollama.com) and following the instructions (for instance, clicking on the \"Download\" button and downloading the ollama application for your operating system)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
   "metadata": {},
   "source": [
    "- Now let's test if ollama is set up correctly\n",
    "- For this, click on the ollama application you downloaded; if it prompts you to install the command line usage, say \"yes\"\n",
    "- Next, on the command line, execute the following command to try out the 8 billion parameters Llama 3 model (the model, which takes up 4.7 GB of storage space, will be automatically downloaded the first time you execute this command)\n",
    "\n",
    "```bash\n",
    "# 8B model\n",
    "ollama run llama3\n",
    "```\n",
    "\n",
    "The output looks like as follows:\n",
    "\n",
    "```\n",
    "$ ollama run llama3\n",
    "pulling manifest \n",
    "pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB                         \n",
    "pulling 4fa551d4f938... 100% ▕████████████████▏  12 KB                         \n",
    "pulling 8ab4849b038c... 100% ▕████████████████▏  254 B                         \n",
    "pulling 577073ffcc6c... 100% ▕████████████████▏  110 B                         \n",
    "pulling 3f8eb4da87fa... 100% ▕████████████████▏  485 B                         \n",
    "verifying sha256 digest \n",
    "writing manifest \n",
    "removing any unused layers \n",
    "success \n",
    "```\n",
    "\n",
    "- Note that `llama3` refers to the instruction finetuned 8 billion Llama 3 model\n",
    "\n",
    "- Alternatively, you can also use the larger 70 billion parameters Llama 3 model, if your machine supports it, by replacing `llama3` with `llama3:70b`\n",
    "\n",
    "- After the download has been completed, you will see a command line prompt that allows you to chat with the model\n",
    "\n",
    "-  Try a prompt like \"What do llamas eat?\", which should return an output similar to the following:\n",
    "\n",
    "```\n",
    ">>> What do llamas eat?\n",
    "Llamas are ruminant animals, which means they have a four-chambered \n",
    "stomach and eat plants that are high in fiber. In the wild, llamas \n",
    "typically feed on:\n",
    "1. Grasses: They love to graze on various types of grasses, including tall \n",
    "grasses, wheat, oats, and barley.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b5addcb-fc7d-455d-bee9-6cc7a0d684c7",
   "metadata": {},
   "source": [
    "- Leave this terminal and `ollama` session running for the next steps (you can exit the `ollama` session later by entering the prompt /bye`)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dda155ee-cf36-44d3-b634-20ba8e1ca38a",
   "metadata": {},
   "source": [
    "## Using Ollama's REST API"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
   "metadata": {},
   "source": [
    "- Now, an alternative way to interact with the model is via its REST API in Python via the following function (the URL is a `ollama` default):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Llamas are ruminant animals, which means they have a four-chambered stomach and are designed to eat plant-based foods. In the wild, llamas primarily feed on:\n",
      "\n",
      "1. Grasses: They love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.\n",
      "2. Leaves: Llamas enjoy munching on leaves from trees and shrubs, like willow, alder, and birch.\n",
      "3. Fruits: They'll eat fruits like apples, berries, and corn, as long as they're not too ripe or rotten.\n",
      "4. Hay: In captivity, llamas are often fed hay, such as timothy grass, alfalfa, or oat hay, which is a staple in their diet.\n",
      "5. Grains: Some llamas may be given grains like oats, barley, or corn as treats or supplements.\n",
      "\n",
      "In general, llamas are browsers and grazers, meaning they prefer to eat plants that grow above ground (like leaves and fruits) rather than those growing below ground (like roots). They have a unique digestive system that allows them to break down cellulose in plant cell walls, making them efficient at converting plant material into energy.\n",
      "\n",
      "It's worth noting that the specific diet of llamas can vary depending on factors like age, breed, climate, and availability of food sources. If you're considering keeping llamas as pets or using them for packing, it's essential to provide a balanced and nutritious diet that meets their nutritional needs. Consult with a veterinarian or an experienced llama breeder for guidance on creating the best diet plan for your llamas!\n"
     ]
    }
   ],
   "source": [
    "import urllib.request\n",
    "import json\n",
    "\n",
    "def query_model(prompt, model=\"llama3\", url=\"http://localhost:11434/api/chat\"):\n",
    "    # Create the data payload as a dictionary\n",
    "    data = {\n",
    "        \"model\": model,\n",
    "        \"seed\":123,        # for deterministic responses\n",
    "        \"temperature\":0,   # for deterministic responses\n",
    "        \"messages\": [\n",
    "            {\"role\": \"user\", \"content\": prompt}\n",
    "        ]\n",
    "    }\n",
    "\n",
    "    # Convert the dictionary to a JSON formatted string and encode it to bytes\n",
    "    payload = json.dumps(data).encode(\"utf-8\")\n",
    "\n",
    "    # Create a request object, setting the method to POST and adding necessary headers\n",
    "    request = urllib.request.Request(url, data=payload, method=\"POST\")\n",
    "    request.add_header(\"Content-Type\", \"application/json\")\n",
    "\n",
    "    # Send the request and capture the response\n",
    "    response_data = \"\"\n",
    "    with urllib.request.urlopen(request) as response:\n",
    "        # Read and decode the response\n",
    "        while True:\n",
    "            line = response.readline().decode(\"utf-8\")\n",
    "            if not line:\n",
    "                break\n",
    "            response_json = json.loads(line)\n",
    "            response_data += response_json[\"message\"][\"content\"]\n",
    "\n",
    "    return response_data\n",
    "\n",
    "\n",
    "result = query_model(\"What do Llamas eat?\")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
   "metadata": {},
   "source": [
    "- First, let's try the API with a simple example to make sure it works as intended:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
   "metadata": {},
   "source": [
    "## Load JSON Entries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
   "metadata": {},
   "source": [
    "- Now, let's get to the data evaluation part\n",
    "- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of entries: 100\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "json_file = \"eval-example-data.json\"\n",
    "\n",
    "with open(json_file, \"r\") as file:\n",
    "    json_data = json.load(file)\n",
    "    \n",
    "print(\"Number of entries:\", len(json_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
   "metadata": {},
   "source": [
    "- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7222fdc0-5684-4f2b-b741-3e341851359e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",
       " 'input': '',\n",
       " 'output': 'The hypotenuse of the triangle is 10 cm.',\n",
       " 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",
       " 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "json_data[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
   "metadata": {},
   "source": [
    "- Below is a small utility function that formats the input for visualization purposes later:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_input(entry):\n",
    "    instruction_text = (\n",
    "        f\"Below is an instruction that describes a task. Write a response that \"\n",
    "        f\"appropriately completes the request.\"\n",
    "        f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",
    "    )\n",
    "\n",
    "    input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
    "    instruction_text + input_text\n",
    "\n",
    "    return instruction_text + input_text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39a55283-7d51-4136-ba60-f799d49f4098",
   "metadata": {},
   "source": [
    "- Now, let's try the ollama API to compare the model responses (we only evalyate the first 5 responses for a visual comparison):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "735cc089-d127-480a-b39d-0782581f0c41",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Dataset response:\n",
      ">> The hypotenuse of the triangle is 10 cm.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The hypotenuse of the triangle is 3 cm.\n",
      "\n",
      "Score:\n",
      ">> To evaluate the model response, I'll compare it to the correct output.\n",
      "\n",
      "Correct output: The hypotenuse of the triangle is 10 cm.\n",
      "Model response: The hypotenuse of the triangle is 3 cm.\n",
      "\n",
      "The model response is incorrect, as the calculated value (3 cm) does not match the actual value (10 cm). Therefore, I would score this response a 0 out of 100.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> 1. Squirrel\n",
      "2. Eagle\n",
      "3. Tiger\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "1. Squirrel\n",
      "2. Tiger\n",
      "3. Eagle\n",
      "4. Cobra\n",
      "5. Tiger\n",
      "6. Cobra\n",
      "\n",
      "Score:\n",
      ">> To complete the request, I will provide a response that names three different animals that are active during the day.\n",
      "\n",
      "### Response:\n",
      "1. Squirrel\n",
      "2. Eagle\n",
      "3. Tiger\n",
      "\n",
      "Now, let's evaluate the model response based on the provided options. Here's how it scores:\n",
      "\n",
      "1. Squirrel (Match)\n",
      "2. Tiger (Match)\n",
      "3. Eagle (Match)\n",
      "\n",
      "The model response correctly identifies three animals that are active during the day: squirrel, tiger, and eagle.\n",
      "\n",
      "On a scale from 0 to 100, I would score this response as **80**. The model accurately completes the request and provides relevant information. However, it does not fully utilize all available options (4-6), which is why the score is not higher.\n",
      "\n",
      "Corrected output: 1. Squirrel\n",
      "2. Eagle\n",
      "3. Tiger\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> I must ascertain what is incorrect.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "What is incorrect?\n",
      "\n",
      "Score:\n",
      ">> The task is to rewrite a sentence in a more formal way.\n",
      "\n",
      "### Original Sentence:\n",
      "\"I need to find out what's wrong.\"\n",
      "\n",
      "### Formal Rewrite:\n",
      "\"I must ascertain what is incorrect.\"\n",
      "\n",
      "Score: **90**\n",
      "\n",
      "The model response accurately captures the original sentence's meaning while adopting a more formal tone. The words \"ascertain\" and \"incorrect\" effectively convey a sense of professionalism and precision, making it suitable for a formal setting.\n",
      "\n",
      "Note: I scored the model response 90 out of 100 because it successfully transformed the informal sentence into a more formal one, but there is room for improvement in terms of style and nuance.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> The interjection in the sentence is 'Wow'.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The interjection in the sentence is 'Wow'.\n",
      "\n",
      "Score:\n",
      ">> A scoring question!\n",
      "\n",
      "I'd rate the model response as **98** out of 100.\n",
      "\n",
      "Here's why:\n",
      "\n",
      "* The model correctly identifies \"Wow\" as the interjection in the sentence.\n",
      "* The response is concise and directly answers the instruction.\n",
      "* There are no grammatical errors, typos, or inaccuracies in the response.\n",
      "\n",
      "The only reason I wouldn't give it a perfect score (100) is that it's possible for an even more precise or detailed response to be given, such as \"The sentence contains a single interjection: 'Wow', which is used to express surprise and enthusiasm.\" However, the model's response is still very good, and 98 out of 100 is a strong score.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> The type of sentence is interrogative.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The type of sentence is exclamatory.\n",
      "\n",
      "Score:\n",
      ">> A nice simple task!\n",
      "\n",
      "To score my response, I'll compare it with the correct output.\n",
      "\n",
      "Correct output: The type of sentence is interrogative.\n",
      "My response: The type of sentence is exclamatory.\n",
      "\n",
      "The correct answer is an interrogative sentence (asking a question), while my response suggests it's an exclamatory sentence (expressing strong emotions). Oops!\n",
      "\n",
      "So, I'd score my response as follows:\n",
      "\n",
      "* Correctness: 0/10\n",
      "* Relevance: 0/10 (my response doesn't even match the input)\n",
      "* Overall quality: 0/100\n",
      "\n",
      "The lowest possible score is 0. Unfortunately, that's where my response falls. Better luck next time!\n",
      "\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "for entry in json_data[:5]:\n",
    "    prompt = (f\"Given the input `{format_input(entry)}` \"\n",
    "              f\"and correct output `{entry['output']}`, \"\n",
    "              f\"score the model response `{entry['model 1 response']}`\"\n",
    "              f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
    "    )\n",
    "    print(\"\\nDataset response:\")\n",
    "    print(\">>\", entry['output'])\n",
    "    print(\"\\nModel response:\")\n",
    "    print(\">>\", entry[\"model 1 response\"])\n",
    "    print(\"\\nScore:\")\n",
    "    print(\">>\", query_model(prompt))\n",
    "    print(\"\\n-------------------------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
   "metadata": {},
   "source": [
    "- Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3552bdfb-7511-42ac-a9ec-da672e2a5468",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "\n",
    "def generate_model_scores(json_data, json_key):\n",
    "    scores = []\n",
    "    for entry in tqdm(json_data, desc=\"Scoring entries\"):\n",
    "        prompt = (\n",
    "            f\"Given the input `{format_input(entry)}` \"\n",
    "            f\"and correct output `{entry['output']}`, \"\n",
    "            f\"score the model response `{entry[json_key]}`\"\n",
    "            f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
    "            f\"Respond with the integer number only.\"\n",
    "        )\n",
    "        score = query_model(prompt)\n",
    "        try:\n",
    "            scores.append(int(score))\n",
    "        except:\n",
    "            continue\n",
    "\n",
    "    return scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b071ce84-1866-427f-a272-b46700f364b2",
   "metadata": {},
   "source": [
    "- Let's now apply this evaluation to the whole dataset and compute the average score of each model (this takes about 1 min per model on a M3 MacBook Air laptop)\n",
    "- Note that ollama is not fully deterministic (as of this writing) so the numbers you are getting might slightly differ from the ones shown below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "4f700d4b-19e5-4404-afa7-b0f093024232",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Scoring entries: 100%|████████████████████████| 100/100 [01:06<00:00,  1.50it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "model 1 response\n",
      "Number of scores: 100 of 100\n",
      "Average score: 78.02\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Scoring entries: 100%|████████████████████████| 100/100 [01:10<00:00,  1.41it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "model 2 response\n",
      "Number of scores: 99 of 100\n",
      "Average score: 66.56\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "for model in (\"model 1 response\", \"model 2 response\"):\n",
    "\n",
    "    scores = generate_model_scores(json_data, model)\n",
    "    print(f\"\\n{model}\")\n",
    "    print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",
    "    print(f\"Average score: {sum(scores)/len(scores):.2f}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86d6fc51-1d67-4ff8-ba53-1b0eaf02ab46",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "8169d534-1fec-43c4-9550-5cb701ff7f05",
   "metadata": {},
   "source": [
    "- Based on the evaluation above, we can say that the 1st model is better than the 2nd model"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}