LLMs-from-scratch/ch07/03_model-evaluation/llm-instruction-eval-ollama.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "136a4efe-fb99-4311-8679-e0a5b6282755",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Evaluating Instruction Responses Locally Using a Llama 3 Model Via Ollama"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a128651b-f326-4232-a994-42f38b7ed520",
   "metadata": {},
   "source": [
    "- This notebook uses an 8 billion parameter Llama 3 model through ollama to evaluate responses of instruction finetuned LLMs based on a dataset in JSON format that includes the generated model responses, for example:\n",
    "\n",
    "\n",
    "\n",
    "```python\n",
    "{\n",
    "    \"instruction\": \"What is the atomic number of helium?\",\n",
    "    \"input\": \"\",\n",
    "    \"output\": \"The atomic number of helium is 2.\",               # <-- The target given in the test set\n",
    "    \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
    "    \"model 2 response\": \"\\nThe atomic number of helium is 3.\"    # <-- Response by a 2nd LLM\n",
    "},\n",
    "```\n",
    "\n",
    "- The code doesn't require a GPU and runs on a laptop (it was tested on a M3 MacBook Air)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "63610acc-db94-437f-8d38-e99dca0299cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tqdm version: 4.66.4\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "pkgs = [\"tqdm\",    # Progress bar\n",
    "        ]\n",
    "\n",
    "for p in pkgs:\n",
    "    print(f\"{p} version: {version(p)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
   "metadata": {},
   "source": [
    "## Installing Ollama and Downloading Llama 3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a092280-5462-4709-a3fe-8669a4a8a0a6",
   "metadata": {},
   "source": [
    "- Ollama is an application to run LLMs efficiently\n",
    "- It is a wrapper around [llama.cpp](https://github.com/ggerganov/llama.cpp), which implements LLMs in pure C/C++ to maximize efficiency\n",
    "- Note that it is a tool for using LLMs to generate text (inference), not training or finetuning LLMs\n",
    "- Prior to running the code below, install ollama by visiting [https://ollama.com](https://ollama.com) and following the instructions (for instance, clicking on the \"Download\" button and downloading the ollama application for your operating system)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
   "metadata": {},
   "source": [
    "- For macOS and Windows users, click on the ollama application you downloaded; if it prompts you to install the command line usage, say \"yes\"\n",
    "- Linux users can use the installation command provided on the ollama website\n",
    "\n",
    "- In general, before we can use ollama from the command line, we have to either start the ollama application or run `ollama serve` in a separate terminal\n",
    "\n",
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/ollama-eval/ollama-serve.webp?1\">\n",
    "\n",
    "\n",
    "- With the ollama application or `ollama serve` running, in a different terminal, on the command line, execute the following command to try out the 8 billion parameters Llama 3 model (the model, which takes up 4.7 GB of storage space, will be automatically downloaded the first time you execute this command)\n",
    "\n",
    "```bash\n",
    "# 8B model\n",
    "ollama run llama3\n",
    "```\n",
    "\n",
    "\n",
    "The output looks like as follows:\n",
    "\n",
    "```\n",
    "$ ollama run llama3\n",
    "pulling manifest \n",
    "pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB                         \n",
    "pulling 4fa551d4f938... 100% ▕████████████████▏  12 KB                         \n",
    "pulling 8ab4849b038c... 100% ▕████████████████▏  254 B                         \n",
    "pulling 577073ffcc6c... 100% ▕████████████████▏  110 B                         \n",
    "pulling 3f8eb4da87fa... 100% ▕████████████████▏  485 B                         \n",
    "verifying sha256 digest \n",
    "writing manifest \n",
    "removing any unused layers \n",
    "success \n",
    "```\n",
    "\n",
    "- Note that `llama3` refers to the instruction finetuned 8 billion Llama 3 model\n",
    "\n",
    "- Alternatively, you can also use the larger 70 billion parameters Llama 3 model, if your machine supports it, by replacing `llama3` with `llama3:70b`\n",
    "\n",
    "- After the download has been completed, you will see a command line prompt that allows you to chat with the model\n",
    "\n",
    "- Try a prompt like \"What do llamas eat?\", which should return an output similar to the following:\n",
    "\n",
    "```\n",
    ">>> What do llamas eat?\n",
    "Llamas are ruminant animals, which means they have a four-chambered \n",
    "stomach and eat plants that are high in fiber. In the wild, llamas \n",
    "typically feed on:\n",
    "1. Grasses: They love to graze on various types of grasses, including tall \n",
    "grasses, wheat, oats, and barley.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b5addcb-fc7d-455d-bee9-6cc7a0d684c7",
   "metadata": {},
   "source": [
    "- You can end this session using the input `/bye`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dda155ee-cf36-44d3-b634-20ba8e1ca38a",
   "metadata": {},
   "source": [
    "## Using Ollama's REST API"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
   "metadata": {},
   "source": [
    "- Now, an alternative way to interact with the model is via its REST API in Python via the following function\n",
    "- Before you run the next cells in this notebook, make sure that ollama is still running, as described above, via\n",
    "  - `ollama serve` in a terminal\n",
    "  - the ollama application\n",
    "- Next, run the following code cell to query the model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
   "metadata": {},
   "source": [
    "- First, let's try the API with a simple example to make sure it works as intended:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:\n",
      "\n",
      "1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.\n",
      "2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy the sweet taste and texture of fresh hay.\n",
      "3. Grains: Llamas may receive grains like oats, barley, or corn as part of their daily ration. However, it's essential to provide these grains in moderation, as they can be high in calories.\n",
      "4. Fruits and vegetables: Llamas enjoy a variety of fruits and veggies, such as apples, carrots, sweet potatoes, and leafy greens like kale or spinach.\n",
      "5. Minerals: Llamas require access to mineral supplements, which help maintain their overall health and well-being.\n",
      "\n",
      "In the wild, llamas might also eat:\n",
      "\n",
      "1. Leaves: They'll munch on leaves from trees and shrubs, including plants like willow, alder, and birch.\n",
      "2. Bark: In some cases, llamas may eat the bark of certain trees, like aspen or cottonwood.\n",
      "3. Mosses and lichens: These non-vascular plants can be a tasty snack for llamas.\n",
      "\n",
      "In captivity, llama owners typically provide a balanced diet that includes a mix of hay, grains, and fruits/vegetables. It's essential to consult with a veterinarian or experienced llama breeder to determine the best feeding plan for your llama.\n"
     ]
    }
   ],
   "source": [
    "import urllib.request\n",
    "import json\n",
    "\n",
    "\n",
    "def query_model(prompt, model=\"llama3\", url=\"http://localhost:11434/api/chat\"):\n",
    "    # Create the data payload as a dictionary\n",
    "    data = {\n",
    "        \"model\": model,\n",
    "        \"messages\": [\n",
    "            {\n",
    "                \"role\": \"user\",\n",
    "                \"content\": prompt\n",
    "            }\n",
    "        ],\n",
    "        \"options\": {     # Settings below are required for deterministic responses\n",
    "            \"seed\": 123,\n",
    "            \"temperature\": 0,\n",
    "            \"num_ctx\": 2048\n",
    "        }\n",
    "    }\n",
    "\n",
    "    # Convert the dictionary to a JSON formatted string and encode it to bytes\n",
    "    payload = json.dumps(data).encode(\"utf-8\")\n",
    "\n",
    "    # Create a request object, setting the method to POST and adding necessary headers\n",
    "    request = urllib.request.Request(url, data=payload, method=\"POST\")\n",
    "    request.add_header(\"Content-Type\", \"application/json\")\n",
    "\n",
    "    # Send the request and capture the response\n",
    "    response_data = \"\"\n",
    "    with urllib.request.urlopen(request) as response:\n",
    "        # Read and decode the response\n",
    "        while True:\n",
    "            line = response.readline().decode(\"utf-8\")\n",
    "            if not line:\n",
    "                break\n",
    "            response_json = json.loads(line)\n",
    "            response_data += response_json[\"message\"][\"content\"]\n",
    "\n",
    "    return response_data\n",
    "\n",
    "\n",
    "result = query_model(\"What do Llamas eat?\")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
   "metadata": {},
   "source": [
    "## Load JSON Entries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
   "metadata": {},
   "source": [
    "- Now, let's get to the data evaluation part\n",
    "- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of entries: 100\n"
     ]
    }
   ],
   "source": [
    "json_file = \"eval-example-data.json\"\n",
    "\n",
    "with open(json_file, \"r\") as file:\n",
    "    json_data = json.load(file)\n",
    "\n",
    "print(\"Number of entries:\", len(json_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
   "metadata": {},
   "source": [
    "- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7222fdc0-5684-4f2b-b741-3e341851359e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",
       " 'input': '',\n",
       " 'output': 'The hypotenuse of the triangle is 10 cm.',\n",
       " 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",
       " 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "json_data[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
   "metadata": {},
   "source": [
    "- Below is a small utility function that formats the input for visualization purposes later:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_input(entry):\n",
    "    instruction_text = (\n",
    "        f\"Below is an instruction that describes a task. Write a response that \"\n",
    "        f\"appropriately completes the request.\"\n",
    "        f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",
    "    )\n",
    "\n",
    "    input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
    "    instruction_text + input_text\n",
    "\n",
    "    return instruction_text + input_text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39a55283-7d51-4136-ba60-f799d49f4098",
   "metadata": {},
   "source": [
    "- Now, let's try the ollama API to compare the model responses (we only evaluate the first 5 responses for a visual comparison):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "735cc089-d127-480a-b39d-0782581f0c41",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Dataset response:\n",
      ">> The hypotenuse of the triangle is 10 cm.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The hypotenuse of the triangle is 3 cm.\n",
      "\n",
      "Score:\n",
      ">> I'd score this response as 0 out of 100.\n",
      "\n",
      "The correct answer is \"The hypotenuse of the triangle is 10 cm.\", not \"3 cm.\". The model failed to accurately calculate the length of the hypotenuse, which is a fundamental concept in geometry and trigonometry.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> 1. Squirrel\n",
      "2. Eagle\n",
      "3. Tiger\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "1. Squirrel\n",
      "2. Tiger\n",
      "3. Eagle\n",
      "4. Cobra\n",
      "5. Tiger\n",
      "6. Cobra\n",
      "\n",
      "Score:\n",
      ">> I'd rate this model response as 60 out of 100.\n",
      "\n",
      "Here's why:\n",
      "\n",
      "* The model correctly identifies two animals that are active during the day: Squirrel and Eagle.\n",
      "* However, it incorrectly includes Tiger twice, which is not a different animal from the original list.\n",
      "* Cobra is also an incorrect answer, as it is typically nocturnal or crepuscular (active at twilight).\n",
      "* The response does not meet the instruction to provide three different animals that are active during the day.\n",
      "\n",
      "To achieve a higher score, the model should have provided three unique and correct answers that fit the instruction.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> I must ascertain what is incorrect.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "What is incorrect?\n",
      "\n",
      "Score:\n",
      ">> A clever test!\n",
      "\n",
      "Here's my attempt at rewriting the sentence in a more formal way:\n",
      "\n",
      "\"I require an identification of the issue.\"\n",
      "\n",
      "Now, let's evaluate the model response \"What is incorrect?\" against the correct output \"I must ascertain what is incorrect.\".\n",
      "\n",
      "To me, this seems like a completely different question being asked. The original instruction was to rewrite the sentence in a more formal way, and the model response doesn't even attempt to do that. It's asking a new question altogether!\n",
      "\n",
      "So, I'd score this response a 0 out of 100.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> The interjection in the sentence is 'Wow'.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The interjection in the sentence is 'Wow'.\n",
      "\n",
      "Score:\n",
      ">> I'd score this model response as 100.\n",
      "\n",
      "Here's why:\n",
      "\n",
      "1. The instruction asks to identify the interjection in the sentence.\n",
      "2. The input sentence is provided: \"Wow, that was an amazing trick!\"\n",
      "3. The model correctly identifies the interjection as \"Wow\", which is a common English interjection used to express surprise or excitement.\n",
      "4. The response accurately answers the question and provides the correct information.\n",
      "\n",
      "Overall, the model's response perfectly completes the request, making it a 100% accurate answer!\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> The type of sentence is interrogative.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The type of sentence is exclamatory.\n",
      "\n",
      "Score:\n",
      ">> I'd rate this model response as 20 out of 100.\n",
      "\n",
      "Here's why:\n",
      "\n",
      "* The input sentence \"Did you finish the report?\" is indeed an interrogative sentence, which asks a question.\n",
      "* The model response says it's exclamatory, which is incorrect. Exclamatory sentences are typically marked by an exclamation mark (!) and express strong emotions or emphasis, whereas this sentence is simply asking a question.\n",
      "\n",
      "The correct output \"The type of sentence is interrogative.\" is the best possible score (100), while the model response is significantly off the mark, hence the low score.\n",
      "\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "for entry in json_data[:5]:\n",
    "    prompt = (f\"Given the input `{format_input(entry)}` \"\n",
    "              f\"and correct output `{entry['output']}`, \"\n",
    "              f\"score the model response `{entry['model 1 response']}`\"\n",
    "              f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
    "              )\n",
    "    print(\"\\nDataset response:\")\n",
    "    print(\">>\", entry['output'])\n",
    "    print(\"\\nModel response:\")\n",
    "    print(\">>\", entry[\"model 1 response\"])\n",
    "    print(\"\\nScore:\")\n",
    "    print(\">>\", query_model(prompt))\n",
    "    print(\"\\n-------------------------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
   "metadata": {},
   "source": [
    "- Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3552bdfb-7511-42ac-a9ec-da672e2a5468",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "\n",
    "\n",
    "def generate_model_scores(json_data, json_key):\n",
    "    scores = []\n",
    "    for entry in tqdm(json_data, desc=\"Scoring entries\"):\n",
    "        prompt = (\n",
    "            f\"Given the input `{format_input(entry)}` \"\n",
    "            f\"and correct output `{entry['output']}`, \"\n",
    "            f\"score the model response `{entry[json_key]}`\"\n",
    "            f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
    "            f\"Respond with the integer number only.\"\n",
    "        )\n",
    "        score = query_model(prompt)\n",
    "        try:\n",
    "            scores.append(int(score))\n",
    "        except ValueError:\n",
    "            continue\n",
    "\n",
    "    return scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b071ce84-1866-427f-a272-b46700f364b2",
   "metadata": {},
   "source": [
    "- Let's now apply this evaluation to the whole dataset and compute the average score of each model (this takes about 1 minute per model on an M3 MacBook Air laptop)\n",
    "- Note that ollama is not fully deterministic across operating systems (as of this writing) so the numbers you are getting might slightly differ from the ones shown below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "4f700d4b-19e5-4404-afa7-b0f093024232",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Scoring entries: 100%|████████████████████████| 100/100 [01:02<00:00,  1.59it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "model 1 response\n",
      "Number of scores: 100 of 100\n",
      "Average score: 78.48\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Scoring entries: 100%|████████████████████████| 100/100 [01:10<00:00,  1.42it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "model 2 response\n",
      "Number of scores: 99 of 100\n",
      "Average score: 64.98\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "for model in (\"model 1 response\", \"model 2 response\"):\n",
    "\n",
    "    scores = generate_model_scores(json_data, model)\n",
    "    print(f\"\\n{model}\")\n",
    "    print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",
    "    print(f\"Average score: {sum(scores)/len(scores):.2f}\\n\")\n",
    "\n",
    "    # Optionally save the scores\n",
    "    save_path = Path(\"scores\") / f\"llama3-8b-{model.replace(' ', '-')}.json\"\n",
    "    with open(save_path, \"w\") as file:\n",
    "        json.dump(scores, file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8169d534-1fec-43c4-9550-5cb701ff7f05",
   "metadata": {},
   "source": [
    "- Based on the evaluation above, we can say that the 1st model is better than the 2nd model"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}