"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
"# Evaluating Instruction Responses Locally Using a Llama 3 Model Via Ollama"
]
},
{
"cell_type": "markdown",
"id": "a128651b-f326-4232-a994-42f38b7ed520",
"metadata": {},
"source": [
"- This notebook uses an 8 billion parameter Llama 3 model through ollama to evaluate responses of instruction finetuned LLMs based on a dataset in JSON format that includes the generated model responses, for example:\n",
"\n",
"\n",
"\n",
"```python\n",
"{\n",
" \"instruction\": \"What is the atomic number of helium?\",\n",
" \"input\": \"\",\n",
" \"output\": \"The atomic number of helium is 2.\", # <-- The target given in the test set\n",
" \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
" \"model 2 response\": \"\\nThe atomic number of helium is 3.\" # <-- Response by a 2nd LLM\n",
"},\n",
"```\n",
"\n",
"- The code doesn't require a GPU and runs on a laptop (it was tested on a M3 MacBook Air)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "63610acc-db94-437f-8d38-e99dca0299cb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tqdm version: 4.66.2\n"
]
}
],
"source": [
"from importlib.metadata import version\n",
"\n",
"pkgs = [\"tqdm\", # Progress bar\n",
" ]\n",
"\n",
"for p in pkgs:\n",
" print(f\"{p} version: {version(p)}\")"
]
},
{
"cell_type": "markdown",
"id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
"metadata": {},
"source": [
"## Installing Ollama and Downloading Llama 3"
]
},
{
"cell_type": "markdown",
"id": "5a092280-5462-4709-a3fe-8669a4a8a0a6",
"metadata": {},
"source": [
"- Ollama is an application to run LLMs efficiently\n",
"- It is a wrapper around [llama.cpp](https://github.com/ggerganov/llama.cpp), which implements LLMs in pure C/C++ to maximize efficiency\n",
"- Note that it is a tool for using LLMs to generate text (inference), not training or finetuning LLMs\n",
"- Prior to running the code below, install ollama by visiting [https://ollama.com](https://ollama.com) and following the instructions (for instance, clicking on the \"Download\" button and downloading the ollama application for your operating system)"
]
},
{
"cell_type": "markdown",
"id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
"metadata": {},
"source": [
"- Now let's test if ollama is set up correctly\n",
"- For this, click on the ollama application you downloaded; if it prompts you to install the command line usage, say \"yes\"\n",
"- Next, on the command line, execute the following command to try out the 8 billion parameters Llama 3 model (the model, which takes up 4.7 GB of storage space, will be automatically downloaded the first time you execute this command)\n",
"- Note that `llama3` refers to the instruction finetuned 8 billion Llama 3 model\n",
"\n",
"- Alternatively, you can also use the larger 70 billion parameters Llama 3 model, if your machine supports it, by replacing `llama3` with `llama3:70b`\n",
"\n",
"- After the download has been completed, you will see a command line prompt that allows you to chat with the model\n",
"\n",
"- Try a prompt like \"What do llamas eat?\", which should return an output similar to the following:\n",
"\n",
"```\n",
">>> What do llamas eat?\n",
"Llamas are ruminant animals, which means they have a four-chambered\n",
"stomach and eat plants that are high in fiber. In the wild, llamas\n",
"typically feed on:\n",
"1. Grasses: They love to graze on various types of grasses, including tall\n",
"- Now, an alternative way to interact with the model is via its REST API in Python via the following function\n",
"- First, in your terminal, start a local ollama server via `ollama serve` (after executing the code in this notebook, you can later stop this session by simply closing the terminal)\n",
"- Next, run the following code cell to query the model"
"Domesticated llamas usually have a more controlled diet, as their owners provide them with specific foods and supplements to ensure they receive the nutrients they need. A balanced diet for an llama typically includes 15-20% hay, 10-15% grains, and 5-10% fruits and vegetables.\n",
"\n",
"Remember, always consult with a veterinarian or experienced llama breeder to determine the best diet for your individual llama!\n"
"result = query_model(\"What do Llamas eat?\")\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
"metadata": {},
"source": [
"- First, let's try the API with a simple example to make sure it works as intended:"
]
},
{
"cell_type": "markdown",
"id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
"metadata": {},
"source": [
"## Load JSON Entries"
]
},
{
"cell_type": "markdown",
"id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
"metadata": {},
"source": [
"- Now, let's get to the data evaluation part\n",
"- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of entries: 100\n"
]
}
],
"source": [
"import json\n",
"\n",
"json_file = \"eval-example-data.json\"\n",
"\n",
"with open(json_file, \"r\") as file:\n",
" json_data = json.load(file)\n",
" \n",
"print(\"Number of entries:\", len(json_data))"
]
},
{
"cell_type": "markdown",
"id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
"metadata": {},
"source": [
"- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "7222fdc0-5684-4f2b-b741-3e341851359e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",
" 'input': '',\n",
" 'output': 'The hypotenuse of the triangle is 10 cm.',\n",
" 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",
" 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"json_data[0]"
]
},
{
"cell_type": "markdown",
"id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
"metadata": {},
"source": [
"- Below is a small utility function that formats the input for visualization purposes later:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
"metadata": {},
"outputs": [],
"source": [
"def format_input(entry):\n",
" instruction_text = (\n",
" f\"Below is an instruction that describes a task. Write a response that \"\n",
" input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
" instruction_text + input_text\n",
"\n",
" return instruction_text + input_text"
]
},
{
"cell_type": "markdown",
"id": "39a55283-7d51-4136-ba60-f799d49f4098",
"metadata": {},
"source": [
"- Now, let's try the ollama API to compare the model responses (we only evalyate the first 5 responses for a visual comparison):"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "735cc089-d127-480a-b39d-0782581f0c41",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Dataset response:\n",
">> The hypotenuse of the triangle is 10 cm.\n",
"\n",
"Model response:\n",
">> \n",
"The hypotenuse of the triangle is 3 cm.\n",
"\n",
"Score:\n",
">> To evaluate the model response, I'll compare it to the correct output.\n",
"\n",
"Correct output: The hypotenuse of the triangle is 10 cm.\n",
"Model response: The hypotenuse of the triangle is 3 cm.\n",
"\n",
"The model response is incorrect, as the calculated value (3 cm) does not match the actual value (10 cm). Therefore, I would score this response a 0 out of 100.\n",
"\n",
"-------------------------\n",
"\n",
"Dataset response:\n",
">> 1. Squirrel\n",
"2. Eagle\n",
"3. Tiger\n",
"\n",
"Model response:\n",
">> \n",
"1. Squirrel\n",
"2. Tiger\n",
"3. Eagle\n",
"4. Cobra\n",
"5. Tiger\n",
"6. Cobra\n",
"\n",
"Score:\n",
">> To complete the request, I will provide a response that names three different animals that are active during the day.\n",
"\n",
"### Response:\n",
"1. Squirrel\n",
"2. Eagle\n",
"3. Tiger\n",
"\n",
"Now, let's evaluate the model response based on the provided options. Here's how it scores:\n",
"\n",
"1. Squirrel (Match)\n",
"2. Tiger (Match)\n",
"3. Eagle (Match)\n",
"\n",
"The model response correctly identifies three animals that are active during the day: squirrel, tiger, and eagle.\n",
"\n",
"On a scale from 0 to 100, I would score this response as **80**. The model accurately completes the request and provides relevant information. However, it does not fully utilize all available options (4-6), which is why the score is not higher.\n",
"\n",
"Corrected output: 1. Squirrel\n",
"2. Eagle\n",
"3. Tiger\n",
"\n",
"-------------------------\n",
"\n",
"Dataset response:\n",
">> I must ascertain what is incorrect.\n",
"\n",
"Model response:\n",
">> \n",
"What is incorrect?\n",
"\n",
"Score:\n",
">> The task is to rewrite a sentence in a more formal way.\n",
"\n",
"### Original Sentence:\n",
"\"I need to find out what's wrong.\"\n",
"\n",
"### Formal Rewrite:\n",
"\"I must ascertain what is incorrect.\"\n",
"\n",
"Score: **90**\n",
"\n",
"The model response accurately captures the original sentence's meaning while adopting a more formal tone. The words \"ascertain\" and \"incorrect\" effectively convey a sense of professionalism and precision, making it suitable for a formal setting.\n",
"\n",
"Note: I scored the model response 90 out of 100 because it successfully transformed the informal sentence into a more formal one, but there is room for improvement in terms of style and nuance.\n",
"\n",
"-------------------------\n",
"\n",
"Dataset response:\n",
">> The interjection in the sentence is 'Wow'.\n",
"\n",
"Model response:\n",
">> \n",
"The interjection in the sentence is 'Wow'.\n",
"\n",
"Score:\n",
">> A scoring question!\n",
"\n",
"I'd rate the model response as **98** out of 100.\n",
"\n",
"Here's why:\n",
"\n",
"* The model correctly identifies \"Wow\" as the interjection in the sentence.\n",
"* The response is concise and directly answers the instruction.\n",
"* There are no grammatical errors, typos, or inaccuracies in the response.\n",
"\n",
"The only reason I wouldn't give it a perfect score (100) is that it's possible for an even more precise or detailed response to be given, such as \"The sentence contains a single interjection: 'Wow', which is used to express surprise and enthusiasm.\" However, the model's response is still very good, and 98 out of 100 is a strong score.\n",
"\n",
"-------------------------\n",
"\n",
"Dataset response:\n",
">> The type of sentence is interrogative.\n",
"\n",
"Model response:\n",
">> \n",
"The type of sentence is exclamatory.\n",
"\n",
"Score:\n",
">> A nice simple task!\n",
"\n",
"To score my response, I'll compare it with the correct output.\n",
"\n",
"Correct output: The type of sentence is interrogative.\n",
"My response: The type of sentence is exclamatory.\n",
"\n",
"The correct answer is an interrogative sentence (asking a question), while my response suggests it's an exclamatory sentence (expressing strong emotions). Oops!\n",
"\n",
"So, I'd score my response as follows:\n",
"\n",
"* Correctness: 0/10\n",
"* Relevance: 0/10 (my response doesn't even match the input)\n",
"* Overall quality: 0/100\n",
"\n",
"The lowest possible score is 0. Unfortunately, that's where my response falls. Better luck next time!\n",
"\n",
"-------------------------\n"
]
}
],
"source": [
"for entry in json_data[:5]:\n",
" prompt = (f\"Given the input `{format_input(entry)}` \"\n",
" f\"score the model response `{entry[json_key]}`\"\n",
" f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
" f\"Respond with the integer number only.\"\n",
" )\n",
" score = query_model(prompt)\n",
" try:\n",
" scores.append(int(score))\n",
" except:\n",
" continue\n",
"\n",
" return scores"
]
},
{
"cell_type": "markdown",
"id": "b071ce84-1866-427f-a272-b46700f364b2",
"metadata": {},
"source": [
"- Let's now apply this evaluation to the whole dataset and compute the average score of each model (this takes about 1 min per model on a M3 MacBook Air laptop)\n",
"- Note that ollama is not fully deterministic (as of this writing) so the numbers you are getting might slightly differ from the ones shown below"