"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
"# Evaluating Instruction Responses Locally Using a Llama 3 Model Via Ollama"
]
},
{
"cell_type": "markdown",
"id": "a128651b-f326-4232-a994-42f38b7ed520",
"metadata": {},
"source": [
"- This notebook uses an 8 billion parameter Llama 3 model through ollama to evaluate responses of instruction finetuned LLMs based on a dataset in JSON format that includes the generated model responses, for example:\n",
"\n",
"\n",
"\n",
"```python\n",
"{\n",
" \"instruction\": \"What is the atomic number of helium?\",\n",
" \"input\": \"\",\n",
" \"output\": \"The atomic number of helium is 2.\", # <-- The target given in the test set\n",
" \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
" \"model 2 response\": \"\\nThe atomic number of helium is 3.\" # <-- Response by a 2nd LLM\n",
"},\n",
"```\n",
"\n",
"- The code doesn't require a GPU and runs on a laptop (it was tested on a M3 MacBook Air)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "63610acc-db94-437f-8d38-e99dca0299cb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tqdm version: 4.66.2\n"
]
}
],
"source": [
"from importlib.metadata import version\n",
"\n",
"pkgs = [\"tqdm\", # Progress bar\n",
" ]\n",
"\n",
"for p in pkgs:\n",
" print(f\"{p} version: {version(p)}\")"
]
},
{
"cell_type": "markdown",
"id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
"metadata": {},
"source": [
"## Installing Ollama and Downloading Llama 3"
]
},
{
"cell_type": "markdown",
"id": "5a092280-5462-4709-a3fe-8669a4a8a0a6",
"metadata": {},
"source": [
"- Ollama is an application to run LLMs efficiently\n",
"- It is a wrapper around [llama.cpp](https://github.com/ggerganov/llama.cpp), which implements LLMs in pure C/C++ to maximize efficiency\n",
"- Note that it is a tool for using LLMs to generate text (inference), not training or finetuning LLMs\n",
"- Prior to running the code below, install ollama by visiting [https://ollama.com](https://ollama.com) and following the instructions (for instance, clicking on the \"Download\" button and downloading the ollama application for your operating system)"
"- For macOS and Windows users, click on the ollama application you downloaded; if it prompts you to install the command line usage, say \"yes\"\n",
"- Linux users can use the installation command provided on the ollama website\n",
"\n",
"- In general, before we can use ollama from the command line, we have to either start the ollama application or run `ollama serve` in a separate terminal\n",
"- With the ollama application or `ollama serve` running, in a different terminal, on the command line, execute the following command to try out the 8 billion parameters Llama 3 model (the model, which takes up 4.7 GB of storage space, will be automatically downloaded the first time you execute this command)\n",
"- Note that `llama3` refers to the instruction finetuned 8 billion Llama 3 model\n",
"\n",
"- Alternatively, you can also use the larger 70 billion parameters Llama 3 model, if your machine supports it, by replacing `llama3` with `llama3:70b`\n",
"\n",
"- After the download has been completed, you will see a command line prompt that allows you to chat with the model\n",
"\n",
"- Try a prompt like \"What do llamas eat?\", which should return an output similar to the following:\n",
"\n",
"```\n",
">>> What do llamas eat?\n",
"Llamas are ruminant animals, which means they have a four-chambered\n",
"stomach and eat plants that are high in fiber. In the wild, llamas\n",
"typically feed on:\n",
"1. Grasses: They love to graze on various types of grasses, including tall\n",
"1. Grasses: Llamas love to graze on various types of grasses, including tall grasses and short grasses.\n",
"2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in many llama diets.\n",
"3. Grains: Oats, corn, and barley are common grains used in llama feed.\n",
"4. Fruits and vegetables: Fresh fruits and vegetables can be offered as treats or added to their regular diet. Favorites include apples, carrots, and sweet potatoes.\n",
"5. Minerals: Llamas require access to mineral supplements, such as salt licks or loose minerals, to ensure they get the necessary nutrients.\n",
"It's essential to provide a balanced diet for llamas, as they have specific nutritional needs. A good quality commercial llama feed or a veterinarian-recommended feeding plan can help ensure your llama stays healthy and happy!\n"
"result = query_model(\"What do Llamas eat?\")\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
"metadata": {},
"source": [
"- First, let's try the API with a simple example to make sure it works as intended:"
]
},
{
"cell_type": "markdown",
"id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
"metadata": {},
"source": [
"## Load JSON Entries"
]
},
{
"cell_type": "markdown",
"id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
"metadata": {},
"source": [
"- Now, let's get to the data evaluation part\n",
"- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of entries: 100\n"
]
}
],
"source": [
"json_file = \"eval-example-data.json\"\n",
"\n",
"with open(json_file, \"r\") as file:\n",
" json_data = json.load(file)\n",
" \n",
"print(\"Number of entries:\", len(json_data))"
]
},
{
"cell_type": "markdown",
"id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
"metadata": {},
"source": [
"- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "7222fdc0-5684-4f2b-b741-3e341851359e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",
" 'input': '',\n",
" 'output': 'The hypotenuse of the triangle is 10 cm.',\n",
" 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",
" 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"json_data[0]"
]
},
{
"cell_type": "markdown",
"id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
"metadata": {},
"source": [
"- Below is a small utility function that formats the input for visualization purposes later:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
"metadata": {},
"outputs": [],
"source": [
"def format_input(entry):\n",
" instruction_text = (\n",
" f\"Below is an instruction that describes a task. Write a response that \"\n",
"To evaluate the model response, I'll compare it with the correct output. Here's my analysis:\n",
"\n",
"* Correct output: The hypotenuse of the triangle is 10 cm.\n",
"* Model response: The hypotenuse of the triangle is 3 cm.\n",
"\n",
"The model response has a significant error. The correct value for the hypotenuse is 10 cm, but the model response suggests it's only 3 cm. This indicates a lack of understanding or application of mathematical concepts in this specific problem.\n",
"* Grammar: 80 (while the sentence is grammatically correct, it's not as polished as my rewritten sentence)\n",
"* Clarity: 60 (the original instruction asked for a more formal way of expressing the thought, and while \"What is incorrect?\" gets the point across, it's not as elegant as my response)\n",
"* Formality: 40 (my response `I must determine what is amiss` has a more formal tone than the model response)\n",
"1. The model correctly identifies the interjection in the sentence, which is indeed \"Wow\".\n",
"2. The response is concise and directly answers the instruction.\n",
"3. There are no grammatical errors or typos in the response.\n",
"\n",
"The only thing that keeps me from giving it a perfect score of 100 is that the response could be slightly more explicit or detailed. For example, the model could have explained why \"Wow\" is considered an interjection (e.g., because it expresses strong emotions) to provide further context and clarity.\n",
"The two responses differ in their identification of the sentence type. The correct answer is \"interrogative\" (a question), while the model's response incorrectly says it's \"exclamatory\" (an exclamation).\n",
"So, the model's response scores 20 out of 100. This indicates that it has a significant mistake in identifying the sentence type, which means its performance is quite poor on this specific task.\n",
"- Let's now apply this evaluation to the whole dataset and compute the average score of each model (this takes about 1 minute per model on a M3 MacBook Air laptop)\n",