mirror of
https://github.com/rasbt/LLMs-from-scratch.git
synced 2025-09-25 08:05:45 +00:00
correlation analysis (#196)
This commit is contained in:
parent
9e257212b2
commit
de36026e5a
@ -179,24 +179,21 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Llamas are ruminant animals, which means they have a four-chambered stomach and eat plants. Their diet typically consists of:\n",
|
||||
"Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:\n",
|
||||
"\n",
|
||||
"1. Grasses: Llamas love to graze on grasses, including tall grasses, meadow grasses, and wheat.\n",
|
||||
"2. Hay: High-quality hay is a staple in an llama's diet. They enjoy timothy hay, alfalfa hay, and other types of hay.\n",
|
||||
"3. Grains: Whole grains like oats, barley, and corn are also part of their diet.\n",
|
||||
"4. Fruits and vegetables: Llamas will eat fruits like apples, carrots, and sweet potatoes as a treat or to supplement their diet.\n",
|
||||
"5. Minerals: They need access to loose minerals like salt, calcium, and phosphorus to stay healthy.\n",
|
||||
"1. Grasses: Llamas love to graze on various types of grasses, including tall grasses and short grasses.\n",
|
||||
"2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in many llama diets.\n",
|
||||
"3. Grains: Oats, corn, and barley are common grains used in llama feed.\n",
|
||||
"4. Fruits and vegetables: Fresh fruits and vegetables can be offered as treats or added to their regular diet. Favorites include apples, carrots, and sweet potatoes.\n",
|
||||
"5. Minerals: Llamas require access to mineral supplements, such as salt licks or loose minerals, to ensure they get the necessary nutrients.\n",
|
||||
"\n",
|
||||
"In the wild, llamas might also eat:\n",
|
||||
"In the wild, llamas will also eat:\n",
|
||||
"\n",
|
||||
"* Leaves from shrubs and trees\n",
|
||||
"* Bark (in some cases)\n",
|
||||
"* Seeds\n",
|
||||
"* Fungi\n",
|
||||
"1. Leaves: They'll munch on leaves from shrubs and trees, like willow or cedar.\n",
|
||||
"2. Bark: In some cases, llamas might eat the bark of certain trees, like aspen or cottonwood.\n",
|
||||
"3. Mosses: Llamas may consume various types of mosses that grow in their environment.\n",
|
||||
"\n",
|
||||
"Domesticated llamas usually have a more controlled diet, as their owners provide them with specific foods and supplements to ensure they receive the nutrients they need. A balanced diet for an llama typically includes 15-20% hay, 10-15% grains, and 5-10% fruits and vegetables.\n",
|
||||
"\n",
|
||||
"Remember, always consult with a veterinarian or experienced llama breeder to determine the best diet for your individual llama!\n"
|
||||
"It's essential to provide a balanced diet for llamas, as they have specific nutritional needs. A good quality commercial llama feed or a veterinarian-recommended feeding plan can help ensure your llama stays healthy and happy!\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -378,12 +375,22 @@
|
||||
"The hypotenuse of the triangle is 3 cm.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> To evaluate the model response, I'll compare it to the correct output.\n",
|
||||
">> I'd be happy to help!\n",
|
||||
"\n",
|
||||
"Correct output: The hypotenuse of the triangle is 10 cm.\n",
|
||||
"Model response: The hypotenuse of the triangle is 3 cm.\n",
|
||||
"To evaluate the model response, I'll compare it with the correct output. Here's my analysis:\n",
|
||||
"\n",
|
||||
"The model response is incorrect, as the calculated value (3 cm) does not match the actual value (10 cm). Therefore, I would score this response a 0 out of 100.\n",
|
||||
"* Correct output: The hypotenuse of the triangle is 10 cm.\n",
|
||||
"* Model response: The hypotenuse of the triangle is 3 cm.\n",
|
||||
"\n",
|
||||
"The model response has a significant error. The correct value for the hypotenuse is 10 cm, but the model response suggests it's only 3 cm. This indicates a lack of understanding or application of mathematical concepts in this specific problem.\n",
|
||||
"\n",
|
||||
"Based on this analysis, I'd score the model response as follows:\n",
|
||||
"\n",
|
||||
"* Accuracy: 0/100 (The model response has a significant error and doesn't match the correct solution.)\n",
|
||||
"* Understanding: 20/100 (The model response shows some misunderstanding of the concept or calculation.)\n",
|
||||
"* Overall score: 10/100\n",
|
||||
"\n",
|
||||
"This score indicates that the model response is not accurate and demonstrates limited understanding of the problem.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -402,26 +409,21 @@
|
||||
"6. Cobra\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> To complete the request, I will provide a response that names three different animals that are active during the day.\n",
|
||||
">> To evaluate the model's response, I'll compare it to the expected output.\n",
|
||||
"\n",
|
||||
"### Response:\n",
|
||||
"1. Squirrel\n",
|
||||
"2. Eagle\n",
|
||||
"3. Tiger\n",
|
||||
"Expected output: 1. Squirrel, 2. Eagle, 3. Tiger\n",
|
||||
"Model's response: 1. Squirrel, 2. Tiger, 3. Eagle\n",
|
||||
"\n",
|
||||
"Now, let's evaluate the model response based on the provided options. Here's how it scores:\n",
|
||||
"The model got two out of three animals correct (Squirrel and Tiger), which means it scored 66.67% on this task.\n",
|
||||
"\n",
|
||||
"1. Squirrel (Match)\n",
|
||||
"2. Tiger (Match)\n",
|
||||
"3. Eagle (Match)\n",
|
||||
"To score the model's response on a scale from 0 to 100, I'll use the following formula:\n",
|
||||
"\n",
|
||||
"The model response correctly identifies three animals that are active during the day: squirrel, tiger, and eagle.\n",
|
||||
"Score = (Number of correct answers / Total number of answers) * 100\n",
|
||||
"\n",
|
||||
"On a scale from 0 to 100, I would score this response as **80**. The model accurately completes the request and provides relevant information. However, it does not fully utilize all available options (4-6), which is why the score is not higher.\n",
|
||||
"In this case:\n",
|
||||
"Score = (2/3) * 100 = 66.67%\n",
|
||||
"\n",
|
||||
"Corrected output: 1. Squirrel\n",
|
||||
"2. Eagle\n",
|
||||
"3. Tiger\n",
|
||||
"So, the model's response scores 66.67% on a scale from 0 to 100.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -433,19 +435,23 @@
|
||||
"What is incorrect?\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> The task is to rewrite a sentence in a more formal way.\n",
|
||||
">> A clever task!\n",
|
||||
"\n",
|
||||
"### Original Sentence:\n",
|
||||
"\"I need to find out what's wrong.\"\n",
|
||||
"Here's my response:\n",
|
||||
"\n",
|
||||
"### Formal Rewrite:\n",
|
||||
"\"I must ascertain what is incorrect.\"\n",
|
||||
"**Rewritten sentence:** I must determine what is amiss.\n",
|
||||
"\n",
|
||||
"Score: **90**\n",
|
||||
"And now, let me evaluate the model response `What is incorrect?`\n",
|
||||
"\n",
|
||||
"The model response accurately captures the original sentence's meaning while adopting a more formal tone. The words \"ascertain\" and \"incorrect\" effectively convey a sense of professionalism and precision, making it suitable for a formal setting.\n",
|
||||
"I would score it as follows:\n",
|
||||
"\n",
|
||||
"Note: I scored the model response 90 out of 100 because it successfully transformed the informal sentence into a more formal one, but there is room for improvement in terms of style and nuance.\n",
|
||||
"* Grammar: 80 (while the sentence is grammatically correct, it's not as polished as my rewritten sentence)\n",
|
||||
"* Clarity: 60 (the original instruction asked for a more formal way of expressing the thought, and while \"What is incorrect?\" gets the point across, it's not as elegant as my response)\n",
|
||||
"* Formality: 40 (my response `I must determine what is amiss` has a more formal tone than the model response)\n",
|
||||
"\n",
|
||||
"Overall score: **60**\n",
|
||||
"\n",
|
||||
"So, I'd say the model response scores a decent 60 out of 100. Not bad!\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -457,17 +463,17 @@
|
||||
"The interjection in the sentence is 'Wow'.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> A scoring question!\n",
|
||||
"\n",
|
||||
"I'd rate the model response as **98** out of 100.\n",
|
||||
">> I'd score this response as 90 out of 100.\n",
|
||||
"\n",
|
||||
"Here's why:\n",
|
||||
"\n",
|
||||
"* The model correctly identifies \"Wow\" as the interjection in the sentence.\n",
|
||||
"* The response is concise and directly answers the instruction.\n",
|
||||
"* There are no grammatical errors, typos, or inaccuracies in the response.\n",
|
||||
"1. The model correctly identifies the interjection in the sentence, which is indeed \"Wow\".\n",
|
||||
"2. The response is concise and directly answers the instruction.\n",
|
||||
"3. There are no grammatical errors or typos in the response.\n",
|
||||
"\n",
|
||||
"The only reason I wouldn't give it a perfect score (100) is that it's possible for an even more precise or detailed response to be given, such as \"The sentence contains a single interjection: 'Wow', which is used to express surprise and enthusiasm.\" However, the model's response is still very good, and 98 out of 100 is a strong score.\n",
|
||||
"The only thing that keeps me from giving it a perfect score of 100 is that the response could be slightly more explicit or detailed. For example, the model could have explained why \"Wow\" is considered an interjection (e.g., because it expresses strong emotions) to provide further context and clarity.\n",
|
||||
"\n",
|
||||
"However, overall, this is a well-crafted and accurate response that effectively completes the request!\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -479,22 +485,23 @@
|
||||
"The type of sentence is exclamatory.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> A nice simple task!\n",
|
||||
">> A language evaluation task!\n",
|
||||
"\n",
|
||||
"To score my response, I'll compare it with the correct output.\n",
|
||||
"To evaluate the model's response, I'll compare it to the correct output. Here's how:\n",
|
||||
"\n",
|
||||
"Correct output: The type of sentence is interrogative.\n",
|
||||
"My response: The type of sentence is exclamatory.\n",
|
||||
"1. Correct output: The type of sentence is interrogative.\n",
|
||||
"2. Model's response: The type of sentence is exclamatory.\n",
|
||||
"\n",
|
||||
"The correct answer is an interrogative sentence (asking a question), while my response suggests it's an exclamatory sentence (expressing strong emotions). Oops!\n",
|
||||
"The two responses differ in their identification of the sentence type. The correct answer is \"interrogative\" (a question), while the model's response incorrectly says it's \"exclamatory\" (an exclamation).\n",
|
||||
"\n",
|
||||
"So, I'd score my response as follows:\n",
|
||||
"Now, let's score the model's response:\n",
|
||||
"\n",
|
||||
"* Correctness: 0/10\n",
|
||||
"* Relevance: 0/10 (my response doesn't even match the input)\n",
|
||||
"* Overall quality: 0/100\n",
|
||||
"* Correctness: 0/100 (the response is entirely incorrect)\n",
|
||||
"* Similarity: 20/100 (the word \"sentence\" is correct, but the type identification is wrong)\n",
|
||||
"\n",
|
||||
"The lowest possible score is 0. Unfortunately, that's where my response falls. Better luck next time!\n",
|
||||
"Total score: 20/100\n",
|
||||
"\n",
|
||||
"So, the model's response scores 20 out of 100. This indicates that it has a significant mistake in identifying the sentence type, which means its performance is quite poor on this specific task.\n",
|
||||
"\n",
|
||||
"-------------------------\n"
|
||||
]
|
||||
@ -557,7 +564,7 @@
|
||||
"id": "b071ce84-1866-427f-a272-b46700f364b2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Let's now apply this evaluation to the whole dataset and compute the average score of each model (this takes about 1 min per model on a M3 MacBook Air laptop)\n",
|
||||
"- Let's now apply this evaluation to the whole dataset and compute the average score of each model (this takes about 1 minute per model on a M3 MacBook Air laptop)\n",
|
||||
"- Note that ollama is not fully deterministic (as of this writing) so the numbers you are getting might slightly differ from the ones shown below"
|
||||
]
|
||||
},
|
||||
@ -571,7 +578,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:06<00:00, 1.50it/s]\n"
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:37<00:00, 1.02it/s]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -581,7 +588,7 @@
|
||||
"\n",
|
||||
"model 1 response\n",
|
||||
"Number of scores: 100 of 100\n",
|
||||
"Average score: 78.02\n",
|
||||
"Average score: 77.05\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
@ -589,7 +596,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:10<00:00, 1.41it/s]"
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:16<00:00, 1.31it/s]"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -599,7 +606,7 @@
|
||||
"\n",
|
||||
"model 2 response\n",
|
||||
"Number of scores: 99 of 100\n",
|
||||
"Average score: 66.56\n",
|
||||
"Average score: 67.37\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
@ -617,7 +624,13 @@
|
||||
" scores = generate_model_scores(json_data, model)\n",
|
||||
" print(f\"\\n{model}\")\n",
|
||||
" print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",
|
||||
" print(f\"Average score: {sum(scores)/len(scores):.2f}\\n\")"
|
||||
" print(f\"Average score: {sum(scores)/len(scores):.2f}\\n\")\n",
|
||||
"\n",
|
||||
" # Optionally save the scores\n",
|
||||
" from pathlib import Path\n",
|
||||
" save_path = (Path(\"scores\")/f\"llama3-8b-{model}.json\").replace(\" \", \"-\")\n",
|
||||
" with open(save_path, \"w\") as file:\n",
|
||||
" json.dump(scores, file)"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -71,7 +71,7 @@
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"openai version: 1.30.3\n",
|
||||
"tqdm version: 4.65.0\n"
|
||||
"tqdm version: 4.66.2\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -303,17 +303,21 @@
|
||||
"The hypotenuse of the triangle is 3 cm.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> The model response \"The hypotenuse of the triangle is 3 cm.\" is incorrect. The correct calculation of the hypotenuse for a right triangle with legs of 6 cm and 8 cm should be done using the Pythagorean theorem, which states that the square of the hypotenuse (c) is equal to the sum of the squares of the other two sides (a and b). Thus, \\( c = \\sqrt{6^2 + 8^2} = \\sqrt{36 + 64} = \\sqrt{100} = 10 \\) cm.\n",
|
||||
">> The model response \"The hypotenuse of the triangle is 3 cm.\" is incorrect. The correct calculation of the hypotenuse for a right triangle with legs of 6 cm and 8 cm can be found using the Pythagorean theorem, which states that the square of the hypotenuse (c) is equal to the sum of the squares of the other two sides (a and b). Mathematically, this is expressed as:\n",
|
||||
"\n",
|
||||
"The model response provides a hypotenuse of 3 cm, which is not only numerically incorrect but also logically inconsistent because the hypotenuse is the longest side of a right triangle and cannot be shorter than either of the other two sides (6 cm and 8 cm in this case).\n",
|
||||
"\\[ c = \\sqrt{a^2 + b^2} \\]\n",
|
||||
"\\[ c = \\sqrt{6^2 + 8^2} \\]\n",
|
||||
"\\[ c = \\sqrt{36 + 64} \\]\n",
|
||||
"\\[ c = \\sqrt{100} \\]\n",
|
||||
"\\[ c = 10 \\text{ cm} \\]\n",
|
||||
"\n",
|
||||
"Given the scale from 0 to 100, where 100 is the best score:\n",
|
||||
"- Accuracy: The response is completely inaccurate.\n",
|
||||
"- Relevance: The response addresses the task of calculating the hypotenuse but fails to do so correctly.\n",
|
||||
"The correct answer should be 10 cm. The response given as 3 cm is not only incorrect but also significantly off from the correct value. This error could lead to misunderstandings or incorrect applications in practical scenarios where precise measurements are crucial.\n",
|
||||
"\n",
|
||||
"Score: 0\n",
|
||||
"Given the scale from 0 to 100, where 100 is the best score, the response would score very low due to its inaccuracy. However, since the response format is correct (stating the measurement and unit), it does not score the absolute minimum.\n",
|
||||
"\n",
|
||||
"The score is 0 because the response is factually incorrect and provides misleading information that does not fulfill the task as required.\n",
|
||||
"**Score: 10/100**\n",
|
||||
"\n",
|
||||
"This score reflects that while the format of the response is correct, the content is highly inaccurate.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -332,22 +336,19 @@
|
||||
"6. Cobra\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> The model response lists six animals, three of which are repeated, and includes animals not specifically known for being diurnal (active during the day). The instruction specifically asked for three different animals that are active during the day. Here's the breakdown:\n",
|
||||
">> The model response lists six animals, three of which (squirrel, tiger, eagle) are indeed active during the day, making them correct responses to the instruction. However, the instruction specifically asked for three different animals, and the model response includes repetitions (tiger and cobra are each listed twice) and also exceeds the requested number of animals.\n",
|
||||
"\n",
|
||||
"1. **Squirrel** - Correct, squirrels are diurnal.\n",
|
||||
"2. **Tiger** - Generally, tigers are crepuscular (active during dawn and dusk) rather than strictly diurnal, but they can be active during the day, especially in cooler weather.\n",
|
||||
"3. **Eagle** - Correct, eagles are diurnal.\n",
|
||||
"4. **Cobra** - Incorrect, cobras are generally not diurnal; they are more active during the evening and early morning.\n",
|
||||
"5. **Tiger** - Repeated, and as noted, not strictly diurnal.\n",
|
||||
"6. **Cobra** - Repeated and incorrect.\n",
|
||||
"The inclusion of \"cobra\" is incorrect as most cobras are not diurnal (active during the day); they are generally more active during the early morning and late evening, which can be considered crepuscular rather than diurnal.\n",
|
||||
"\n",
|
||||
"### Scoring:\n",
|
||||
"- **Relevance to the task**: The task was to name three different animals active during the day. The response included two correct diurnal animals (squirrel, eagle) but also included incorrect and repeated entries.\n",
|
||||
"- **Accuracy**: Including animals not known for being diurnal (cobra) and repeating animals reduces the accuracy.\n",
|
||||
"- **Adherence to instruction**: The instruction asked for three animals, but six were provided, with repetitions.\n",
|
||||
"### Scoring Breakdown:\n",
|
||||
"- **Relevance to the task**: The response correctly identifies three diurnal animals but also includes additional animals, which was not requested.\n",
|
||||
"- **Accuracy**: Including animals not active during the day (cobra) and repeating animals reduces the accuracy.\n",
|
||||
"- **Adherence to instructions**: The task was to name three different animals, but the response included six names with repetitions.\n",
|
||||
"\n",
|
||||
"### Score: 40/100\n",
|
||||
"**Reasoning**: The response partially meets the criteria by including some correct animals but fails in terms of accuracy (inclusion of non-diurnal animals), repetition of animals, and not adhering to the instruction of listing only three animals.\n",
|
||||
"Given these points, the response partially meets the requirements but also deviates significantly in terms of the number of animals and the inclusion of incorrect and repeated entries.\n",
|
||||
"\n",
|
||||
"### Score: 50/100\n",
|
||||
"This score reflects that while the response did include three correct animals, it failed to strictly follow the instructions by listing only three different animals and included incorrect information.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -359,11 +360,16 @@
|
||||
"What is incorrect?\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> The model response \"What is incorrect?\" would score relatively low on the scale for the given task. The original instruction was to rewrite the sentence \"I need to find out what's wrong\" in a more formal way. The correct output provided, \"I must ascertain what is incorrect,\" effectively increases the formality of the original sentence by using more formal vocabulary (\"must\" instead of \"need\" and \"ascertain\" instead of \"find out\") and adjusting the phrasing (\"what is incorrect\" instead of \"what's wrong\").\n",
|
||||
">> The model response \"What is incorrect?\" scores low in terms of fulfilling the instruction to rewrite the sentence in a more formal way. The original sentence \"I need to find out what's wrong.\" expresses a personal obligation and a process of discovery, which is not captured in the model response. The model response turns the sentence into a direct question and loses the nuance of needing to discover or investigate the issue.\n",
|
||||
"\n",
|
||||
"The model response, however, only addresses part of the sentence and does not maintain the original meaning or structure. It changes the sentence into a question and omits the aspect of needing to discover or investigate the issue, which is a critical component of the original sentence. Additionally, it does not enhance the formality significantly.\n",
|
||||
"**Score: 20/100**\n",
|
||||
"\n",
|
||||
"Given these considerations, I would score the model response around 20 out of 100. It recognizes the need to adjust the formality slightly but fails to maintain the original sentence's intent and structure, and does not fully meet the requirement of rewriting the sentence in a more formal way.\n",
|
||||
"**Reasoning:**\n",
|
||||
"- **Formality:** The response is slightly more formal than casual speech but does not elevate the formality significantly or appropriately. It does use \"incorrect\" which is slightly more formal than \"wrong.\"\n",
|
||||
"- **Completeness:** The response fails to include the aspect of needing to find out or ascertain, which is a critical part of the original sentence.\n",
|
||||
"- **Accuracy:** The response changes the structure and intent by converting it into a direct question, which does not align with the instruction to rewrite the statement while maintaining its original intent.\n",
|
||||
"\n",
|
||||
"Overall, the response does not adequately meet the requirements of the task as it significantly alters the meaning and omits key elements of the original sentence.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -375,7 +381,9 @@
|
||||
"The interjection in the sentence is 'Wow'.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> The model response `The interjection in the sentence is 'Wow'.` accurately identifies the interjection in the provided sentence. The response is clear, directly addresses the instruction, and correctly identifies \"Wow\" as the interjection. Therefore, the response should be scored 100 out of 100.\n",
|
||||
">> The model response `The interjection in the sentence is 'Wow'.` accurately identifies the interjection in the provided sentence. The response is clear, directly addresses the instruction, and correctly identifies \"Wow\" as the interjection, which is used to express surprise or admiration, fitting the context of the sentence. Therefore, the response is fully correct and meets all the requirements of the task.\n",
|
||||
"\n",
|
||||
"Score: 100/100\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -387,13 +395,11 @@
|
||||
"The type of sentence is exclamatory.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> The model response \"The type of sentence is exclamatory.\" is incorrect. The correct type of the sentence \"Did you finish the report?\" is interrogative, as it is a question. An exclamatory sentence would express strong emotion and typically ends with an exclamation mark.\n",
|
||||
">> The model response \"The type of sentence is exclamatory.\" is incorrect. The input sentence \"Did you finish the report?\" is clearly an interrogative sentence as it is asking a question, indicated by the question mark at the end and the structure of the sentence.\n",
|
||||
"\n",
|
||||
"Given the incorrect identification of the sentence type, the score for the model response should be low. However, the response does correctly format the answer by stating \"The type of sentence is...\" which shows an understanding of the task's requirements but fails in accuracy.\n",
|
||||
"Given the scoring criteria where 100 is the best score and should be awarded to a correct and precise response, the model's response should receive a low score because it incorrectly identifies the type of sentence. An exclamatory sentence typically expresses strong emotion and ends with an exclamation mark, which is not the case here.\n",
|
||||
"\n",
|
||||
"Score: 10/100\n",
|
||||
"\n",
|
||||
"The score reflects that the response is well-structured but fundamentally incorrect in identifying the sentence type.\n",
|
||||
"Therefore, the score for the model response would be 0 out of 100, as it completely misidentifies the type of sentence, providing incorrect information.\n",
|
||||
"\n",
|
||||
"-------------------------\n"
|
||||
]
|
||||
@ -469,7 +475,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"execution_count": 11,
|
||||
"id": "4f700d4b-19e5-4404-afa7-b0f093024232",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -477,7 +483,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|█████████████████████████████████████████████████| 100/100 [01:09<00:00, 1.44it/s]\n"
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:03<00:00, 1.56it/s]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -487,7 +493,7 @@
|
||||
"\n",
|
||||
"model 1 response\n",
|
||||
"Number of scores: 100 of 100\n",
|
||||
"Average score: 74.04\n",
|
||||
"Average score: 74.09\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
@ -495,7 +501,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|█████████████████████████████████████████████████| 100/100 [01:08<00:00, 1.46it/s]"
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:06<00:00, 1.50it/s]"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -505,7 +511,7 @@
|
||||
"\n",
|
||||
"model 2 response\n",
|
||||
"Number of scores: 100 of 100\n",
|
||||
"Average score: 56.72\n",
|
||||
"Average score: 56.57\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
@ -523,7 +529,13 @@
|
||||
" scores = generate_model_scores(json_data, model, client)\n",
|
||||
" print(f\"\\n{model}\")\n",
|
||||
" print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",
|
||||
" print(f\"Average score: {sum(scores)/len(scores):.2f}\\n\")"
|
||||
" print(f\"Average score: {sum(scores)/len(scores):.2f}\\n\")\n",
|
||||
"\n",
|
||||
" # Optionally save the scores\n",
|
||||
" from pathlib import Path\n",
|
||||
" save_path = (Path(\"scores\")/f\"gpt4-{model}.json\").replace(\" \", \"-\")\n",
|
||||
" with open(save_path, \"w\") as file:\n",
|
||||
" json.dump(scores, file)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -551,7 +563,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.6"
|
||||
"version": "3.11.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
673
ch07/03_model-evaluation/llm-instruction-eval-prometheus.ipynb
Normal file
673
ch07/03_model-evaluation/llm-instruction-eval-prometheus.ipynb
Normal file
@ -0,0 +1,673 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "136a4efe-fb99-4311-8679-e0a5b6282755",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<table style=\"width:100%\">\n",
|
||||
"<tr>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<font size=\"2\">\n",
|
||||
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
|
||||
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
|
||||
"</font>\n",
|
||||
"</td>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
|
||||
"</td>\n",
|
||||
"</tr>\n",
|
||||
"</table>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"# Evaluating Instruction Responses Locally Using the Prometheus Evaluator LLM"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a128651b-f326-4232-a994-42f38b7ed520",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- This notebook uses an 7 billion parameter LLM that has been specifically developed for evaluating other LLMs; for more information, see the [Prometheus 2 paper](https://arxiv.org/abs/2405.01535)\n",
|
||||
"- We will use Prometheus 2 via the [prometheus-eval](https://github.com/prometheus-eval/prometheus-eval) Python package, which in turn is based on [vllm](https://github.com/vllm-project/vllm), which is an efficient LLM inference tool that runs locally\n",
|
||||
"- Specifically, in this notebook, we will use Prometheus 2 to evaluate responses of instruction finetuned LLMs based on a dataset in JSON format that includes the generated model responses, for example:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"{\n",
|
||||
" \"instruction\": \"What is the atomic number of helium?\",\n",
|
||||
" \"input\": \"\",\n",
|
||||
" \"output\": \"The atomic number of helium is 2.\", # <-- The target given in the test set\n",
|
||||
" \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
|
||||
" \"model 2 response\": \"\\nThe atomic number of helium is 3.\" # <-- Response by a 2nd LLM\n",
|
||||
"},\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"<div style=\"background-color: #ffdddd; border-left: 6px solid #f44336; padding: 10px;\">\n",
|
||||
" <strong>Note:</strong> The code in this notebook requires installing <a href=\"https://github.com/vllm-project/vllm\"><vllm>, which currently only supports Linux.\n",
|
||||
"</div>\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "2c10ef46-4dd5-4a20-a949-afc15a18498d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# pip install -r requirements-extra.txt\n",
|
||||
"# pip install vllm # only supports Linux"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "63610acc-db94-437f-8d38-e99dca0299cb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"prometheus-eval version: 0.1.15\n",
|
||||
"tqdm version: 4.66.4\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from importlib.metadata import version\n",
|
||||
"\n",
|
||||
"pkgs = [\n",
|
||||
" \"prometheus-eval\",\n",
|
||||
" \"tqdm\", # Progress bar,\n",
|
||||
" \"vllm\"\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"for p in pkgs:\n",
|
||||
" print(f\"{p} version: {version(p)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Installing Ollama and Downloading Llama 3"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5a092280-5462-4709-a3fe-8669a4a8a0a6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Ollama is an application to run LLMs efficiently\n",
|
||||
"- It is a wrapper around [llama.cpp](https://github.com/ggerganov/llama.cpp), which implements LLMs in pure C/C++ to maximize efficiency\n",
|
||||
"- Note that it is a tool for using LLMs to generate text (inference), not training or finetuning LLMs\n",
|
||||
"- Prior to running the code below, install ollama by visiting [https://ollama.com](https://ollama.com) and following the instructions (for instance, clicking on the \"Download\" button and downloading the ollama application for your operating system)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now let's test if ollama is set up correctly\n",
|
||||
"- For this, click on the ollama application you downloaded; if it prompts you to install the command line usage, say \"yes\"\n",
|
||||
"- Next, on the command line, execute the following command to try out the 8 billion parameters Llama 3 model (the model, which takes up 4.7 GB of storage space, will be automatically downloaded the first time you execute this command)\n",
|
||||
"\n",
|
||||
"```bash\n",
|
||||
"# 8B model\n",
|
||||
"ollama run llama3\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"The output looks like as follows:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"$ ollama run llama3\n",
|
||||
"pulling manifest \n",
|
||||
"pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB \n",
|
||||
"pulling 4fa551d4f938... 100% ▕████████████████▏ 12 KB \n",
|
||||
"pulling 8ab4849b038c... 100% ▕████████████████▏ 254 B \n",
|
||||
"pulling 577073ffcc6c... 100% ▕████████████████▏ 110 B \n",
|
||||
"pulling 3f8eb4da87fa... 100% ▕████████████████▏ 485 B \n",
|
||||
"verifying sha256 digest \n",
|
||||
"writing manifest \n",
|
||||
"removing any unused layers \n",
|
||||
"success \n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"- Note that `llama3` refers to the instruction finetuned 8 billion Llama 3 model\n",
|
||||
"\n",
|
||||
"- Alternatively, you can also use the larger 70 billion parameters Llama 3 model, if your machine supports it, by replacing `llama3` with `llama3:70b`\n",
|
||||
"\n",
|
||||
"- After the download has been completed, you will see a command line prompt that allows you to chat with the model\n",
|
||||
"\n",
|
||||
"- Try a prompt like \"What do llamas eat?\", which should return an output similar to the following:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
">>> What do llamas eat?\n",
|
||||
"Llamas are ruminant animals, which means they have a four-chambered \n",
|
||||
"stomach and eat plants that are high in fiber. In the wild, llamas \n",
|
||||
"typically feed on:\n",
|
||||
"1. Grasses: They love to graze on various types of grasses, including tall \n",
|
||||
"grasses, wheat, oats, and barley.\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0b5addcb-fc7d-455d-bee9-6cc7a0d684c7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- You can end this session using the input `/bye`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dda155ee-cf36-44d3-b634-20ba8e1ca38a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using Ollama's REST API"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, an alternative way to interact with the model is via its REST API in Python via the following function\n",
|
||||
"- First, in your terminal, start a local ollama server via `ollama serve` (after executing the code in this notebook, you can later stop this session by simply closing the terminal)\n",
|
||||
"- Next, run the following code cell to query the model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Llamas are ruminant animals, which means they have a four-chambered stomach and eat plants. Their diet typically consists of:\n",
|
||||
"\n",
|
||||
"1. Grasses: Llamas love to graze on grasses, including tall grasses, meadow grasses, and wheat.\n",
|
||||
"2. Hay: High-quality hay is a staple in an llama's diet. They enjoy timothy hay, alfalfa hay, and other types of hay.\n",
|
||||
"3. Grains: Whole grains like oats, barley, and corn are also part of their diet.\n",
|
||||
"4. Fruits and vegetables: Llamas will eat fruits like apples, carrots, and sweet potatoes as a treat or to supplement their diet.\n",
|
||||
"5. Minerals: They need access to loose minerals like salt, calcium, and phosphorus to stay healthy.\n",
|
||||
"\n",
|
||||
"In the wild, llamas might also eat:\n",
|
||||
"\n",
|
||||
"* Leaves from shrubs and trees\n",
|
||||
"* Bark (in some cases)\n",
|
||||
"* Seeds\n",
|
||||
"* Fungi\n",
|
||||
"\n",
|
||||
"Domesticated llamas usually have a more controlled diet, as their owners provide them with specific foods and supplements to ensure they receive the nutrients they need. A balanced diet for an llama typically includes 15-20% hay, 10-15% grains, and 5-10% fruits and vegetables.\n",
|
||||
"\n",
|
||||
"Remember, always consult with a veterinarian or experienced llama breeder to determine the best diet for your individual llama!\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import urllib.request\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"def query_model(prompt, model=\"llama3\", url=\"http://localhost:11434/api/chat\"):\n",
|
||||
" # Create the data payload as a dictionary\n",
|
||||
" data = {\n",
|
||||
" \"model\": model,\n",
|
||||
" \"seed\":123, # for deterministic responses\n",
|
||||
" \"temperature\":0, # for deterministic responses\n",
|
||||
" \"messages\": [\n",
|
||||
" {\"role\": \"user\", \"content\": prompt}\n",
|
||||
" ]\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" # Convert the dictionary to a JSON formatted string and encode it to bytes\n",
|
||||
" payload = json.dumps(data).encode(\"utf-8\")\n",
|
||||
"\n",
|
||||
" # Create a request object, setting the method to POST and adding necessary headers\n",
|
||||
" request = urllib.request.Request(url, data=payload, method=\"POST\")\n",
|
||||
" request.add_header(\"Content-Type\", \"application/json\")\n",
|
||||
"\n",
|
||||
" # Send the request and capture the response\n",
|
||||
" response_data = \"\"\n",
|
||||
" with urllib.request.urlopen(request) as response:\n",
|
||||
" # Read and decode the response\n",
|
||||
" while True:\n",
|
||||
" line = response.readline().decode(\"utf-8\")\n",
|
||||
" if not line:\n",
|
||||
" break\n",
|
||||
" response_json = json.loads(line)\n",
|
||||
" response_data += response_json[\"message\"][\"content\"]\n",
|
||||
"\n",
|
||||
" return response_data\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"result = query_model(\"What do Llamas eat?\")\n",
|
||||
"print(result)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, let's try the API with a simple example to make sure it works as intended:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Load JSON Entries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, let's get to the data evaluation part\n",
|
||||
"- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of entries: 100\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"\n",
|
||||
"json_file = \"eval-example-data.json\"\n",
|
||||
"\n",
|
||||
"with open(json_file, \"r\") as file:\n",
|
||||
" json_data = json.load(file)\n",
|
||||
" \n",
|
||||
"print(\"Number of entries:\", len(json_data))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "7222fdc0-5684-4f2b-b741-3e341851359e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",
|
||||
" 'input': '',\n",
|
||||
" 'output': 'The hypotenuse of the triangle is 10 cm.',\n",
|
||||
" 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",
|
||||
" 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"json_data[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Below is a small utility function that formats the input for visualization purposes later:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def format_input(entry):\n",
|
||||
" instruction_text = (\n",
|
||||
" f\"Below is an instruction that describes a task. Write a response that \"\n",
|
||||
" f\"appropriately completes the request.\"\n",
|
||||
" f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
|
||||
" instruction_text + input_text\n",
|
||||
"\n",
|
||||
" return instruction_text + input_text"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "39a55283-7d51-4136-ba60-f799d49f4098",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, let's try the ollama API to compare the model responses (we only evalyate the first 5 responses for a visual comparison):"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "735cc089-d127-480a-b39d-0782581f0c41",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> The hypotenuse of the triangle is 10 cm.\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"The hypotenuse of the triangle is 3 cm.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> To evaluate the model response, I'll compare it to the correct output.\n",
|
||||
"\n",
|
||||
"Correct output: The hypotenuse of the triangle is 10 cm.\n",
|
||||
"Model response: The hypotenuse of the triangle is 3 cm.\n",
|
||||
"\n",
|
||||
"The model response is incorrect, as the calculated value (3 cm) does not match the actual value (10 cm). Therefore, I would score this response a 0 out of 100.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> 1. Squirrel\n",
|
||||
"2. Eagle\n",
|
||||
"3. Tiger\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"1. Squirrel\n",
|
||||
"2. Tiger\n",
|
||||
"3. Eagle\n",
|
||||
"4. Cobra\n",
|
||||
"5. Tiger\n",
|
||||
"6. Cobra\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> To complete the request, I will provide a response that names three different animals that are active during the day.\n",
|
||||
"\n",
|
||||
"### Response:\n",
|
||||
"1. Squirrel\n",
|
||||
"2. Eagle\n",
|
||||
"3. Tiger\n",
|
||||
"\n",
|
||||
"Now, let's evaluate the model response based on the provided options. Here's how it scores:\n",
|
||||
"\n",
|
||||
"1. Squirrel (Match)\n",
|
||||
"2. Tiger (Match)\n",
|
||||
"3. Eagle (Match)\n",
|
||||
"\n",
|
||||
"The model response correctly identifies three animals that are active during the day: squirrel, tiger, and eagle.\n",
|
||||
"\n",
|
||||
"On a scale from 0 to 100, I would score this response as **80**. The model accurately completes the request and provides relevant information. However, it does not fully utilize all available options (4-6), which is why the score is not higher.\n",
|
||||
"\n",
|
||||
"Corrected output: 1. Squirrel\n",
|
||||
"2. Eagle\n",
|
||||
"3. Tiger\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> I must ascertain what is incorrect.\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"What is incorrect?\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> The task is to rewrite a sentence in a more formal way.\n",
|
||||
"\n",
|
||||
"### Original Sentence:\n",
|
||||
"\"I need to find out what's wrong.\"\n",
|
||||
"\n",
|
||||
"### Formal Rewrite:\n",
|
||||
"\"I must ascertain what is incorrect.\"\n",
|
||||
"\n",
|
||||
"Score: **90**\n",
|
||||
"\n",
|
||||
"The model response accurately captures the original sentence's meaning while adopting a more formal tone. The words \"ascertain\" and \"incorrect\" effectively convey a sense of professionalism and precision, making it suitable for a formal setting.\n",
|
||||
"\n",
|
||||
"Note: I scored the model response 90 out of 100 because it successfully transformed the informal sentence into a more formal one, but there is room for improvement in terms of style and nuance.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> The interjection in the sentence is 'Wow'.\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"The interjection in the sentence is 'Wow'.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> A scoring question!\n",
|
||||
"\n",
|
||||
"I'd rate the model response as **98** out of 100.\n",
|
||||
"\n",
|
||||
"Here's why:\n",
|
||||
"\n",
|
||||
"* The model correctly identifies \"Wow\" as the interjection in the sentence.\n",
|
||||
"* The response is concise and directly answers the instruction.\n",
|
||||
"* There are no grammatical errors, typos, or inaccuracies in the response.\n",
|
||||
"\n",
|
||||
"The only reason I wouldn't give it a perfect score (100) is that it's possible for an even more precise or detailed response to be given, such as \"The sentence contains a single interjection: 'Wow', which is used to express surprise and enthusiasm.\" However, the model's response is still very good, and 98 out of 100 is a strong score.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> The type of sentence is interrogative.\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"The type of sentence is exclamatory.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> A nice simple task!\n",
|
||||
"\n",
|
||||
"To score my response, I'll compare it with the correct output.\n",
|
||||
"\n",
|
||||
"Correct output: The type of sentence is interrogative.\n",
|
||||
"My response: The type of sentence is exclamatory.\n",
|
||||
"\n",
|
||||
"The correct answer is an interrogative sentence (asking a question), while my response suggests it's an exclamatory sentence (expressing strong emotions). Oops!\n",
|
||||
"\n",
|
||||
"So, I'd score my response as follows:\n",
|
||||
"\n",
|
||||
"* Correctness: 0/10\n",
|
||||
"* Relevance: 0/10 (my response doesn't even match the input)\n",
|
||||
"* Overall quality: 0/100\n",
|
||||
"\n",
|
||||
"The lowest possible score is 0. Unfortunately, that's where my response falls. Better luck next time!\n",
|
||||
"\n",
|
||||
"-------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for entry in json_data[:5]:\n",
|
||||
" prompt = (f\"Given the input `{format_input(entry)}` \"\n",
|
||||
" f\"and correct output `{entry['output']}`, \"\n",
|
||||
" f\"score the model response `{entry['model 1 response']}`\"\n",
|
||||
" f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
|
||||
" )\n",
|
||||
" print(\"\\nDataset response:\")\n",
|
||||
" print(\">>\", entry['output'])\n",
|
||||
" print(\"\\nModel response:\")\n",
|
||||
" print(\">>\", entry[\"model 1 response\"])\n",
|
||||
" print(\"\\nScore:\")\n",
|
||||
" print(\">>\", query_model(prompt))\n",
|
||||
" print(\"\\n-------------------------\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "3552bdfb-7511-42ac-a9ec-da672e2a5468",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from tqdm import tqdm\n",
|
||||
"\n",
|
||||
"def generate_model_scores(json_data, json_key):\n",
|
||||
" scores = []\n",
|
||||
" for entry in tqdm(json_data, desc=\"Scoring entries\"):\n",
|
||||
" prompt = (\n",
|
||||
" f\"Given the input `{format_input(entry)}` \"\n",
|
||||
" f\"and correct output `{entry['output']}`, \"\n",
|
||||
" f\"score the model response `{entry[json_key]}`\"\n",
|
||||
" f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
|
||||
" f\"Respond with the integer number only.\"\n",
|
||||
" )\n",
|
||||
" score = query_model(prompt)\n",
|
||||
" try:\n",
|
||||
" scores.append(int(score))\n",
|
||||
" except:\n",
|
||||
" continue\n",
|
||||
"\n",
|
||||
" return scores"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b071ce84-1866-427f-a272-b46700f364b2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Let's now apply this evaluation to the whole dataset and compute the average score of each model (this takes about 1 min per model on a M3 MacBook Air laptop)\n",
|
||||
"- Note that ollama is not fully deterministic (as of this writing) so the numbers you are getting might slightly differ from the ones shown below"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "4f700d4b-19e5-4404-afa7-b0f093024232",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:06<00:00, 1.50it/s]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"model 1 response\n",
|
||||
"Number of scores: 100 of 100\n",
|
||||
"Average score: 78.02\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:10<00:00, 1.41it/s]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"model 2 response\n",
|
||||
"Number of scores: 99 of 100\n",
|
||||
"Average score: 66.56\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for model in (\"model 1 response\", \"model 2 response\"):\n",
|
||||
"\n",
|
||||
" scores = generate_model_scores(json_data, model)\n",
|
||||
" print(f\"\\n{model}\")\n",
|
||||
" print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",
|
||||
" print(f\"Average score: {sum(scores)/len(scores):.2f}\\n\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8169d534-1fec-43c4-9550-5cb701ff7f05",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Based on the evaluation above, we can say that the 1st model is better than the 2nd model"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
135
ch07/03_model-evaluation/scores/correlation-analysis.ipynb
Normal file
135
ch07/03_model-evaluation/scores/correlation-analysis.ipynb
Normal file
File diff suppressed because one or more lines are too long
@ -0,0 +1 @@
|
||||
[0, 50, 20, 100, 0, 100, 0, 100, 100, 100, 55, 0, 100, 100, 100, 100, 100, 0, 98, 100, 100, 0, 100, 100, 100, 100, 100, 100, 0, 100, 100, 0, 100, 100, 85, 100, 0, 0, 100, 100, 100, 100, 100, 100, 0, 100, 100, 95, 20, 50, 85, 100, 100, 100, 100, 55, 100, 100, 100, 0, 100, 98, 100, 100, 100, 0, 85, 100, 100, 98, 100, 100, 100, 0, 100, 100, 100, 100, 0, 100, 0, 100, 100, 0, 0, 100, 50, 100, 100, 10, 100, 100, 100, 100, 0, 100, 100, 25, 100, 30]
|
@ -0,0 +1 @@
|
||||
[0, 100, 0, 100, 0, 100, 0, 100, 0, 0, 50, 0, 100, 100, 100, 100, 100, 100, 100, 95, 0, 50, 100, 100, 0, 0, 100, 0, 0, 100, 0, 0, 100, 0, 67, 0, 0, 0, 100, 100, 95, 100, 100, 100, 0, 0, 0, 0, 100, 100, 100, 0, 55, 100, 0, 100, 65, 100, 100, 0, 100, 100, 100, 0, 100, 0, 85, 100, 100, 85, 0, 75, 100, 0, 0, 100, 100, 100, 0, 100, 0, 50, 100, 100, 0, 100, 0, 0, 100, 85, 100, 0, 100, 100, 0, 100, 100, 0, 0, 0]
|
@ -0,0 +1 @@
|
||||
[20, 92, 85, 90, 20, 90, 22, 97, 60, 96, 20, 20, 98, 95, 90, 98, 95, 20, 98, 98, 92, 20, 96, 96, 100, 98, 98, 95, 20, 95, 98, 20, 85, 95, 80, 97, 40, 21, 100, 85, 95, 98, 92, 98, 69, 98, 80, 60, 60, 20, 80, 68, 80, 96, 96, 68, 80, 95, 80, 20, 95, 98, 80, 98, 94, 20, 40, 98, 100, 85, 98, 90, 95, 85, 95, 80, 98, 98, 25, 98, 40, 92, 95, 82, 87, 98, 80, 90, 95, 4, 90, 90, 80, 98, 20, 98, 98, 40, 92, 98]
|
@ -0,0 +1 @@
|
||||
[76, 85, 67, 90, 20, 98, 22, 96, 40, 80, 40, 20, 90, 98, 80, 92, 98, 98, 95, 99, 55, 99, 80, 90, 20, 4, 98, 4, 40, 95, 14, 44, 95, 44, 80, 4, 4, 40, 95, 80, 98, 95, 92, 98, 68, 20, 20, 60, 95, 90, 98, 0, 20, 80, 20, 80, 92, 98, 98, 20, 95, 100, 95, 85, 98, 4, 40, 98, 98, 65, 20, 76, 100, 67, 44, 92, 75, 97, 27, 98, 20, 60, 90, 96, 67, 98, 80, 10, 80, 98, 100, 40, 92, 98, 20, 98, 98, 20, 20]
|
Loading…
x
Reference in New Issue
Block a user