mirror of
https://github.com/rasbt/LLMs-from-scratch.git
synced 2025-08-30 03:20:51 +00:00
Use deterministic ollama settings (#250)
* deterministic ollama settings * add missing file
This commit is contained in:
parent
99058c3d07
commit
65d68097ee
@ -2301,7 +2301,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 1,
|
||||
"id": "026e8570-071e-48a2-aa38-64d7be35f288",
|
||||
"metadata": {
|
||||
"colab": {
|
||||
@ -2340,7 +2340,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 2,
|
||||
"id": "723c9b00-e3cd-4092-83c3-6e48b5cf65b0",
|
||||
"metadata": {
|
||||
"id": "723c9b00-e3cd-4092-83c3-6e48b5cf65b0"
|
||||
@ -2384,7 +2384,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 3,
|
||||
"id": "e3ae0e10-2b28-42ce-8ea2-d9366a58088f",
|
||||
"metadata": {
|
||||
"id": "e3ae0e10-2b28-42ce-8ea2-d9366a58088f",
|
||||
@ -2395,25 +2395,21 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Llamas are ruminant animals, which means they have a four-chambered stomach that allows them to digest plant-based foods. Their diet typically consists of:\n",
|
||||
"Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:\n",
|
||||
"\n",
|
||||
"1. Grasses: Llamas love to graze on grasses, including tall grasses, short grasses, and even weeds.\n",
|
||||
"2. Hay: Hay is a common staple in a llama's diet. They enjoy high-quality hay like timothy hay, alfalfa hay, or oat hay.\n",
|
||||
"3. Fruits and vegetables: Llamas will eat fruits and veggies as treats or as part of their regular diet. Favorites include apples, carrots, sweet potatoes, and leafy greens like kale or spinach.\n",
|
||||
"4. Grains: Whole grains like oats, barley, and corn can be fed to llamas as a supplement.\n",
|
||||
"5. Minerals: Llamas need access to minerals like calcium, phosphorus, and salt to stay healthy.\n",
|
||||
"1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.\n",
|
||||
"2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy the sweet taste and texture of fresh hay.\n",
|
||||
"3. Grains: Llamas may receive grains like oats, barley, or corn as part of their daily ration. However, it's essential to provide these grains in moderation, as they can be high in calories.\n",
|
||||
"4. Fruits and vegetables: Llamas enjoy a variety of fruits and veggies, such as apples, carrots, sweet potatoes, and leafy greens like kale or spinach.\n",
|
||||
"5. Minerals: Llamas require access to mineral supplements, which help maintain their overall health and well-being.\n",
|
||||
"\n",
|
||||
"In the wild, llamas might eat:\n",
|
||||
"In the wild, llamas might also eat:\n",
|
||||
"\n",
|
||||
"* Leaves from shrubs and trees\n",
|
||||
"* Bark\n",
|
||||
"* Twigs\n",
|
||||
"* Fruits\n",
|
||||
"* Roots\n",
|
||||
"1. Leaves: They'll munch on leaves from trees and shrubs, including plants like willow, alder, and birch.\n",
|
||||
"2. Bark: In some cases, llamas may eat the bark of certain trees, like aspen or cottonwood.\n",
|
||||
"3. Mosses and lichens: These non-vascular plants can be a tasty snack for llamas.\n",
|
||||
"\n",
|
||||
"Domesticated llamas, on the other hand, are usually fed a diet of hay, grains, and fruits/veggies. Their nutritional needs can be met with a balanced feed that includes essential vitamins and minerals.\n",
|
||||
"\n",
|
||||
"Keep in mind that llamas have specific dietary requirements, and their food should be tailored to their individual needs. It's always best to consult with a veterinarian or experienced llama breeder to determine the best diet for your llama.\n"
|
||||
"In captivity, llama owners typically provide a balanced diet that includes a mix of hay, grains, and fruits/vegetables. It's essential to consult with a veterinarian or experienced llama breeder to determine the best feeding plan for your llama.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -2424,13 +2420,17 @@
|
||||
" # Create the data payload as a dictionary\n",
|
||||
" data = {\n",
|
||||
" \"model\": model,\n",
|
||||
" \"seed\": 123, # for deterministic responses\n",
|
||||
" \"temperature\": 0, # for deterministic responses\n",
|
||||
" \"messages\": [\n",
|
||||
" {\"role\": \"user\", \"content\": prompt}\n",
|
||||
" ]\n",
|
||||
" ],\n",
|
||||
" \"options\": { # Settings below are required for deterministic responses\n",
|
||||
" \"seed\": 123,\n",
|
||||
" \"temperature\": 0,\n",
|
||||
" \"num_ctx\": 2048\n",
|
||||
" }\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
"\n",
|
||||
" # Convert the dictionary to a JSON formatted string and encode it to bytes\n",
|
||||
" payload = json.dumps(data).encode(\"utf-8\")\n",
|
||||
"\n",
|
||||
@ -2469,7 +2469,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 4,
|
||||
"id": "86b839d4-064d-4178-b2d7-01691b452e5e",
|
||||
"metadata": {
|
||||
"id": "86b839d4-064d-4178-b2d7-01691b452e5e",
|
||||
@ -2488,17 +2488,17 @@
|
||||
">> The car is as fast as a bullet.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> A scoring task!\n",
|
||||
">> I'd rate the model response \"The car is as fast as a bullet.\" an 85 out of 100.\n",
|
||||
"\n",
|
||||
"To evaluate the model response \"The car is as fast as a bullet.\", I'll consider how well it follows the instruction and uses a simile that's coherent, natural-sounding, and effective in conveying the idea of speed.\n",
|
||||
"Here's why:\n",
|
||||
"\n",
|
||||
"Here are some factors to consider:\n",
|
||||
"* The response uses a simile correctly, comparing the speed of the car to something else (in this case, a bullet).\n",
|
||||
"* The comparison is relevant and makes sense, as bullets are known for their high velocity.\n",
|
||||
"* The phrase \"as fast as\" is used correctly to introduce the simile.\n",
|
||||
"\n",
|
||||
"1. **Follows instruction**: Yes, the model uses a simile to rewrite the sentence.\n",
|
||||
"2. **Coherence and naturalness**: The comparison between the car's speed and a bullet is common and easy to understand. It's a good choice for a simile that conveys the idea of rapid movement.\n",
|
||||
"3. **Effectiveness in conveying idea of speed**: A bullet is known for its high velocity, which makes it an excellent choice to describe a fast-moving car.\n",
|
||||
"The only reason I wouldn't give it a perfect score is that some people might find the comparison slightly less vivid or evocative than others. For example, comparing something to lightning (as in the original response) can be more dramatic and attention-grabbing. However, \"as fast as a bullet\" is still a strong and effective simile that effectively conveys the idea of the car's speed.\n",
|
||||
"\n",
|
||||
"Considering these factors, I'd score the model response \"The car is as fast as a bullet.\" around 85 out of 100. The simile is well-chosen, coherent, and effectively conveys the idea of speed. Well done, model!\n",
|
||||
"Overall, I think the model did a great job!\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -2509,15 +2509,15 @@
|
||||
">> The type of cloud associated with thunderstorms is a cumulus cloud.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> A scoring task!\n",
|
||||
">> I'd score this model response as 40 out of 100.\n",
|
||||
"\n",
|
||||
"I'll evaluate the model's response based on its accuracy and relevance to the original instruction.\n",
|
||||
"Here's why:\n",
|
||||
"\n",
|
||||
"**Accuracy:** The model's response is partially correct. Cumulus clouds are indeed associated with fair weather and not typically linked to thunderstorms. The correct answer, cumulonimbus, is a type of cloud that is closely tied to thunderstorm formation.\n",
|
||||
"* The model correctly identifies that thunderstorms are related to clouds (correctly identifying the type of phenomenon).\n",
|
||||
"* However, it incorrectly specifies the type of cloud associated with thunderstorms. Cumulus clouds are not typically associated with thunderstorms; cumulonimbus clouds are.\n",
|
||||
"* The response lacks precision and accuracy in its description.\n",
|
||||
"\n",
|
||||
"**Relevance:** The model's response is somewhat relevant, as it mentions clouds in the context of thunderstorms. However, the specific type of cloud mentioned (cumulus) is not directly related to thunderstorms.\n",
|
||||
"\n",
|
||||
"Considering these factors, I would score the model response a **40 out of 100**. While the response attempts to address the instruction, it provides an incorrect answer and lacks relevance to the original question.\n",
|
||||
"Overall, while the model attempts to address the instruction, it provides an incorrect answer, which is a significant error.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -2528,19 +2528,13 @@
|
||||
">> The author of 'Pride and Prejudice' is Jane Austen.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> A simple one!\n",
|
||||
">> I'd rate my own response as 95 out of 100. Here's why:\n",
|
||||
"\n",
|
||||
"My model response: \"The author of 'Pride and Prejudice' is Jane Austen.\"\n",
|
||||
"* The response accurately answers the question by naming the author of 'Pride and Prejudice' as Jane Austen.\n",
|
||||
"* The response is concise and clear, making it easy to understand.\n",
|
||||
"* There are no grammatical errors or ambiguities that could lead to confusion.\n",
|
||||
"\n",
|
||||
"Score: **99**\n",
|
||||
"\n",
|
||||
"Reasoning:\n",
|
||||
"\n",
|
||||
"* The response directly answers the question, providing the correct name of the author.\n",
|
||||
"* The sentence structure is clear and easy to understand.\n",
|
||||
"* There's no room for misinterpretation or ambiguity.\n",
|
||||
"\n",
|
||||
"Overall, a perfect score!\n",
|
||||
"The only reason I wouldn't give myself a perfect score is that the response is slightly redundant - it's not necessary to rephrase the question in the answer. A more concise response would be simply \"Jane Austen.\"\n",
|
||||
"\n",
|
||||
"-------------------------\n"
|
||||
]
|
||||
@ -2577,7 +2571,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 5,
|
||||
"id": "9d7bca69-97c4-47a5-9aa0-32f116fa37eb",
|
||||
"metadata": {
|
||||
"id": "9d7bca69-97c4-47a5-9aa0-32f116fa37eb",
|
||||
@ -2588,7 +2582,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|████████████████████████| 110/110 [01:10<00:00, 1.56it/s]"
|
||||
"Scoring entries: 100%|████████████████████████| 110/110 [01:08<00:00, 1.60it/s]"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2596,7 +2590,7 @@
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of scores: 110 of 110\n",
|
||||
"Average score: 54.16\n",
|
||||
"Average score: 50.32\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
@ -2642,7 +2636,7 @@
|
||||
},
|
||||
"source": [
|
||||
"- Our model achieves an average score of above 50, which we can use as a reference point to compare the model to other models or to try out other training settings that may improve the model\n",
|
||||
"- Note that ollama is not fully deterministic (as of this writing), so the numbers you are getting might slightly differ from the ones shown above"
|
||||
"- Note that ollama is not fully deterministic across operating systems (as of this writing), so the numbers you are getting might slightly differ from the ones shown above"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2733,7 +2727,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.11"
|
||||
"version": "3.11.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
@ -15,11 +15,14 @@ def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"):
|
||||
# Create the data payload as a dictionary
|
||||
data = {
|
||||
"model": model,
|
||||
"seed": 123, # for deterministic responses
|
||||
"temperature": 0, # for deterministic responses
|
||||
"messages": [
|
||||
{"role": "user", "content": prompt}
|
||||
]
|
||||
],
|
||||
"options": { # Settings below are required for deterministic responses
|
||||
"seed": 123,
|
||||
"temperature": 0,
|
||||
"num_ctx": 2048
|
||||
}
|
||||
}
|
||||
|
||||
# Convert the dictionary to a JSON formatted string and encode it to bytes
|
||||
|
@ -62,7 +62,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"tqdm version: 4.66.2\n"
|
||||
"tqdm version: 4.66.4\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -198,19 +198,19 @@
|
||||
"text": [
|
||||
"Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:\n",
|
||||
"\n",
|
||||
"1. Grasses: Llamas love to graze on various types of grasses, including tall grasses and short grasses.\n",
|
||||
"2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in many llama diets.\n",
|
||||
"3. Grains: Oats, corn, and barley are common grains used in llama feed.\n",
|
||||
"4. Fruits and vegetables: Fresh fruits and vegetables can be offered as treats or added to their regular diet. Favorites include apples, carrots, and sweet potatoes.\n",
|
||||
"5. Minerals: Llamas require access to mineral supplements, such as salt licks or loose minerals, to ensure they get the necessary nutrients.\n",
|
||||
"1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.\n",
|
||||
"2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy the sweet taste and texture of fresh hay.\n",
|
||||
"3. Grains: Llamas may receive grains like oats, barley, or corn as part of their daily ration. However, it's essential to provide these grains in moderation, as they can be high in calories.\n",
|
||||
"4. Fruits and vegetables: Llamas enjoy a variety of fruits and veggies, such as apples, carrots, sweet potatoes, and leafy greens like kale or spinach.\n",
|
||||
"5. Minerals: Llamas require access to mineral supplements, which help maintain their overall health and well-being.\n",
|
||||
"\n",
|
||||
"In the wild, llamas will also eat:\n",
|
||||
"In the wild, llamas might also eat:\n",
|
||||
"\n",
|
||||
"1. Leaves: They'll munch on leaves from shrubs and trees, like willow or cedar.\n",
|
||||
"2. Bark: In some cases, llamas might eat the bark of certain trees, like aspen or cottonwood.\n",
|
||||
"3. Mosses: Llamas may consume various types of mosses that grow in their environment.\n",
|
||||
"1. Leaves: They'll munch on leaves from trees and shrubs, including plants like willow, alder, and birch.\n",
|
||||
"2. Bark: In some cases, llamas may eat the bark of certain trees, like aspen or cottonwood.\n",
|
||||
"3. Mosses and lichens: These non-vascular plants can be a tasty snack for llamas.\n",
|
||||
"\n",
|
||||
"It's essential to provide a balanced diet for llamas, as they have specific nutritional needs. A good quality commercial llama feed or a veterinarian-recommended feeding plan can help ensure your llama stays healthy and happy!\n"
|
||||
"In captivity, llama owners typically provide a balanced diet that includes a mix of hay, grains, and fruits/vegetables. It's essential to consult with a veterinarian or experienced llama breeder to determine the best feeding plan for your llama.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -223,11 +223,17 @@
|
||||
" # Create the data payload as a dictionary\n",
|
||||
" data = {\n",
|
||||
" \"model\": model,\n",
|
||||
" \"seed\": 123, # for deterministic responses\n",
|
||||
" \"temperature\": 0, # for deterministic responses\n",
|
||||
" \"messages\": [\n",
|
||||
" {\"role\": \"user\", \"content\": prompt}\n",
|
||||
" ]\n",
|
||||
" {\n",
|
||||
" \"role\": \"user\",\n",
|
||||
" \"content\": prompt\n",
|
||||
" }\n",
|
||||
" ],\n",
|
||||
" \"options\": { # Settings below are required for deterministic responses\n",
|
||||
" \"seed\": 123,\n",
|
||||
" \"temperature\": 0,\n",
|
||||
" \"num_ctx\": 2048\n",
|
||||
" }\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" # Convert the dictionary to a JSON formatted string and encode it to bytes\n",
|
||||
@ -383,22 +389,9 @@
|
||||
"The hypotenuse of the triangle is 3 cm.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> I'd be happy to help!\n",
|
||||
">> I'd score this response as 0 out of 100.\n",
|
||||
"\n",
|
||||
"To evaluate the model response, I'll compare it with the correct output. Here's my analysis:\n",
|
||||
"\n",
|
||||
"* Correct output: The hypotenuse of the triangle is 10 cm.\n",
|
||||
"* Model response: The hypotenuse of the triangle is 3 cm.\n",
|
||||
"\n",
|
||||
"The model response has a significant error. The correct value for the hypotenuse is 10 cm, but the model response suggests it's only 3 cm. This indicates a lack of understanding or application of mathematical concepts in this specific problem.\n",
|
||||
"\n",
|
||||
"Based on this analysis, I'd score the model response as follows:\n",
|
||||
"\n",
|
||||
"* Accuracy: 0/100 (The model response has a significant error and doesn't match the correct solution.)\n",
|
||||
"* Understanding: 20/100 (The model response shows some misunderstanding of the concept or calculation.)\n",
|
||||
"* Overall score: 10/100\n",
|
||||
"\n",
|
||||
"This score indicates that the model response is not accurate and demonstrates limited understanding of the problem.\n",
|
||||
"The correct answer is \"The hypotenuse of the triangle is 10 cm.\", not \"3 cm.\". The model failed to accurately calculate the length of the hypotenuse, which is a fundamental concept in geometry and trigonometry.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -417,21 +410,16 @@
|
||||
"6. Cobra\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> To evaluate the model's response, I'll compare it to the expected output.\n",
|
||||
">> I'd rate this model response as 60 out of 100.\n",
|
||||
"\n",
|
||||
"Expected output: 1. Squirrel, 2. Eagle, 3. Tiger\n",
|
||||
"Model's response: 1. Squirrel, 2. Tiger, 3. Eagle\n",
|
||||
"Here's why:\n",
|
||||
"\n",
|
||||
"The model got two out of three animals correct (Squirrel and Tiger), which means it scored 66.67% on this task.\n",
|
||||
"* The model correctly identifies two animals that are active during the day: Squirrel and Eagle.\n",
|
||||
"* However, it incorrectly includes Tiger twice, which is not a different animal from the original list.\n",
|
||||
"* Cobra is also an incorrect answer, as it is typically nocturnal or crepuscular (active at twilight).\n",
|
||||
"* The response does not meet the instruction to provide three different animals that are active during the day.\n",
|
||||
"\n",
|
||||
"To score the model's response on a scale from 0 to 100, I'll use the following formula:\n",
|
||||
"\n",
|
||||
"Score = (Number of correct answers / Total number of answers) * 100\n",
|
||||
"\n",
|
||||
"In this case:\n",
|
||||
"Score = (2/3) * 100 = 66.67%\n",
|
||||
"\n",
|
||||
"So, the model's response scores 66.67% on a scale from 0 to 100.\n",
|
||||
"To achieve a higher score, the model should have provided three unique and correct answers that fit the instruction.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -443,23 +431,17 @@
|
||||
"What is incorrect?\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> A clever task!\n",
|
||||
">> A clever test!\n",
|
||||
"\n",
|
||||
"Here's my response:\n",
|
||||
"Here's my attempt at rewriting the sentence in a more formal way:\n",
|
||||
"\n",
|
||||
"**Rewritten sentence:** I must determine what is amiss.\n",
|
||||
"\"I require an identification of the issue.\"\n",
|
||||
"\n",
|
||||
"And now, let me evaluate the model response `What is incorrect?`\n",
|
||||
"Now, let's evaluate the model response \"What is incorrect?\" against the correct output \"I must ascertain what is incorrect.\".\n",
|
||||
"\n",
|
||||
"I would score it as follows:\n",
|
||||
"To me, this seems like a completely different question being asked. The original instruction was to rewrite the sentence in a more formal way, and the model response doesn't even attempt to do that. It's asking a new question altogether!\n",
|
||||
"\n",
|
||||
"* Grammar: 80 (while the sentence is grammatically correct, it's not as polished as my rewritten sentence)\n",
|
||||
"* Clarity: 60 (the original instruction asked for a more formal way of expressing the thought, and while \"What is incorrect?\" gets the point across, it's not as elegant as my response)\n",
|
||||
"* Formality: 40 (my response `I must determine what is amiss` has a more formal tone than the model response)\n",
|
||||
"\n",
|
||||
"Overall score: **60**\n",
|
||||
"\n",
|
||||
"So, I'd say the model response scores a decent 60 out of 100. Not bad!\n",
|
||||
"So, I'd score this response a 0 out of 100.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -471,17 +453,16 @@
|
||||
"The interjection in the sentence is 'Wow'.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> I'd score this response as 90 out of 100.\n",
|
||||
">> I'd score this model response as 100.\n",
|
||||
"\n",
|
||||
"Here's why:\n",
|
||||
"\n",
|
||||
"1. The model correctly identifies the interjection in the sentence, which is indeed \"Wow\".\n",
|
||||
"2. The response is concise and directly answers the instruction.\n",
|
||||
"3. There are no grammatical errors or typos in the response.\n",
|
||||
"1. The instruction asks to identify the interjection in the sentence.\n",
|
||||
"2. The input sentence is provided: \"Wow, that was an amazing trick!\"\n",
|
||||
"3. The model correctly identifies the interjection as \"Wow\", which is a common English interjection used to express surprise or excitement.\n",
|
||||
"4. The response accurately answers the question and provides the correct information.\n",
|
||||
"\n",
|
||||
"The only thing that keeps me from giving it a perfect score of 100 is that the response could be slightly more explicit or detailed. For example, the model could have explained why \"Wow\" is considered an interjection (e.g., because it expresses strong emotions) to provide further context and clarity.\n",
|
||||
"\n",
|
||||
"However, overall, this is a well-crafted and accurate response that effectively completes the request!\n",
|
||||
"Overall, the model's response perfectly completes the request, making it a 100% accurate answer!\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
@ -493,23 +474,14 @@
|
||||
"The type of sentence is exclamatory.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> A language evaluation task!\n",
|
||||
">> I'd rate this model response as 20 out of 100.\n",
|
||||
"\n",
|
||||
"To evaluate the model's response, I'll compare it to the correct output. Here's how:\n",
|
||||
"Here's why:\n",
|
||||
"\n",
|
||||
"1. Correct output: The type of sentence is interrogative.\n",
|
||||
"2. Model's response: The type of sentence is exclamatory.\n",
|
||||
"* The input sentence \"Did you finish the report?\" is indeed an interrogative sentence, which asks a question.\n",
|
||||
"* The model response says it's exclamatory, which is incorrect. Exclamatory sentences are typically marked by an exclamation mark (!) and express strong emotions or emphasis, whereas this sentence is simply asking a question.\n",
|
||||
"\n",
|
||||
"The two responses differ in their identification of the sentence type. The correct answer is \"interrogative\" (a question), while the model's response incorrectly says it's \"exclamatory\" (an exclamation).\n",
|
||||
"\n",
|
||||
"Now, let's score the model's response:\n",
|
||||
"\n",
|
||||
"* Correctness: 0/100 (the response is entirely incorrect)\n",
|
||||
"* Similarity: 20/100 (the word \"sentence\" is correct, but the type identification is wrong)\n",
|
||||
"\n",
|
||||
"Total score: 20/100\n",
|
||||
"\n",
|
||||
"So, the model's response scores 20 out of 100. This indicates that it has a significant mistake in identifying the sentence type, which means its performance is quite poor on this specific task.\n",
|
||||
"The correct output \"The type of sentence is interrogative.\" is the best possible score (100), while the model response is significantly off the mark, hence the low score.\n",
|
||||
"\n",
|
||||
"-------------------------\n"
|
||||
]
|
||||
@ -574,7 +546,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Let's now apply this evaluation to the whole dataset and compute the average score of each model (this takes about 1 minute per model on an M3 MacBook Air laptop)\n",
|
||||
"- Note that ollama is not fully deterministic (as of this writing) so the numbers you are getting might slightly differ from the ones shown below"
|
||||
"- Note that ollama is not fully deterministic across operating systems (as of this writing) so the numbers you are getting might slightly differ from the ones shown below"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -587,7 +559,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:37<00:00, 1.02it/s]\n"
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:02<00:00, 1.59it/s]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -597,7 +569,7 @@
|
||||
"\n",
|
||||
"model 1 response\n",
|
||||
"Number of scores: 100 of 100\n",
|
||||
"Average score: 77.05\n",
|
||||
"Average score: 78.48\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
@ -605,7 +577,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:16<00:00, 1.31it/s]"
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:10<00:00, 1.42it/s]"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -615,7 +587,7 @@
|
||||
"\n",
|
||||
"model 2 response\n",
|
||||
"Number of scores: 99 of 100\n",
|
||||
"Average score: 67.37\n",
|
||||
"Average score: 64.98\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
@ -668,7 +640,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.11"
|
||||
"version": "3.11.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
Loading…
x
Reference in New Issue
Block a user