{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "-pftZ-ZF1_BA"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NPUGFpKP1_BH"
},
"source": [
"# Demonstrating the `AgentEval` framework using the task of solving math problems as an example\n",
"\n",
"This notebook aims to demonstrate how to `AgentEval` implemented through [AutoGen](https://github.com/microsoft/autogen) works, where we use a math problem-solving task as an example. \n",
"`AgentEval` consists of two key components:\n",
"\n",
"- `CriticAgent`: This is an LLM-based agent that generates a list criteria $(c_1, \\dots, c_n)$ to help to evaluate a utility given task.\n",
"\n",
"- `QuantifierAgent`: This agent quantifies the performance of any sample task based on the criteria designed by the `CriticAgent` in the following way: $(c_1=a_1, \\dots, c_n=a_n)$\n",
"\n",
"\n",
"\n",
"For more detailed explanations, please refer to the accompanying [blog post](https://https://microsoft.github.io/autogen/blog/2023/11/11/AgentEval)\n",
"\n",
"## Requirements\n",
"\n",
"AutoGen requires `Python>=3.8`. To run this notebook example, please install pyautogen, Docker, and OpenAI:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2023-02-13T23:40:52.317406Z",
"iopub.status.busy": "2023-02-13T23:40:52.316561Z",
"iopub.status.idle": "2023-02-13T23:40:52.321193Z",
"shell.execute_reply": "2023-02-13T23:40:52.320628Z"
},
"id": "68lTZZyJ1_BI",
"outputId": "15a55fab-e13a-4654-b8cb-ae117478d6d8"
},
"outputs": [],
"source": [
"%pip install pyautogen~=0.2.0b5 docker\n",
"%pip install scipy\n",
"%pip install matplotlib"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HxgqKJrd1_BJ"
},
"source": [
"## Set your API Endpoint\n",
"\n",
"* The [`config_list_openai_aoai`](https://microsoft.github.io/autogen/docs/reference/oai/openai_utils#config_list_openai_aoai) function tries to create a list of configurations using Azure OpenAI endpoints and OpenAI endpoints. It assumes the api keys and api bases are stored in the corresponding environment variables or local txt files:\n",
" - OpenAI API key: os.environ[\"OPENAI_API_KEY\"] or `openai_api_key_file=\"key_openai.txt\"`.\n",
" - Azure OpenAI API key: os.environ[\"AZURE_OPENAI_API_KEY\"] or `aoai_api_key_file=\"key_aoai.txt\"`. Multiple keys can be stored, one per line.\n",
" - Azure OpenAI API base: os.environ[\"AZURE_OPENAI_API_BASE\"] or `aoai_api_base_file=\"base_aoai.txt\"`. Multiple bases can be stored, one per line.\n",
"* The [`config_list_from_json`](https://microsoft.github.io/autogen/docs/reference/oai/openai_utils#config_list_from_json) function loads a list of configurations from an environment variable or a json file. It first looks for an environment variable with a specified name. The value of the environment variable needs to be a valid json string. If that variable is not found, it looks for a json file with the same name. It filters the configs by filter_dict.\n",
"\n",
"You can set the value of config_list in any way you prefer. Please refer to this [notebook](https://github.com/microsoft/autogen/blob/main/notebook/oai_openai_utils.ipynb) for full code examples of the different methods.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "YRycFEDJ1_BJ"
},
"outputs": [],
"source": [
"import autogen\n",
"\n",
"config_list = autogen.config_list_from_json(\n",
" \"OAI_CONFIG_LIST\",\n",
" filter_dict={\n",
" \"model\": [\"gpt-4\"],\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fBZ-XFXy1_BJ"
},
"source": [
"\n",
"## Construct `CriticAgent`\n",
"\n",
"We construct the planning agent named `critic` and a user proxy agent for the critic named `critic_user`. We specify `human_input_mode` as \"NEVER\" in the user proxy agent, ensuring that it will never ask for human feedback. Additionally, we define the `ask_critic` function to send a message to the critic and retrieve the criteria from the critic.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "9XAeyjd11_BK"
},
"outputs": [],
"source": [
"\n",
"critic = autogen.AssistantAgent(\n",
" name = \"critic\",\n",
" llm_config = {\"config_list\": config_list},\n",
" system_message = \"\"\"You are a helpful assistant. You suggest criteria for evaluating different tasks. They should be dinstinguishable, quantifieable and not redundant.\n",
" Convert the evaluation criteria into a dictionary where the keys are the criteria.\n",
" The value of each key is a dictionary as follows {\"description\": criteria description , \"accepted_values\": possible accepted inputs for this key}\n",
" Make sure the keys are criteria for assessing the given task. \"accepted_values\" include the acceptable inputs for each key that are fine-grained and preferrably mlti-graded levels. \"description\" includes the criterion description.\n",
" Return the dictionary.\"\"\"\n",
")\n",
"\n",
"critic_user = autogen.UserProxyAgent(\n",
" name = \"critic_user\",\n",
" max_consecutive_auto_reply = 0, # terminate without auto-reply\n",
" human_input_mode = \"NEVER\",\n",
")\n",
"\n",
"def ask_critic(message):\n",
" \"\"\"\n",
" Initiate a chat with the critic user and return the last message received from the planner.\n",
"\n",
" Args:\n",
" - message (str): The message to be sent to the critic user.\n",
"\n",
" Returns:\n",
" - str: The content of the last message received.\n",
" \"\"\"\n",
" critic_user.initiate_chat(critic, message=message)\n",
" # return the last received from the planner\n",
" return critic_user.messagelast_message()[\"content\"]\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6vPTtNkhk2V1"
},
"source": [
"# Run the Critic\n",
"\n",
"To run the critic, we need a couple of math problem examples. One of them failed to solve the problem successfully, given in `agenteval-in-out/response_failed.txt`, and the other one was solved successfully, i.e., `agenteval-in-out/response_successful.txt`."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"id": "5H1WRs_wkiK0"
},
"outputs": [],
"source": [
"def read_without_groundtruth(file_name):\n",
" \"\"\"\n",
" Read the mathproblem logs - bypassing any information about the ground truths.\n",
"\n",
" Args:\n",
" - file_name (str): The single log file that wants to get evaluated.\n",
"\n",
" Returns:\n",
" - str: The log file without any information about the ground truth answer of the problem.\n",
" \"\"\"\n",
" f = open( file_name,\"r\").readlines()\n",
" output_dictionary = \"\"\n",
" for line in f:\n",
" if \"is_correct\" not in line and \"correct_ans\" not in line and \"check_result\" not in line:\n",
" output_dictionary += line\n",
" elif \"is_correct\" in line:\n",
" correctness = line.replace(\",\",\"\").split(\":\")[-1].rstrip().strip()\n",
" return [output_dictionary,correctness]\n",
"\n",
"# Reading one successful and one failed example of the task\n",
"response_successful = read_without_groundtruth(\"../test/test_files/agenteval-in-out/samples/sample_math_response_successful.txt\")[0]\n",
"response_failed = read_without_groundtruth(\"../test/test_files/agenteval-in-out/samples/sample_math_response_failed.txt\")[0]\n",
"\n",
"task = {\"name\": \"Math problem solving\",\n",
" \"description\": \"Given any question, the system needs to solve the problem as consisely and acccurately as possible\",\n",
" \"successful_response\" : response_successful,\n",
" \"failed_response\" : response_failed}\n",
"\n",
"sys_msg = f\"\"\"Task: {task[\"name\"]}.\n",
"Task description: {task[\"description\"]}\n",
"Task successfull example: {task[\"successful_response\"]}\n",
"Task failed example: {task[\"failed_response\"]}\n",
"\"\"\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Vu70o024lenI"
},
"source": [
"# The Criteria\n",
"Now, we print the designed criteria for assessing math problems. "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "k9DsDB5hqvtG",
"outputId": "0edd7a0c-b031-4f67-efc6-1a1e77066921"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33mcritic_user\u001b[0m (to critic):\n",
"\n",
"Task: Math problem solving.\n",
"Task description: Given any question, the system needs to solve the problem as consisely and acccurately as possible\n",
"Task successfull example: {\n",
" \"problem\": \"What is the sum of all the distinct positive two-digit factors of 144?\",\n",
" \"level\": \"Level 5\",\n",
" \"type\": \"Number Theory\",\n",
" \"solution\": \"Prime factorize $144=2^4\\\\cdot3^2$. The sum of the positive two-digit factors of 144 is $2^4+2\\\\cdot3^2+2^2\\\\cdot3+2^2\\\\cdot3^2+2^3\\\\cdot3+2^3\\\\cdot3^2+2^4\\\\cdot3=\\\\boxed{226}.$\",\n",
" \"problem_id\": \"0\",\n",
" \"response_with_ans\": \"To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\\n\\nHere's a Python script to accomplish this:\\n\\n```python\\ntwo_digit_factors = []\\n\\nfor i in range(10, 100):\\n if 144 % i == 0:\\n two_digit_factors.append(i)\\n\\nsum_of_factors = sum(two_digit_factors)\\nprint(\\\"The sum of all the distinct positive two-digit factors of 144 is:\\\", sum_of_factors)\\n```\\n\\nPlease run this script to find the desired sum.\",\n",
" \"round\": 0,\n",
" \"messages\": [\n",
" {\n",
" \"content\": \"What is the sum of all the distinct positive two-digit factors of 144?\",\n",
" \"role\": \"user\"\n",
" },\n",
" {\n",
" \"content\": \"To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\\n\\nHere's a Python script to accomplish this:\\n\\n```python\\ntwo_digit_factors = []\\n\\nfor i in range(10, 100):\\n if 144 % i == 0:\\n two_digit_factors.append(i)\\n\\nsum_of_factors = sum(two_digit_factors)\\nprint(\\\"The sum of all the distinct positive two-digit factors of 144 is:\\\", sum_of_factors)\\n```\\n\\nPlease run this script to find the desired sum.\",\n",
" \"role\": \"assistant\"\n",
" }\n",
" ],\n",
" \"time\": 11.140539407730103,\n",
" \"trial\": -1\n",
"}\n",
"Task failed example: {\n",
" \"problem\": \"Find all $x$ that satisfy the inequality $(2x+10)(x+3)<(3x+9)(x+8)$. Express your answer in interval notation.\",\n",
" \"level\": \"Level 5\",\n",
" \"type\": \"Algebra\",\n",
" \"solution\": \"We have \\\\begin{align*} (2x+10)(x+3)&<(3x+9)(x+8) \\\\quad \\\\Rightarrow\\n\\\\\\\\ 2(x+5)(x+3)&<3(x+3)(x+8) \\\\quad \\\\Rightarrow\\n\\\\\\\\ 2(x+5)(x+3)-3(x+3)(x+8)&<0 \\\\quad \\\\Rightarrow\\n\\\\\\\\ (2x+10-(3x+24))(x+3)&<0 \\\\quad \\\\Rightarrow\\n\\\\\\\\ (-x-14)(x+3)&<0 \\\\quad \\\\Rightarrow\\n\\\\\\\\ (x+14)(x+3)&>0.\\n\\\\end{align*} This inequality is satisfied if and only if $(x+14)$ and $(x+3)$ are either both positive or both negative. Both factors are positive for $x>-3$ and both factors are negative for $x<-14$. When $-14 0$ and then find intervals where the inequality holds true.\\n\\nLet's first find the factorized form of the quadratic:\",\n",
" \"role\": \"assistant\"\n",
" },\n",
" {\n",
" \"content\": \"\",\n",
" \"role\": \"user\"\n",
" },\n",
" {\n",
" \"content\": \"I apologize for the confusion in my previous message. Let's find the factorized form of the quadratic using Python code.\\n\\n```python\\nfrom sympy import symbols, factor\\r\\nx = symbols('x')\\r\\nfactorized = factor(x**2 + 17*x + 42)\\r\\nfactorized\\r\\n```\\n\\nPlease run the Python code above to find the factorized form of the quadratic, and we can proceed to determine the intervals where the inequality holds true.\",\n",
" \"role\": \"assistant\"\n",
" }\n",
" ],\n",
" \"time\": 24.91333508491516,\n",
" \"trial\": -1\n",
"}\n",
"\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[33mcritic\u001b[0m (to critic_user):\n",
"\n",
"{\n",
" \"accuracy\": {\n",
" \"description\": \"Measures the correctness of the problem solution\",\n",
" \"accepted_values\": [\"correct\", \"incorrect\"]\n",
" },\n",
" \"conciseness\": {\n",
" \"description\": \"Determines if the solution is provided in the most concise way possible\",\n",
" \"accepted_values\": [\"not concise\", \"fairly concise\", \"very concise\"]\n",
" },\n",
" \"problem_solving_time\": {\n",
" \"description\": \"Measures the time taken by the system to solve the problem\",\n",
" \"accepted_values\": [\"0-1 minutes\", \"1-5 minutes\", \"5-10 minutes\", \"10+ minutes\"]\n",
" },\n",
" \"clarity\": {\n",
" \"description\": \"Assesses the clarity of the solution provided by the system\",\n",
" \"accepted_values\": [\"not clear\", \"fairly clear\", \"very clear\"]\n",
" },\n",
" \"difficulty_level\": {\n",
" \"description\": \"The level of difficulty of the problem\",\n",
" \"accepted_values\": [\"level 1\", \"level 2\", \"level 3\", \"level 4\", \"level 5\"]\n",
" },\n",
" \"mathematics_discipline\": {\n",
" \"description\": \"The branch of mathematics the problem belongs to e.g algebra, number theory\",\n",
" \"accepted_values\": [\"algebra\", \"number theory\", \"calculus\", \"geometry\", \"statistics\", \"combinatorics\"]\n",
" }\n",
"}\n",
"\n",
"--------------------------------------------------------------------------------\n"
]
}
],
"source": [
"\n",
"current_task_name = '_'.join(task[\"name\"].split()).lower()\n",
"gen_criteria = critic_user.initiate_chat(critic, message=sys_msg)\n",
"criteria = critic_user.last_message()\n",
"cr_file = open(f\"../test/test_files/agenteval-in-out/{current_task_name}_criteria.json\",\"w\")\n",
"cr_file.write(criteria[\"content\"])\n",
"cr_file.close()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PETPZluOEGCR"
},
"source": [
"*Note :* You can also define and use your own criteria by editing `criteria.txt`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SmpUZv_ylo9U"
},
"source": [
"# The `QuantifierAgent`\n",
"\n",
"Once we have the criteria, we need to quantify a new sample based on the designed criteria and its accepted values. This will be done through `QuantifierAgent` agent as follows. \n",
"We note that can skip the designed creteria by the agent and use your own defined criteria in `criteria_file`."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"id": "4uUkZJh_subA"
},
"outputs": [],
"source": [
"import json\n",
"\n",
"criteria_file = f\"../test/test_files/agenteval-in-out/{current_task_name}_criteria.json\"\n",
"quantifier = autogen.AssistantAgent(\n",
" name = \"quantifier\",\n",
" llm_config = {\"config_list\": config_list},\n",
" system_message = \"\"\"You are a helpful assistant. You quantify the output of different tasks based on the given criteria.\n",
" The criterion is given in a dictionary format where each key is a dintinct criteria.\n",
" The value of each key is a dictionary as follows {\"description\": criteria description , \"accepted_values\": possible accepted inputs for this key}\n",
" You are going to quantify each of the crieria for a given task based on the task decription.\n",
" Return a dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.\n",
" Return only the dictionary.\"\"\"\n",
")\n",
"\n",
"quantifier_user = autogen.UserProxyAgent(\n",
" name = \"quantifier_user\",\n",
" max_consecutive_auto_reply = 0, # terminate without auto-reply\n",
" human_input_mode = \"NEVER\",\n",
")\n",
"\n",
"dictionary_for_eval = open(criteria_file,\"r\").read()\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "64rRJfB2l6lO"
},
"source": [
"## Running the quantifier on a single test case"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"id": "zQ0H3sy8l-Ai"
},
"outputs": [],
"source": [
"import os\n",
"\n",
"def get_quantifier(file,criteria_file):\n",
" \"\"\"\n",
" Running quantifier agent on individual log.\n",
"\n",
" Args:\n",
" - file (str): The log path.\n",
" - file (str): The criteria jason file path\n",
" Returns:\n",
" - dict: A dictionary including the actual success of the problem as well as estimated performance by the agent eval.\n",
" {\"actual_success\":actual_label, \"estimated_performance\" : a dictionary of all the criteria and their quantified estimated performance.} }\n",
" \"\"\"\n",
" dictionary_for_eval = open(criteria_file,\"r\").read()\n",
"\n",
" test_case , actual_label = read_without_groundtruth(file)\n",
" print(\"actual label for this case: \", actual_label)\n",
" cq_results = quantifier_user.initiate_chat(quantifier, message = sys_msg + \\\n",
" \"Evaluation dictionary: \" + str(dictionary_for_eval) +\\\n",
" \"actual test case to evaluate: \" + test_case)\n",
" quantified_results = quantifier_user.last_message()\n",
" return {\"actual_success\": actual_label, \"estimated_performance\": quantified_results[\"content\"]}\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we run the quantifier on a single math problem test case, `sample_test_case.json`, for demonstration."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Pf623aNbHZTG",
"outputId": "0031871b-a438-43f5-d2b2-c99fa1ad0dbd"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"actual label for this case: true\n",
"\u001b[33mquantifier_user\u001b[0m (to quantifier):\n",
"\n",
"Task: Math problem solving.\n",
"Task description: Given any question, the system needs to solve the problem as consisely and acccurately as possible\n",
"Task successfull example: {\n",
" \"problem\": \"What is the sum of all the distinct positive two-digit factors of 144?\",\n",
" \"level\": \"Level 5\",\n",
" \"type\": \"Number Theory\",\n",
" \"solution\": \"Prime factorize $144=2^4\\\\cdot3^2$. The sum of the positive two-digit factors of 144 is $2^4+2\\\\cdot3^2+2^2\\\\cdot3+2^2\\\\cdot3^2+2^3\\\\cdot3+2^3\\\\cdot3^2+2^4\\\\cdot3=\\\\boxed{226}.$\",\n",
" \"problem_id\": \"0\",\n",
" \"response_with_ans\": \"To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\\n\\nHere's a Python script to accomplish this:\\n\\n```python\\ntwo_digit_factors = []\\n\\nfor i in range(10, 100):\\n if 144 % i == 0:\\n two_digit_factors.append(i)\\n\\nsum_of_factors = sum(two_digit_factors)\\nprint(\\\"The sum of all the distinct positive two-digit factors of 144 is:\\\", sum_of_factors)\\n```\\n\\nPlease run this script to find the desired sum.\",\n",
" \"round\": 0,\n",
" \"messages\": [\n",
" {\n",
" \"content\": \"What is the sum of all the distinct positive two-digit factors of 144?\",\n",
" \"role\": \"user\"\n",
" },\n",
" {\n",
" \"content\": \"To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\\n\\nHere's a Python script to accomplish this:\\n\\n```python\\ntwo_digit_factors = []\\n\\nfor i in range(10, 100):\\n if 144 % i == 0:\\n two_digit_factors.append(i)\\n\\nsum_of_factors = sum(two_digit_factors)\\nprint(\\\"The sum of all the distinct positive two-digit factors of 144 is:\\\", sum_of_factors)\\n```\\n\\nPlease run this script to find the desired sum.\",\n",
" \"role\": \"assistant\"\n",
" }\n",
" ],\n",
" \"time\": 11.140539407730103,\n",
" \"trial\": -1\n",
"}\n",
"Task failed example: {\n",
" \"problem\": \"Find all $x$ that satisfy the inequality $(2x+10)(x+3)<(3x+9)(x+8)$. Express your answer in interval notation.\",\n",
" \"level\": \"Level 5\",\n",
" \"type\": \"Algebra\",\n",
" \"solution\": \"We have \\\\begin{align*} (2x+10)(x+3)&<(3x+9)(x+8) \\\\quad \\\\Rightarrow\\n\\\\\\\\ 2(x+5)(x+3)&<3(x+3)(x+8) \\\\quad \\\\Rightarrow\\n\\\\\\\\ 2(x+5)(x+3)-3(x+3)(x+8)&<0 \\\\quad \\\\Rightarrow\\n\\\\\\\\ (2x+10-(3x+24))(x+3)&<0 \\\\quad \\\\Rightarrow\\n\\\\\\\\ (-x-14)(x+3)&<0 \\\\quad \\\\Rightarrow\\n\\\\\\\\ (x+14)(x+3)&>0.\\n\\\\end{align*} This inequality is satisfied if and only if $(x+14)$ and $(x+3)$ are either both positive or both negative. Both factors are positive for $x>-3$ and both factors are negative for $x<-14$. When $-14 0$ and then find intervals where the inequality holds true.\\n\\nLet's first find the factorized form of the quadratic:\",\n",
" \"role\": \"assistant\"\n",
" },\n",
" {\n",
" \"content\": \"\",\n",
" \"role\": \"user\"\n",
" },\n",
" {\n",
" \"content\": \"I apologize for the confusion in my previous message. Let's find the factorized form of the quadratic using Python code.\\n\\n```python\\nfrom sympy import symbols, factor\\r\\nx = symbols('x')\\r\\nfactorized = factor(x**2 + 17*x + 42)\\r\\nfactorized\\r\\n```\\n\\nPlease run the Python code above to find the factorized form of the quadratic, and we can proceed to determine the intervals where the inequality holds true.\",\n",
" \"role\": \"assistant\"\n",
" }\n",
" ],\n",
" \"time\": 24.91333508491516,\n",
" \"trial\": -1\n",
"}\n",
"Evaluation dictionary: {\n",
" \"accuracy\": {\n",
" \"description\": \"Measures the correctness of the problem solution\",\n",
" \"accepted_values\": [\"correct\", \"incorrect\"]\n",
" },\n",
" \"conciseness\": {\n",
" \"description\": \"Determines if the solution is provided in the most concise way possible\",\n",
" \"accepted_values\": [\"not concise\", \"fairly concise\", \"very concise\"]\n",
" },\n",
" \"problem_solving_time\": {\n",
" \"description\": \"Measures the time taken by the system to solve the problem\",\n",
" \"accepted_values\": [\"0-1 minutes\", \"1-5 minutes\", \"5-10 minutes\", \"10+ minutes\"]\n",
" },\n",
" \"clarity\": {\n",
" \"description\": \"Assesses the clarity of the solution provided by the system\",\n",
" \"accepted_values\": [\"not clear\", \"fairly clear\", \"very clear\"]\n",
" },\n",
" \"difficulty_level\": {\n",
" \"description\": \"The level of difficulty of the problem\",\n",
" \"accepted_values\": [\"level 1\", \"level 2\", \"level 3\", \"level 4\", \"level 5\"]\n",
" },\n",
" \"mathematics_discipline\": {\n",
" \"description\": \"The branch of mathematics the problem belongs to e.g algebra, number theory\",\n",
" \"accepted_values\": [\"algebra\", \"number theory\", \"calculus\", \"geometry\", \"statistics\", \"combinatorics\"]\n",
" }\n",
"}actual test case to evaluate: {\n",
" \"problem\": \"Find $24^{-1} \\\\pmod{11^2}$. That is, find the residue $b$ for which $24b \\\\equiv 1\\\\pmod{11^2}$.\\n\\nExpress your answer as an integer from $0$ to $11^2-1$, inclusive.\",\n",
" \"level\": \"Level 5\",\n",
" \"type\": \"Number Theory\",\n",
" \"solution\": \"Since $5 \\\\times 24 = 120 = 121 - 1$, it follows that $-5 \\\\times 24 \\\\equiv 1 \\\\pmod{121}$. Adding 121 to $-5$ to make it positive, we find $(-5 + 121) \\\\times 24 \\\\equiv 116 \\\\times 24 \\\\equiv 1 \\\\pmod{121}$, so it follows that the modular inverse of $24$ is $\\\\boxed{116}$ when taken modulo $121$.\",\n",
" \"problem_id\": \"5\",\n",
" \"response_with_ans\": \"To find the modular inverse of 24 modulo 11^2, we can use the Extended Euclidean Algorithm. Here is a Python function to compute the modular inverse using this algorithm:\\n\\n```python\\ndef mod_inverse(a, m):\\n g, x, _ = extended_gcd(a, m)\\n if g != 1:\\n raise Exception(f\\\"{a} and {m} are not coprime.\\\")\\n return x % m\\n\\ndef extended_gcd(a, b):\\n if a == 0:\\n return b, 0, 1\\n else:\\n g, x, y = extended_gcd(b % a, a)\\n return g, y - (b // a) * x, x\\n```\\n\\nLet's use the above function to find the modular inverse of 24 modulo 11^2:\\n\\n```python\\na = 24\\nm = 11**2\\nmod_inverse(a, m)\\n```\\n\\nI will execute the above code to find the modular inverse of 24 modulo 11^2.\",\n",
" \"round\": 0,\n",
" \"messages\": [\n",
" {\n",
" \"content\": \"Find $24^{-1} \\\\pmod{11^2}$. That is, find the residue $b$ for which $24b \\\\equiv 1\\\\pmod{11^2}$.\\n\\nExpress your answer as an integer from $0$ to $11^2-1$, inclusive.\",\n",
" \"role\": \"user\"\n",
" },\n",
" {\n",
" \"content\": \"To find the modular inverse of 24 modulo 11^2, we can use the Extended Euclidean Algorithm. Here is a Python function to compute the modular inverse using this algorithm:\\n\\n```python\\ndef mod_inverse(a, m):\\n g, x, _ = extended_gcd(a, m)\\n if g != 1:\\n raise Exception(f\\\"{a} and {m} are not coprime.\\\")\\n return x % m\\n\\ndef extended_gcd(a, b):\\n if a == 0:\\n return b, 0, 1\\n else:\\n g, x, y = extended_gcd(b % a, a)\\n return g, y - (b // a) * x, x\\n```\\n\\nLet's use the above function to find the modular inverse of 24 modulo 11^2:\\n\\n```python\\na = 24\\nm = 11**2\\nmod_inverse(a, m)\\n```\\n\\nI will execute the above code to find the modular inverse of 24 modulo 11^2.\",\n",
" \"role\": \"assistant\"\n",
" }\n",
" ],\n",
" \"time\": 13.481226921081543,\n",
" \"trial\": -1\n",
"}\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[33mquantifier\u001b[0m (to quantifier_user):\n",
"\n",
"{\n",
" \"accuracy\": \"correct\",\n",
" \"conciseness\": \"very concise\",\n",
" \"problem_solving_time\": \"10+ minutes\",\n",
" \"clarity\": \"very clear\",\n",
" \"difficulty_level\": \"level 5\",\n",
" \"mathematics_discipline\": \"number theory\"\n",
"}\n",
"\n",
"--------------------------------------------------------------------------------\n",
"actual correctness: true\n",
"predicted coprrectness:\n",
" {\n",
" \"accuracy\": \"correct\",\n",
" \"conciseness\": \"very concise\",\n",
" \"problem_solving_time\": \"10+ minutes\",\n",
" \"clarity\": \"very clear\",\n",
" \"difficulty_level\": \"level 5\",\n",
" \"mathematics_discipline\": \"number theory\"\n",
"}\n"
]
}
],
"source": [
"test_case = \"../test/test_files/agenteval-in-out/samples/sample_test_case.json\"\n",
"quantifier_output = get_quantifier(test_case,criteria_file)\n",
"print(\"actual correctness:\" , quantifier_output[\"actual_success\"])\n",
"print(\"predicted coprrectness:\\n\" , quantifier_output[\"estimated_performance\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2VtdM44WEGCS"
},
"source": [
"# Run `AgentEval` on the logs\n",
"\n",
"In the example below, log_path points to the sample logs folder to run the quantifier. The current sample belongs to the prealgebra category which will be downloaded from [here](https://github.com/julianakiseleva/autogen/tree/agenteval/test/test_files/agenteval-in-out/samples).\n",
"In case you want to replicate the results described in the blog post, you can download all the logs for math problems using the following [link](https://github.com/julianakiseleva/autogen/tree/agenteval/model-logs/math-problems/agentchat). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget https://github.com/julianakiseleva/autogen/blob/agenteval/test/test_files/agenteval-in-out/samples/prealgebra.zip \n",
"!unzip z.zip -d ../test/test_files/agenteval-in-out/agentchat_results\n",
"!rm prealgebra.zip "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "dZdIbHPFEGCS",
"outputId": "83c0a51b-f184-494b-81a0-d4b4a3667319"
},
"outputs": [],
"source": [
"# You can set your own log path - we also limited the number of samples to avoid additional costs.\n",
"# By removing the condition about limitations on the number of samples per category, you can run it on all 120 problems\n",
"\n",
"log_path = \"../test/test_files/agenteval-in-out/agentchat_results/\"\n",
"criteria_file = \"../test/test_files/agenteval-in-out/samples/sample_math_criteria.json\"\n",
"outcome = {}\n",
"\n",
"for prefix in os.listdir(log_path):\n",
" for file_name in os.listdir(log_path+\"/\"+prefix):\n",
" gameid = prefix+\"_\"+file_name\n",
" if file_name.split('.')[-1]=='json':\n",
" outcome[gameid] = get_quantifier(log_path+\"/\"+prefix+\"/\"+file_name,criteria_file)\n",
"\n",
"# store the evaluated problems\n",
"with open(\"../test/test_files/agenteval-in-out/evaluated_problems.json\",\"w\") as file:\n",
" json.dump(outcome,file,indent=2) # use `json.loads` to do the reverse"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qbrRRiP_EGCT"
},
"source": [
"## Plotting the estimated performance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here you can find an example of how to visualize the obtained result in the histogram form (similar to the one in the blog post)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "LKu2xZJcEGCT",
"outputId": "7780bc7c-382f-4ad3-b8c6-ac6051302303"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import scipy.stats as stats\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# computing average and 95% interval for failed and successful cases on all criteria\n",
"try:\n",
" # convert the criteria to dict type if it is already not\n",
" dictionary_for_eval = eval(open(criteria_file,'r').read())\n",
"except:\n",
" pass\n",
"\n",
"criteria = list(dictionary_for_eval.keys())\n",
"nl2int = {}\n",
"for criterion in dictionary_for_eval:\n",
" score = 0\n",
" for v in dictionary_for_eval[criterion][\"accepted_values\"]:\n",
" nl2int[v] = score\n",
" score += 1\n",
"print(nl2int)\n",
"\n",
"average_s = {}\n",
"average_f = {}\n",
"\n",
"conf_interval_s = {}\n",
"conf_interval_f = {}\n",
"\n",
"for criterion in criteria:\n",
" task={\"s\": [] , \"f\" : []}\n",
"\n",
" for game in outcome:\n",
" try:\n",
" tmp_dic = eval(outcome[game][\"estimated_performance\"])\n",
" if outcome[game][\"actual_success\"] == \"false\":\n",
" task[\"f\"].append(nl2int[tmp_dic[criterion]])\n",
" else:\n",
" task[\"s\"].append(nl2int[tmp_dic[criterion]])\n",
" except:\n",
" pass\n",
"\n",
" \n",
" average_f[criterion] = np.mean(task['f'])\n",
" average_s[criterion] = np.mean(task['s'])\n",
"\n",
" conf_interval_s[criterion] = stats.norm.interval(0.95, loc=np.mean(task['s']), scale=stats.sem(task['s']))\n",
" conf_interval_f[criterion] = stats.norm.interval(0.95, loc=np.mean(task['f']), scale=stats.sem(task['f']))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final plot would be saved in `../test/test_files/agenteval-in-out/estimated_performance.png`"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 695
},
"id": "zqa86vwgEGCT",
"outputId": "248cd0bc-0927-4d9f-b911-088bd76acf5d"
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Create a bar plot with error bars for the average values of \"s\" and \"f\" for each criterion\n",
"\n",
"plt.figure(figsize=(12, 8))\n",
"bar_width = 0.1\n",
"index = np.arange(len(criteria))\n",
"\n",
"\n",
"plt.bar(index , list(average_s.values()), bar_width, label=f\"success ({len(task['s'])} samples)\", color=\"darkblue\", yerr=[(avg - conf_interval_s[key][0]) for key, avg in average_s.items()], capsize=5)\n",
"plt.bar(index + bar_width, list(average_f.values()), bar_width, label=f\"failed ({len(task['f'])} samples)\", color=\"lightblue\", yerr=[(avg - conf_interval_f[key][0]) for key, avg in average_f.items()], capsize=5)\n",
"\n",
"plt.xlabel(\"Criteria\", fontsize=16)\n",
"plt.ylabel(\"Average Value\", fontsize=16)\n",
"plt.title(\"Average Values of 3 different baselines cases with 95% Confidence Intervals - math problems \", fontsize=12, pad=10) # Adjust titlepad to move the title further above\n",
"plt.xticks(index + bar_width / 2, criteria, rotation=45, fontsize=14)\n",
"plt.legend(loc=\"upper center\", fontsize=14, bbox_to_anchor=(0.5, 1), ncol=3) # Adjust legend placement and ncol\n",
"plt.tight_layout() # Adjust subplot parameters to fit the labels\n",
"plt.ylim(0,5)\n",
"plt.savefig(\"../test/test_files/agenteval-in-out/estimated_performance.png\")\n",
"plt.show()\n"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
},
"vscode": {
"interpreter": {
"hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1"
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}