diff --git a/.gitignore b/.gitignore index 479178234..622fc88ff 100644 --- a/.gitignore +++ b/.gitignore @@ -167,3 +167,6 @@ wolfram.txt # DB on disk for TeachableAgent tmp/ + +# Storage for the AgentEval output +test/test_files/agenteval-in-out/out/ diff --git a/notebook/agenteval_cq_math.ipynb b/notebook/agenteval_cq_math.ipynb new file mode 100644 index 000000000..d834a56c8 --- /dev/null +++ b/notebook/agenteval_cq_math.ipynb @@ -0,0 +1,789 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "-pftZ-ZF1_BA" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NPUGFpKP1_BH" + }, + "source": [ + "# Demonstrating the `AgentEval` framework using the task of solving math problems as an example\n", + "\n", + "This notebook aims to demonstrate how to `AgentEval` implemented through [AutoGen](https://github.com/microsoft/autogen) works, where we use a math problem-solving task as an example. \n", + "`AgentEval` consists of two key components:\n", + "\n", + "- `CriticAgent`: This is an LLM-based agent that generates a list criteria $(c_1, \\dots, c_n)$ to help to evaluate a utility given task.\n", + "\n", + "- `QuantifierAgent`: This agent quantifies the performance of any sample task based on the criteria designed by the `CriticAgent` in the following way: $(c_1=a_1, \\dots, c_n=a_n)$\n", + "\n", + "![AgentEval](../website/blog/2023-11-11-AgentEval/img/agenteval-CQ.png)\n", + "\n", + "For more detailed explanations, please refer to the accompanying [blog post](https://https://microsoft.github.io/autogen/blog/2023/11/11/AgentEval)\n", + "\n", + "## Requirements\n", + "\n", + "AutoGen requires `Python>=3.8`. To run this notebook example, please install pyautogen, Docker, and OpenAI:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2023-02-13T23:40:52.317406Z", + "iopub.status.busy": "2023-02-13T23:40:52.316561Z", + "iopub.status.idle": "2023-02-13T23:40:52.321193Z", + "shell.execute_reply": "2023-02-13T23:40:52.320628Z" + }, + "id": "68lTZZyJ1_BI", + "outputId": "15a55fab-e13a-4654-b8cb-ae117478d6d8" + }, + "outputs": [], + "source": [ + "%pip install pyautogen~=0.2.0b5 docker\n", + "%pip install scipy\n", + "%pip install matplotlib" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HxgqKJrd1_BJ" + }, + "source": [ + "## Set your API Endpoint\n", + "\n", + "* The [`config_list_openai_aoai`](https://microsoft.github.io/autogen/docs/reference/oai/openai_utils#config_list_openai_aoai) function tries to create a list of configurations using Azure OpenAI endpoints and OpenAI endpoints. It assumes the api keys and api bases are stored in the corresponding environment variables or local txt files:\n", + " - OpenAI API key: os.environ[\"OPENAI_API_KEY\"] or `openai_api_key_file=\"key_openai.txt\"`.\n", + " - Azure OpenAI API key: os.environ[\"AZURE_OPENAI_API_KEY\"] or `aoai_api_key_file=\"key_aoai.txt\"`. Multiple keys can be stored, one per line.\n", + " - Azure OpenAI API base: os.environ[\"AZURE_OPENAI_API_BASE\"] or `aoai_api_base_file=\"base_aoai.txt\"`. Multiple bases can be stored, one per line.\n", + "* The [`config_list_from_json`](https://microsoft.github.io/autogen/docs/reference/oai/openai_utils#config_list_from_json) function loads a list of configurations from an environment variable or a json file. It first looks for an environment variable with a specified name. The value of the environment variable needs to be a valid json string. If that variable is not found, it looks for a json file with the same name. It filters the configs by filter_dict.\n", + "\n", + "You can set the value of config_list in any way you prefer. Please refer to this [notebook](https://github.com/microsoft/autogen/blob/main/notebook/oai_openai_utils.ipynb) for full code examples of the different methods.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YRycFEDJ1_BJ" + }, + "outputs": [], + "source": [ + "import autogen\n", + "\n", + "config_list = autogen.config_list_from_json(\n", + " \"OAI_CONFIG_LIST\",\n", + " filter_dict={\n", + " \"model\": [\"gpt-4\"],\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fBZ-XFXy1_BJ" + }, + "source": [ + "\n", + "## Construct `CriticAgent`\n", + "\n", + "We construct the planning agent named `critic` and a user proxy agent for the critic named `critic_user`. We specify `human_input_mode` as \"NEVER\" in the user proxy agent, ensuring that it will never ask for human feedback. Additionally, we define the `ask_critic` function to send a message to the critic and retrieve the criteria from the critic.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "9XAeyjd11_BK" + }, + "outputs": [], + "source": [ + "\n", + "critic = autogen.AssistantAgent(\n", + " name = \"critic\",\n", + " llm_config = {\"config_list\": config_list},\n", + " system_message = \"\"\"You are a helpful assistant. You suggest criteria for evaluating different tasks. They should be dinstinguishable, quantifieable and not redundant.\n", + " Convert the evaluation criteria into a dictionary where the keys are the criteria.\n", + " The value of each key is a dictionary as follows {\"description\": criteria description , \"accepted_values\": possible accepted inputs for this key}\n", + " Make sure the keys are criteria for assessing the given task. \"accepted_values\" include the acceptable inputs for each key that are fine-grained and preferrably mlti-graded levels. \"description\" includes the criterion description.\n", + " Return the dictionary.\"\"\"\n", + ")\n", + "\n", + "critic_user = autogen.UserProxyAgent(\n", + " name = \"critic_user\",\n", + " max_consecutive_auto_reply = 0, # terminate without auto-reply\n", + " human_input_mode = \"NEVER\",\n", + ")\n", + "\n", + "def ask_critic(message):\n", + " \"\"\"\n", + " Initiate a chat with the critic user and return the last message received from the planner.\n", + "\n", + " Args:\n", + " - message (str): The message to be sent to the critic user.\n", + "\n", + " Returns:\n", + " - str: The content of the last message received.\n", + " \"\"\"\n", + " critic_user.initiate_chat(critic, message=message)\n", + " # return the last received from the planner\n", + " return critic_user.messagelast_message()[\"content\"]\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6vPTtNkhk2V1" + }, + "source": [ + "# Run the Critic\n", + "\n", + "To run the critic, we need a couple of math problem examples. One of them failed to solve the problem successfully, given in `agenteval-in-out/response_failed.txt`, and the other one was solved successfully, i.e., `agenteval-in-out/response_successful.txt`." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "5H1WRs_wkiK0" + }, + "outputs": [], + "source": [ + "def read_without_groundtruth(file_name):\n", + " \"\"\"\n", + " Read the mathproblem logs - bypassing any information about the ground truths.\n", + "\n", + " Args:\n", + " - file_name (str): The single log file that wants to get evaluated.\n", + "\n", + " Returns:\n", + " - str: The log file without any information about the ground truth answer of the problem.\n", + " \"\"\"\n", + " f = open( file_name,\"r\").readlines()\n", + " output_dictionary = \"\"\n", + " for line in f:\n", + " if \"is_correct\" not in line and \"correct_ans\" not in line and \"check_result\" not in line:\n", + " output_dictionary += line\n", + " elif \"is_correct\" in line:\n", + " correctness = line.replace(\",\",\"\").split(\":\")[-1].rstrip().strip()\n", + " return [output_dictionary,correctness]\n", + "\n", + "# Reading one successful and one failed example of the task\n", + "response_successful = read_without_groundtruth(\"../test/test_files/agenteval-in-out/samples/sample_math_response_successful.txt\")[0]\n", + "response_failed = read_without_groundtruth(\"../test/test_files/agenteval-in-out/samples/sample_math_response_failed.txt\")[0]\n", + "\n", + "task = {\"name\": \"Math problem solving\",\n", + " \"description\": \"Given any question, the system needs to solve the problem as consisely and acccurately as possible\",\n", + " \"successful_response\" : response_successful,\n", + " \"failed_response\" : response_failed}\n", + "\n", + "sys_msg = f\"\"\"Task: {task[\"name\"]}.\n", + "Task description: {task[\"description\"]}\n", + "Task successfull example: {task[\"successful_response\"]}\n", + "Task failed example: {task[\"failed_response\"]}\n", + "\"\"\"\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vu70o024lenI" + }, + "source": [ + "# The Criteria\n", + "Now, we print the designed criteria for assessing math problems. " + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "k9DsDB5hqvtG", + "outputId": "0edd7a0c-b031-4f67-efc6-1a1e77066921" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[33mcritic_user\u001b[0m (to critic):\n", + "\n", + "Task: Math problem solving.\n", + "Task description: Given any question, the system needs to solve the problem as consisely and acccurately as possible\n", + "Task successfull example: {\n", + " \"problem\": \"What is the sum of all the distinct positive two-digit factors of 144?\",\n", + " \"level\": \"Level 5\",\n", + " \"type\": \"Number Theory\",\n", + " \"solution\": \"Prime factorize $144=2^4\\\\cdot3^2$. The sum of the positive two-digit factors of 144 is $2^4+2\\\\cdot3^2+2^2\\\\cdot3+2^2\\\\cdot3^2+2^3\\\\cdot3+2^3\\\\cdot3^2+2^4\\\\cdot3=\\\\boxed{226}.$\",\n", + " \"problem_id\": \"0\",\n", + " \"response_with_ans\": \"To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\\n\\nHere's a Python script to accomplish this:\\n\\n```python\\ntwo_digit_factors = []\\n\\nfor i in range(10, 100):\\n if 144 % i == 0:\\n two_digit_factors.append(i)\\n\\nsum_of_factors = sum(two_digit_factors)\\nprint(\\\"The sum of all the distinct positive two-digit factors of 144 is:\\\", sum_of_factors)\\n```\\n\\nPlease run this script to find the desired sum.\",\n", + " \"round\": 0,\n", + " \"messages\": [\n", + " {\n", + " \"content\": \"What is the sum of all the distinct positive two-digit factors of 144?\",\n", + " \"role\": \"user\"\n", + " },\n", + " {\n", + " \"content\": \"To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\\n\\nHere's a Python script to accomplish this:\\n\\n```python\\ntwo_digit_factors = []\\n\\nfor i in range(10, 100):\\n if 144 % i == 0:\\n two_digit_factors.append(i)\\n\\nsum_of_factors = sum(two_digit_factors)\\nprint(\\\"The sum of all the distinct positive two-digit factors of 144 is:\\\", sum_of_factors)\\n```\\n\\nPlease run this script to find the desired sum.\",\n", + " \"role\": \"assistant\"\n", + " }\n", + " ],\n", + " \"time\": 11.140539407730103,\n", + " \"trial\": -1\n", + "}\n", + "Task failed example: {\n", + " \"problem\": \"Find all $x$ that satisfy the inequality $(2x+10)(x+3)<(3x+9)(x+8)$. Express your answer in interval notation.\",\n", + " \"level\": \"Level 5\",\n", + " \"type\": \"Algebra\",\n", + " \"solution\": \"We have \\\\begin{align*} (2x+10)(x+3)&<(3x+9)(x+8) \\\\quad \\\\Rightarrow\\n\\\\\\\\ 2(x+5)(x+3)&<3(x+3)(x+8) \\\\quad \\\\Rightarrow\\n\\\\\\\\ 2(x+5)(x+3)-3(x+3)(x+8)&<0 \\\\quad \\\\Rightarrow\\n\\\\\\\\ (2x+10-(3x+24))(x+3)&<0 \\\\quad \\\\Rightarrow\\n\\\\\\\\ (-x-14)(x+3)&<0 \\\\quad \\\\Rightarrow\\n\\\\\\\\ (x+14)(x+3)&>0.\\n\\\\end{align*} This inequality is satisfied if and only if $(x+14)$ and $(x+3)$ are either both positive or both negative. Both factors are positive for $x>-3$ and both factors are negative for $x<-14$. When $-14 0$ and then find intervals where the inequality holds true.\\n\\nLet's first find the factorized form of the quadratic:\",\n", + " \"role\": \"assistant\"\n", + " },\n", + " {\n", + " \"content\": \"\",\n", + " \"role\": \"user\"\n", + " },\n", + " {\n", + " \"content\": \"I apologize for the confusion in my previous message. Let's find the factorized form of the quadratic using Python code.\\n\\n```python\\nfrom sympy import symbols, factor\\r\\nx = symbols('x')\\r\\nfactorized = factor(x**2 + 17*x + 42)\\r\\nfactorized\\r\\n```\\n\\nPlease run the Python code above to find the factorized form of the quadratic, and we can proceed to determine the intervals where the inequality holds true.\",\n", + " \"role\": \"assistant\"\n", + " }\n", + " ],\n", + " \"time\": 24.91333508491516,\n", + " \"trial\": -1\n", + "}\n", + "\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[33mcritic\u001b[0m (to critic_user):\n", + "\n", + "{\n", + " \"accuracy\": {\n", + " \"description\": \"Measures the correctness of the problem solution\",\n", + " \"accepted_values\": [\"correct\", \"incorrect\"]\n", + " },\n", + " \"conciseness\": {\n", + " \"description\": \"Determines if the solution is provided in the most concise way possible\",\n", + " \"accepted_values\": [\"not concise\", \"fairly concise\", \"very concise\"]\n", + " },\n", + " \"problem_solving_time\": {\n", + " \"description\": \"Measures the time taken by the system to solve the problem\",\n", + " \"accepted_values\": [\"0-1 minutes\", \"1-5 minutes\", \"5-10 minutes\", \"10+ minutes\"]\n", + " },\n", + " \"clarity\": {\n", + " \"description\": \"Assesses the clarity of the solution provided by the system\",\n", + " \"accepted_values\": [\"not clear\", \"fairly clear\", \"very clear\"]\n", + " },\n", + " \"difficulty_level\": {\n", + " \"description\": \"The level of difficulty of the problem\",\n", + " \"accepted_values\": [\"level 1\", \"level 2\", \"level 3\", \"level 4\", \"level 5\"]\n", + " },\n", + " \"mathematics_discipline\": {\n", + " \"description\": \"The branch of mathematics the problem belongs to e.g algebra, number theory\",\n", + " \"accepted_values\": [\"algebra\", \"number theory\", \"calculus\", \"geometry\", \"statistics\", \"combinatorics\"]\n", + " }\n", + "}\n", + "\n", + "--------------------------------------------------------------------------------\n" + ] + } + ], + "source": [ + "\n", + "current_task_name = '_'.join(task[\"name\"].split()).lower()\n", + "gen_criteria = critic_user.initiate_chat(critic, message=sys_msg)\n", + "criteria = critic_user.last_message()\n", + "cr_file = open(f\"../test/test_files/agenteval-in-out/{current_task_name}_criteria.json\",\"w\")\n", + "cr_file.write(criteria[\"content\"])\n", + "cr_file.close()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PETPZluOEGCR" + }, + "source": [ + "*Note :* You can also define and use your own criteria by editing `criteria.txt`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SmpUZv_ylo9U" + }, + "source": [ + "# The `QuantifierAgent`\n", + "\n", + "Once we have the criteria, we need to quantify a new sample based on the designed criteria and its accepted values. This will be done through `QuantifierAgent` agent as follows. \n", + "We note that can skip the designed creteria by the agent and use your own defined criteria in `criteria_file`." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "4uUkZJh_subA" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "criteria_file = f\"../test/test_files/agenteval-in-out/{current_task_name}_criteria.json\"\n", + "quantifier = autogen.AssistantAgent(\n", + " name = \"quantifier\",\n", + " llm_config = {\"config_list\": config_list},\n", + " system_message = \"\"\"You are a helpful assistant. You quantify the output of different tasks based on the given criteria.\n", + " The criterion is given in a dictionary format where each key is a dintinct criteria.\n", + " The value of each key is a dictionary as follows {\"description\": criteria description , \"accepted_values\": possible accepted inputs for this key}\n", + " You are going to quantify each of the crieria for a given task based on the task decription.\n", + " Return a dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.\n", + " Return only the dictionary.\"\"\"\n", + ")\n", + "\n", + "quantifier_user = autogen.UserProxyAgent(\n", + " name = \"quantifier_user\",\n", + " max_consecutive_auto_reply = 0, # terminate without auto-reply\n", + " human_input_mode = \"NEVER\",\n", + ")\n", + "\n", + "dictionary_for_eval = open(criteria_file,\"r\").read()\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "64rRJfB2l6lO" + }, + "source": [ + "## Running the quantifier on a single test case" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "id": "zQ0H3sy8l-Ai" + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "def get_quantifier(file,criteria_file):\n", + " \"\"\"\n", + " Running quantifier agent on individual log.\n", + "\n", + " Args:\n", + " - file (str): The log path.\n", + " - file (str): The criteria jason file path\n", + " Returns:\n", + " - dict: A dictionary including the actual success of the problem as well as estimated performance by the agent eval.\n", + " {\"actual_success\":actual_label, \"estimated_performance\" : a dictionary of all the criteria and their quantified estimated performance.} }\n", + " \"\"\"\n", + " dictionary_for_eval = open(criteria_file,\"r\").read()\n", + "\n", + " test_case , actual_label = read_without_groundtruth(file)\n", + " print(\"actual label for this case: \", actual_label)\n", + " cq_results = quantifier_user.initiate_chat(quantifier, message = sys_msg + \\\n", + " \"Evaluation dictionary: \" + str(dictionary_for_eval) +\\\n", + " \"actual test case to evaluate: \" + test_case)\n", + " quantified_results = quantifier_user.last_message()\n", + " return {\"actual_success\": actual_label, \"estimated_performance\": quantified_results[\"content\"]}\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here, we run the quantifier on a single math problem test case, `sample_test_case.json`, for demonstration." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Pf623aNbHZTG", + "outputId": "0031871b-a438-43f5-d2b2-c99fa1ad0dbd" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "actual label for this case: true\n", + "\u001b[33mquantifier_user\u001b[0m (to quantifier):\n", + "\n", + "Task: Math problem solving.\n", + "Task description: Given any question, the system needs to solve the problem as consisely and acccurately as possible\n", + "Task successfull example: {\n", + " \"problem\": \"What is the sum of all the distinct positive two-digit factors of 144?\",\n", + " \"level\": \"Level 5\",\n", + " \"type\": \"Number Theory\",\n", + " \"solution\": \"Prime factorize $144=2^4\\\\cdot3^2$. The sum of the positive two-digit factors of 144 is $2^4+2\\\\cdot3^2+2^2\\\\cdot3+2^2\\\\cdot3^2+2^3\\\\cdot3+2^3\\\\cdot3^2+2^4\\\\cdot3=\\\\boxed{226}.$\",\n", + " \"problem_id\": \"0\",\n", + " \"response_with_ans\": \"To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\\n\\nHere's a Python script to accomplish this:\\n\\n```python\\ntwo_digit_factors = []\\n\\nfor i in range(10, 100):\\n if 144 % i == 0:\\n two_digit_factors.append(i)\\n\\nsum_of_factors = sum(two_digit_factors)\\nprint(\\\"The sum of all the distinct positive two-digit factors of 144 is:\\\", sum_of_factors)\\n```\\n\\nPlease run this script to find the desired sum.\",\n", + " \"round\": 0,\n", + " \"messages\": [\n", + " {\n", + " \"content\": \"What is the sum of all the distinct positive two-digit factors of 144?\",\n", + " \"role\": \"user\"\n", + " },\n", + " {\n", + " \"content\": \"To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\\n\\nHere's a Python script to accomplish this:\\n\\n```python\\ntwo_digit_factors = []\\n\\nfor i in range(10, 100):\\n if 144 % i == 0:\\n two_digit_factors.append(i)\\n\\nsum_of_factors = sum(two_digit_factors)\\nprint(\\\"The sum of all the distinct positive two-digit factors of 144 is:\\\", sum_of_factors)\\n```\\n\\nPlease run this script to find the desired sum.\",\n", + " \"role\": \"assistant\"\n", + " }\n", + " ],\n", + " \"time\": 11.140539407730103,\n", + " \"trial\": -1\n", + "}\n", + "Task failed example: {\n", + " \"problem\": \"Find all $x$ that satisfy the inequality $(2x+10)(x+3)<(3x+9)(x+8)$. Express your answer in interval notation.\",\n", + " \"level\": \"Level 5\",\n", + " \"type\": \"Algebra\",\n", + " \"solution\": \"We have \\\\begin{align*} (2x+10)(x+3)&<(3x+9)(x+8) \\\\quad \\\\Rightarrow\\n\\\\\\\\ 2(x+5)(x+3)&<3(x+3)(x+8) \\\\quad \\\\Rightarrow\\n\\\\\\\\ 2(x+5)(x+3)-3(x+3)(x+8)&<0 \\\\quad \\\\Rightarrow\\n\\\\\\\\ (2x+10-(3x+24))(x+3)&<0 \\\\quad \\\\Rightarrow\\n\\\\\\\\ (-x-14)(x+3)&<0 \\\\quad \\\\Rightarrow\\n\\\\\\\\ (x+14)(x+3)&>0.\\n\\\\end{align*} This inequality is satisfied if and only if $(x+14)$ and $(x+3)$ are either both positive or both negative. Both factors are positive for $x>-3$ and both factors are negative for $x<-14$. When $-14 0$ and then find intervals where the inequality holds true.\\n\\nLet's first find the factorized form of the quadratic:\",\n", + " \"role\": \"assistant\"\n", + " },\n", + " {\n", + " \"content\": \"\",\n", + " \"role\": \"user\"\n", + " },\n", + " {\n", + " \"content\": \"I apologize for the confusion in my previous message. Let's find the factorized form of the quadratic using Python code.\\n\\n```python\\nfrom sympy import symbols, factor\\r\\nx = symbols('x')\\r\\nfactorized = factor(x**2 + 17*x + 42)\\r\\nfactorized\\r\\n```\\n\\nPlease run the Python code above to find the factorized form of the quadratic, and we can proceed to determine the intervals where the inequality holds true.\",\n", + " \"role\": \"assistant\"\n", + " }\n", + " ],\n", + " \"time\": 24.91333508491516,\n", + " \"trial\": -1\n", + "}\n", + "Evaluation dictionary: {\n", + " \"accuracy\": {\n", + " \"description\": \"Measures the correctness of the problem solution\",\n", + " \"accepted_values\": [\"correct\", \"incorrect\"]\n", + " },\n", + " \"conciseness\": {\n", + " \"description\": \"Determines if the solution is provided in the most concise way possible\",\n", + " \"accepted_values\": [\"not concise\", \"fairly concise\", \"very concise\"]\n", + " },\n", + " \"problem_solving_time\": {\n", + " \"description\": \"Measures the time taken by the system to solve the problem\",\n", + " \"accepted_values\": [\"0-1 minutes\", \"1-5 minutes\", \"5-10 minutes\", \"10+ minutes\"]\n", + " },\n", + " \"clarity\": {\n", + " \"description\": \"Assesses the clarity of the solution provided by the system\",\n", + " \"accepted_values\": [\"not clear\", \"fairly clear\", \"very clear\"]\n", + " },\n", + " \"difficulty_level\": {\n", + " \"description\": \"The level of difficulty of the problem\",\n", + " \"accepted_values\": [\"level 1\", \"level 2\", \"level 3\", \"level 4\", \"level 5\"]\n", + " },\n", + " \"mathematics_discipline\": {\n", + " \"description\": \"The branch of mathematics the problem belongs to e.g algebra, number theory\",\n", + " \"accepted_values\": [\"algebra\", \"number theory\", \"calculus\", \"geometry\", \"statistics\", \"combinatorics\"]\n", + " }\n", + "}actual test case to evaluate: {\n", + " \"problem\": \"Find $24^{-1} \\\\pmod{11^2}$. That is, find the residue $b$ for which $24b \\\\equiv 1\\\\pmod{11^2}$.\\n\\nExpress your answer as an integer from $0$ to $11^2-1$, inclusive.\",\n", + " \"level\": \"Level 5\",\n", + " \"type\": \"Number Theory\",\n", + " \"solution\": \"Since $5 \\\\times 24 = 120 = 121 - 1$, it follows that $-5 \\\\times 24 \\\\equiv 1 \\\\pmod{121}$. Adding 121 to $-5$ to make it positive, we find $(-5 + 121) \\\\times 24 \\\\equiv 116 \\\\times 24 \\\\equiv 1 \\\\pmod{121}$, so it follows that the modular inverse of $24$ is $\\\\boxed{116}$ when taken modulo $121$.\",\n", + " \"problem_id\": \"5\",\n", + " \"response_with_ans\": \"To find the modular inverse of 24 modulo 11^2, we can use the Extended Euclidean Algorithm. Here is a Python function to compute the modular inverse using this algorithm:\\n\\n```python\\ndef mod_inverse(a, m):\\n g, x, _ = extended_gcd(a, m)\\n if g != 1:\\n raise Exception(f\\\"{a} and {m} are not coprime.\\\")\\n return x % m\\n\\ndef extended_gcd(a, b):\\n if a == 0:\\n return b, 0, 1\\n else:\\n g, x, y = extended_gcd(b % a, a)\\n return g, y - (b // a) * x, x\\n```\\n\\nLet's use the above function to find the modular inverse of 24 modulo 11^2:\\n\\n```python\\na = 24\\nm = 11**2\\nmod_inverse(a, m)\\n```\\n\\nI will execute the above code to find the modular inverse of 24 modulo 11^2.\",\n", + " \"round\": 0,\n", + " \"messages\": [\n", + " {\n", + " \"content\": \"Find $24^{-1} \\\\pmod{11^2}$. That is, find the residue $b$ for which $24b \\\\equiv 1\\\\pmod{11^2}$.\\n\\nExpress your answer as an integer from $0$ to $11^2-1$, inclusive.\",\n", + " \"role\": \"user\"\n", + " },\n", + " {\n", + " \"content\": \"To find the modular inverse of 24 modulo 11^2, we can use the Extended Euclidean Algorithm. Here is a Python function to compute the modular inverse using this algorithm:\\n\\n```python\\ndef mod_inverse(a, m):\\n g, x, _ = extended_gcd(a, m)\\n if g != 1:\\n raise Exception(f\\\"{a} and {m} are not coprime.\\\")\\n return x % m\\n\\ndef extended_gcd(a, b):\\n if a == 0:\\n return b, 0, 1\\n else:\\n g, x, y = extended_gcd(b % a, a)\\n return g, y - (b // a) * x, x\\n```\\n\\nLet's use the above function to find the modular inverse of 24 modulo 11^2:\\n\\n```python\\na = 24\\nm = 11**2\\nmod_inverse(a, m)\\n```\\n\\nI will execute the above code to find the modular inverse of 24 modulo 11^2.\",\n", + " \"role\": \"assistant\"\n", + " }\n", + " ],\n", + " \"time\": 13.481226921081543,\n", + " \"trial\": -1\n", + "}\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[33mquantifier\u001b[0m (to quantifier_user):\n", + "\n", + "{\n", + " \"accuracy\": \"correct\",\n", + " \"conciseness\": \"very concise\",\n", + " \"problem_solving_time\": \"10+ minutes\",\n", + " \"clarity\": \"very clear\",\n", + " \"difficulty_level\": \"level 5\",\n", + " \"mathematics_discipline\": \"number theory\"\n", + "}\n", + "\n", + "--------------------------------------------------------------------------------\n", + "actual correctness: true\n", + "predicted coprrectness:\n", + " {\n", + " \"accuracy\": \"correct\",\n", + " \"conciseness\": \"very concise\",\n", + " \"problem_solving_time\": \"10+ minutes\",\n", + " \"clarity\": \"very clear\",\n", + " \"difficulty_level\": \"level 5\",\n", + " \"mathematics_discipline\": \"number theory\"\n", + "}\n" + ] + } + ], + "source": [ + "test_case = \"../test/test_files/agenteval-in-out/samples/sample_test_case.json\"\n", + "quantifier_output = get_quantifier(test_case,criteria_file)\n", + "print(\"actual correctness:\" , quantifier_output[\"actual_success\"])\n", + "print(\"predicted coprrectness:\\n\" , quantifier_output[\"estimated_performance\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2VtdM44WEGCS" + }, + "source": [ + "# Run `AgentEval` on the logs\n", + "\n", + "In the example below, log_path points to the sample logs folder to run the quantifier. The current sample belongs to the prealgebra category which will be downloaded from [here](https://github.com/julianakiseleva/autogen/tree/agenteval/test/test_files/agenteval-in-out/samples).\n", + "In case you want to replicate the results described in the blog post, you can download all the logs for math problems using the following [link](https://github.com/julianakiseleva/autogen/tree/agenteval/model-logs/math-problems/agentchat). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!wget https://github.com/julianakiseleva/autogen/blob/agenteval/test/test_files/agenteval-in-out/samples/prealgebra.zip \n", + "!unzip z.zip -d ../test/test_files/agenteval-in-out/agentchat_results\n", + "!rm prealgebra.zip " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "dZdIbHPFEGCS", + "outputId": "83c0a51b-f184-494b-81a0-d4b4a3667319" + }, + "outputs": [], + "source": [ + "# You can set your own log path - we also limited the number of samples to avoid additional costs.\n", + "# By removing the condition about limitations on the number of samples per category, you can run it on all 120 problems\n", + "\n", + "log_path = \"../test/test_files/agenteval-in-out/agentchat_results/\"\n", + "criteria_file = \"../test/test_files/agenteval-in-out/samples/sample_math_criteria.json\"\n", + "outcome = {}\n", + "\n", + "for prefix in os.listdir(log_path):\n", + " for file_name in os.listdir(log_path+\"/\"+prefix):\n", + " gameid = prefix+\"_\"+file_name\n", + " if file_name.split('.')[-1]=='json':\n", + " outcome[gameid] = get_quantifier(log_path+\"/\"+prefix+\"/\"+file_name,criteria_file)\n", + "\n", + "# store the evaluated problems\n", + "with open(\"../test/test_files/agenteval-in-out/evaluated_problems.json\",\"w\") as file:\n", + " json.dump(outcome,file,indent=2) # use `json.loads` to do the reverse" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qbrRRiP_EGCT" + }, + "source": [ + "## Plotting the estimated performance" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here you can find an example of how to visualize the obtained result in the histogram form (similar to the one in the blog post)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "LKu2xZJcEGCT", + "outputId": "7780bc7c-382f-4ad3-b8c6-ac6051302303" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import scipy.stats as stats\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# computing average and 95% interval for failed and successful cases on all criteria\n", + "try:\n", + " # convert the criteria to dict type if it is already not\n", + " dictionary_for_eval = eval(open(criteria_file,'r').read())\n", + "except:\n", + " pass\n", + "\n", + "criteria = list(dictionary_for_eval.keys())\n", + "nl2int = {}\n", + "for criterion in dictionary_for_eval:\n", + " score = 0\n", + " for v in dictionary_for_eval[criterion][\"accepted_values\"]:\n", + " nl2int[v] = score\n", + " score += 1\n", + "print(nl2int)\n", + "\n", + "average_s = {}\n", + "average_f = {}\n", + "\n", + "conf_interval_s = {}\n", + "conf_interval_f = {}\n", + "\n", + "for criterion in criteria:\n", + " task={\"s\": [] , \"f\" : []}\n", + "\n", + " for game in outcome:\n", + " try:\n", + " tmp_dic = eval(outcome[game][\"estimated_performance\"])\n", + " if outcome[game][\"actual_success\"] == \"false\":\n", + " task[\"f\"].append(nl2int[tmp_dic[criterion]])\n", + " else:\n", + " task[\"s\"].append(nl2int[tmp_dic[criterion]])\n", + " except:\n", + " pass\n", + "\n", + " \n", + " average_f[criterion] = np.mean(task['f'])\n", + " average_s[criterion] = np.mean(task['s'])\n", + "\n", + " conf_interval_s[criterion] = stats.norm.interval(0.95, loc=np.mean(task['s']), scale=stats.sem(task['s']))\n", + " conf_interval_f[criterion] = stats.norm.interval(0.95, loc=np.mean(task['f']), scale=stats.sem(task['f']))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The final plot would be saved in `../test/test_files/agenteval-in-out/estimated_performance.png`" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 695 + }, + "id": "zqa86vwgEGCT", + "outputId": "248cd0bc-0927-4d9f-b911-088bd76acf5d" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create a bar plot with error bars for the average values of \"s\" and \"f\" for each criterion\n", + "\n", + "plt.figure(figsize=(12, 8))\n", + "bar_width = 0.1\n", + "index = np.arange(len(criteria))\n", + "\n", + "\n", + "plt.bar(index , list(average_s.values()), bar_width, label=f\"success ({len(task['s'])} samples)\", color=\"darkblue\", yerr=[(avg - conf_interval_s[key][0]) for key, avg in average_s.items()], capsize=5)\n", + "plt.bar(index + bar_width, list(average_f.values()), bar_width, label=f\"failed ({len(task['f'])} samples)\", color=\"lightblue\", yerr=[(avg - conf_interval_f[key][0]) for key, avg in average_f.items()], capsize=5)\n", + "\n", + "plt.xlabel(\"Criteria\", fontsize=16)\n", + "plt.ylabel(\"Average Value\", fontsize=16)\n", + "plt.title(\"Average Values of 3 different baselines cases with 95% Confidence Intervals - math problems \", fontsize=12, pad=10) # Adjust titlepad to move the title further above\n", + "plt.xticks(index + bar_width / 2, criteria, rotation=45, fontsize=14)\n", + "plt.legend(loc=\"upper center\", fontsize=14, bbox_to_anchor=(0.5, 1), ncol=3) # Adjust legend placement and ncol\n", + "plt.tight_layout() # Adjust subplot parameters to fit the labels\n", + "plt.ylim(0,5)\n", + "plt.savefig(\"../test/test_files/agenteval-in-out/estimated_performance.png\")\n", + "plt.show()\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + }, + "vscode": { + "interpreter": { + "hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1" + } + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/test/test_files/agenteval-in-out/samples/sample_math_criteria.json b/test/test_files/agenteval-in-out/samples/sample_math_criteria.json new file mode 100644 index 000000000..e7ba16546 --- /dev/null +++ b/test/test_files/agenteval-in-out/samples/sample_math_criteria.json @@ -0,0 +1,26 @@ +{ + "Problem Interpretation": { + "description": "Ability to correctly interpret the problem.", + "accepted_values": ["completely off", "slightly relevant", "relevant", "mostly accurate", "completely accurate"] + }, + "Mathematical Methodology": { + "description": "Adequacy of the chosen mathematical or algorithmic methodology for the question", + "accepted_values": ["inappropriate", "barely adequate", "adequate", "mostly effective", "completely effective"] + }, + "Calculation Correctness": { + "description": "Accuracy of calculations made and solutions given", + "accepted_values": ["completely incorrect", "mostly incorrect", "neither", "mostly correct", "completely correct"] + }, + "Explanation Clarity": { + "description": "Clarity and comprehensibility of explanations, including language use and structure", + "accepted_values": ["not at all clear", "slightly clear", "moderately clear", "very clear", "completely clear"] + }, + "Code Efficiency": { + "description": "Quality of code in terms of efficiency and elegance", + "accepted_values": ["not at all efficient", "slightly efficient", "moderately efficient", "very efficient", "extremely efficient"] + }, + "Code Correctness": { + "description": "Correctness of the provided code", + "accepted_values": ["completely incorrect", "mostly incorrect", "partly correct", "mostly correct", "completely correct"] + } + } diff --git a/test/test_files/agenteval-in-out/samples/sample_math_evaluated_results.json b/test/test_files/agenteval-in-out/samples/sample_math_evaluated_results.json new file mode 100644 index 000000000..453b8498c --- /dev/null +++ b/test/test_files/agenteval-in-out/samples/sample_math_evaluated_results.json @@ -0,0 +1,13 @@ +{ + "sample_algebra.json": { + "actual_success": "true", + "estimated_performance": { + "Problem Interpretation": "completely accurate", + "Mathematical Methodology": "completely effective", + "Calculation Correctness": "completely correct", + "Explanation Clarity": "very clear", + "Code Efficiency": "very efficient", + "Code Correctness": "completely correct" + } + } +} diff --git a/test/test_files/agenteval-in-out/samples/sample_math_response_failed.txt b/test/test_files/agenteval-in-out/samples/sample_math_response_failed.txt new file mode 100644 index 000000000..165448351 --- /dev/null +++ b/test/test_files/agenteval-in-out/samples/sample_math_response_failed.txt @@ -0,0 +1,32 @@ +{ + "problem": "Find all $x$ that satisfy the inequality $(2x+10)(x+3)<(3x+9)(x+8)$. Express your answer in interval notation.", + "level": "Level 5", + "type": "Algebra", + "solution": "We have \\begin{align*} (2x+10)(x+3)&<(3x+9)(x+8) \\quad \\Rightarrow\n\\\\ 2(x+5)(x+3)&<3(x+3)(x+8) \\quad \\Rightarrow\n\\\\ 2(x+5)(x+3)-3(x+3)(x+8)&<0 \\quad \\Rightarrow\n\\\\ (2x+10-(3x+24))(x+3)&<0 \\quad \\Rightarrow\n\\\\ (-x-14)(x+3)&<0 \\quad \\Rightarrow\n\\\\ (x+14)(x+3)&>0.\n\\end{align*} This inequality is satisfied if and only if $(x+14)$ and $(x+3)$ are either both positive or both negative. Both factors are positive for $x>-3$ and both factors are negative for $x<-14$. When $-14 0$ and then find intervals where the inequality holds true.\n\nLet's first find the factorized form of the quadratic:", + "role": "assistant" + }, + { + "content": "", + "role": "user" + }, + { + "content": "I apologize for the confusion in my previous message. Let's find the factorized form of the quadratic using Python code.\n\n```python\nfrom sympy import symbols, factor\r\nx = symbols('x')\r\nfactorized = factor(x**2 + 17*x + 42)\r\nfactorized\r\n```\n\nPlease run the Python code above to find the factorized form of the quadratic, and we can proceed to determine the intervals where the inequality holds true.", + "role": "assistant" + } + ], + "time": 24.91333508491516, + "check_result": "the reply doesn't contain an answer.", + "is_correct": false, + "trial": -1 +} diff --git a/test/test_files/agenteval-in-out/samples/sample_math_response_successful.txt b/test/test_files/agenteval-in-out/samples/sample_math_response_successful.txt new file mode 100644 index 000000000..89e9ab7b1 --- /dev/null +++ b/test/test_files/agenteval-in-out/samples/sample_math_response_successful.txt @@ -0,0 +1,24 @@ +{ + "problem": "What is the sum of all the distinct positive two-digit factors of 144?", + "level": "Level 5", + "type": "Number Theory", + "solution": "Prime factorize $144=2^4\\cdot3^2$. The sum of the positive two-digit factors of 144 is $2^4+2\\cdot3^2+2^2\\cdot3+2^2\\cdot3^2+2^3\\cdot3+2^3\\cdot3^2+2^4\\cdot3=\\boxed{226}.$", + "correct_ans": "226", + "problem_id": "0", + "response_with_ans": "To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\n\nHere's a Python script to accomplish this:\n\n```python\ntwo_digit_factors = []\n\nfor i in range(10, 100):\n if 144 % i == 0:\n two_digit_factors.append(i)\n\nsum_of_factors = sum(two_digit_factors)\nprint(\"The sum of all the distinct positive two-digit factors of 144 is:\", sum_of_factors)\n```\n\nPlease run this script to find the desired sum.", + "round": 0, + "messages": [ + { + "content": "What is the sum of all the distinct positive two-digit factors of 144?", + "role": "user" + }, + { + "content": "To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\n\nHere's a Python script to accomplish this:\n\n```python\ntwo_digit_factors = []\n\nfor i in range(10, 100):\n if 144 % i == 0:\n two_digit_factors.append(i)\n\nsum_of_factors = sum(two_digit_factors)\nprint(\"The sum of all the distinct positive two-digit factors of 144 is:\", sum_of_factors)\n```\n\nPlease run this script to find the desired sum.", + "role": "assistant" + } + ], + "time": 11.140539407730103, + "check_result": "the answer is not directly provided in the reply, but i can extract it by running the given python code:\n\n```python\ntwo_digit_factors = []\n\nfor i in range(10, 100):\n if 144 % i == 0:\n two_digit_factors.append(i)\n\nsum_of_factors = sum(two_digit_factors)\nprint(\"the sum of all the distinct positive two-digit factors of 144 is:\", sum_of_factors)\n```\n\nafter running this code, the result is:\nthe sum of all the distinct positive two-digit factors of 144 is: 226\n\nthe answer is 226. the ground truth answer is 226. therefore,\n\nthe answer is correct.", + "is_correct": true, + "trial": -1 +} diff --git a/test/test_files/agenteval-in-out/samples/sample_test_case.json b/test/test_files/agenteval-in-out/samples/sample_test_case.json new file mode 100644 index 000000000..08de40476 --- /dev/null +++ b/test/test_files/agenteval-in-out/samples/sample_test_case.json @@ -0,0 +1,24 @@ +{ + "problem": "Find $24^{-1} \\pmod{11^2}$. That is, find the residue $b$ for which $24b \\equiv 1\\pmod{11^2}$.\n\nExpress your answer as an integer from $0$ to $11^2-1$, inclusive.", + "level": "Level 5", + "type": "Number Theory", + "solution": "Since $5 \\times 24 = 120 = 121 - 1$, it follows that $-5 \\times 24 \\equiv 1 \\pmod{121}$. Adding 121 to $-5$ to make it positive, we find $(-5 + 121) \\times 24 \\equiv 116 \\times 24 \\equiv 1 \\pmod{121}$, so it follows that the modular inverse of $24$ is $\\boxed{116}$ when taken modulo $121$.", + "correct_ans": "116", + "problem_id": "5", + "response_with_ans": "To find the modular inverse of 24 modulo 11^2, we can use the Extended Euclidean Algorithm. Here is a Python function to compute the modular inverse using this algorithm:\n\n```python\ndef mod_inverse(a, m):\n g, x, _ = extended_gcd(a, m)\n if g != 1:\n raise Exception(f\"{a} and {m} are not coprime.\")\n return x % m\n\ndef extended_gcd(a, b):\n if a == 0:\n return b, 0, 1\n else:\n g, x, y = extended_gcd(b % a, a)\n return g, y - (b // a) * x, x\n```\n\nLet's use the above function to find the modular inverse of 24 modulo 11^2:\n\n```python\na = 24\nm = 11**2\nmod_inverse(a, m)\n```\n\nI will execute the above code to find the modular inverse of 24 modulo 11^2.", + "round": 0, + "messages": [ + { + "content": "Find $24^{-1} \\pmod{11^2}$. That is, find the residue $b$ for which $24b \\equiv 1\\pmod{11^2}$.\n\nExpress your answer as an integer from $0$ to $11^2-1$, inclusive.", + "role": "user" + }, + { + "content": "To find the modular inverse of 24 modulo 11^2, we can use the Extended Euclidean Algorithm. Here is a Python function to compute the modular inverse using this algorithm:\n\n```python\ndef mod_inverse(a, m):\n g, x, _ = extended_gcd(a, m)\n if g != 1:\n raise Exception(f\"{a} and {m} are not coprime.\")\n return x % m\n\ndef extended_gcd(a, b):\n if a == 0:\n return b, 0, 1\n else:\n g, x, y = extended_gcd(b % a, a)\n return g, y - (b // a) * x, x\n```\n\nLet's use the above function to find the modular inverse of 24 modulo 11^2:\n\n```python\na = 24\nm = 11**2\nmod_inverse(a, m)\n```\n\nI will execute the above code to find the modular inverse of 24 modulo 11^2.", + "role": "assistant" + } + ], + "time": 13.481226921081543, + "check_result": "let me verify the modular inverse of 24 modulo 11^2.\n\nafter executing the provided python code:\n\n```python\na = 24\nm = 11**2\nmod_inverse(a, m)\n```\n\nthe answer is found to be: 116.\n\nthe answer is correct.", + "is_correct": true, + "trial": -1 +} diff --git a/website/blog/2023-11-20-AgentEval/img/agenteval-CQ.png b/website/blog/2023-11-20-AgentEval/img/agenteval-CQ.png new file mode 100644 index 000000000..efa6a21c0 Binary files /dev/null and b/website/blog/2023-11-20-AgentEval/img/agenteval-CQ.png differ diff --git a/website/blog/2023-11-20-AgentEval/img/math-problems-plot.png b/website/blog/2023-11-20-AgentEval/img/math-problems-plot.png new file mode 100644 index 000000000..88ce6b697 Binary files /dev/null and b/website/blog/2023-11-20-AgentEval/img/math-problems-plot.png differ diff --git a/website/blog/2023-11-20-AgentEval/img/tasks-taxonomy.png b/website/blog/2023-11-20-AgentEval/img/tasks-taxonomy.png new file mode 100644 index 000000000..743c690eb Binary files /dev/null and b/website/blog/2023-11-20-AgentEval/img/tasks-taxonomy.png differ diff --git a/website/blog/2023-11-20-AgentEval/index.mdx b/website/blog/2023-11-20-AgentEval/index.mdx new file mode 100644 index 000000000..97d0b010a --- /dev/null +++ b/website/blog/2023-11-20-AgentEval/index.mdx @@ -0,0 +1,151 @@ +--- +title: How to Assess Utility of LLM-powered Applications? +authors: + - julianakiseleva + - narabzad +tags: [LLM, GPT, evaluation, task utility] +--- + + +![Fig.1: A verification framework](img/agenteval-CQ.png) + +

Fig.1 illustrates the general flow of AgentEval

+ +**TL;DR:** +* As a developer of an LLM-powered application, how can you assess the utility it brings to end users while helping them with their tasks? +* To shed light on the question above, we introduce `AgentEval` — the first version of the framework to assess the utility of any LLM-powered application crafted to assist users in specific tasks. AgentEval aims to simplify the evaluation process by automatically proposing a set of criteria tailored to the unique purpose of your application. This allows for a comprehensive assessment, quantifying the utility of your application against the suggested criteria. +* We demonstrate how `AgentEval` work using [math problems dataset](https://microsoft.github.io/autogen/blog/2023/06/28/MathChat) as an example in the [following notebook](https://github.com/microsoft/autogen/blob/main/notebook/agenteval_cq_math.ipynb). Any feedback would be useful for future development. Please contact us on our [Discord](http://aka.ms/autogen-dc). + + +## Introduction + + AutoGen aims to simplify the development of LLM-powered multi-agent systems for various applications, ultimately making end users' lives easier by assisting with their tasks. Next, we all yearn to understand how our developed systems perform, their utility for users, and, perhaps most crucially, how we can enhance them. Directly evaluating multi-agent systems poses challenges as current approaches predominantly rely on success metrics – essentially, whether the agent accomplishes tasks. However, comprehending user interaction with a system involves far more than success alone. Take math problems, for instance; it's not merely about the agent solving the problem. Equally significant is its ability to convey solutions based on various criteria, including completeness, conciseness, and the clarity of the provided explanation. Furthermore, success isn't always clearly defined for every task. + + Rapid advances in LLMs and multi-agent systems have brought forth many emerging capabilities that we're keen on translating into tangible utilities for end users. We introduce the first version of `AgentEval` framework - a tool crafted to empower developers in swiftly gauging the utility of LLM-powered applications designed to help end users accomplish the desired task. + + +![Fig.2: An overview of the tasks taxonomy](img/tasks-taxonomy.png) +

Fig. 2 provides an overview of the tasks taxonomy

+ + +Let's first look into an overview of the suggested task taxonomy that a multi-agent system can be designed for. In general, the tasks can be split into two types, where: +* _Success is not clearly defined_ - refer to instances when users utilize a system in an assistive manner, seeking suggestions rather than expecting the system to solve the task. For example, a user might request the system to generate an email. In many cases, this generated content serves as a template that the user will later edit. However, defining success precisely for such tasks is relatively complex. +* _Success is clearly defined_ - refer to instances where we can clearly define whether a system solved the task or not. Consider agents that assist in accomplishing household tasks, where the definition of success is clear and measurable. This category can be further divided into two separate subcategories: + * _The optimal solution exits_ - these are tasks where only one solution is possible. For example, if you ask your assistant to turn on the light, the success of this task is clearly defined, and there is only one way to accomplish it. + * _Multiple solutions exist_ - increasingly, we observe situations where multiple trajectories of agent behavior can lead to either success or failure. In such cases, it is crucial to differentiate between the various successful and unsuccessful trajectories. For example, when you ask the agent to suggest you a food recipe or tell you a joke. + +In our `AgentEval` framework, we are currently focusing on tasks where _Success is clearly defined_. Next, we will introduce the suggeted framework. + +## `AgentEval` Framework + +Our previous research on [assistive agents in Minecraft](https://github.com/microsoft/iglu-datasets) suggested that the most optimal way to obtain human judgments is to present humans with two agents side by side and ask for preferences. In this setup of pairwise comparison, humans can develop criteria to explain why they prefer the behavior of one agent over another. For instance, _'the first agent was faster in execution,'_ or _'the second agent moves more naturally.'_ So, the comparative nature led humans to come up with a list of criteria that helps to infer the utility of the task. With this idea in mind, we designed `AgentEval` (shown in Fig. 1), where we employ LLMs to help us understand, verify, and assess task *utility* for the multi-agent system. Namely: + +* The goal of `CriticAgent` is to suggest the list of criteria (Fig. 1), that can be used to assess task utility. This is an example of how `CriticAgent` is defined using `Autogen`: + +```python +critic = autogen.AssistantAgent( + name="critic", + llm_config={"config_list": config_list}, + system_message="""You are a helpful assistant. You suggest criteria for evaluating different tasks. They should be distinguishable, quantifiable, and not redundant. + Convert the evaluation criteria into a dictionary where the keys are the criteria. + The value of each key is a dictionary as follows {"description": criteria description, "accepted_values": possible accepted inputs for this key} + Make sure the keys are criteria for assessing the given task. "accepted_values" include the acceptable inputs for each key that are fine-grained and preferably multi-graded levels. "description" includes the criterion description. + Return only the dictionary.""" +) +``` + +Next, the critic is given successful and failed examples of the task execution; then, it is able to return a list of criteria (Fig. 1). For reference, use the [following notebook](https://github.com/microsoft/autogen/blob/main/notebook/agenteval_cq_math.ipynb). + +* The goal of `QuantifierAgent` is to quantify each of the suggested criteria (Fig. 1), providing us with an idea of the utility of this system for the given task. Here is an example of how it can be defined: + +```python +quantifier = autogen.AssistantAgent( + name="quantifier", + llm_config={"config_list": config_list}, + system_message = """You are a helpful assistant. You quantify the output of different tasks based on the given criteria. + The criterion is given in a dictionary format where each key is a distinct criteria. + The value of each key is a dictionary as follows {"description": criteria description , "accepted_values": possible accepted inputs for this key} + You are going to quantify each of the criteria for a given task based on the task description. + Return a dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria. + Return only the dictionary.""" + +) +``` + +## `AgentEval` Results based on Math Problems Dataset + + As an example, after running CriticAgent, we obtained the following criteria to verify the results for math problem dataset: + +| Criteria | Description | Accepted Values| +|-----------|-----|----------------| +| Problem Interpretation | Ability to correctly interpret the problem | ["completely off", "slightly relevant", "relevant", "mostly accurate", "completely accurate"]| +| Mathematical Methodology | Adequacy of the chosen mathematical or algorithmic methodology for the question | ["inappropriate", "barely adequate", "adequate", "mostly effective", "completely effective"] | +| Calculation Correctness | Accuracy of calculations made and solutions given | ["completely incorrect", "mostly incorrect", "neither", "mostly correct", "completely correct"] | +| Explanation Clarity | Clarity and comprehensibility of explanations, including language use and structure | ["not at all clear", "slightly clear", "moderately clear", "very clear", "completely clear"] | +| Code Efficiency | Quality of code in terms of efficiency and elegance |["not at all efficient", "slightly efficient", "moderately efficient", "very efficient", "extremely efficient"] | +| Code Correctness | Correctness of the provided code | ["completely incorrect", "mostly incorrect", "partly correct", "mostly correct", "completely correct"] + + +Then, after running QuantifierAgent, we obtained the results presented in Fig. 3, where you can see three models: +* AgentChat +* ReAct +* GPT-4 Vanilla Solver + +Lighter colors represent estimates for failed cases, and brighter colors show how discovered criteria were quantified. + +![Fig.3: Results based on overall mathp roblems dataset `_s` stands for successful cases, `_f` - stands for failed cases](img/math-problems-plot.png) +

Fig.3 presents results based on overall mathproblems datase `_s` stands for successful cases, `_f` - stands for failed cases

+ +We note that while applying agentEval to math problems, the agent was not exposed to any ground truth information about the problem. As such, this figure illustrates an estimated performance of the three different agents, namely, Autogen (blue), Gpt-4 (red), and ReAct (green). We observe that by comparing the performance of any of the three agents in successful cases (dark bars of any color) versus unsuccessful cases (lighter version of the same bar), we note that AgentEval was able to assign higher quantification to successful cases than that of failed ones. This observation verifies AgentEval's ability for task utility prediction. Additionally, AgentEval allows us to go beyond just a binary definition of success, enabling a more in-depth comparison between successful and failed cases. + +It's important not only to identify what is not working but also to recognize what and why actually went well. + +## Limitations and Future Work +The current implementation of `AgentEval` has a number of limitations which are planning to overcome in the future: +* The list of criteria varies per run (unless you store a seed). We would recommend to run `CriticAgent` at least two times, and pick criteria you think is important for your domain. +* The results of the `QuantifierAgent` can vary with each run, so we recommend conducting multiple runs to observe the extent of result variations. + +To mitigate the limitations mentioned above, we are working on VerifierAgent, whose goal is to stabilize the results and provide additional explanations. + +## Summary +`CriticAgent` and `QuantifierAgent` can be applied to the logs of any type of application, providing you with an in-depth understanding of the utility your solution brings to the user for a given task. + +We would love to hear about how AgentEval works for your application. Any feedback would be useful for future development. Please contact us on our [Discord](http://aka.ms/autogen-dc). + + +## Previous Research + +``` +@InProceedings{pmlr-v176-kiseleva22a, + title = "Interactive Grounded Language Understanding in a Collaborative Environment: IGLU 2021", + author = "Kiseleva, Julia and Li, Ziming and Aliannejadi, Mohammad and Mohanty, Shrestha and ter Hoeve, Maartje and Burtsev, Mikhail and Skrynnik, Alexey and Zholus, Artem and Panov, Aleksandr and Srinet, Kavya and Szlam, Arthur and Sun, Yuxuan and Hofmann, Katja and C{\^o}t{\'e}, Marc-Alexandre and Awadallah, Ahmed and Abdrazakov, Linar and Churin, Igor and Manggala, Putra and Naszadi, Kata and van der Meer, Michiel and Kim, Taewoon", + booktitle = "Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track", + pages = "146--161", + year = 2022, + editor = "Kiela, Douwe and Ciccone, Marco and Caputo, Barbara", + volume = 176, + series = "Proceedings of Machine Learning Research", + month = "06--14 Dec", + publisher = "PMLR", + pdf = {https://proceedings.mlr.press/v176/kiseleva22a/kiseleva22a.pdf}, + url = {https://proceedings.mlr.press/v176/kiseleva22a.html}. +} +``` + + +``` +@InProceedings{pmlr-v220-kiseleva22a, + title = "Interactive Grounded Language Understanding in a Collaborative Environment: Retrospective on Iglu 2022 Competition", + author = "Kiseleva, Julia and Skrynnik, Alexey and Zholus, Artem and Mohanty, Shrestha and Arabzadeh, Negar and C\^{o}t\'e, Marc-Alexandre and Aliannejadi, Mohammad and Teruel, Milagro and Li, Ziming and Burtsev, Mikhail and ter Hoeve, Maartje and Volovikova, Zoya and Panov, Aleksandr and Sun, Yuxuan and Srinet, Kavya and Szlam, Arthur and Awadallah, Ahmed and Rho, Seungeun and Kwon, Taehwan and Wontae Nam, Daniel and Bivort Haiek, Felipe and Zhang, Edwin and Abdrazakov, Linar and Qingyam, Guo and Zhang, Jason and Guo, Zhibin", + booktitle = "Proceedings of the NeurIPS 2022 Competitions Track", + pages = "204--216", + year = 2022, + editor = "Ciccone, Marco and Stolovitzky, Gustavo and Albrecht, Jacob", + volume = 220, + series = "Proceedings of Machine Learning Research", + month = "28 Nov--09 Dec", + publisher = "PMLR", + pdf = "https://proceedings.mlr.press/v220/kiseleva22a/kiseleva22a.pdf", + url = "https://proceedings.mlr.press/v220/kiseleva22a.html". +} +``` diff --git a/website/blog/authors.yml b/website/blog/authors.yml index 67ef6ba9c..a487086cb 100644 --- a/website/blog/authors.yml +++ b/website/blog/authors.yml @@ -1,77 +1,89 @@ -sonichi: - name: Chi Wang - title: Principal Researcher at Microsoft Research - url: https://www.linkedin.com/in/chi-wang-49b15b16/ - image_url: https://github.com/sonichi.png - -qingyunwu: - name: Qingyun Wu - title: Assistant Professor at the Pennsylvania State University - url: https://qingyun-wu.github.io/ - image_url: https://github.com/qingyun-wu.png - -yiranwu: - name: Yiran Wu - title: PhD student at Pennsylvania State University - url: https://github.com/kevin666aa - image_url: https://github.com/kevin666aa.png - -jialeliu: - name: Jiale Liu - title: Undergraduate student at Xidian University - url: https://leoljl.github.io - image_url: https://github.com/LeoLjl/leoljl.github.io/blob/main/profile.jpg?raw=true - -thinkall: - name: Li Jiang - title: Senior Software Engineer at Microsoft - url: https://github.com/thinkall - image_url: https://github.com/thinkall.png - -rickyloynd-microsoft: - name: Ricky Loynd - title: Senior Research Engineer at Microsoft - url: https://github.com/rickyloynd-microsoft - image_url: https://github.com/rickyloynd-microsoft.png - -samershi: - name: Saleema Amershi - title: Senior Principal Research Manager at Microsoft Research - url: https://github.com/samershi - image_url: https://github.com/samershi.png - -pcdeadeasy: - name: Piali Choudhury - title: Principal RSDE at Microsoft Research - url: https://github.com/pcdeadeasy - image_url: https://github.com/pcdeadeasy.png - -victordibia: - name: Victor Dibia - title: Principal RSDE at Microsoft Research - url: https://github.com/victordibia - image_url: https://github.com/victordibia.png - -afourney: - name: Adam Fourney - title: Principal Researcher Microsoft Research - url: https://github.com/afourney - image_url: https://github.com/afourney.png - -beibinli: - name: Beibin Li - title: Senior Research Engineer at Microsoft - url: https://github.com/beibinli - image_url: https://github.com/beibinli.png - -gagb: - name: Gagan Bansal - title: Senior Researcher at Microsoft Research - url: https://www.linkedin.com/in/gagan-bansal/ - image_url: https://github.com/gagb.png - -jieyuz2: - name: Jieyu Zhang - title: PhD student at University of Washington - url: https://jieyuz2.github.io/ - image_url: https://github.com/jieyuz2.png +sonichi: + name: Chi Wang + title: Principal Researcher at Microsoft Research + url: https://www.linkedin.com/in/chi-wang-49b15b16/ + image_url: https://github.com/sonichi.png + +qingyunwu: + name: Qingyun Wu + title: Assistant Professor at the Pennsylvania State University + url: https://qingyun-wu.github.io/ + image_url: https://github.com/qingyun-wu.png + +yiranwu: + name: Yiran Wu + title: PhD student at Pennsylvania State University + url: https://github.com/kevin666aa + image_url: https://github.com/kevin666aa.png + +jialeliu: + name: Jiale Liu + title: Undergraduate student at Xidian University + url: https://leoljl.github.io + image_url: https://github.com/LeoLjl/leoljl.github.io/blob/main/profile.jpg?raw=true + +thinkall: + name: Li Jiang + title: Senior Software Engineer at Microsoft + url: https://github.com/thinkall + image_url: https://github.com/thinkall.png + +rickyloynd-microsoft: + name: Ricky Loynd + title: Senior Research Engineer at Microsoft + url: https://github.com/rickyloynd-microsoft + image_url: https://github.com/rickyloynd-microsoft.png + +samershi: + name: Saleema Amershi + title: Senior Principal Research Manager at Microsoft Research + url: https://github.com/samershi + image_url: https://github.com/samershi.png + +pcdeadeasy: + name: Piali Choudhury + title: Principal RSDE at Microsoft Research + url: https://github.com/pcdeadeasy + image_url: https://github.com/pcdeadeasy.png + +victordibia: + name: Victor Dibia + title: Principal RSDE at Microsoft Research + url: https://github.com/victordibia + image_url: https://github.com/victordibia.png + +afourney: + name: Adam Fourney + title: Principal Researcher Microsoft Research + url: https://github.com/afourney + image_url: https://github.com/afourney.png + +beibinli: + name: Beibin Li + title: Senior Research Engineer at Microsoft + url: https://github.com/beibinli + image_url: https://github.com/beibinli.png + +gagb: + name: Gagan Bansal + title: Senior Researcher at Microsoft Research + url: https://www.linkedin.com/in/gagan-bansal/ + image_url: https://github.com/gagb.png + +jieyuz2: + name: Jieyu Zhang + title: PhD student at University of Washington + url: https://jieyuz2.github.io/ + image_url: https://github.com/jieyuz2.png + +julianakiseleva: + name: Julia Kiseleva + title: Senior Researcher at Microsoft Research + url: https://github.com/julianakiseleva/ + image_url: https://avatars.githubusercontent.com/u/5908392?v=4 + +narabzad: + name: Negar Arabzadeh + title: PhD student at the University of Waterloo + url: https://www.negara.me/ + image_url: https://github.com/Narabzad.png