Li Jiang 42b27b9a9d
Add isort (#2265)
* Add isort

* Apply isort on py files

* Fix circular import

* Fix format for notebooks

* Fix format

---------

Co-authored-by: Chi Wang <wang.chi@microsoft.com>
2024-04-05 02:26:06 +00:00
..
2024-04-05 02:26:06 +00:00
2024-04-05 02:26:06 +00:00
2024-01-26 00:46:58 +00:00
2024-03-07 15:02:48 +00:00

MATH Benchmark

This scenario implements the MATH benchmark.

Running the tasks

autogenbench run Tasks/math_two_agents.jsonl
autogenbench tabulate Results/math_two_agents

By default, only a small subset (17 of 5000) MATH problems are exposed. Edit Scripts/init_tasks.py to expose more tasks.

Note on automated evaluation

In this scenario, we adopted an automated evaluation pipeline (from AutoGen evaluation) that uses LLM to compare the results. Thus, the metric above is only an estimation of the agent's performance on math problems. We also find a similar practice of using LLM as judger for MATH dataset from the Cumulative Reasoning paper (code).

The static checking from MATH dataset requires an exact match ('comparing 2.0 and 2 results in False'). We haven't found an established way that accurately compares the answer, so human involvement is still needed to confirm the result. In AutoGen, the conversation will end at “TERMINATE” by default. To enable an automated way of answer extraction and evaluation, we prompt an LLM with 1. the given problem 2. the ground truth answer 3. the last response from the solver, to extract the answer and compare it with the ground truth answer.

We evaluate the 17 problems for 3 times and go through these problems manually to check the answers. Compared with the automated result evaluation (the model is gpt-4-0613), we find that in 2/3 trials, the automated evaluation determined 1 correct answer as wrong (False Negative). This means 49/51 problems are evaluated correctly. We also went through 200 random sampled problems from whole dataset to check the results. There are 1 False Negative and 2 False Positives.

We note that False Positive is also possible due to the hallucination of LLMs, and the variety of problems.

References

Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt
https://arxiv.org/abs/2103.03874

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang and Chi Wang
https://arxiv.org/abs/2308.08155

Cumulative Reasoning with Large Language Models
Yifan Zhang, Jingqin Yang, Yang Yuan, Andrew Chi-Chih Yao
https://arxiv.org/abs/2308.04371