mirror of
https://github.com/microsoft/autogen.git
synced 2025-08-12 10:41:13 +00:00

* Initial commit of AutoGenBench * wording * typo * pre-commit reformulation * Updated README to point to contributor's guide earlier. * Simplified the description of the JSON format. * Added print statements to indicate when run.sh and scenario.py are starting. * Added SocietyOfMind scenario to GAIA. * Pointing autogenbench clone command to the latest branch. * Temporarily disable subsample option. * Updated the GAIA readme to specify how to define a BING API key. * Fixed and re-enabled the subsample option. * Added a draft of a blog post. * Updated authors. * Incorporating Gagan's feedback. * Fixed code formatting. * Updated the help string in the docs. * Light editing of the AutoGenBench blogpost. * Support filtering on model tags. * Added websurfer dependencies to Dockerfile. * Renamed testbed -> autogenbench * Attempting to fix formatting. * Added more gracefull handling of task timeouts (the script is allowed to terminate before Docker is stopped). * Updated the blogpost based on Saleema's and Julia's feedback. * Fixed formatting... again. * Added a main MANIFEST to list available scenarios. * Limit main manifest to directories. * Manifests now use relative paths. * All manifests are now relative. * Updated the contributing guide, and address windows path issues. * Updated the version. Fixed formatting. * Fixed formatting. * De-listing Examples, since it has no clear tabulate criteria. * Updated email in pyproject * typo in blogpost * wording --------- Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu> Co-authored-by: Qingyun Wu <qingyun0327@gmail.com>
GAIA Benchmark
This scenario implements the GAIA agent benchmark.
Running the TwoAgents tasks
Level 1 tasks:
autogenbench run Tasks/gaia_test_level_1__two_agents.jsonl
autogenbench tabulate Results/gaia_test_level_1__two_agents
Level 2 and 3 tasks are executed similarly.
Running the SocietyOfMind tasks
Running the SocietyOfMind tasks is similar to the TwoAgentTasks, but requires an ENV.json
file
with a working BING API key. This file should be located in the root current working directory
from where you are running autogenbench, and should have at least the following contents:
{
"BING_API_KEY": "Your_API_key"
}
Once created, simply run:
autogenbench run Tasks/gaia_test_level_1__soc.jsonl
autogenbench tabulate Results/gaia_test_level_1__soc
And similarly for level 2 and 3.
References
GAIA: a benchmark for General AI Assistants
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom
https://arxiv.org/abs/2311.12983