mirror of https://github.com/microsoft/autogen.git synced 2026-01-20 19:17:36 +00:00

History

* fix doc on distributed runtime

* Fix references

* Update references

* Fix import paths in user guide notebooks for code executor components

2024-12-06 01:23:05 -08:00

Scripts

Adding Benchmarks to agbench (#3803 )

2024-10-18 06:33:33 +02:00

Templates/MagenticOne

fix docs (#4589 )

2024-12-06 01:23:05 -08:00

.gitignore

Adding Benchmarks to agbench (#3803 )

2024-10-18 06:33:33 +02:00

ENV.json.sample

Adding Benchmarks to agbench (#3803 )

2024-10-18 06:33:33 +02:00

README.md

Adding Benchmarks to agbench (#3803 )

2024-10-18 06:33:33 +02:00

README.md

AssistantBench Benchmark

This scenario implements the AssistantBench agent benchmark. Before you begin, make sure you have followed the instructions in ../README.md to prepare your environment. We modify the evaluation code from AssistantBench in Scripts and retain the license including it here LICENSE. Please find the original AssistantBench evaluation code here https://huggingface.co/spaces/AssistantBench/leaderboard/tree/main/evaluation.

Setup Environment Variables for AgBench

Navigate to AssistantBench

cd benchmarks/AssistantBench

Create a file called ENV.json with the following (required) contents (If you're using MagenticOne)

{
    "BING_API_KEY": "REPLACE_WITH_YOUR_BING_API_KEY",
    "HOMEPAGE": "https://www.bing.com/",
    "WEB_SURFER_DEBUG_DIR": "/autogen/debug",
    "CHAT_COMPLETION_KWARGS_JSON": "{\"api_version\": \"2024-02-15-preview\", \"azure_endpoint\": \"YOUR_ENDPOINT/\", \"model_capabilities\": {\"function_calling\": true, \"json_output\": true, \"vision\": true}, \"azure_ad_token_provider\": \"DEFAULT\", \"model\": \"gpt-4o-2024-05-13\"}",
    "CHAT_COMPLETION_PROVIDER": "azure"
}

You can also use the openai client by replacing the last two entries in the ENV file by:

CHAT_COMPLETION_PROVIDER='openai'
CHAT_COMPLETION_KWARGS_JSON with the following JSON structure:

{
  "api_key": "REPLACE_WITH_YOUR_API",
  "model": "gpt-4o-2024-05-13"
}

Now initialize the tasks.

python Scripts/init_tasks.py

Note: This will attempt to download AssistantBench from Huggingface, but this requires authentication.

After running the script, you should see the new following folders and files:

.
./Downloads
./Downloads/AssistantBench
./Downloads/AssistantBench/assistant_bench_v1.0_dev.jsonl
./Downloads/AssistantBench/assistant_bench_v1.0_dev.jsonl
./Tasks
./Tasks/assistant_bench_v1.0_dev.jsonl
./Tasks/assistant_bench_v1.0_dev.jsonl

Then run Scripts/init_tasks.py again.

Once the script completes, you should now see a folder in your current directory called Tasks that contains one JSONL file per template in Templates.

Running AssistantBench

Now to run a specific subset of AssistantBench use:

agbench run Tasks/assistant_bench_v1.0_dev__MagenticOne.jsonl

You should see the command line print the raw logs that shows the agents in action To see a summary of the results (e.g., task completion rates), in a new terminal run the following:

agbench tabulate Results/assistant_bench_v1.0_dev__MagenticOne

References

Yoran, Ori, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?." arXiv preprint arXiv:2407.15711 (2024). https://arxiv.org/abs/2407.15711