autogen/python/packages/agbench/benchmarks/GAIA/README.md

# GAIA Benchmark

This scenario implements the [GAIA](https://arxiv.org/abs/2311.12983) agent benchmark. Before you begin, make sure you have followed instruction in `../README.md` to prepare your environment.

### Setup Environment Variables for AgBench

Navigate to GAIA

```bash
cd benchmarks/GAIA
```

Update `config.yaml` to point to your model host, as appropriate. The default configuration points to 'gpt-4o'.

Now initialize the tasks.

```bash
python Scripts/init_tasks.py
```

Note: This will attempt to download GAIA from Hugginface, but this requires authentication.

The resulting folder structure should look like this:

```
.
./Downloads
./Downloads/GAIA
./Downloads/GAIA/2023
./Downloads/GAIA/2023/test
./Downloads/GAIA/2023/validation
./Scripts
./Templates
./Templates/TeamOne
```

Then run `Scripts/init_tasks.py` again.

Once the script completes, you should now see a folder in your current directory called `Tasks` that contains one JSONL file per template in `Templates`.

### Running GAIA

Now to run a specific subset of GAIA use:

```bash
agbench run Tasks/gaia_validation_level_1__MagenticOne.jsonl
```

You should see the command line print the raw logs that shows the agents in action To see a summary of the results (e.g., task completion rates), in a new terminal run the following:

```bash
agbench tabulate Results/gaia_validation_level_1__MagenticOne/
```

## References

**GAIA: a benchmark for General AI Assistants** `<br/>`
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom `<br/>`
[https://arxiv.org/abs/2311.12983](https://arxiv.org/abs/2311.12983)
Adding Benchmarks to agbench (#3803) * Move from tomllib to tomli * added example code for magentic-one + code comments * adding benchmarks temporarily * add license for datasets * revert changes to magentic-one * change license location --------- Co-authored-by: Ryan Sweet <rysweet@microsoft.com> 2024-10-17 21:33:33 -07:00			`# GAIA Benchmark`

			This scenario implements the [GAIA](https://arxiv.org/abs/2311.12983) agent benchmark. Before you begin, make sure you have followed instruction in `../README.md` to prepare your environment.

			`### Setup Environment Variables for AgBench`

			`Navigate to GAIA`

			```bash
			`cd benchmarks/GAIA`
			```

Significant updates to agbench. (#5313) - Updated HumanEval template to use AgentChat - Update templates to use config.yaml for model and other configuration - Read environment from ENV.yaml (ENV.json still supported but deprecated) - Temporarily removed WebArena and AssistantBench. Neither had viable Templates after `autogen_magentic_one` was removed. Templates need to be update to AgentChat (in a future PR, but this PR is getting big enough already) 2025-02-07 10:01:44 -08:00			Update `config.yaml` to point to your model host, as appropriate. The default configuration points to 'gpt-4o'.
Adding Benchmarks to agbench (#3803) * Move from tomllib to tomli * added example code for magentic-one + code comments * adding benchmarks temporarily * add license for datasets * revert changes to magentic-one * change license location --------- Co-authored-by: Ryan Sweet <rysweet@microsoft.com> 2024-10-17 21:33:33 -07:00
			`Now initialize the tasks.`

			```bash
			`python Scripts/init_tasks.py`
			```

			`Note: This will attempt to download GAIA from Hugginface, but this requires authentication.`

			`The resulting folder structure should look like this:`

			```
			`.`
			`./Downloads`
			`./Downloads/GAIA`
			`./Downloads/GAIA/2023`
			`./Downloads/GAIA/2023/test`
			`./Downloads/GAIA/2023/validation`
			`./Scripts`
			`./Templates`
			`./Templates/TeamOne`
			```

			Then run `Scripts/init_tasks.py` again.

			Once the script completes, you should now see a folder in your current directory called `Tasks` that contains one JSONL file per template in `Templates`.

			`### Running GAIA`

			`Now to run a specific subset of GAIA use:`

			```bash
			`agbench run Tasks/gaia_validation_level_1__MagenticOne.jsonl`
			```

			`You should see the command line print the raw logs that shows the agents in action To see a summary of the results (e.g., task completion rates), in a new terminal run the following:`

			```bash
			`agbench tabulate Results/gaia_validation_level_1__MagenticOne/`
			```

			`## References`

			GAIA: a benchmark for General AI Assistants `<br/>`
			Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom `<br/>`
			`[https://arxiv.org/abs/2311.12983](https://arxiv.org/abs/2311.12983)`