Added a simple Testbed tool for repeatedly running templated Autogen scenarios with tightly-controlled initial conditions. (#455)

* Initial commit of the autogen testbed environment.

* Fixed some typos in the Testbed README.md

* Added some stricter termination logic to the two_agent scenario, and swiched the logo task from finding Autogen's logo, to finding Microsoft's (it's easier)

* Added documentation to testbed code in preparation for PR

* Added a variation of HumanEval to the Testbed. It is also a reasonable example of how to integrate other benchmarks.

* Removed ChatCompletion.start_logging and related features. Added an explicit TERMINATE output to HumanEval to save 1 turn in each conversation.

* Added metrics utils script for HumanEval

* Updated the requirements in the README.

* Added documentation for HumanEval csv schemas

* Standardized on how the OAI_CONFIG_LIST is handled.

* Removed dot-slash from 'includes' path for cross-platform compatibility

* Missed a file.

* Updated readme to include known-working versions.
This commit is contained in:
afourney 2023-11-04 03:38:43 -07:00 committed by GitHub
parent c4f8b1c761
commit 1c4a5e6a1a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
10 changed files with 920 additions and 0 deletions

View File

@ -0,0 +1,151 @@
# Autogen Testbed Environment
The Autogen Testbed environment is a tool for repeatedly running a set of pre-defined Autogen scenarios in a setting with tightly-controlled initial conditions. With each run, Autogen will start from a blank slate, working out what code needs to be written, and what libraries or dependencies to install. The results of each run are logged, and can be ingested by analysis or metrics scripts (see the HumanEval example later in this README). By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.
This Testbed sample has been tested in, and is known to work with, Autogen versions 0.1.14 and 0.2.0b1
## Setup
Before you begin, you must configure your API keys for use with the Testbed. As with other Autogen applications, the Testbed will look for the OpenAI keys in a file in the current working directy, or environment variable named, OAI_CONFIG_LIST. This can be overrriden using a command-line parameter described later.
For some scenarios, additional keys may be required (e.g., keys for the Bing Search API). These can be added to an `ENV` file in the `includes` folder. A sample has been provided in ``includes/ENV.example``. Edit ``includes/ENV`` as needed.
The Testbed also requires Docker (Desktop or Engine) AND the __python docker__ library. **It will not run in codespaces**, unless you opt for native execution (with is strongly discouraged). To install Docker Desktop see [https://www.docker.com/products/docker-desktop/](https://www.docker.com/products/docker-desktop/). To install the Python library:
``pip install docker``
## Running the Testbed
To run the Testbed, simply execute
``python run_scenarios.py``
The default it to repeat this scenario 10 times. This can be costly. To run each scenario only once, use:
``python run_scenarios.py --repeat 1``
The run_scenarios.py script also allows a number of command-line arguments to control various parameters of execution. Type ``python run_scenarios.py -h`` to explore these options:
```
run_scenarios.py will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.
positional arguments:
scenario The JSONL scenario file to run. If a directory is specified,
then all JSONL scenarios in the directory are run. (default:
./scenarios)
options:
-h, --help show this help message and exit
-r REPEAT, --repeat REPEAT
The number of repetitions to run for each scenario (default: 10).
-c CONFIG, --config CONFIG
The environment variable name or path to the OAI_CONFIG_LIST (default: OAI_CONFIG_LIST).
--native Run the scenarios natively rather than in docker.
NOTE: This is not advisable, and should be done with great caution.
```
## Results
By default, the Testbed stores results in a folder heirarchy with the following template:
``./results/[scenario]/[instance_id]/[repetition]``
For example, consider the following folders:
``./results/default_two_agents/two_agent_stocks_gpt4/0``
``./results/default_two_agents/two_agent_stocks_gpt4/1``
...
``./results/default_two_agents/two_agent_stocks_gpt4/9``
This folder holds the results for the ``two_agent_stocks_gpt4`` instance of the ``default_two_agents`` scenario. The ``0`` folder contains the results of the first run. The ``1`` folder contains the results of the second run, and so on. You can think of the _instance_ as mapping to a prompt, or a unique set of parameters, while the _scenario_ defines the template in which those parameters are input.
Within each folder, you will find the following files:
- *timestamp.txt*: records the date and time of the run, along with the version of the pyautogen library installed
- *console_log.txt*: all console output produced by Docker when running autogen. Read this like you would a regular console.
- *chat_completions.json*: a log of all OpenAI ChatCompletions, as logged by ``autogen.ChatCompletion.start_logging(compact=False)``
- *[agent]_messages.json*: for each Agent, a log of their messages dictionaries
- *./coding*: A directory containing all code written by Autogen, and all artifacts produced by that code.
## Scenario Templating
All scenarios are stored in JSONL files in the ``./scenarios'' directory. Each line of a scenario file is a JSON object with the following schema:
```
{
"id": string,
"template": filename,
"values" {
"field_name1": string,
"field_name2": string,
...
"field_nameN": string
}
}
```
For example:
```
{
"id": "two_agent_stocks_gpt4",
"template": "default_two_agents.py",
"values": {
"\__MODEL\__": "gpt-4",
"\__PROMPT\__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD."
}
}
```
Where the ``id`` is the instance id used when saving results, ``template`` points to a python file that contains the scenario logic, and ``values`` contains a set of strings to find and replace when expanding the template.
An example templated python file is:
```
from autogen import AssistantAgent, UserProxyAgent, config_list_from_json
import os
import json
import testbed_utils
testbed_utils.init()
##############################
config_list = config_list_from_json(
"OAI_CONFIG_LIST", filter_dict={"model": ["\__MODEL\__"]},
)
assistant = AssistantAgent("assistant", llm_config={
"request_timeout": 180,
"config_list": config_list}
)
user_proxy = UserProxyAgent("user_proxy",
human_input_mode="NEVER",
code_execution_config={
"work_dir": "coding",
"use_docker": False,
},
max_consecutive_auto_reply=10)
user_proxy.initiate_chat(assistant, message="\__PROMPT\__")
##############################
testbed_utils.finalize(assistant, user_proxy)
```
## (Example) Running HumanEval
One sample Testbed scenario type is a variation of the classic [HumanEval](https://github.com/openai/human-eval) benchmark. In this scenario, agents are given access to the unit test results, and are able to continue to debug their code until the problem is solved or they run out of tokens or turns. We can then count how many turns it took to solve the problem (returning -1 if the problem remains unsolved by the end of the conversation, and "" if the run is missing).
Accessing this scenario-type requires downloading and converting the HumanEval dataset, running the Testbed, collating the results, and finally computing the metrics. The following commands will accomplish this, running each test instance 3 times with GPT-3.5-Turbo-16k:
```
python utils/download_humaneval.py
python ./run_scenarios.py --repeat 3 scenarios/human_eval_two_agents_gpt35.jsonl
python utils/collate_human_eval.py ./results/human_eval_two_agents_gpt35 | python utils/metrics_human_eval.py > human_eval_results_gpt35.csv
cat human_eval_results_gpt35.csv
```

View File

@ -0,0 +1 @@
export BING_API_KEY=

View File

@ -0,0 +1,50 @@
from importlib.metadata import version as lib_version
from datetime import datetime
import os
import autogen
import json
def init():
"""Helper function to initialize logging in a testbed scenario.
Specifically, write timestamp and version information, then
initialize autogen logging.
Args:
None
Returns:
None
"""
# Print some information about the run
with open("timestamp.txt", "wt") as f:
f.write("Timestamp: " + datetime.now().isoformat() + "\n")
f.write("pyautogen version: " + lib_version("pyautogen") + "\n")
def finalize(agents):
"""Helper function to finalize logging in a testbed scenario.
Calling this function will save all the chat completions logged
by Autogen to disk, and will save the messages dictionaries of
all agents passed via the agents argument.
Args:
agents (list): a list of the agents whose messages will be logged to disk.
Returns:
None
"""
script_dir = os.path.dirname(os.path.realpath(__file__))
def messages_to_json(agent):
messages = dict()
for item in agent.chat_messages.items():
messages[item[0].name] = item[1]
return json.dumps(messages, indent=4)
for agent in agents:
fname = agent.name + "_messages.json"
with open(os.path.join(script_dir, fname), "wt") as fh:
fh.write(messages_to_json(agent))

View File

@ -0,0 +1,298 @@
import os
import errno
import shutil
import subprocess
import json
import sys
import time
import pathlib
import argparse
from autogen import config_list_from_json
# Location of the global includes dir. The contents of this directory will be copied to the Docker environment.
INCLUDES_DIR = "includes"
def run_scenarios(scenario, n_repeats, is_native, config_list, results_dir="results"):
"""
Run a set testbed scenarios a given number of times.
Args:
scenario (path): The file or folder containing the scenario JSONL instances. If given a folder, then
all JSONL files in the folder will be loaded and run.
n_repeats (int): The number of times each scenario instance will be repeated
is_native (bool): True if the scenario should be run locally rather than in Docker (proceed with caution!)
config_list (list): An Autogen OAI_CONFIG_LIST to be used when running scenarios.
results_dir (path): The folder were results will be saved.
"""
files = []
# Figure out which files or folders we are working with
if os.path.isfile(scenario):
files.append(scenario)
elif os.path.isdir(scenario):
for f in os.listdir(scenario):
scenario_file = os.path.join(scenario, f)
if not os.path.isfile(scenario_file):
continue
if not scenario_file.lower().endswith(".jsonl"):
continue
files.append(scenario_file)
else:
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), scenario)
# Run all the scenario files
for scenario_file in files:
scenario_name = os.path.basename(scenario_file).split(".")
scenario_name.pop()
scenario_name = ".".join(scenario_name)
scenario_dir = os.path.dirname(os.path.realpath(scenario_file))
# Each line in the scenario file is an instance. Run it.
with open(scenario_file) as fh:
for line in fh:
instance = json.loads(line)
scenario_name + "_" + instance["id"]
# Create a folder to store the results
# Results base
if not os.path.isdir(results_dir):
os.mkdir(results_dir)
# Results for the scenario
results_scenario = os.path.join(results_dir, scenario_name)
if not os.path.isdir(results_scenario):
os.mkdir(results_scenario)
# Results fot the instance
results_instance = os.path.join(results_scenario, instance["id"])
if not os.path.isdir(results_instance):
os.mkdir(results_instance)
# Results for the repeats
for i in range(0, n_repeats):
results_repetition = os.path.join(results_instance, str(i))
# Skip it if it already exists
if os.path.isdir(results_repetition):
print(f"Found folder {results_repetition} ... Skipping.")
continue
print(f"Running scenario {results_repetition}")
# Create the folder, and copy the script to a standard name
os.mkdir(results_repetition)
expand_scenario(scenario_dir, instance, os.path.join(results_repetition, "scenario.py"))
# Also copy the contents of INCLUDES_DIR
for item in os.listdir(INCLUDES_DIR):
if item.endswith(".example"):
continue
item_path = os.path.join(INCLUDES_DIR, item)
if os.path.isfile(item_path):
shutil.copyfile(item_path, os.path.join(results_repetition, item))
# Append the config list to the ENV file
config_list_json = json.dumps(config_list)
with open(os.path.join(results_repetition, "ENV"), "at") as fh:
fh.write(f"export OAI_CONFIG_LIST='{config_list_json}'\n")
# Run the scenario
if is_native:
run_scenario_natively(results_repetition)
else:
run_scenario_in_docker(results_repetition)
def expand_scenario(scenario_dir, scenario, output_file):
template_fh = open(os.path.join(scenario_dir, scenario["template"]), "rt")
output_fh = open(output_file, "wt")
for line in template_fh:
if "values" in scenario:
for k, v in scenario["values"].items():
line = line.replace(k, v)
output_fh.write(line)
template_fh.close()
output_fh.close()
def run_scenario_natively(work_dir):
"""
Run a scenario in the native environment.
Args:
work_dir (path): the path to the working directory previously created to house this sceario instance
"""
# Get the current working directory
cwd = os.getcwd()
# Navigate to the scenario
os.chdir(work_dir)
print("\n\n" + os.getcwd() + "\n===================================================================")
# Prepare the run script
with open(os.path.join("run.sh"), "wt") as f:
f.write(
"""#
. ./ENV
python scenario.py
echo SCENARIO COMPLETE !#!#
"""
)
# Run the script and log the output
with open("console_log.txt", "wb") as f:
process = subprocess.Popen(["sh", "run.sh"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for c in iter(lambda: process.stdout.read(1), b""):
f.write(c)
os.write(sys.stdout.fileno(), c) # Write binary to stdout
# Return where we started
os.chdir(cwd)
return
def run_scenario_in_docker(work_dir, timeout=600):
"""
Run a scenario in a Docker environment.
Args:
work_dir (path): the path to the working directory previously created to house this sceario instance
timeout (Optional, int): the number of seconds to allow a Docker container to run before timing out
"""
# Create a docker client
client = docker.from_env()
image_name = "python:3.11"
# Pull a suitable image
try:
image = client.images.get(image_name)
except docker.errors.ImageNotFound:
# pull the image
print("Pulling image", image_name)
try:
image = client.images.pull(image_name)
except docker.errors.DockerException:
print("Failed to pull image", image_name)
# Prepare the run script
with open(os.path.join(work_dir, "run.sh"), "wt") as f:
f.write(
"""#
. ./ENV
pip install pyautogen
python scenario.py
rm ENV
echo SCENARIO COMPLETE !#!#
"""
)
print("\n\n" + work_dir + "\n===================================================================")
# Create and run the container
abs_path = str(pathlib.Path(work_dir).absolute())
container = client.containers.run(
image,
command=["sh", "run.sh"],
working_dir="/workspace",
detach=True,
# get absolute path to the working directory
volumes={abs_path: {"bind": "/workspace", "mode": "rw"}},
)
# Poll until the container is done, or we've timed out
start_time = time.time()
while container.status != "exited" and time.time() - start_time < timeout:
# Reload the container object
container.reload()
if container.status != "exited":
container.stop()
logs = container.logs().decode("utf-8").rstrip() + "\nDocker timed out."
print(logs)
with open(os.path.join(work_dir, "console_log.txt"), "wt") as f:
f.write(logs)
container.remove()
return
# get the container logs
logs = container.logs().decode("utf-8").rstrip()
container.remove()
print(logs)
with open(os.path.join(work_dir, "console_log.txt"), "wt") as f:
f.write(logs)
###############################################################################
if __name__ == "__main__":
script_name = os.path.basename(__file__)
parser = argparse.ArgumentParser(
description=f"{script_name} will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.".strip()
)
parser.add_argument(
"scenario",
nargs="?",
help="The JSONL scenario file to run. If a directory is specified, then all JSONL scenarios in the directory are run. (default: ./scenarios)",
default="scenarios",
)
parser.add_argument(
"-c",
"--config",
type=str,
help="The environment variable name or path to the OAI_CONFIG_LIST (default: OAI_CONFIG_LIST).",
default="OAI_CONFIG_LIST",
)
parser.add_argument(
"-r", "--repeat", type=int, help="The number of repetitions to run for each scenario (default: 10).", default=10
)
parser.add_argument(
"--native",
action="store_true",
help="Run the scenarios natively rather than in docker. NOTE: This is not advisable, and should be done with great caution.",
)
args = parser.parse_args()
# Load the OAI_CONFIG_LIST
config_list = config_list_from_json(env_or_file=args.config)
if len(config_list) == 0:
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), args.config)
# Warn if running natively
if args.native:
choice = input(
'WARNING: Running natively, without Docker, not only poses the usual risks of executing arbitrary AI generated code on your machine, it also makes it impossible to ensure that each test starts from a known and consistent set of initial conditions. For example, if the agents spend time debugging and installing Python libraries to solve the task, then those libraries will be available to all other runs. In other words, earlier runs can influence later runs, leading to many confounds in testing.\n\nAre you absolutely sure you want to continue with native execution? Type "Yes" exactly, and in full, to proceed: '
)
if choice.strip().lower() != "yes":
print("Received '" + choice + "'. Exiting.")
# Import docker if needed
is_native = True if args.native else False
if not is_native:
import docker
# Warn aboit a common error
env_file = os.path.join(INCLUDES_DIR, "ENV")
example_file = os.path.join(INCLUDES_DIR, "ENV.example")
if not os.path.isfile(env_file):
shutil.copyfile(example_file, env_file)
sys.stderr.write(
f"The environment file '{env_file}' does not exist (perhaps this is your first time setting up the testbed). A default environment file has been provided, but you may want to edit it to include your API keys and configurations.\n"
)
run_scenarios(args.scenario, args.repeat, is_native, config_list)

View File

@ -0,0 +1,6 @@
{ "id": "two_agent_stocks_gpt4", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-4", "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD." } }
{ "id": "two_agent_stocks_gpt35", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD." } }
{ "id": "two_agent_arxiv_search_gpt4", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-4", "__PROMPT__": "Find 10 papers on explainable or interpretable AI that were submitted to arXiv within the last year. When printing results, include paper titles, authors, dates, and URLs, but not their abstracts." } }
{ "id": "two_agent_arxiv_search_gpt35", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Find 10 papers on explainable or interpretable AI that were submitted to arXiv within the last year. When printing results, include paper titles, authors, dates, and URLs, but not their abstracts." } }
{ "id": "two_agent_mslogo_search_gpt4", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-4", "__PROMPT__": "Find Microsoft's logo from 1983, and save it to disk. If searching the web, use Bing with API key stored in os.environ['BING_API_KEY']" } }
{ "id": "two_agent_mslogo_search_gpt35", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Find Microsoft's logo from 1983, and save it to disk. If searching the web, use Bing with the API key stored in os.environ['BING_API_KEY']" } }

View File

@ -0,0 +1,37 @@
from autogen import AssistantAgent, UserProxyAgent, config_list_from_json
import os
import json
import testbed_utils
testbed_utils.init()
##############################
config_list = config_list_from_json(
"OAI_CONFIG_LIST",
filter_dict={"model": ["__MODEL__"]},
)
assistant = AssistantAgent(
"assistant",
is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
llm_config={
# "request_timeout": 180, # Remove for autogen version >= 0.2, and OpenAI version >= 1.0
"config_list": config_list,
},
)
user_proxy = UserProxyAgent(
"user_proxy",
human_input_mode="NEVER",
is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
code_execution_config={
"work_dir": "coding",
"use_docker": False,
},
max_consecutive_auto_reply=10,
default_auto_reply="TERMINATE",
)
user_proxy.initiate_chat(assistant, message="__PROMPT__")
##############################
testbed_utils.finalize(agents=[assistant, user_proxy])

View File

@ -0,0 +1,91 @@
from autogen import AssistantAgent, UserProxyAgent, config_list_from_json
import os
import json
import base64
import testbed_utils
# NOTE:
# This scenario runs Human Eval in a slightly unconventional way:
# The agents have access to the unit tests, and can keep trying
# until they pass.
testbed_utils.init()
##############################
work_dir = "coding"
# These come formatted as Base64 to avoid conflicting with the triple-quotes
TESTS = base64.b64decode("__TEST_BASE64__").decode("utf-8")
PROMPT = base64.b64decode("__PROMPT_BASE64__").decode("utf-8")
# Write the tests to a file so that the agents can access them
if not os.path.isdir(work_dir):
os.mkdir(work_dir)
with open(os.path.join(work_dir, "my_tests.py"), "wt") as fh:
fh.write(
TESTS
+ """
def run_tests(candidate):
check(candidate)
# We can search for this string in the output
print("ALL TESTS PASSED !#!#\\nTERMINATE")
"""
)
# Ok, now get autogen to solve it.
config_list = config_list_from_json(
"OAI_CONFIG_LIST",
filter_dict={"model": ["__MODEL__"]},
)
assistant = AssistantAgent(
"assistant",
is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
llm_config={
# "request_timeout": 180, # Remove for autogen version >= 0.2, and OpenAI version >= 1.0
"config_list": config_list,
},
)
user_proxy = UserProxyAgent(
"user_proxy",
human_input_mode="NEVER",
is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
code_execution_config={
"work_dir": work_dir,
"use_docker": False,
},
max_consecutive_auto_reply=10,
default_auto_reply="TERMINATE",
)
user_proxy.initiate_chat(
assistant,
message="""
The following python code imports the `run_tests(candidate)` function from my_tests.py, and runs
it on the function `__ENTRY_POINT__`. This will run a set of automated unit tests to verify the
correct implementation of `__ENTRY_POINT__`. However, `__ENTRY_POINT__` is only partially
implemented in the code below. Complete the implementation of `__ENTRY_POINT__` and output
a new stand-alone code block that contains everything needed run the tests, including: importing
`my_tests`, calling `run_tests(__ENTRY_POINT__)`, as well as __ENTRY_POINT__'s comepelte definition,
such that this code block can be run direcly in Python.
```python
from my_tests import run_tests
"""
+ PROMPT
+ """
# Run the unit tests
run_tests(__ENTRY_POINT__)
```
""",
)
##############################
testbed_utils.finalize(agents=[assistant, user_proxy])

View File

@ -0,0 +1,103 @@
import os
import errno
import shutil
import subprocess
import json
import sys
import time
import pathlib
import argparse
def collate(results_dir):
"""
Collate the results of running human eval.
Args:
results_dir (path): The folder were results were be saved.
"""
all_results = list()
max_instances = 0
for test_id in os.listdir(results_dir):
test_path = os.path.join(results_dir, test_id)
# Collect the reslts vector
results = [test_id]
instance = 0
instance_dir = os.path.join(test_path, str(instance))
while os.path.isdir(instance_dir):
console_log = os.path.join(instance_dir, "console_log.txt")
if os.path.isfile(console_log):
with open(console_log, "rt") as fh:
content = fh.read()
if "ALL TESTS PASSED !#!#" in content:
results.append(
str(content.count("assistant (to user_proxy):"))
) # The number of assistant replies (which is also equal to the number of GPT calls in this case)
else:
results.append("-1")
else:
# Missing results will appear as blanks
results.append("")
instance += 1
instance_dir = os.path.join(test_path, str(instance))
max_instances = max(max_instances, instance)
# Buffer the results
all_results.append(results)
# Create a header
header = "TestId"
for i in range(0, max_instances):
header += ",Trial" + str(i)
print(header)
# Print a fully-populated table of results
for r in all_results:
while len(r) < max_instances + 1:
r.append("")
print(",".join(r))
###############################################################################
if __name__ == "__main__":
script_path = os.path.realpath(__file__)
script_name = os.path.basename(script_path)
script_dir = os.path.dirname(script_path)
# Path to the default results directory
# (relative to this script, up on directory, then into the results folder)
default_results_dir = os.path.realpath(
os.path.join(script_dir, os.path.pardir, "results", "human_eval_two_agents_gpt4")
)
parser = argparse.ArgumentParser(
description=f"""
{script_name} will collate the results of the HumanEval scenarios and output them to a CSV. The CSV format is as follows:
TestId, Trial0, Trial1, ..., TrialN
HumanEval_1, x_10, x_11, ..., X_1N
HumanEval_2, x_20, x_21, ..., X_2N
...
HumanEval_M, x_M0, x_M1, ..., X_MN
Where x_ij is the number of AsssitantAgent conversation turns needed to pass all the tests for problem i, in Trial/repetition j. If the agent was not able to pass the tests by the end of the conversation, the value will be -1. If data for the trial is missing, the value will be an empty string "".
""".strip(),
formatter_class=argparse.RawTextHelpFormatter,
)
parser.add_argument(
"scenario",
nargs="?",
help="Path to the scenario results. (default: " + default_results_dir + ")",
default=default_results_dir,
)
args = parser.parse_args()
collate(args.scenario)

View File

@ -0,0 +1,67 @@
#
# Run this file to download the human_eval dataset, and create a corresponding testbed scenario:
# (default: ../scenarios/human_eval_two_agents_gpt4.jsonl and ./scenarios/human_eval_two_agents_gpt35.jsonl)
#
import requests
import gzip
import io
import json
import os
import base64
script_path = os.path.realpath(__file__)
script_name = os.path.basename(script_path)
script_dir = os.path.dirname(script_path)
# Directory where scenarios are stored
scenarios_dir = os.path.realpath(os.path.join(script_dir, os.path.pardir, "scenarios"))
print("Saving HumanEval scenarios to: " + scenarios_dir)
# URL of the file to download
url = "https://github.com/openai/human-eval/raw/master/data/HumanEval.jsonl.gz"
# Send a HTTP request to the URL of the file
response = requests.get(url)
# Ensure we raise an error if the download failed
response.raise_for_status()
# Create a BytesIO object from the response content
buffer = io.BytesIO(response.content)
# Create a scenario file
fh_gpt4 = open(os.path.join(scenarios_dir, "human_eval_two_agents_gpt4.jsonl"), "wt")
fh_gpt35 = open(os.path.join(scenarios_dir, "human_eval_two_agents_gpt35.jsonl"), "wt")
# Open the buffer as a .gz file and read it line by line
with gzip.GzipFile(fileobj=buffer) as f_in:
for line in f_in:
# Parse each line as JSON
data = json.loads(line)
print("Converting: " + data["task_id"])
# Write the GPT-4 scenario
# Prompts and tests are saved in base 64 to greatly simplify escaping them as they
# move through the various formats and scripts. I welcome a better, more readable, alternative.
record = {
"id": data["task_id"].replace("/", "_"),
"template": "human_eval_two_agents.py",
"values": {
"__MODEL__": "gpt-4",
"__PROMPT_BASE64__": base64.b64encode(data["prompt"].encode("utf-8")).decode("utf-8"),
"__ENTRY_POINT__": data["entry_point"],
"__TEST_BASE64__": base64.b64encode(data["test"].encode("utf-8")).decode("utf-8"),
},
}
fh_gpt4.write(json.dumps(record).strip() + "\n")
# Write the GPT 3.5 Version
record["values"]["__MODEL__"] = "gpt-3.5-turbo-16k"
fh_gpt35.write(json.dumps(record).strip() + "\n")
fh_gpt4.close()
fh_gpt35.close()

View File

@ -0,0 +1,116 @@
import os
import sys
import argparse
import csv
def metrics(results_fh):
"""
Compute metrics from collated HumanEval results.
Args:
results_fh (File Stream): A file stream containing the collated results in CSV.
"""
reader = csv.reader(results_fh)
first_row = next(reader) # Read the first line
num_trials = len(first_row) - 1 # Don't count the first column (TestId)
max_turns = 0
num_rows = 0
# Load the results. We'll need to iterate over them a few times.
results = list()
for row in reader:
num_rows += 1
name = row[0]
trials = [(None if v.strip() == "" else int(v)) for v in row[1:]]
for v in trials:
if v is not None:
max_turns = max(max_turns, v)
results.append([name, trials])
# Print the header
header = ["Trial"]
for i in range(1, max_turns + 1):
header.append("cumulative_passes_by_turn_" + str(i))
header.append("fails")
header.append("missing")
print(",".join(header))
# Compute the metrics
def _metrics_for_trial(t):
counts = [None]
fails = 0
missing = 0
# Compute cumulative passes for each conversation turn
for i in range(1, max_turns + 1):
counts.append(0)
assert len(counts) == i + 1
for r in results:
v = r[1][t]
if v is not None:
v = int(v)
if 0 <= v and v <= i:
counts[i] += 1
# Count missing and failed
for r in results:
v = r[1][t]
if v is None:
missing += 1
elif int(v) < 0:
fails += 1
# Prepare the row in the format specified by the header
return str(t) + "," + ",".join([str(v) for v in counts[1:]]) + "," + str(fails) + "," + str(missing)
# Print each row
for t in range(0, num_trials):
print(_metrics_for_trial(t))
###############################################################################
if __name__ == "__main__":
script_path = os.path.realpath(__file__)
script_name = os.path.basename(script_path)
script_dir = os.path.dirname(script_path)
parser = argparse.ArgumentParser(
description=f"""
{script_name} will compute metrics on the collated results of the HumanEval scenarios. Use collate_human_eval.py to prepare input to this script.
The output will be formatted as a CSV with the following schema:
Trial, cumulative_passes_by_turn_1, ..., cumulative_passes_by_turn_N, fails, missing
0 x_01, x_0N, y_0, z_0
1 x_11, x_1N, y_1, z_1
...
M x_M1, x_MN, y_M, z_M
Where:
x_ij is the number of HumanEval problems in Trial i that achieved a passing result by conversation turn j.
y_i is the number of HumanEval problems in Trial i that never achieved a passing result (they failed).
z_i is the number of HumanEval problems in Trial i that have missing data.
""".strip(),
formatter_class=argparse.RawTextHelpFormatter,
)
parser.add_argument(
"scenario",
nargs="?",
help="Path to collated results. If '-' or omitted, read from stdin. (default: '-')",
default="-",
)
args = parser.parse_args()
if args.scenario == "" or args.scenario == "-":
metrics(sys.stdin)
else:
with open(args.scenario, "rt") as fh:
metrics(fh)