"Copyright (c) Microsoft Corporation. All rights reserved. \n",
"\n",
"Licensed under the MIT License.\n",
"\n",
"# Use FLAML to Tune OpenAI Models\n",
"\n",
"FLAML offers a cost-effective hyperparameter optimization technique [EcoOptiGen](https://arxiv.org/abs/2303.04673) for tuning Large Language Models. Our study finds that tuning hyperparameters can significantly improve the utility of LLMs.\n",
"\n",
"In this notebook, we tune OpenAI models for code generation. We use [the HumanEval benchmark](https://huggingface.co/datasets/openai_humaneval) released by OpenAI for synthesizing programs from docstrings. \n",
"# openai.api_version = \"2023-03-15-preview\" # change if necessary"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load dataset\n",
"\n",
"First, we load the humaneval dataset. The dataset contains 164 examples. We use the first 20 for tuning the generation hyperparameters and the remaining for evaluation. In each example, the \"prompt\" is the prompt string for eliciting the code generation (renamed into \"definition\"), \"test\" is the Python code for unit test for the example, and \"entry_point\" is the function name to be tested."
"Loading cached shuffled indices for dataset at /home/vscode/.cache/huggingface/datasets/openai_humaneval/openai_humaneval/1.0.0/2955cebd73602e828fa8c0a424c594e5fab4ec863b316ca98f3d8fdb6a626e75/cache-1e8448101c1b32e8.arrow\n"
" assert candidate([1,2,3,4,5,1],[1,2,3,4,2,-2])==[0,0,0,0,3,3], \"This prints if this assert fails 1 (good for debugging!)\"\n",
" assert candidate([0,0,0,0,0,0],[0,0,0,0,0,0])==[0,0,0,0,0,0], \"This prints if this assert fails 1 (good for debugging!)\"\n",
" assert candidate([1,2,3],[-1,-2,-3])==[2,4,6], \"This prints if this assert fails 1 (good for debugging!)\"\n",
" assert candidate([1,2,3,5],[-1,2,3,4])==[2,0,0,1], \"This prints if this assert fails 1 (good for debugging!)\"\n",
"\n",
" # Check some edge cases that are easy to work out by hand.\n",
" assert True, \"This prints if this assert fails 2 (also good for debugging!)\"\n",
"\n",
"\n"
]
}
],
"source": [
"print(tune_data[1][\"test\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define Success Metric\n",
"\n",
"Before we start tuning, we need to define the success metric we want to optimize. For each code generation task, we can use the model to generate multiple candidates, and then select one from them. If the final selected response can pass a unit test, we consider the task as successfully solved. Then we can define the mean success rate of a collection of tasks."
"This function will first generate assertion statements for each problem. Then, it uses the assertions to select the generated responses.\n",
"\n",
"## Use the tuning data to find a good configuration\n",
"\n",
"### Import the oai and tune subpackages from flaml.\n",
"\n",
"FLAML has provided an API for hyperparameter optimization of OpenAI models: `oai.Completion.tune` and to make a request with the tuned config: `oai.Completion.create`. First, we import oai from flaml:"
"This will create a disk cache in \".cache/{seed}\". You can change `cache_path` in `set_cache()`. The cache for different seeds are stored separately.\n",
"\n",
"### Perform tuning\n",
"\n",
"The tuning will take a while to finish, depending on the optimization budget. The tuning will be performed under the specified optimization budgets.\n",
"\n",
"* `inference_budget` is the target average inference budget per instance in the benchmark. For example, 0.02 means the target inference budget is 0.02 dollars, which translates to 1000 tokens (input + output combined) if the text Davinci model is used.\n",
"* `optimization_budget` is the total budget allowed to perform the tuning. For example, 5 means 5 dollars are allowed in total, which translates to 250K tokens for the text Davinci model.\n",
"* `num_sumples` is the number of different hyperparameter configurations which is allowed to try. The tuning will stop after either num_samples trials or after optimization_budget dollars spent, whichever happens first. -1 means no hard restriction in the number of trials and the actual number is decided by `optimization_budget`.\n",
"\n",
"Users can specify tuning data, optimization metric, optimization mode, evaluation function, search spaces etc.. The default search space is:\n",
"\n",
"```python\n",
"default_search_space = {\n",
" \"model\": tune.choice([\n",
" \"text-ada-001\",\n",
" \"text-babbage-001\",\n",
" \"text-davinci-003\",\n",
" \"gpt-3.5-turbo\",\n",
" \"gpt-4\",\n",
" ]),\n",
" \"temperature_or_top_p\": tune.choice(\n",
" [\n",
" {\"temperature\": tune.uniform(0, 1)},\n",
" {\"top_p\": tune.uniform(0, 1)},\n",
" ]\n",
" ),\n",
" \"max_tokens\": tune.lograndint(50, 1000),\n",
" \"n\": tune.randint(1, 100),\n",
" \"prompt\": \"{prompt}\",\n",
"}\n",
"```\n",
"\n",
"The default search space can be overridden by users' input.\n",
"For example, the following code specifies three choices for the prompt and two choices of stop sequences. For hyperparameters which don't appear in users' input, the default search space will be used. If you don't have access to gpt-4 or would like to modify the choice of models, you can provide a different search space for model."
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i]-guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 1,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i] - guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 2,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n diff = abs(game[i] - guess[i])\\n result.append(diff)\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 3,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i] - guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 4,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n diff = abs(game[i] - guess[i])\\n result.append(diff)\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 5,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i] - guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 6,\n",
" \"logprobs\": null,\n",
" \"text\": \" results = []\\n for i in range(len(game)):\\n if game[i] == guess[i]:\\n results.append(0)\\n else:\\n results.append(abs(game[i] - guess[i]))\\n return results\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 7,\n",
" \"logprobs\": null,\n",
" \"text\": \" res = []\\n for i in range(len(game)):\\n res.append(abs(game[i] - guess[i]))\\n return res\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 8,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i] - guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 9,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i]-guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 10,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i] - guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 11,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n diff = abs(game[i] - guess[i])\\n result.append(diff)\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 12,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n if game[i] == guess[i]:\\n result.append(0)\\n else:\\n result.append(abs(game[i] - guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 13,\n",
" \"logprobs\": null,\n",
" \"text\": \" #your code here\\n result = []\\n for i in range(len(game)):\\n if game[i] == guess[i]:\\n result.append(0)\\n else:\\n result.append(abs(game[i] - guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 14,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i] - guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 15,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n diff = abs(game[i] - guess[i])\\n result.append(diff)\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 16,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i]-guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 17,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i] - guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 18,\n",
" \"logprobs\": null,\n",
" \"text\": \" # Your code here\\n result = []\\n for i in range(len(game)):\\n result.append(abs(game[i] - guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 19,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i] - guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 20,\n",
" \"logprobs\": null,\n",
" \"text\": \" #create an empty list\\n result = []\\n #iterate over the two lists and compare the values\\n for i in range(len(game)):\\n diff = abs(game[i] - guess[i])\\n result.append(diff)\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 21,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i] - guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 22,\n",
" \"logprobs\": null,\n",
" \"text\": \" # initialize the result array\\n result = []\\n \\n # loop over the arrays and calculate the difference\\n for i in range(len(game)):\\n diff = abs(game[i] - guess[i])\\n result.append(diff)\\n \\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 23,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i]-guess[i]))\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 24,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n diff = abs(game[i] - guess[i])\\n result.append(diff)\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 25,\n",
" \"logprobs\": null,\n",
" \"text\": \" # Your code here\\n result = []\\n for i in range(len(game)):\\n diff = abs(game[i] - guess[i])\\n result.append(diff)\\n return result\"\n",
" },\n",
" {\n",
" \"finish_reason\": \"stop\",\n",
" \"index\": 26,\n",
" \"logprobs\": null,\n",
" \"text\": \" result = []\\n for i in range(len(game)):\\n result.append(abs(game[i]-guess[i]))\\n return result\"\n",
"### Evaluate the success rate on the test data\n",
"\n",
"You can use flaml's `oai.Completion.test` to evaluate the performance of an entire dataset with the tuned config. The following code will take a while to evaluate all the 144 test data instances. The cost is about $6 if you uncomment it and run it."
"performance on test data with the tuned config: {'index_selected': 5.208333333333333, 'succeed_assertions': 0.8402777777777778, 'success': 0.7777777777777778, 'gen_cost': 0.00045375000000000005, 'cost': 5.785519999999999, 'inference_cost': 0.04017722222222222}\n"