autogen/notebook/integrate_openai.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved. \n",
    "\n",
    "Licensed under the MIT License.\n",
    "\n",
    "# Use FLAML to Tune OpenAI Models\n",
    "\n",
    "In this notebook, we tune OpenAI models for code generation. We use [the HumanEval benchmark](https://huggingface.co/datasets/openai_humaneval) released by OpenAI for synthesizing programs from docstrings. \n",
    "\n",
    "## Requirements\n",
    "\n",
    "FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the [openai] option:\n",
    "```bash\n",
    "pip install flaml[openai]==1.1.3\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:40:52.317406Z",
     "iopub.status.busy": "2023-02-13T23:40:52.316561Z",
     "iopub.status.idle": "2023-02-13T23:40:52.321193Z",
     "shell.execute_reply": "2023-02-13T23:40:52.320628Z"
    }
   },
   "outputs": [],
   "source": [
    "# %pip install flaml[openai]==1.1.3 datasets"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set your OpenAI key:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:40:52.324240Z",
     "iopub.status.busy": "2023-02-13T23:40:52.323783Z",
     "iopub.status.idle": "2023-02-13T23:40:52.330570Z",
     "shell.execute_reply": "2023-02-13T23:40:52.329750Z"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "if \"OPENAI_API_KEY\" not in os.environ:\n",
    "    os.environ[\"OPENAI_API_KEY\"] = \"<your OpenAI API key here>\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you use Azure OpenAI, uncomment the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:40:52.333547Z",
     "iopub.status.busy": "2023-02-13T23:40:52.333249Z",
     "iopub.status.idle": "2023-02-13T23:40:52.336508Z",
     "shell.execute_reply": "2023-02-13T23:40:52.335858Z"
    }
   },
   "outputs": [],
   "source": [
    "# openai.api_type = \"azure\"\n",
    "# openai.api_base = \"https://<your_endpoint>.openai.azure.com/\"\n",
    "# openai.api_version = \"2022-12-01\"  # change if necessary"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load dataset\n",
    "\n",
    "First, we load the humaneval dataset. The dataset contains 164 examples. We use the first 20 for tuning the generation hyperparameters and the remaining for evaluation. In each example, the \"prompt\" is the prompt string for eliciting the code generation, \"test\" is the Python code for unit test for the example, and \"entry_point\" is the function name to be tested."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:40:52.339977Z",
     "iopub.status.busy": "2023-02-13T23:40:52.339556Z",
     "iopub.status.idle": "2023-02-13T23:40:54.603349Z",
     "shell.execute_reply": "2023-02-13T23:40:54.602630Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Found cached dataset openai_humaneval (/home/vscode/.cache/huggingface/datasets/openai_humaneval/openai_humaneval/1.0.0/2955cebd73602e828fa8c0a424c594e5fab4ec863b316ca98f3d8fdb6a626e75)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "454146d0f7224f038689031002906e6f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading cached shuffled indices for dataset at /home/vscode/.cache/huggingface/datasets/openai_humaneval/openai_humaneval/1.0.0/2955cebd73602e828fa8c0a424c594e5fab4ec863b316ca98f3d8fdb6a626e75/cache-1e8448101c1b32e8.arrow\n"
     ]
    }
   ],
   "source": [
    "import datasets\n",
    "\n",
    "seed = 41\n",
    "data = datasets.load_dataset(\"openai_humaneval\")[\"test\"].shuffle(seed=seed)\n",
    "n_tune_data = 20\n",
    "tune_data = [\n",
    "    {\n",
    "        \"prompt\": data[x][\"prompt\"],\n",
    "        \"test\": data[x][\"test\"],\n",
    "        \"entry_point\": data[x][\"entry_point\"],\n",
    "    }\n",
    "    for x in range(n_tune_data)\n",
    "]\n",
    "test_data = [\n",
    "    {\n",
    "        \"prompt\": data[x][\"prompt\"],\n",
    "        \"test\": data[x][\"test\"],\n",
    "        \"entry_point\": data[x][\"entry_point\"],\n",
    "    }\n",
    "    for x in range(n_tune_data, len(data))\n",
    "]\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Check a tuning example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:40:54.607152Z",
     "iopub.status.busy": "2023-02-13T23:40:54.606441Z",
     "iopub.status.idle": "2023-02-13T23:40:54.610504Z",
     "shell.execute_reply": "2023-02-13T23:40:54.609759Z"
    },
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "def compare(game,guess):\n",
      "    \"\"\"I think we all remember that feeling when the result of some long-awaited\n",
      "    event is finally known. The feelings and thoughts you have at that moment are\n",
      "    definitely worth noting down and comparing.\n",
      "    Your task is to determine if a person correctly guessed the results of a number of matches.\n",
      "    You are given two arrays of scores and guesses of equal length, where each index shows a match. \n",
      "    Return an array of the same length denoting how far off each guess was. If they have guessed correctly,\n",
      "    the value is 0, and if not, the value is the absolute difference between the guess and the score.\n",
      "    \n",
      "    \n",
      "    example:\n",
      "\n",
      "    compare([1,2,3,4,5,1],[1,2,3,4,2,-2]) -> [0,0,0,0,3,3]\n",
      "    compare([0,5,0,0,0,4],[4,1,1,0,0,-2]) -> [4,4,1,0,0,6]\n",
      "    \"\"\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(tune_data[1][\"prompt\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is one example of the unit test code for verifying the correctness of the generated code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:40:54.613590Z",
     "iopub.status.busy": "2023-02-13T23:40:54.613168Z",
     "iopub.status.idle": "2023-02-13T23:40:54.616873Z",
     "shell.execute_reply": "2023-02-13T23:40:54.616193Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "def check(candidate):\n",
      "\n",
      "    # Check some simple cases\n",
      "    assert candidate([1,2,3,4,5,1],[1,2,3,4,2,-2])==[0,0,0,0,3,3], \"This prints if this assert fails 1 (good for debugging!)\"\n",
      "    assert candidate([0,0,0,0,0,0],[0,0,0,0,0,0])==[0,0,0,0,0,0], \"This prints if this assert fails 1 (good for debugging!)\"\n",
      "    assert candidate([1,2,3],[-1,-2,-3])==[2,4,6], \"This prints if this assert fails 1 (good for debugging!)\"\n",
      "    assert candidate([1,2,3,5],[-1,2,3,4])==[2,0,0,1], \"This prints if this assert fails 1 (good for debugging!)\"\n",
      "\n",
      "    # Check some edge cases that are easy to work out by hand.\n",
      "    assert True, \"This prints if this assert fails 2 (also good for debugging!)\"\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(tune_data[1][\"test\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Success Metric\n",
    "\n",
    "Before we start tuning, we need to define the success metric we want to opotimize. For each code generation task, if one of the returned responses can pass the test, we consider the task as successfully solved. Then we can define the mean success rate of a collection of tasks.\n",
    "\n",
    "### Define a code executor\n",
    "\n",
    "First, we write a simple code executor. The code executor takes the generated code and the test code as the input, and execute them with a timer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:40:54.619618Z",
     "iopub.status.busy": "2023-02-13T23:40:54.619218Z",
     "iopub.status.idle": "2023-02-13T23:40:54.624272Z",
     "shell.execute_reply": "2023-02-13T23:40:54.623664Z"
    }
   },
   "outputs": [],
   "source": [
    "import signal\n",
    "import subprocess\n",
    "import sys\n",
    "\n",
    "def timeout_handler(signum, frame):\n",
    "    raise TimeoutError(\"Timed out!\")\n",
    "\n",
    "signal.signal(signal.SIGALRM, timeout_handler)\n",
    "max_exec_time = 3  # seconds\n",
    "\n",
    "def execute_code(code):\n",
    "    code = code.strip()\n",
    "    with open(\"codetest.py\", \"w\") as fout:\n",
    "        fout.write(code)\n",
    "    try:\n",
    "        signal.alarm(max_exec_time)\n",
    "        result = subprocess.run(\n",
    "            [sys.executable, \"codetest.py\"],\n",
    "            stdout=subprocess.DEVNULL,\n",
    "            stderr=subprocess.PIPE,\n",
    "        )\n",
    "        signal.alarm(0)\n",
    "    except TimeoutError:\n",
    "        return 0\n",
    "    return int(result.returncode == 0)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function will create a temp file \"codetest.py\" and execute it in a separate process. It allows for 3 seconds to finish that code.\n",
    "\n",
    "### Define a function to evaluate the success for a given program synthesis task"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:40:54.626998Z",
     "iopub.status.busy": "2023-02-13T23:40:54.626593Z",
     "iopub.status.idle": "2023-02-13T23:40:54.631383Z",
     "shell.execute_reply": "2023-02-13T23:40:54.630770Z"
    }
   },
   "outputs": [],
   "source": [
    "def success_metrics(responses, prompt, test, entry_point):\n",
    "    \"\"\"Check if the task is successful.\n",
    "\n",
    "    Args:\n",
    "        responses (list): The list of responses.\n",
    "        prompt (str): The input prompt.\n",
    "        test (str): The test code.\n",
    "        entry_point (str): The name of the function.\n",
    "\n",
    "    Returns:\n",
    "        dict: The success metrics.\n",
    "    \"\"\"\n",
    "    success_list = []\n",
    "    n = len(responses)\n",
    "    for i in range(n):\n",
    "        response = responses[i]\n",
    "        code = f\"{prompt}{response}\\n{test}\\ncheck({entry_point})\"\n",
    "        succeed = execute_code(code)\n",
    "        success_list.append(succeed)\n",
    "    return {\n",
    "        \"expected_success\": 1 - pow(1 - sum(success_list) / n, n),\n",
    "        \"success\": any(s for s in success_list),\n",
    "    }\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Use the tuning data to find a good configuration\n",
    "\n",
    "### Import the oai and tune subpackages from flaml.\n",
    "\n",
    "FLAML has provided an API for hyperparameter optimization of OpenAI models: `oai.Completion.tune` and to make a request with the tuned config: `oai.Completion.create`. First, we import oai from flaml:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:40:54.634335Z",
     "iopub.status.busy": "2023-02-13T23:40:54.633929Z",
     "iopub.status.idle": "2023-02-13T23:40:56.105700Z",
     "shell.execute_reply": "2023-02-13T23:40:56.105085Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "from flaml import oai, tune"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For (local) reproducibility and cost efficiency, we cache responses from OpenAI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:40:56.109177Z",
     "iopub.status.busy": "2023-02-13T23:40:56.108624Z",
     "iopub.status.idle": "2023-02-13T23:40:56.112651Z",
     "shell.execute_reply": "2023-02-13T23:40:56.112076Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "oai.Completion.set_cache(seed)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will create a disk cache in \".cache/{seed}\". You can change `cache_path` in `set_cache()`. The cache for different seeds are stored separately.\n",
    "\n",
    "### Perform tuning\n",
    "\n",
    "The tuning will take a while to finish, depending on the optimization budget (~1 min for the current budget). The tuning will be performed under the specified optimization budgets.\n",
    "\n",
    "* `inference_budget` is the target average inference budget per instance in the benchmark. For example, 0.02 means the target inference budget is 0.02 dollars, which translates to 1000 tokens (input + output combined) if the text Davinci model is used.\n",
    "* `optimization_budget` is the total budget allowed to perform the tuning. For example, 5 means 5 dollars are allowed in total, which translates to 250K tokens for the text Davinci model.\n",
    "* `num_sumples` is the number of different hyperparameter configurations which is allowed to try. The tuning will stop after either num_samples trials or after optimization_budget dollars spent, whichever happens first. -1 means no hard restriction in the number of trials and the actual number is decided by `optimization_budget`.\n",
    "\n",
    "Users can specify tuning data, optimization metric, optimization mode, evaluation function, search spaces etc.. The default search space is:\n",
    "\n",
    "```python\n",
    "price1K = {\n",
    "    \"text-ada-001\": 0.0004,\n",
    "    \"text-babbage-001\": 0.0005,\n",
    "    \"text-curie-001\": 0.002,\n",
    "    \"code-cushman-001\": 0.024,\n",
    "    \"code-davinci-002\": 0.1,\n",
    "    \"text-davinci-002\": 0.02,\n",
    "    \"text-davinci-003\": 0.02,\n",
    "}\n",
    "\n",
    "default_search_space = {\n",
    "    \"model\": tune.choice(list(price1K.keys())),\n",
    "    \"temperature_or_top_p\": tune.choice(\n",
    "        [\n",
    "            {\"temperature\": tune.uniform(0, 1)},\n",
    "            {\"top_p\": tune.uniform(0, 1)},\n",
    "        ]\n",
    "    ),\n",
    "    \"max_tokens\": tune.lograndint(50, 1000),\n",
    "    \"n\": tune.randint(1, 100),\n",
    "    \"prompt\": \"{prompt}\",\n",
    "}\n",
    "```\n",
    "\n",
    "The default search space can be overriden by users' input.\n",
    "For example, the following code specifies two choices for the model, four choices for the prompt and a fixed list of stop sequences. For hyperparameters which don't appear in users' input, the default search space will be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:40:56.115383Z",
     "iopub.status.busy": "2023-02-13T23:40:56.114975Z",
     "iopub.status.idle": "2023-02-13T23:41:55.045654Z",
     "shell.execute_reply": "2023-02-13T23:41:55.044973Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m[I 2023-02-13 23:40:56,159]\u001b[0m A new study created in memory with name: optuna\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m[I 2023-02-13 23:40:56,161]\u001b[0m A new study created in memory with name: optuna\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:40:56] {806} INFO - trial 1 config: {'model': 'code-davinci-002', 'temperature_or_top_p': {'temperature': 0.36865945026811975}, 'max_tokens': 347, 'n': 1, 'prompt': 1, 'stop': 0}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:40:59] {215} INFO - result: {'expected_success': 0.6, 'success': 0.6, 'total_cost': 0.4624999999999999, 'cost': 0.4624999999999999, 'inference_cost': 0.023125, 'training_iteration': 0, 'config': {'model': 'code-davinci-002', 'temperature_or_top_p': {'temperature': 0.36865945026811975}, 'max_tokens': 347, 'n': 1, 'prompt': 1, 'stop': 0}, 'config/model': 'code-davinci-002', 'config/temperature_or_top_p': {'temperature': 0.36865945026811975}, 'config/max_tokens': 347, 'config/n': 1, 'config/prompt': 1, 'config/stop': 0, 'experiment_tag': 'exp', 'time_total_s': 3.7016141414642334}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:40:59] {806} INFO - trial 2 config: {'model': 'code-cushman-001', 'temperature_or_top_p': {'temperature': 0.36865945026811975}, 'max_tokens': 347, 'n': 1, 'prompt': 1, 'stop': 0}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:41:00] {215} INFO - result: {'expected_success': 0.35, 'success': 0.35, 'total_cost': 0.5671159999999997, 'cost': 0.104616, 'inference_cost': 0.0052308, 'training_iteration': 0, 'config': {'model': 'code-cushman-001', 'temperature_or_top_p': {'temperature': 0.36865945026811975}, 'max_tokens': 347, 'n': 1, 'prompt': 1, 'stop': 0}, 'config/model': 'code-cushman-001', 'config/temperature_or_top_p': {'temperature': 0.36865945026811975}, 'config/max_tokens': 347, 'config/n': 1, 'config/prompt': 1, 'config/stop': 0, 'experiment_tag': 'exp', 'time_total_s': 0.673302412033081}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:41:00] {806} INFO - trial 3 config: {'model': 'code-cushman-001', 'temperature_or_top_p': {'top_p': 0.4985070123025904}, 'max_tokens': 97, 'n': 20, 'prompt': 0, 'stop': 0}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:41:17] {215} INFO - result: {'expected_success': 0.5080706992649381, 'success': 0.55, 'total_cost': 1.1848999999999996, 'cost': 0.617784, 'inference_cost': 0.0287676, 'training_iteration': 0, 'config': {'model': 'code-cushman-001', 'temperature_or_top_p': {'top_p': 0.4985070123025904}, 'max_tokens': 97, 'n': 20, 'prompt': 0, 'stop': 0}, 'config/model': 'code-cushman-001', 'config/temperature_or_top_p': {'top_p': 0.4985070123025904}, 'config/max_tokens': 97, 'config/n': 20, 'config/prompt': 0, 'config/stop': 0, 'experiment_tag': 'exp', 'time_total_s': 16.56331181526184}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:41:17] {806} INFO - trial 4 config: {'model': 'code-cushman-001', 'temperature_or_top_p': {'top_p': 0.6125260668293881}, 'max_tokens': 433, 'n': 29, 'prompt': 0, 'stop': 0}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:41:51] {215} INFO - result: {'expected_success': 0.6186627404336135, 'success': 0.65, 'total_cost': 2.4239719999999987, 'cost': 1.2390720000000002, 'inference_cost': 0.059620799999999995, 'training_iteration': 0, 'config': {'model': 'code-cushman-001', 'temperature_or_top_p': {'top_p': 0.6125260668293881}, 'max_tokens': 433, 'n': 29, 'prompt': 0, 'stop': 0}, 'config/model': 'code-cushman-001', 'config/temperature_or_top_p': {'top_p': 0.6125260668293881}, 'config/max_tokens': 433, 'config/n': 29, 'config/prompt': 0, 'config/stop': 0, 'experiment_tag': 'exp', 'time_total_s': 34.57707595825195}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:41:51] {806} INFO - trial 5 config: {'model': 'code-davinci-002', 'temperature_or_top_p': {'temperature': 0.6177669784693172}, 'max_tokens': 231, 'n': 65, 'prompt': 3, 'stop': 0}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:41:51] {215} INFO - result: {'expected_success': 0, 'total_cost': 2.6356719999999987, 'cost': 0.2117, 'training_iteration': 0, 'config': {'model': 'code-davinci-002', 'temperature_or_top_p': {'temperature': 0.6177669784693172}, 'max_tokens': 231, 'n': 65, 'prompt': 3, 'stop': 0}, 'config/model': 'code-davinci-002', 'config/temperature_or_top_p': {'temperature': 0.6177669784693172}, 'config/max_tokens': 231, 'config/n': 65, 'config/prompt': 3, 'config/stop': 0, 'experiment_tag': 'exp', 'time_total_s': 0.0022132396697998047}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:41:51] {806} INFO - trial 6 config: {'model': 'code-davinci-002', 'max_tokens': 263, 'n': 41, 'prompt': 0, 'stop': 0, 'temperature_or_top_p': {'top_p': 0.49834557213253655}}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:41:54] {215} INFO - result: {'expected_success': 0, 'total_cost': 3.003171999999999, 'cost': 0.3675, 'training_iteration': 0, 'config': {'model': 'code-davinci-002', 'max_tokens': 263, 'n': 41, 'prompt': 0, 'stop': 0, 'temperature_or_top_p': {'top_p': 0.49834557213253655}}, 'config/model': 'code-davinci-002', 'config/max_tokens': 263, 'config/n': 41, 'config/prompt': 0, 'config/stop': 0, 'config/temperature_or_top_p': {'top_p': 0.49834557213253655}, 'experiment_tag': 'exp', 'time_total_s': 3.3002660274505615}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:41:55] {806} INFO - trial 7 config: {'model': 'code-cushman-001', 'temperature_or_top_p': {'temperature': 0.8286813263076767}, 'max_tokens': 57, 'n': 63, 'prompt': 3, 'stop': 0}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:41:55] {215} INFO - result: {'expected_success': 0, 'total_cost': 4.046379999999999, 'cost': 1.043208, 'training_iteration': 0, 'config': {'model': 'code-cushman-001', 'temperature_or_top_p': {'temperature': 0.8286813263076767}, 'max_tokens': 57, 'n': 63, 'prompt': 3, 'stop': 0}, 'config/model': 'code-cushman-001', 'config/temperature_or_top_p': {'temperature': 0.8286813263076767}, 'config/max_tokens': 57, 'config/n': 63, 'config/prompt': 3, 'config/stop': 0, 'experiment_tag': 'exp', 'time_total_s': 0.007852792739868164}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[flaml.tune.tune: 02-13 23:41:55] {827} WARNING - fail to sample a trial for 100 times in a row, stopping.\n"
     ]
    }
   ],
   "source": [
    "config, analysis = oai.Completion.tune(\n",
    "    data=tune_data,  # the data for tuning\n",
    "    metric=\"expected_success\",  # the metric to optimize\n",
    "    mode=\"max\",  # the optimization mode\n",
    "    eval_func=success_metrics,  # the evaluation function to return the success metrics\n",
    "    # log_file_name=\"logs/humaneval.log\",  # the log file name\n",
    "    inference_budget=0.1,  # the inference budget (dollar)\n",
    "    optimization_budget=4,  # the optimization budget (dollar)\n",
    "    # num_samples can further limit the number of trials for different hyperparameter configurations;\n",
    "    # -1 means decided by the optimization budget only\n",
    "    num_samples=-1,\n",
    "    model=tune.choice(\n",
    "        [\n",
    "            # These two models are currently free to use from OpenAI,\n",
    "            # so no actual cost will incur. They are not free in Azure OpenAI.\n",
    "            # The optimization is based on the price in Azure OpenAI.\n",
    "            \"code-cushman-001\", \n",
    "            \"code-davinci-002\",\n",
    "        ]\n",
    "    ),\n",
    "    prompt=[\n",
    "        \"{prompt}\",\n",
    "        \"# Python 3{prompt}\",\n",
    "        \"Complete the following Python function:{prompt}\",\n",
    "        \"Complete the following Python function while including necessary import statements inside the function:{prompt}\",\n",
    "    ],  # the prompt templates to choose from\n",
    "    stop=[\"\\nclass\", \"\\ndef\", \"\\nif\", \"\\nprint\"],  # the stop sequence\n",
    ")\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Output tuning results\n",
    "\n",
    "After the tuning, we can print out the config and the result found by FLAML:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:41:55.049204Z",
     "iopub.status.busy": "2023-02-13T23:41:55.048871Z",
     "iopub.status.idle": "2023-02-13T23:41:55.053284Z",
     "shell.execute_reply": "2023-02-13T23:41:55.052574Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "optimized config {'model': 'code-cushman-001', 'max_tokens': 433, 'n': 29, 'prompt': '{prompt}', 'stop': ['\\nclass', '\\ndef', '\\nif', '\\nprint'], 'top_p': 0.6125260668293881}\n",
      "best result on tuning data {'expected_success': 0.6186627404336135, 'success': 0.65, 'total_cost': 2.4239719999999987, 'cost': 1.2390720000000002, 'inference_cost': 0.059620799999999995, 'training_iteration': 0, 'config': {'model': 'code-cushman-001', 'temperature_or_top_p': {'top_p': 0.6125260668293881}, 'max_tokens': 433, 'n': 29, 'prompt': 0, 'stop': 0}, 'config/model': 'code-cushman-001', 'config/temperature_or_top_p': {'top_p': 0.6125260668293881}, 'config/max_tokens': 433, 'config/n': 29, 'config/prompt': 0, 'config/stop': 0, 'experiment_tag': 'exp', 'time_total_s': 34.57707595825195}\n"
     ]
    }
   ],
   "source": [
    "print(\"optimized config\", config)\n",
    "print(\"best result on tuning data\", analysis.best_result)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Make a request with the tuned config\n",
    "\n",
    "We can apply the tuned config on the request for an example task:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:41:55.056205Z",
     "iopub.status.busy": "2023-02-13T23:41:55.055631Z",
     "iopub.status.idle": "2023-02-13T23:41:56.039259Z",
     "shell.execute_reply": "2023-02-13T23:41:56.038427Z"
    },
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"choices\": [\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 0,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i]-game[i]) for i in range(len(guess))]\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 1,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i]-game[i]) for i in range(len(game))]\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 2,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i]-game[i]) for i in range(len(guess))]\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 3,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(game[i]-guess[i]) for i in range(len(game))]\\n\\n#print(compare([1,2,3,4,5,1],[1,2,3,4,2,-2]))\\n#print(compare([0,5,0,0,0,4],[4,1,1,0,0,-2]))\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 4,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    #\\n    # TODO: Define compare\\n    #\\n    return None\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 5,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    # your code here\\n    return [abs(game[i]-guess[i]) for i in range(len(game))]\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 6,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(game[i]-guess[i]) for i in range(len(game))]\\n    \\n\\n# Recursion\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"length\",\n",
      "      \"index\": 7,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    #return a list of the absolute difference between guess and score\\n    return [abs(score-guess) for score,guess in zip(game,guess)]\\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 8,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    #Your code goes here\\n    guess = np.array(guess)\\n    score = np.array(game)\\n    res = np.zeros(len(guess))\\n    for i in range(len(guess)):\\n        if guess[i] == score[i]:\\n            res[i] = 0\\n        else:\\n            res[i] = abs(guess[i] - score[i])\\n    return res\\n\\n#print(compare([1,2,3,4,5,1],[1,2,3,4,2,-2]))\\n#print(compare([0,5,0,0,0,4],[4,1,1,0,0,-2]))\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 9,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    if len(game) != len(guess):\\n        return \\\"Your arrays are not of equal length.\\\"\\n    else:\\n        return [abs(guess[i]-game[i]) for i in range(len(guess))]\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"length\",\n",
      "      \"index\": 10,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    l = []\\n    for i in range(len(guess)):\\n        if guess[i] == game[i]:\\n            l.append(0)\\n        else:\\n            l.append(abs(guess[i] - game[i]))\\n    return l\\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 11,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i]-game[i]) for i in range(len(game))]\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 12,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    assert len(game) == len(guess), \\\"the length of game and guess must be equal\\\"\\n    return [abs(guess[i] - game[i]) for i in range(len(game))]\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 13,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(a-b) for a,b in zip(game,guess)]\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 14,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i]-game[i]) for i in range(len(guess))]\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"length\",\n",
      "      \"index\": 15,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    answer = []\\n    for i in range(len(guess)):\\n        answer.append(guess[i]-game[i])\\n    return answer\\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 16,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i]-game[i]) for i in range(len(guess))]\\n\\n#%%\\n#EXAMPLE\\n#%%\\ngame = [1,2,3,4,5,1]\\nguess = [1,2,3,4,2,-2]\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 17,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(game[i]-guess[i]) for i in range(len(game))]\\n    \\n\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 18,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i]-game[i]) for i in range(len(game))]\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"length\",\n",
      "      \"index\": 19,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i]-game[i]) for i in range(len(guess))]\\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"length\",\n",
      "      \"index\": 20,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    if len(game) != len(guess):\\n        return []\\n    results = []\\n    for i in range(len(game)):\\n        results.append(abs(guess[i] - game[i]))\\n    return results\\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 21,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(game[i]-guess[i]) for i in range(len(game))]\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"length\",\n",
      "      \"index\": 22,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i] - game[i]) for i in range(len(game))]\\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"length\",\n",
      "      \"index\": 23,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i] - game[i]) for i in range(len(guess))]\\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \\n    \"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 24,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i] - game[i]) for i in range(len(guess))]\\n    \"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 25,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i]-game[i]) for i in range(len(guess))]\\n\\n#or use the following solution\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 26,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(guess[i]-game[i]) for i in range(len(game))]\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 27,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    return [abs(score-guess) for score,guess in zip(game,guess)]\"\n",
      "    },\n",
      "    {\n",
      "      \"finish_reason\": \"stop\",\n",
      "      \"index\": 28,\n",
      "      \"logprobs\": null,\n",
      "      \"text\": \"    results = []\\n    for i in range(len(game)):\\n        if guess[i] == game[i]:\\n            results.append(0)\\n        else:\\n            results.append(abs(guess[i] - game[i]))\\n    return results\\n\"\n",
      "    }\n",
      "  ],\n",
      "  \"created\": 1675617146,\n",
      "  \"id\": \"cmpl-6gcqwCz8JXC5eB62rsjxrcIgL3n4B\",\n",
      "  \"model\": \"code-cushman-001\",\n",
      "  \"object\": \"text_completion\",\n",
      "  \"usage\": {\n",
      "    \"completion_tokens\": 3959,\n",
      "    \"prompt_tokens\": 239,\n",
      "    \"total_tokens\": 4198\n",
      "  }\n",
      "}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'expected_success': 1.0, 'success': True}\n"
     ]
    }
   ],
   "source": [
    "responses = oai.Completion.create(context=tune_data[1], **config)\n",
    "print(responses)\n",
    "print(success_metrics([response[\"text\"].rstrip() for response in responses[\"choices\"]], **tune_data[1]))\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate the success rate on the test data\n",
    "\n",
    "You can use flaml's `oai.Completion.eval` to evaluate the performance of an entire dataset with the tuned config. To do that you need to set `oai.Completion.data` to the data to evaluate. The following code will take a while to evaluate all the 144 test data instances. Compared to the baseline success rate (0.46) on the [HELM benchmark](https://crfm.stanford.edu/helm/latest/?group=code_humaneval), the tuned config has a success rate of 0.68. It can be further improved if the inference budget and optimization budget are further increased."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-13T23:41:56.042764Z",
     "iopub.status.busy": "2023-02-13T23:41:56.042086Z",
     "iopub.status.idle": "2023-02-13T23:53:05.597643Z",
     "shell.execute_reply": "2023-02-13T23:53:05.596603Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'expected_success': 0.6364503360372493, 'success': 0.6805555555555556, 'total_cost': 12.227739999999997, 'cost': 8.181360000000003, 'inference_cost': 0.056815}\n"
     ]
    }
   ],
   "source": [
    "oai.Completion.data = test_data\n",
    "result = oai.Completion.eval(analysis.best_config, prune=False, eval_only=True)\n",
    "print(result)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  },
  "vscode": {
   "interpreter": {
    "hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1"
   }
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {
     "2d910cfd2d2a4fc49fc30fbbdc5576a7": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "454146d0f7224f038689031002906e6f": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HBoxModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HBoxModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HBoxView",
       "box_style": "",
       "children": [
        "IPY_MODEL_e4ae2b6f5a974fd4bafb6abb9d12ff26",
        "IPY_MODEL_577e1e3cc4db4942b0883577b3b52755",
        "IPY_MODEL_b40bdfb1ac1d4cffb7cefcb870c64d45"
       ],
       "layout": "IPY_MODEL_dc83c7bff2f241309537a8119dfc7555",
       "tabbable": null,
       "tooltip": null
      }
     },
     "577e1e3cc4db4942b0883577b3b52755": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "FloatProgressModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "FloatProgressModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "ProgressView",
       "bar_style": "success",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_2d910cfd2d2a4fc49fc30fbbdc5576a7",
       "max": 1,
       "min": 0,
       "orientation": "horizontal",
       "style": "IPY_MODEL_74a6ba0c3cbc4051be0a83e152fe1e62",
       "tabbable": null,
       "tooltip": null,
       "value": 1
      }
     },
     "6086462a12d54bafa59d3c4566f06cb2": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "74a6ba0c3cbc4051be0a83e152fe1e62": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "ProgressStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "ProgressStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "bar_color": null,
       "description_width": ""
      }
     },
     "7d3f3d9e15894d05a4d188ff4f466554": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "background": null,
       "description_width": "",
       "font_size": null,
       "text_color": null
      }
     },
     "b40bdfb1ac1d4cffb7cefcb870c64d45": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HTMLView",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_f1355871cc6f4dd4b50d9df5af20e5c8",
       "placeholder": "",
       "style": "IPY_MODEL_ca245376fd9f4354af6b2befe4af4466",
       "tabbable": null,
       "tooltip": null,
       "value": " 1/1 [00:00&lt;00:00, 44.69it/s]"
      }
     },
     "ca245376fd9f4354af6b2befe4af4466": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "background": null,
       "description_width": "",
       "font_size": null,
       "text_color": null
      }
     },
     "dc83c7bff2f241309537a8119dfc7555": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "e4ae2b6f5a974fd4bafb6abb9d12ff26": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HTMLView",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_6086462a12d54bafa59d3c4566f06cb2",
       "placeholder": "",
       "style": "IPY_MODEL_7d3f3d9e15894d05a4d188ff4f466554",
       "tabbable": null,
       "tooltip": null,
       "value": "100%"
      }
     },
     "f1355871cc6f4dd4b50d9df5af20e5c8": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     }
    },
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}