{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved. \n", "\n", "Licensed under the MIT License.\n", "\n", "# Math Study\n", "\n", "In this notebook, we study GPT-4 for math problem solving. We use [the MATH benchmark](https://crfm.stanford.edu/helm/latest/?group=math_chain_of_thought) for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning. \n", "\n", "## Requirements\n", "\n", "FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the [openai] option:\n", "```bash\n", "pip install flaml[openai]==1.2.2\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2023-02-13T23:40:52.317406Z", "iopub.status.busy": "2023-02-13T23:40:52.316561Z", "iopub.status.idle": "2023-02-13T23:40:52.321193Z", "shell.execute_reply": "2023-02-13T23:40:52.320628Z" } }, "outputs": [], "source": [ "# %pip install flaml[openai]==1.2.2 datasets" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Set your OpenAI key:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2023-02-13T23:40:52.324240Z", "iopub.status.busy": "2023-02-13T23:40:52.323783Z", "iopub.status.idle": "2023-02-13T23:40:52.330570Z", "shell.execute_reply": "2023-02-13T23:40:52.329750Z" } }, "outputs": [], "source": [ "import os\n", "\n", "if \"OPENAI_API_KEY\" not in os.environ:\n", " os.environ[\"OPENAI_API_KEY\"] = \"\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Uncomment the following to use Azure OpenAI:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2023-02-13T23:40:52.333547Z", "iopub.status.busy": "2023-02-13T23:40:52.333249Z", "iopub.status.idle": "2023-02-13T23:40:52.336508Z", "shell.execute_reply": "2023-02-13T23:40:52.335858Z" } }, "outputs": [], "source": [ "# import openai\n", "# openai.api_type = \"azure\"\n", "# openai.api_base = \"https://.openai.azure.com/\"\n", "# openai.api_version = \"2023-03-15-preview\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Load dataset\n", "\n", "First, we load the competition_math dataset. We use a random sample of 50 examples for testing." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2023-02-13T23:40:52.339977Z", "iopub.status.busy": "2023-02-13T23:40:52.339556Z", "iopub.status.idle": "2023-02-13T23:40:54.603349Z", "shell.execute_reply": "2023-02-13T23:40:54.602630Z" } }, "outputs": [], "source": [ "import datasets\n", "\n", "seed = 41\n", "data = datasets.load_dataset(\"competition_math\")\n", "train_data = data[\"train\"].shuffle(seed=seed)\n", "test_data = data[\"test\"].shuffle(seed=seed)\n", "n_tune_data = 20\n", "tune_data = [\n", " {\n", " \"problem\": train_data[x][\"problem\"],\n", " \"solution\": train_data[x][\"solution\"],\n", " }\n", " for x in range(len(train_data)) if train_data[x][\"level\"] == \"Level 5\" and train_data[x][\"type\"] == \"Counting & Probability\"\n", "][:n_tune_data]\n", "test_data = [\n", " {\n", " \"problem\": test_data[x][\"problem\"],\n", " \"solution\": test_data[x][\"solution\"],\n", " }\n", " for x in range(len(test_data)) if test_data[x][\"level\"] == \"Level 5\" and test_data[x][\"type\"] == \"Counting & Probability\"\n", "]\n", "print(len(tune_data), len(test_data))\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Check a tuning example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2023-02-13T23:40:54.607152Z", "iopub.status.busy": "2023-02-13T23:40:54.606441Z", "iopub.status.idle": "2023-02-13T23:40:54.610504Z", "shell.execute_reply": "2023-02-13T23:40:54.609759Z" }, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [], "source": [ "print(tune_data[1][\"problem\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Here is one example of the canonical solution:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2023-02-13T23:40:54.613590Z", "iopub.status.busy": "2023-02-13T23:40:54.613168Z", "iopub.status.idle": "2023-02-13T23:40:54.616873Z", "shell.execute_reply": "2023-02-13T23:40:54.616193Z" } }, "outputs": [], "source": [ "print(tune_data[1][\"solution\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Import Success Metric\n", "\n", "For each math task, we use voting to select a response with the most common answers out of all the generated responses. If it has an equivalent answer to the canonical solution, we consider the task as successfully solved. Then we can optimize the mean success rate of a collection of tasks." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2023-02-13T23:40:54.626998Z", "iopub.status.busy": "2023-02-13T23:40:54.626593Z", "iopub.status.idle": "2023-02-13T23:40:54.631383Z", "shell.execute_reply": "2023-02-13T23:40:54.630770Z" } }, "outputs": [], "source": [ "from flaml.autogen.math_utils import eval_math_responses" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Import the oai and tune subpackages from flaml.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2023-02-13T23:40:54.634335Z", "iopub.status.busy": "2023-02-13T23:40:54.633929Z", "iopub.status.idle": "2023-02-13T23:40:56.105700Z", "shell.execute_reply": "2023-02-13T23:40:56.105085Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "from flaml import oai" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "For (local) reproducibility and cost efficiency, we cache responses from OpenAI." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2023-02-13T23:40:56.109177Z", "iopub.status.busy": "2023-02-13T23:40:56.108624Z", "iopub.status.idle": "2023-02-13T23:40:56.112651Z", "shell.execute_reply": "2023-02-13T23:40:56.112076Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "oai.ChatCompletion.set_cache(seed)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "This will create a disk cache in \".cache/{seed}\". You can change `cache_path` in `set_cache()`. The cache for different seeds are stored separately." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2023-02-13T23:40:56.115383Z", "iopub.status.busy": "2023-02-13T23:40:56.114975Z", "iopub.status.idle": "2023-02-13T23:41:55.045654Z", "shell.execute_reply": "2023-02-13T23:41:55.044973Z" } }, "outputs": [], "source": [ "prompt = \"{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\\\boxed{{}}.\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate the success rate on the test data\n", "\n", "You can use flaml's `oai.ChatCompletion.test` to evaluate the performance of an entire dataset with the tuned config." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "config_n1 = {\"model\": 'gpt-4', \"prompt\": prompt, \"max_tokens\": 600, \"n\": 1}\n", "n1_result = oai.ChatCompletion.test(test_data[:50], config_n1, eval_math_responses)\n", "print(n1_result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "oai.ChatCompletion.request_timeout = 120\n", "config_n10 = {\"model\": 'gpt-4', \"prompt\": prompts[0], \"max_tokens\": 600, \"n\": 10}\n", "n10_result = oai.ChatCompletion.test(test_data[:50], config_n10, eval_math_responses, logging_level=logging.INFO)\n", "print(n10_result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_n30 = {\"model\": 'gpt-4', \"prompt\": prompts[0], \"max_tokens\": 600, \"n\": 30}\n", "n30_result = oai.ChatCompletion.test(test_data[:50], config_n30, eval_math_responses, logging_level=logging.INFO)\n", "print(n30_result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from collections import defaultdict\n", "import matplotlib.pyplot as plt\n", "\n", "prompts = [\"{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\\\boxed{{}}.\"]\n", "markers = [\"o\", \"s\", \"D\", \"v\", \"p\", \"h\", \"d\", \"P\", \"X\", \"H\", \"8\", \"4\", \"3\", \"2\", \"1\", \"x\", \"+\", \">\", \"<\", \"^\", \"v\", \"1\", \"2\", \"3\", \"4\", \"8\", \"s\", \"p\", \"*\", \"h\", \"H\", \"d\", \"D\", \"|\", \"_\"]\n", "for j, n in enumerate([10, 30]):\n", " config = {\"model\": 'gpt-4', \"prompt\": prompts[0], \"max_tokens\": 600, \"n\": n}\n", " metrics = []\n", " x, y = [], []\n", " votes_success = defaultdict(lambda: [0, 0])\n", " for i, data_i in enumerate(test_data[:50]):\n", " response = oai.ChatCompletion.create(context=data_i, **config)\n", " responses = oai.ChatCompletion.extract_text(response)\n", " metrics.append(eval_math_responses(responses, **data_i))\n", " votes = metrics[-1][\"votes\"]\n", " success = metrics[-1][\"success_vote\"]\n", " votes_success[votes][0] += 1\n", " votes_success[votes][1] += success\n", " for votes in votes_success:\n", " x.append(votes)\n", " y.append(votes_success[votes][1] / votes_success[votes][0])\n", "\n", " plt.scatter(x, y, marker=markers[j])\n", " plt.xlabel(\"top vote\")\n", " plt.ylabel(\"success rate\")\n", "plt.legend([\"n=10\", \"n=30\"])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" }, "vscode": { "interpreter": { "hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1" } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": { "2d910cfd2d2a4fc49fc30fbbdc5576a7": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "454146d0f7224f038689031002906e6f": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_e4ae2b6f5a974fd4bafb6abb9d12ff26", "IPY_MODEL_577e1e3cc4db4942b0883577b3b52755", "IPY_MODEL_b40bdfb1ac1d4cffb7cefcb870c64d45" ], "layout": "IPY_MODEL_dc83c7bff2f241309537a8119dfc7555", "tabbable": null, "tooltip": null } }, "577e1e3cc4db4942b0883577b3b52755": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "ProgressView", "bar_style": "success", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_2d910cfd2d2a4fc49fc30fbbdc5576a7", "max": 1, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_74a6ba0c3cbc4051be0a83e152fe1e62", "tabbable": null, "tooltip": null, "value": 1 } }, "6086462a12d54bafa59d3c4566f06cb2": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "74a6ba0c3cbc4051be0a83e152fe1e62": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "7d3f3d9e15894d05a4d188ff4f466554": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "background": null, "description_width": "", "font_size": null, "text_color": null } }, "b40bdfb1ac1d4cffb7cefcb870c64d45": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HTMLView", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_f1355871cc6f4dd4b50d9df5af20e5c8", "placeholder": "​", "style": "IPY_MODEL_ca245376fd9f4354af6b2befe4af4466", "tabbable": null, "tooltip": null, "value": " 1/1 [00:00<00:00, 44.69it/s]" } }, "ca245376fd9f4354af6b2befe4af4466": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "background": null, "description_width": "", "font_size": null, "text_color": null } }, "dc83c7bff2f241309537a8119dfc7555": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "e4ae2b6f5a974fd4bafb6abb9d12ff26": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HTMLView", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_6086462a12d54bafa59d3c4566f06cb2", "placeholder": "​", "style": "IPY_MODEL_7d3f3d9e15894d05a4d188ff4f466554", "tabbable": null, "tooltip": null, "value": "100%" } }, "f1355871cc6f4dd4b50d9df5af20e5c8": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } } }, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 2 }