Copyright (c) Microsoft Corporation. All rights reserved. 

Licensed under the MIT License.

# Math Study

In this notebook, we study GPT-4 for math problem solving. We use [the MATH benchmark](https://crfm.stanford.edu/helm/latest/?group=math_chain_of_thought) for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning. 

## Requirements

FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the [openai] option:
```bash
pip install flaml[openai]==1.2.2
```

In [None]:
# %pip install flaml[openai]==1.2.2 datasets

Set your OpenAI key:

In [None]:
import os

if "OPENAI_API_KEY" not in os.environ:
 os.environ["OPENAI_API_KEY"] = ""

Uncomment the following to use Azure OpenAI:

In [None]:
# import openai
# openai.api_type = "azure"
# openai.api_base = "https://.openai.azure.com/"
# openai.api_version = "2023-03-15-preview"

## Load dataset

First, we load the competition_math dataset. We use a random sample of 50 examples for testing.

In [None]:
import datasets

seed = 41
data = datasets.load_dataset("competition_math")
train_data = data["train"].shuffle(seed=seed)
test_data = data["test"].shuffle(seed=seed)
n_tune_data = 20
tune_data = [
 {
 "problem": train_data[x]["problem"],
 "solution": train_data[x]["solution"],
 }
 for x in range(len(train_data)) if train_data[x]["level"] == "Level 5" and train_data[x]["type"] == "Counting & Probability"
][:n_tune_data]
test_data = [
 {
 "problem": test_data[x]["problem"],
 "solution": test_data[x]["solution"],
 }
 for x in range(len(test_data)) if test_data[x]["level"] == "Level 5" and test_data[x]["type"] == "Counting & Probability"
]
print(len(tune_data), len(test_data))


Check a tuning example:

In [None]:
print(tune_data[1]["problem"])

Here is one example of the canonical solution:

In [None]:
print(tune_data[1]["solution"])

## Import Success Metric

For each math task, we use voting to select a response with the most common answers out of all the generated responses. If it has an equivalent answer to the canonical solution, we consider the task as successfully solved. Then we can optimize the mean success rate of a collection of tasks.

In [None]:
from flaml.autogen.math_utils import eval_math_responses

### Import the oai and tune subpackages from flaml.


In [None]:
from flaml import oai

For (local) reproducibility and cost efficiency, we cache responses from OpenAI.

In [None]:
oai.ChatCompletion.set_cache(seed)

This will create a disk cache in ".cache/{seed}". You can change `cache_path` in `set_cache()`. The cache for different seeds are stored separately.

In [None]:
prompt = "{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{{}}."

### Evaluate the success rate on the test data

You can use flaml's `oai.ChatCompletion.test` to evaluate the performance of an entire dataset with the tuned config.

In [None]:
import logging

config_n1 = {"model": 'gpt-4', "prompt": prompt, "max_tokens": 600, "n": 1}
n1_result = oai.ChatCompletion.test(test_data[:50], config_n1, eval_math_responses)
print(n1_result)

In [None]:
oai.ChatCompletion.request_timeout = 120
config_n10 = {"model": 'gpt-4', "prompt": prompts[0], "max_tokens": 600, "n": 10}
n10_result = oai.ChatCompletion.test(test_data[:50], config_n10, eval_math_responses, logging_level=logging.INFO)
print(n10_result)

In [None]:
config_n30 = {"model": 'gpt-4', "prompt": prompts[0], "max_tokens": 600, "n": 30}
n30_result = oai.ChatCompletion.test(test_data[:50], config_n30, eval_math_responses, logging_level=logging.INFO)
print(n30_result)

In [None]:
from collections import defaultdict
import matplotlib.pyplot as plt

prompts = ["{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{{}}."]
markers = ["o", "s", "D", "v", "p", "h", "d", "P", "X", "H", "8", "4", "3", "2", "1", "x", "+", ">", "<", "^", "v", "1", "2", "3", "4", "8", "s", "p", "*", "h", "H", "d", "D", "|", "_"]
for j, n in enumerate([10, 30]):
 config = {"model": 'gpt-4', "prompt": prompts[0], "max_tokens": 600, "n": n}
 metrics = []
 x, y = [], []
 votes_success = defaultdict(lambda: [0, 0])
 for i, data_i in enumerate(test_data[:50]):
 response = oai.ChatCompletion.create(context=data_i, **config)
 responses = oai.ChatCompletion.extract_text(response)
 metrics.append(eval_math_responses(responses, **data_i))
 votes = metrics[-1]["votes"]
 success = metrics[-1]["success_vote"]
 votes_success[votes][0] += 1
 votes_success[votes][1] += success
 for votes in votes_success:
 x.append(votes)
 y.append(votes_success[votes][1] / votes_success[votes][0])

 plt.scatter(x, y, marker=markers[j])
 plt.xlabel("top vote")
 plt.ylabel("success rate")
plt.legend(["n=10", "n=30"])