diff --git a/README.md b/README.md
index 7e199eb2e..a91851b34 100644
--- a/README.md
+++ b/README.md
@@ -14,7 +14,7 @@
     <br>
 </p>
 
-:fire: OpenAI GPT-3 models support in v1.1.3. ChatGPT and GPT-4 support will be added in v1.2.0.
+:fire: v1.2.0 is released with support for ChatGPT and GPT-4.
 
 :fire: A [lab forum](https://github.com/microsoft/FLAML/tree/tutorial-aaai23/tutorial) on FLAML at AAAI 2023.
 
diff --git a/flaml/autogen/math_utils.py b/flaml/autogen/math_utils.py
index a16b05c0c..b5e0807e7 100644
--- a/flaml/autogen/math_utils.py
+++ b/flaml/autogen/math_utils.py
@@ -290,8 +290,16 @@ def eval_math_responses(responses, solution=None, **args):
     Returns:
         dict: The success metrics.
     """
-    success_list = []
     n = len(responses)
+    if not n:
+        return {
+            "expected_success": 0,
+            "success": False,
+            "success_vote": 0,
+            "voted_answer": None,
+            "votes": 0,
+        }
+    success_list = []
     if solution is not None:
         for i in range(n):
             response = responses[i]
diff --git a/flaml/autogen/oai/completion.py b/flaml/autogen/oai/completion.py
index 962e2f4b7..d38483dd1 100644
--- a/flaml/autogen/oai/completion.py
+++ b/flaml/autogen/oai/completion.py
@@ -843,7 +843,7 @@ class Completion:
         choices = response["choices"]
         if "text" in choices[0]:
             return [choice["text"] for choice in choices]
-        return [choice["message"]["content"] for choice in choices]
+        return [choice["message"].get("content", "") for choice in choices]
 
 
 class ChatCompletion(Completion):
diff --git a/flaml/version.py b/flaml/version.py
index c68196d1c..a955fdae1 100644
--- a/flaml/version.py
+++ b/flaml/version.py
@@ -1 +1 @@
-__version__ = "1.2.0"
+__version__ = "1.2.1"
diff --git a/test/openai/test_completion.py b/test/openai/test_completion.py
index 4eec54a75..3578b66a2 100644
--- a/test/openai/test_completion.py
+++ b/test/openai/test_completion.py
@@ -216,6 +216,7 @@ def test_math(num_samples=-1):
     print("tuned config", config)
     result = oai.ChatCompletion.test(test_data_sample, config)
     print("result from tuned config:", result)
+    print("empty responses", eval_math_responses([], None))
 
 
 if __name__ == "__main__":
diff --git a/website/docs/Examples/AutoGen-OpenAI.md b/website/docs/Examples/AutoGen-OpenAI.md
index 19e35f992..6a9bf9101 100644
--- a/website/docs/Examples/AutoGen-OpenAI.md
+++ b/website/docs/Examples/AutoGen-OpenAI.md
@@ -56,7 +56,7 @@ test_data = [
 ]
 ```
 
-### Defining the metric
+### Define the metric
 
 Before starting tuning, you need to define the metric for the optimization. For each code generation task, we can use the model to generate multiple candidate responses, and then select one from them. If the final selected response can pass a unit test, we consider the task as successfully solved. Then we can define the average success rate on a collection of tasks as the optimization metric.
 
@@ -69,7 +69,7 @@ eval_with_generated_assertions = partial(eval_function_completions, assertions=g
 
 This function will first generate assertion statements for each problem. Then, it uses the assertions to select the generated responses.
 
-### Tuning Hyperparameters for OpenAI
+### Tune the hyperparameters
 
 The tuning will be performed under the specified optimization budgets.
 
diff --git a/website/docs/Use-Cases/Auto-Generation.md b/website/docs/Use-Cases/Auto-Generation.md
index 3158ed790..94d9742dc 100644
--- a/website/docs/Use-Cases/Auto-Generation.md
+++ b/website/docs/Use-Cases/Auto-Generation.md
@@ -44,13 +44,13 @@ Collect a diverse set of instances. They can be stored in an iterable of dicts.
 The evaluation function should take a list of responses, and other keyword arguments corresponding to the keys in each validation data instance as input, and output a dict of metrics. For example,
 
 ```python
-def success_metrics(responses: List[str], problem: str, solution: str) -> Dict:
+def eval_math_responses(responses: List[str], solution: str, **args) -> Dict:
     # select a response from the list of responses
     # check whether the answer is correct
     return {"success": True or False}
 ```
 
-`flaml.autogen` offers some example evaluation functions for common tasks such as code generation and math problem solving.
+[`flaml.autogen.code_utils`](../reference/autogen/code_utils) and [`flaml.autogen.math_utils`](../reference/autogen/math_utils) offer some example evaluation functions for code generation and math problem solving.
 
 ### Metric to optimize