diff --git a/website/blog/2023-04-21-LLM-tuning-math/index.mdx b/website/blog/2023-04-21-LLM-tuning-math/index.mdx index 90cbebc45..0e2aa4e54 100644 --- a/website/blog/2023-04-21-LLM-tuning-math/index.mdx +++ b/website/blog/2023-04-21-LLM-tuning-math/index.mdx @@ -24,7 +24,7 @@ We will use FLAML to perform model selection and inference parameter tuning. The We use FLAML to select between the following models with a target inference budget $0.02 per instance: - gpt-3.5-turbo, a relatively cheap model that powers the popular ChatGPT app -- gpt-4, the state of the art LLM that costs more than 100 times of gpt-3.5-turbo +- gpt-4, the state of the art LLM that costs more than 10 times of gpt-3.5-turbo We adapt the models using 20 examples in the train set, using the problem statement as the input and generating the solution as the output. We use the following inference parameters: diff --git a/website/blog/2023-05-18-GPT-adaptive-humaneval/index.mdx b/website/blog/2023-05-18-GPT-adaptive-humaneval/index.mdx index 934f65432..e519b5827 100644 --- a/website/blog/2023-05-18-GPT-adaptive-humaneval/index.mdx +++ b/website/blog/2023-05-18-GPT-adaptive-humaneval/index.mdx @@ -10,7 +10,7 @@ tags: [LLM, GPT, research] * **A case study using the HumanEval benchmark shows that an adaptive way of using multiple GPT models can achieve both much higher accuracy (from 68% to 90%) and lower inference cost (by 18%) than using GPT-4 for coding.** -GPT-4 is a big upgrade of foundation model capability, e.g., in code and math, accompanied by a much higher (more than 100x) price per token to use over GPT-3.5-Turbo. On a code completion benchmark, [HumanEval](https://huggingface.co/datasets/openai_humaneval), developed by OpenAI, GPT-4 can successfully solve 68% tasks while GPT-3.5-Turbo does 46%. It is possible to increase the success rate of GPT-4 further by generating multiple responses or making multiple calls. However, that will further increase the cost, which is already nearly 20 times of using GPT-3.5-Turbo and with more restricted API call rate limit. Can we achieve more with less? +GPT-4 is a big upgrade of foundation model capability, e.g., in code and math, accompanied by a much higher (more than 10x) price per token to use over GPT-3.5-Turbo. On a code completion benchmark, [HumanEval](https://huggingface.co/datasets/openai_humaneval), developed by OpenAI, GPT-4 can successfully solve 68% tasks while GPT-3.5-Turbo does 46%. It is possible to increase the success rate of GPT-4 further by generating multiple responses or making multiple calls. However, that will further increase the cost, which is already nearly 20 times of using GPT-3.5-Turbo and with more restricted API call rate limit. Can we achieve more with less? In this blog post, we will explore a creative, adaptive way of using GPT models which leads to a big leap forward.