{ "cells": [ { "cell_type": "markdown", "id": "136a4efe-fb99-4311-8679-e0a5b6282755", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka
\n", "
Code repository: https://github.com/rasbt/LLMs-from-scratch\n", "
\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4", "metadata": { "tags": [] }, "source": [ "# Create \"Passive Voice\" Entries for an Instruction Dataset" ] }, { "cell_type": "markdown", "id": "a128651b-f326-4232-a994-42f38b7ed520", "metadata": {}, "source": [ "- This notebook uses OpenAI's GPT-4 to create \"passive voice\" entries for an instruction dataset, as shown in the example below\n", "\n", "```python\n", "{ \n", " 'instruction': 'Identify the verb in the following sentence',\n", " 'input': 'The cat sleeps on the couch.',\n", " 'output': 'The verb in the sentence is \"sleeps.\"',\n", " 'output_2': 'The sentence is \"sleeps.\"' # <---- Newly created entry\n", "} \n", "```" ] }, { "cell_type": "code", "execution_count": 1, "id": "267ba0d1-b884-42df-85bd-0be746fd47a5", "metadata": {}, "outputs": [], "source": [ "# pip install -r requirements-extra.txt" ] }, { "cell_type": "code", "execution_count": 2, "id": "63610acc-db94-437f-8d38-e99dca0299cb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "openai version: 1.30.3\n", "tqdm version: 4.65.0\n" ] } ], "source": [ "from importlib.metadata import version\n", "\n", "pkgs = [\"openai\", # OpenAI API\n", " \"tqdm\", # Progress bar\n", " ]\n", "\n", "for p in pkgs:\n", " print(f\"{p} version: {version(p)}\")" ] }, { "cell_type": "markdown", "id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5", "metadata": {}, "source": [ "## Test OpenAI API" ] }, { "cell_type": "markdown", "id": "9558a522-650d-401a-84fc-9fd7b1f39da7", "metadata": {}, "source": [ "- First, let's test if the OpenAI API is correctly set up\n", "- If you don't have an account yet, you need to create one at https://platform.openai.com/\n", "- Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)\n", "- Creating the ~200 passive voice entries using the code in this notebook costs about $0.13 (13 cents)" ] }, { "cell_type": "markdown", "id": "89343a84-0ddc-42fc-bf50-298a342b93c0", "metadata": {}, "source": [ "- First, we need to provide our OpenAI API secret key, which can be found at https://platform.openai.com/api-keys\n", "- Make sure not to share this key with anyone\n", "- Add this secret key (`\"sk-...\"`) to the `config.json` file in this folder" ] }, { "cell_type": "code", "execution_count": 3, "id": "26900564-aba7-48ba-8ee8-6cc9a505a25c", "metadata": {}, "outputs": [], "source": [ "import json\n", "from openai import OpenAI\n", "\n", "# Load API key from a JSON file. \n", "# Make sure to replace \"sk-...\" with your actual API key from https://platform.openai.com/api-keys\n", "with open(\"config.json\", \"r\") as config_file:\n", " config = json.load(config_file)\n", " api_key = config[\"OPENAI_API_KEY\"]\n", "\n", "client = OpenAI(api_key=api_key)" ] }, { "cell_type": "markdown", "id": "16642a48-1cab-40d2-af08-ab8c2fbf5876", "metadata": {}, "source": [ "- First, let's try the API with a simple example to make sure it works as intended:" ] }, { "cell_type": "code", "execution_count": 4, "id": "08e9ef2e-e816-4283-840e-43625791ad33", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Breakfast was eaten by me.'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def run_chatgpt(prompt, client, model=\"gpt-4-turbo\"):\n", " response = client.chat.completions.create(\n", " model=model,\n", " messages=[{\"role\": \"user\", \"content\": prompt}],\n", " temperature=0.0,\n", " )\n", " return response.choices[0].message.content\n", "\n", "\n", "# Prepare input\n", "sentence = \"I ate breakfast\"\n", "prompt = f\"Convert the following sentence to passive voice: '{sentence}'\"\n", "run_chatgpt(prompt, client)" ] }, { "cell_type": "markdown", "id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d", "metadata": {}, "source": [ "## Create JSON Entries" ] }, { "cell_type": "markdown", "id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a", "metadata": {}, "source": [ "- Next, we load the file we want to modify:" ] }, { "cell_type": "code", "execution_count": 5, "id": "8b2d393a-aa92-4190-9d44-44326a6f699b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of entries: 200\n" ] } ], "source": [ "import json\n", "\n", "json_file = \"instruction-examples.json\"\n", "\n", "with open(json_file, \"r\") as file:\n", " json_data = json.load(file)\n", " \n", "print(\"Number of entries:\", len(json_data))" ] }, { "cell_type": "markdown", "id": "39a55283-7d51-4136-ba60-f799d49f4098", "metadata": {}, "source": [ "- And we try the OpenAI chat API on a small sample first to ensure that it works correctly:" ] }, { "cell_type": "code", "execution_count": 6, "id": "735cc089-d127-480a-b39d-0782581f0c41", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Input:\n", ">> The verb in the sentence is \"sleeps.\"\n", "\n", "Output:\n", ">> The sentence is \"sleeps.\"\n", "\n", "-------------------------\n", "\n", "Input:\n", ">> The plural form of \"goose\" is \"geese.\"\n", "\n", "Output:\n", ">> The plural form of \"goose\" is referred to as \"geese.\"\n", "\n", "-------------------------\n", "\n", "Input:\n", ">> The three primary colors are red, blue, and yellow.\n", "\n", "Output:\n", ">> Red, blue, and yellow are considered the three primary colors.\n", "\n", "-------------------------\n", "\n", "Input:\n", ">> They had finished the game.\n", "\n", "Output:\n", ">> The game had been finished by them.\n", "\n", "-------------------------\n", "\n", "Input:\n", ">> The abbreviation for \"Doctor of Philosophy\" is Ph.D.\n", "\n", "Output:\n", ">> The abbreviation \"Ph.D.\" is used for \"Doctor of Philosophy\".\n", "\n", "-------------------------\n" ] } ], "source": [ "for entry in json_data[:5]:\n", " text = entry[\"output\"]\n", " prompt = f\"Without adding any response or explanation, convert the following text to passive voice: {text}\"\n", " \n", " print(\"\\nInput:\")\n", " print(\">>\", text)\n", " print(\"\\nOutput:\")\n", " print(\">>\", run_chatgpt(prompt, client))\n", " print(\"\\n-------------------------\")" ] }, { "cell_type": "markdown", "id": "142dfaa7-429f-4eb0-b74d-ff327f79547a", "metadata": {}, "source": [ "- Let's now extend the code to add the generated entries to the `json_data` and add a progress bar:" ] }, { "cell_type": "code", "execution_count": 7, "id": "4f700d4b-19e5-4404-afa7-b0f093024232", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00, 1.23it/s]\n" ] } ], "source": [ "from tqdm import tqdm # a progress bar tool\n", "\n", "\n", "for i, entry in tqdm(enumerate(json_data[:5]), total=len(json_data[:5])):\n", " text = entry[\"output\"]\n", " prompt = f\"Without adding any response or explanation, convert the following text to passive voice: {text}\"\n", " json_data[i][\"output_2\"] = run_chatgpt(prompt, client)" ] }, { "cell_type": "markdown", "id": "cd144282-0596-4e9b-9815-322cff34b400", "metadata": {}, "source": [ "- One more time, let's make sure that the new entries (`\"output_2\"`) look ok" ] }, { "cell_type": "code", "execution_count": 8, "id": "5b6eaa87-a86d-42a1-a20a-b764b0d559d4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'instruction': 'Identify the verb in the following sentence: The cat sleeps on the couch.',\n", " 'input': '',\n", " 'output': 'The verb in the sentence is \"sleeps.\"',\n", " 'output_2': 'The sentence is \"sleeps.\"'}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "json_data[0]" ] }, { "cell_type": "markdown", "id": "6970e8cf-2b18-4e3d-9f25-e6a4489c39a7", "metadata": {}, "source": [ "- Finally, if everything above looks ok, let's run the conversion to passive voice on our entire json dataset (this takes about 3 minutes):" ] }, { "cell_type": "code", "execution_count": 9, "id": "eef99407-8ffd-4a63-b7ab-ffe30c0f0677", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████████████████████████████████████████████████████████████| 200/200 [03:43<00:00, 1.12s/it]\n" ] } ], "source": [ "for i, entry in tqdm(enumerate(json_data), total=len(json_data)):\n", " text = entry[\"output\"]\n", " prompt = f\"Without adding any response or explanation, convert the following text to passive voice: {text}\"\n", " json_data[i][\"output_2\"] = run_chatgpt(prompt, client)" ] }, { "cell_type": "markdown", "id": "ac91ae85-2f0e-456a-be1d-56e1958f30d8", "metadata": {}, "source": [ "- After the conversion is completed, we save the file:" ] }, { "cell_type": "code", "execution_count": 10, "id": "330cc30a-b08e-4bf0-bee2-bec0da4208de", "metadata": {}, "outputs": [], "source": [ "new_json_file = json_file.replace(\".json\", \"-modified.json\")\n", "\n", "\n", "with open(new_json_file, \"w\") as file:\n", " json.dump(json_data, file, indent=4) # \"indent\" for pretty-printing" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }