mirror of
https://github.com/rasbt/LLMs-from-scratch.git
synced 2025-11-10 23:07:28 +00:00
Merge pull request #179 from rasbt/create-json-entries
OpenAI API example to create instruction examples
This commit is contained in:
commit
df44b254dc
2
.github/workflows/check-links.yml
vendored
2
.github/workflows/check-links.yml
vendored
@ -27,4 +27,4 @@ jobs:
|
||||
|
||||
- name: Check links
|
||||
run: |
|
||||
pytest --check-links ./
|
||||
pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*"
|
||||
@ -11,8 +11,8 @@ pip install -r requirements-extra.txt
|
||||
|
||||
|
||||
|
||||
|
||||
### Finding near duplicates
|
||||
|
||||
## Finding Near-duplicates
|
||||
|
||||
The `find-near-duplicates.py` function can be used to identify duplicates and near-duplicates in an instruction dataset. For example,
|
||||
|
||||
@ -23,6 +23,7 @@ python find-near-duplicates.py --json_file instruction-examples.json
|
||||
```
|
||||
|
||||
```
|
||||
scikit-learn version: 1.3.1
|
||||
|
||||
|
||||
==================================================
|
||||
@ -69,3 +70,17 @@ Duplicate pair found with similarity 1.00:
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Creating Passive Voice Entries
|
||||
|
||||
- The [create-passive-voice-entries.ipynb](create-passive-voice-entries.ipynb) notebook uses OpenAI's GPT-4 to create "passive voice" entries for an instruction dataset, as shown in the example below
|
||||
|
||||
```python
|
||||
{
|
||||
'instruction': 'Identify the verb in the following sentence',
|
||||
'input': 'The cat sleeps on the couch.',
|
||||
'output': 'The verb in the sentence is "sleeps."',
|
||||
'output_2': 'The sentence is "sleeps."' # <---- Newly created entry
|
||||
}
|
||||
```
|
||||
|
||||
429
ch07/02_dataset-utilities/create-passive-voice-entries.ipynb
Normal file
429
ch07/02_dataset-utilities/create-passive-voice-entries.ipynb
Normal file
@ -0,0 +1,429 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "136a4efe-fb99-4311-8679-e0a5b6282755",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<table style=\"width:100%\">\n",
|
||||
"<tr>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<font size=\"2\">\n",
|
||||
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
|
||||
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
|
||||
"</font>\n",
|
||||
"</td>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
|
||||
"</td>\n",
|
||||
"</tr>\n",
|
||||
"</table>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"# Create \"Passive Voice\" Entries for an Instruction Dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a128651b-f326-4232-a994-42f38b7ed520",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- This notebook uses OpenAI's GPT-4 to create \"passive voice\" entries for an instruction dataset, as shown in the example below\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"{ \n",
|
||||
" 'instruction': 'Identify the verb in the following sentence',\n",
|
||||
" 'input': 'The cat sleeps on the couch.',\n",
|
||||
" 'output': 'The verb in the sentence is \"sleeps.\"',\n",
|
||||
" 'output_2': 'The sentence is \"sleeps.\"' # <---- Newly created entry\n",
|
||||
"} \n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "267ba0d1-b884-42df-85bd-0be746fd47a5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# pip install -r requirements-exra.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "63610acc-db94-437f-8d38-e99dca0299cb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"openai version: 1.30.3\n",
|
||||
"tqdm version: 4.65.0\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from importlib.metadata import version\n",
|
||||
"\n",
|
||||
"pkgs = [\"openai\", # OpenAI API\n",
|
||||
" \"tqdm\", # Progress bar\n",
|
||||
" ]\n",
|
||||
"\n",
|
||||
"for p in pkgs:\n",
|
||||
" print(f\"{p} version: {version(p)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Test OpenAI API"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, let's test if the OpenAI API is correctly set up\n",
|
||||
"- If you don't have an account yet, you need to create one at https://platform.openai.com/\n",
|
||||
"- Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)\n",
|
||||
"- Creating the ~200 passive voice entries using the code in this notebook costs about $0.13 (13 cents)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, we need to provide our OpenAI API key, which can be found at https://platform.openai.com/api-keys\n",
|
||||
"- Make sure not to share this key with anyone (make sure to delete it from this notebook in case you intend to share it; I recommend deleting the entire notebook cell that contains the key)\n",
|
||||
"- Alternatively, delete the used API key from your account after you are finished to make sure it can't be abused later"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "8ba8760c-1635-43cf-b039-9d1557b664c4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"OPENAI_API_KEY = \"your OpenAI API key\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "26900564-aba7-48ba-8ee8-6cc9a505a25c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from openai import OpenAI\n",
|
||||
"\n",
|
||||
"client = OpenAI(api_key=OPENAI_API_KEY)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, let's try the API with a simple example to make sure it works as intended:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "08e9ef2e-e816-4283-840e-43625791ad33",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Breakfast was eaten by me.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def run_chatgpt(prompt, client, model=\"gpt-4-turbo\"):\n",
|
||||
" response = client.chat.completions.create(\n",
|
||||
" model=model,\n",
|
||||
" messages=[{\"role\": \"user\", \"content\": prompt}],\n",
|
||||
" temperature=0.0,\n",
|
||||
" )\n",
|
||||
" return response.choices[0].message.content\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Prepare intput\n",
|
||||
"sentence = \"I ate breakfast\"\n",
|
||||
"prompt = f\"Convert the following sentence to passive voice: '{sentence}'\"\n",
|
||||
"run_chatgpt(prompt, client)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create JSON Entries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Next, we load the file we want to modify:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of entries: 200\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"\n",
|
||||
"json_file = \"instruction-examples.json\"\n",
|
||||
"\n",
|
||||
"with open(json_file, \"r\") as file:\n",
|
||||
" json_data = json.load(file)\n",
|
||||
" \n",
|
||||
"print(\"Number of entries:\", len(json_data))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "39a55283-7d51-4136-ba60-f799d49f4098",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- And we try the OpenAI chat API on a small sample first to ensure that it works correctly:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "735cc089-d127-480a-b39d-0782581f0c41",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"Input:\n",
|
||||
">> The verb in the sentence is \"sleeps.\"\n",
|
||||
"\n",
|
||||
"Output:\n",
|
||||
">> The sentence is \"sleeps.\"\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Input:\n",
|
||||
">> The plural form of \"goose\" is \"geese.\"\n",
|
||||
"\n",
|
||||
"Output:\n",
|
||||
">> The plural form of \"goose\" is referred to as \"geese.\"\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Input:\n",
|
||||
">> The three primary colors are red, blue, and yellow.\n",
|
||||
"\n",
|
||||
"Output:\n",
|
||||
">> Red, blue, and yellow are the three primary colors.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Input:\n",
|
||||
">> They had finished the game.\n",
|
||||
"\n",
|
||||
"Output:\n",
|
||||
">> The game had been finished by them.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Input:\n",
|
||||
">> The abbreviation for \"Doctor of Philosophy\" is Ph.D.\n",
|
||||
"\n",
|
||||
"Output:\n",
|
||||
">> The abbreviation \"Ph.D.\" is used for \"Doctor of Philosophy\".\n",
|
||||
"\n",
|
||||
"-------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for entry in json_data[:5]:\n",
|
||||
" text = entry[\"output\"]\n",
|
||||
" prompt = f\"Without adding any response or explanation, convert the following text to passive voice: {text}\"\n",
|
||||
" \n",
|
||||
" print(\"\\nInput:\")\n",
|
||||
" print(\">>\", text)\n",
|
||||
" print(\"\\nOutput:\")\n",
|
||||
" print(\">>\", run_chatgpt(prompt, client))\n",
|
||||
" print(\"\\n-------------------------\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Let's now extend the code to add the generated entries to the `json_data` and add a progress bar:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "4f700d4b-19e5-4404-afa7-b0f093024232",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|█████████████████████████████████████████████| 5/5 [00:05<00:00, 1.12s/it]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from tqdm import tqdm # a progress bar tool\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"for i, entry in tqdm(enumerate(json_data[:5]), total=len(json_data[:5])):\n",
|
||||
" text = entry[\"output\"]\n",
|
||||
" prompt = f\"Without adding any response or explanation, convert the following text to passive voice: {text}\"\n",
|
||||
" json_data[i][\"output_2\"] = run_chatgpt(prompt, client)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cd144282-0596-4e9b-9815-322cff34b400",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- One more time, let's make sure that the new entries (`\"output_2\"`) look ok"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "5b6eaa87-a86d-42a1-a20a-b764b0d559d4",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'instruction': 'Identify the verb in the following sentence: The cat sleeps on the couch.',\n",
|
||||
" 'input': '',\n",
|
||||
" 'output': 'The verb in the sentence is \"sleeps.\"',\n",
|
||||
" 'output_2': 'The sentence is \"sleeps.\"'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"json_data[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6970e8cf-2b18-4e3d-9f25-e6a4489c39a7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Finally, if everything above looks ok, let's run the conversion to passive voice on our entire json dataset (this takes about 3 minutes):"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "eef99407-8ffd-4a63-b7ab-ffe30c0f0677",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|█████████████████████████████████████████| 200/200 [02:38<00:00, 1.26it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i, entry in tqdm(enumerate(json_data), total=len(json_data)):\n",
|
||||
" text = entry[\"output\"]\n",
|
||||
" prompt = f\"Without adding any response or explanation, convert the following text to passive voice: {text}\"\n",
|
||||
" json_data[i][\"output_2\"] = run_chatgpt(prompt, client)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ac91ae85-2f0e-456a-be1d-56e1958f30d8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- After the conversion is completed, we save the file:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "330cc30a-b08e-4bf0-bee2-bec0da4208de",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"new_json_file = json_file.replace(\".json\", \"-modified.json\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"with open(new_json_file, \"w\") as file:\n",
|
||||
" json.dump(json_data, file, indent=4) # \"indent\" for pretty-printing"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@ -6,6 +6,7 @@
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from sklearn import __version__ as sklearn_version
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
@ -75,6 +76,7 @@ def find_and_print_new_duplicates(json_data):
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("scikit-learn version:", sklearn_version)
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
|
||||
1202
ch07/02_dataset-utilities/instruction-examples-modified.json
Normal file
1202
ch07/02_dataset-utilities/instruction-examples-modified.json
Normal file
File diff suppressed because it is too large
Load Diff
@ -1,2 +1,3 @@
|
||||
openai
|
||||
scikit-learn
|
||||
openai>=1.30.3
|
||||
scikit-learn>=1.3.1
|
||||
tqdm>=4.65.0
|
||||
Loading…
x
Reference in New Issue
Block a user