mirror of
https://github.com/rasbt/LLMs-from-scratch.git
synced 2025-08-30 19:41:51 +00:00
87 lines
2.4 KiB
Markdown
87 lines
2.4 KiB
Markdown
# Chapter 7: Instruction and Preference Finetuning
|
|
|
|
This folder contains utility code that can be used for preparing an instruction dataset.
|
|
|
|
Install the additional package requirements via:
|
|
|
|
```bash
|
|
pip install -r requirements-extra.txt
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Finding Near-duplicates
|
|
|
|
The `find-near-duplicates.py` function can be used to identify duplicates and near-duplicates in an instruction dataset. For example,
|
|
|
|
|
|
|
|
```python
|
|
python find-near-duplicates.py --json_file instruction-examples.json
|
|
```
|
|
|
|
```
|
|
scikit-learn version: 1.3.1
|
|
|
|
|
|
==================================================
|
|
Searching 'instruction' for duplicates ...
|
|
==================================================
|
|
Duplicate pair found with similarity 0.85:
|
|
1. Determine the state of matter for helium at room temperature.
|
|
2. Determine the state of matter for nitrogen at room temperature.
|
|
|
|
Duplicate pair found with similarity 0.98:
|
|
1. Edit the following sentence to make it more formal.
|
|
2. Edit the sentence to make it more formal.
|
|
|
|
Duplicate pair found with similarity 1.00:
|
|
1. Name a dwarf planet in our solar system.
|
|
2. Name a dwarf planet in our solar system.
|
|
|
|
Duplicate pair found with similarity 0.88:
|
|
1. Change the sentences from active voice to passive voice.
|
|
2. Change the sentence from passive to active voice.
|
|
|
|
|
|
|
|
==================================================
|
|
Searching 'input' for duplicates ...
|
|
==================================================
|
|
Duplicate pair found with similarity 0.88:
|
|
1.
|
|
2. She said, "I am tired."
|
|
|
|
|
|
|
|
==================================================
|
|
Searching 'output' for duplicates ...
|
|
==================================================
|
|
Duplicate pair found with similarity 0.82:
|
|
1. Helium is in a gaseous state at room temperature.
|
|
2. Nitrogen is in a gaseous state at room temperature.
|
|
|
|
Duplicate pair found with similarity 1.00:
|
|
1. One dwarf planet in our solar system is Pluto.
|
|
2. One dwarf planet in our solar system is Pluto.
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
## Creating Passive Voice Entries
|
|
|
|
- The [create-passive-voice-entries.ipynb](create-passive-voice-entries.ipynb) notebook uses OpenAI's GPT-4 to create "passive voice" entries for an instruction dataset, as shown in the example below
|
|
|
|
```python
|
|
{
|
|
'instruction': 'Identify the verb in the following sentence',
|
|
'input': 'The cat sleeps on the couch.',
|
|
'output': 'The verb in the sentence is "sleeps."',
|
|
'output_2': 'The sentence is "sleeps."' # <---- Newly created entry
|
|
}
|
|
```
|