LLMs-from-scratch/ch07/02_dataset-utilities

Chapter 7: Instruction and Preference Finetuning

This folder contains utility code that can be used for preparing an instruction dataset.

Install the additional package requirements via:

pip install -r requirements-extra.txt

 

Finding Near-duplicates

The find-near-duplicates.py function can be used to identify duplicates and near-duplicates in an instruction dataset. For example,

python find-near-duplicates.py --json_file instruction-examples.json
scikit-learn version: 1.3.1


==================================================
 Searching 'instruction' for duplicates ...
==================================================
Duplicate pair found with similarity 0.85:
1. Determine the state of matter for helium at room temperature.
2. Determine the state of matter for nitrogen at room temperature.

Duplicate pair found with similarity 0.98:
1. Edit the following sentence to make it more formal.
2. Edit the sentence to make it more formal.

Duplicate pair found with similarity 1.00:
1. Name a dwarf planet in our solar system.
2. Name a dwarf planet in our solar system.

Duplicate pair found with similarity 0.88:
1. Change the sentences from active voice to passive voice.
2. Change the sentence from passive to active voice.



==================================================
 Searching 'input' for duplicates ...
==================================================
Duplicate pair found with similarity 0.88:
1. 
2. She said, "I am tired."



==================================================
 Searching 'output' for duplicates ...
==================================================
Duplicate pair found with similarity 0.82:
1. Helium is in a gaseous state at room temperature.
2. Nitrogen is in a gaseous state at room temperature.

Duplicate pair found with similarity 1.00:
1. One dwarf planet in our solar system is Pluto.
2. One dwarf planet in our solar system is Pluto.


 

Creating Passive Voice Entries

{  
   'instruction': 'Identify the verb in the following sentence',
   'input': 'The cat sleeps on the couch.',
   'output': 'The verb in the sentence is "sleeps."',
   'output_2': 'The sentence is "sleeps."'   #  <---- Newly created entry
}