2024-05-25 11:22:51 -05:00
|
|
|
# Chapter 7: Instruction and Preference Finetuning
|
|
|
|
|
|
|
|
This folder contains utility code that can be used for preparing an instruction dataset.
|
|
|
|
|
2024-05-25 11:38:55 -05:00
|
|
|
Install the additional package requirements via:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
pip install -r requirements-extra.txt
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
2024-05-25 11:22:51 -05:00
|
|
|
|
2024-05-26 14:25:09 -05:00
|
|
|
|
|
|
|
### Finding near duplicates
|
2024-05-25 11:22:51 -05:00
|
|
|
|
|
|
|
The `find-near-duplicates.py` function can be used to identify duplicates and near-duplicates in an instruction dataset. For example,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
python find-near-duplicates.py --json_file instruction-examples.json
|
|
|
|
```
|
|
|
|
|
|
|
|
```
|
|
|
|
==================================================
|
2024-05-26 14:25:09 -05:00
|
|
|
Searching 'instruction' for duplicates ...
|
2024-05-25 11:22:51 -05:00
|
|
|
==================================================
|
2024-05-26 14:25:09 -05:00
|
|
|
Duplicate pair found with similarity 0.94:
|
2024-05-25 11:22:51 -05:00
|
|
|
1. Edit the following sentence to make it more formal.
|
|
|
|
2. Edit the sentence to make it more formal.
|
|
|
|
|
|
|
|
Duplicate pair found with similarity 1.00:
|
|
|
|
1. Name a dwarf planet in our solar system.
|
|
|
|
2. Name a dwarf planet in our solar system.
|
|
|
|
|
2024-05-26 14:25:09 -05:00
|
|
|
Duplicate pair found with similarity 0.91:
|
2024-05-25 11:22:51 -05:00
|
|
|
1. Change the sentences from active voice to passive voice.
|
|
|
|
2. Change the sentence from passive to active voice.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
==================================================
|
2024-05-26 14:25:09 -05:00
|
|
|
Searching 'input' for duplicates ...
|
2024-05-25 11:22:51 -05:00
|
|
|
==================================================
|
2024-05-26 14:25:09 -05:00
|
|
|
No duplicates found
|
2024-05-25 11:22:51 -05:00
|
|
|
|
|
|
|
|
|
|
|
==================================================
|
2024-05-26 14:25:09 -05:00
|
|
|
Searching 'output' for duplicates ...
|
2024-05-25 11:22:51 -05:00
|
|
|
==================================================
|
|
|
|
Duplicate pair found with similarity 1.00:
|
|
|
|
1. One dwarf planet in our solar system is Pluto.
|
|
|
|
2. One dwarf planet in our solar system is Pluto.
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
2024-05-26 14:25:09 -05:00
|
|
|
 
|
|
|
|
You can use the `--threshold` setting with a value between 0 and 1 to decrease or increase the sensitivity.
|
|
|
|
The default threshold is 0.9.
|