Unified Finetune
In this example, we show how to perform unified fine-tuning based on BAAI/bge-m3 with your data.
1. Installation
- with pip
pip install -U FlagEmbedding
- from source
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install -e .
2. Data format
Training data should be a jsonl file, where each line is a dict like this:
{"query": str, "pos": List[str], "neg":List[str]}
query is the query, and pos is a list of positive texts, neg is a list of negative texts.
If you want to use knowledge distillation, each line of your jsonl file should be like this:
{"query": str, "pos": List[str], "neg":List[str], "pos_scores": List[float], "neg_scores": List[float]}
pos_scores is a list of positive scores, where pos_scores[i] is the score between query and pos[i] from the teacher model. neg_scores is a list of negative scores, where neg_scores[i] is the score between query and neg[i] from the teacher model.
See toy_train_data for an example of training data.
3. Train
Note
: If you only want to fine-tune the dense embedding of
BAAI/bge-m3, you can refer to here.
Here is an simple example of how to perform unified fine-tuning (dense embedding, sparse embedding and colbert) based on BAAI/bge-m3:
torchrun --nproc_per_node {number of gpus} \
-m FlagEmbedding.BGE_M3.run \
--output_dir {path to save model} \
--model_name_or_path BAAI/bge-m3 \
--train_data ./toy_train_data \
--learning_rate 1e-5 \
--fp16 \
--num_train_epochs 5 \
--per_device_train_batch_size {large batch size; set 1 for toy data} \
--dataloader_drop_last True \
--normlized True \
--temperature 0.02 \
--query_max_len 64 \
--passage_max_len 256 \
--train_group_size 2 \
--negatives_cross_device \
--logging_steps 10 \
--same_task_within_batch True \
--unified_finetuning True \
--use_self_distill True
You can also refer to this script for more details. In this script, we use deepspeed to perform distributed training. Learn more about deepspeed at https://www.deepspeed.ai/getting-started/. Note that there are some important parameters to be modified in this script:
HOST_FILE_CONTENT: Machines and GPUs for training. If you want to use multiple machines for training, please refer to https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node (note that you should configurepdshandsshproperly).DS_CONFIG_FILE: Path of deepspeed config file. Here is an example ofds_config.json.DATA_PATH: One or more paths of training data. Each path must be a directory containing one or more jsonl files.DEFAULT_BATCH_SIZE: Default batch size for training. If you use efficient batching strategy, which means you have split your data to different parts by sequence length, then the batch size for each part will be decided by theget_file_batch_size()function inBGE_M3/data.py. Before starting training, you should set the corresponding batch size for each part in this function according to the GPU memory of your machines.DEFAULT_BATCH_SIZEwill be used for the part whose sequence length is not in theget_file_batch_size()function.EPOCHS: Number of training epochs.LEARNING_RATE: The initial learning rate.SAVE_PATH: Path of saving finetuned model.
You should set these parameters appropriately.
For more detailed arguments setting, please refer to BGE_M3/arguments.py.