`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts. `pos_scores` is a list of scores corresponding to the `query` and `pos`, `neg_scores` is a list of scores corresponding to the `query` and `neg`, if you don't use knowledge distillation, it can be ignored. `prompt` is the prompt used for the input, input has the following format: `query [sep] passage [sep] prompt`. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives.
- **`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
- **`output_file`**: path to save JSON data with mined hard negatives for finetuning
- **`negative_number`**: the number of sampled negatives
- **`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
- **`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
- **`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
### Teacher Scores
Teacher scores can be used for model distillation. You can obtain the scores using the following command:
- **`reranker_name_or_path`**: The reranker name or path. Default: None
- **`reranker_model_class`**: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto
- **`reranker_peft_path`**: The reranker peft path. Default: None
- **`use_bf16`**: Whether to use bf16 for inference. Default: False
- **`query_instruction_for_rerank`**: Instruction for query. Default: None
- **`query_instruction_format_for_rerank`**: Format for query instruction. Default: {{}{}}
- **`passage_instruction_for_rerank`**: Instruction for passage. Default: None
- **`passage_instruction_format_for_rerank`**: Format for passage instruction. Default: {{}{}}
- **`cache_dir`**: Cache directory for models. Default: None
- **`reranker_batch_size`**: Batch size for inference. Default: 3000
- **`reranker_query_max_length`**: Max length for reranking queries. Default: None
- **`reranker_max_length`**: Max length for reranking. Default: 512
- **`normalize`**: Whether to normalize the reranking scores. Default: False
- **`prompt`**: The prompt for the reranker. Default: None
- **`cutoff_layers`**: The output layers of layerwise/lightweight reranker. Default: None
- **`compress_ratio`**: The compress ratio of lightweight reranker. Default: 1
- **`compress_layers`**: The compress layers of lightweight reranker. Default: None, multiple values allowed
Detailed examples of various fine-tuning can be found in the bash files located in the corresponding folders. Here, we simply provide the training methods for the `standard model`, `bge-reranker-v2-gemma` and `bge-reranker-v2-layerwise-minicpm`.
Here are some import arguments:
- **`model_name_or_path`**: The model checkpoint for initialization.
- **`config_name`**: Pretrained config name or path if not the same as model_name. Default: None
- **`tokenizer_name`**: Pretrained tokenizer name or path if not the same as model_name. Default: None
- **`cache_dir`**: Where do you want to store the pre-trained models downloaded from s3. Default: None
- **`model_type`**: Type of finetune, ['encoder', 'decoder']. Default: 'encoder'
- **`token`**: The token to use when accessing the model. Default: Value from environment variable HF_TOKEN or None if not set
- **`train_data`**: One or more paths to training data. `query: str`, `pos: List[str]`, `neg: List[str]` are required in the training data. Default: None
- **`cache_path`**: Where do you want to store the cached data. Default: None
- **`train_group_size`**: Default: 8
- **`query_max_len`**: The maximum total input sequence length after tokenization for passage. Sequences longer than this will be truncated. Default: 32
- **`passage_max_len`**: The maximum total input sequence length after tokenization for passage. Sequences longer than this will be truncated. Default: 128
- **`max_len`**: The maximum total input sequence length after tokenization. Sequences longer than this will be truncated. Default: 512
- **`pad_to_multiple_of`**: If set, will pad the sequence to be a multiple of the provided value. Default: None
- **`max_example_num_per_dataset`**: The max number of examples for each dataset. Default: 100000000
- **`query_instruction_for_rerank`**: Instruction for query. Default: None
- **`query_instruction_format`**: Format for query instruction. Default: "{}{}"
- **`knowledge_distillation`**: Use knowledge distillation when `pos_scores: List[float]` and `neg_scores: List[float]` are in features of training data. Default: False
- **`passage_instruction_for_rerank`**: Instruction for passage. Default: None
- **`passage_instruction_format`**: Format for passage instruction. Default: "{}{}"
- **`shuffle_ratio`**: The ratio of shuffling the text. Default: 0.0
- **`sep_token`**: The separator token for LLM reranker to discriminate between query and passage. Default: '\n'