update readme

2025-06-27 02:39:58 +00:00 · 2024-10-30 11:37:19 +08:00 · 2024-10-30 11:37:19 +08:00 · f8549e65c9
commit f8549e65c9
parent 7ac70bcd37
4 changed files with 83 additions and 102 deletions
--- a/examples/finetune/reranker/README.md
+++ b/examples/finetune/reranker/README.md
@ -36,74 +36,74 @@ Train data should be a json file, where each line is a dict like this:

 See [example_data](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder/example_data) for more detailed files.

- Hard Negatives
+### Hard Negatives

-  Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command:
+Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command:

-  ```shell
-  git clone https://github.com/FlagOpen/FlagEmbedding.git
-  cd FlagEmbedding/scripts
-  ```
+```shell
+git clone https://github.com/FlagOpen/FlagEmbedding.git
+cd FlagEmbedding/scripts
+```

-  ```shell
-  python hn_mine.py \
-  --model_name_or_path BAAI/bge-base-en-v1.5 \
-  --input_file toy_finetune_data.jsonl \
-  --output_file toy_finetune_data_minedHN.jsonl \
-  --range_for_sampling 2-200 \
-  --negative_number 15 \
-  --use_gpu_for_searching 
-  ```
+```shell
+python hn_mine.py \
+--model_name_or_path BAAI/bge-base-en-v1.5 \
+--input_file toy_finetune_data.jsonl \
+--output_file toy_finetune_data_minedHN.jsonl \
+--range_for_sampling 2-200 \
+--negative_number 15 \
+--use_gpu_for_searching 
+```

-  - **`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
-  - **`output_file`**: path to save JSON data with mined hard negatives for finetuning
-  - **`negative_number`**: the number of sampled negatives
-  - **`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
-  - **`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
-  - **`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
+- **`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
+- **`output_file`**: path to save JSON data with mined hard negatives for finetuning
+- **`negative_number`**: the number of sampled negatives
+- **`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
+- **`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
+- **`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.

-  ### Teacher Scores
+### Teacher Scores

-  Teacher scores can be used for model distillation. You can obtain the scores using the following command:
+Teacher scores can be used for model distillation. You can obtain the scores using the following command:

-  ```shell
-  git clone https://github.com/FlagOpen/FlagEmbedding.git
-  cd FlagEmbedding/scripts
-  ```
+```shell
+git clone https://github.com/FlagOpen/FlagEmbedding.git
+cd FlagEmbedding/scripts
+```

-  ```shell
-  python add_reranker_score.py \
-  --input_file toy_finetune_data_minedHN.jsonl \
-  --output_file toy_finetune_data_score.jsonl \
-  --reranker_name_or_path BAAI/bge-reranker-v2-m3 \
-  --devices cuda:0 cuda:1 \
-  --cache_dir ./cache/model \
-  --reranker_query_max_length 512 \
-  --reranker_max_length 1024
-  ```
+```shell
+python add_reranker_score.py \
+--input_file toy_finetune_data_minedHN.jsonl \
+--output_file toy_finetune_data_score.jsonl \
+--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
+--devices cuda:0 cuda:1 \
+--cache_dir ./cache/model \
+--reranker_query_max_length 512 \
+--reranker_max_length 1024
+```

-  - **`input_file`**: path to save JSON data with mined hard negatives for finetuning
-  - **`output_file`**: path to save JSON data with scores for finetuning
-  - **`use_fp16`**: Whether to use fp16 for inference. Default: True
-  - **`devices`**: Devices to use for inference. Default: None, multiple values allowed
-  - **`trust_remote_code`**: Trust remote code. Default: False
-  - **`reranker_name_or_path`**: The reranker name or path. Default: None
-  - **`reranker_model_class`**: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto
-  - **`reranker_peft_path`**: The reranker peft path. Default: None
-  - **`use_bf16`**: Whether to use bf16 for inference. Default: False
-  - **`query_instruction_for_rerank`**: Instruction for query. Default: None
-  - **`query_instruction_format_for_rerank`**: Format for query instruction. Default: {{}{}}
-  - **`passage_instruction_for_rerank`**: Instruction for passage. Default: None
-  - **`passage_instruction_format_for_rerank`**: Format for passage instruction. Default: {{}{}}
-  - **`cache_dir`**: Cache directory for models. Default: None
-  - **`reranker_batch_size`**: Batch size for inference. Default: 3000
-  - **`reranker_query_max_length`**: Max length for reranking queries. Default: None
-  - **`reranker_max_length`**: Max length for reranking. Default: 512
-  - **`normalize`**: Whether to normalize the reranking scores. Default: False
-  - **`prompt`**: The prompt for the reranker. Default: None
-  - **`cutoff_layers`**: The output layers of layerwise/lightweight reranker. Default: None
-  - **`compress_ratio`**: The compress ratio of lightweight reranker. Default: 1
-  - **`compress_layers`**: The compress layers of lightweight reranker. Default: None, multiple values allowed
+- **`input_file`**: path to save JSON data with mined hard negatives for finetuning
+- **`output_file`**: path to save JSON data with scores for finetuning
+- **`use_fp16`**: Whether to use fp16 for inference. Default: True
+- **`devices`**: Devices to use for inference. Default: None, multiple values allowed
+- **`trust_remote_code`**: Trust remote code. Default: False
+- **`reranker_name_or_path`**: The reranker name or path. Default: None
+- **`reranker_model_class`**: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto
+- **`reranker_peft_path`**: The reranker peft path. Default: None
+- **`use_bf16`**: Whether to use bf16 for inference. Default: False
+- **`query_instruction_for_rerank`**: Instruction for query. Default: None
+- **`query_instruction_format_for_rerank`**: Format for query instruction. Default: {{}{}}
+- **`passage_instruction_for_rerank`**: Instruction for passage. Default: None
+- **`passage_instruction_format_for_rerank`**: Format for passage instruction. Default: {{}{}}
+- **`cache_dir`**: Cache directory for models. Default: None
+- **`reranker_batch_size`**: Batch size for inference. Default: 3000
+- **`reranker_query_max_length`**: Max length for reranking queries. Default: None
+- **`reranker_max_length`**: Max length for reranking. Default: 512
+- **`normalize`**: Whether to normalize the reranking scores. Default: False
+- **`prompt`**: The prompt for the reranker. Default: None
+- **`cutoff_layers`**: The output layers of layerwise/lightweight reranker. Default: None
+- **`compress_ratio`**: The compress ratio of lightweight reranker. Default: 1
+- **`compress_layers`**: The compress layers of lightweight reranker. Default: None, multiple values allowed

 ## 3. Train

--- a/research/old-examples/finetune/README.md
+++ b/research/old-examples/finetune/README.md
@ -2,24 +2,10 @@
 In this example, we show how to finetune the baai-general-embedding with your data.

 ## 1. Installation
-* **with pip**
-```
-pip install -U FlagEmbedding
-```
-
-* **from source**
 ```
 git clone https://github.com/FlagOpen/FlagEmbedding.git
-cd FlagEmbedding
-pip install  .
+cd FlagEmbedding/research/baai_general_embedding
 ```
-For development, install as editable:
-```
-pip install -e .
-```
-
- 
-
 ## 2. Data format
 Train data should be a json file, where each line is a dict like this:

@ -36,6 +22,12 @@ See [toy_finetune_data.jsonl](https://github.com/FlagOpen/FlagEmbedding/blob/mas

 Hard negatives is a widely used method to improve the quality of sentence embedding. 
 You can mine hard negatives following this command:
+
+```shell
+git clone https://github.com/FlagOpen/FlagEmbedding.git
+cd FlagEmbedding/scripts
+```
+
 ```bash
 python -m FlagEmbedding.baai_general_embedding.finetune.hn_mine \
 --model_name_or_path BAAI/bge-base-en-v1.5 \
@ -59,7 +51,7 @@ The format of this file is the same as [pretrain data](https://github.com/FlagOp
 ## 3. Train
 ```
 torchrun --nproc_per_node {number of gpus} \
-m FlagEmbedding.baai_general_embedding.finetune.run \
+-m finetune.run \
 --output_dir {path to save model} \
 --model_name_or_path BAAI/bge-large-zh-v1.5 \
 --train_data ./toy_finetune_data.jsonl \
@ -97,9 +89,9 @@ Besides the negatives in this group, the in-batch negatives also will be used in
 For more training arguments please refer to [transformers.TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)


-### 4. Model merging via [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail) [optional]
+### 4. Model merging via [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail) [optional]

-For more details please refer to [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail).
+For more details please refer to [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail).

 Fine-tuning the base bge model can improve its performance on target task, 
 but maybe lead to severe degeneration of model’s general capabilities 
@ -144,14 +136,15 @@ You can fine-tune the base model on more tasks and merge them to achieve better


 ### 5. Load your model
-After fine-tuning BGE model, you can load it easily in the same way as [here](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding#usage) 
+After fine-tuning BGE model, you can load it easily in the same way as [here](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/baai_general_embedding#usage) 

 Please replace the `query_instruction_for_retrieval` with your instruction if you set a different value for hyper-parameter `--query_instruction_for_retrieval` when fine-tuning.


 ### 6. Evaluate model
-We provide [a simple script](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding/finetune/eval_msmarco.py) to evaluate the model's performance.
+We provide [a simple script](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/research/baai_general_embedding/finetune/eval_msmarco.py) to evaluate the model's performance.
 A brief summary of how the script works:
+
 1. Load the model on all available GPUs through [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). 
 2. Encode the corpus and offload the embeddings in `faiss` Flat index. By default, `faiss` also dumps the index on all available GPUs.
 3. Encode the queries and search `100` nearest neighbors for each query.
@ -170,7 +163,7 @@ You can check the data formats for the [msmarco corpus](https://huggingface.co/d
 Run the following command:

 ```bash
-python -m FlagEmbedding.baai_general_embedding.finetune.eval_msmarco \
+python -m finetune.eval_msmarco \
 --encoder BAAI/bge-base-en-v1.5 \
 --fp16 \
 --add_instruction \
@ -223,4 +216,3 @@ python -m FlagEmbedding.baai_general_embedding.finetune.eval_msmarco \
 --query_data ./toy_evaluation_data/toy_query.json 
 ```

-
--- a/research/old-examples/reranker/README.md
+++ b/research/old-examples/reranker/README.md
@ -2,34 +2,23 @@
 In this example, we show how to finetune the cross-encoder reranker with your data.

 ## 1. Installation
-* **with pip**
-```
-pip install -U FlagEmbedding
-```
-
-* **from source**
 ```
 git clone https://github.com/FlagOpen/FlagEmbedding.git
-cd FlagEmbedding
+cd research/reranker
 pip install  .
 ```
-For development, install as editable:
-```
-pip install -e .
-```
- 

 ## 2. Data format

-The data format for reranker is the same as [embedding fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#data-format).
-Besides, we strongly suggest to [mine hard negatives](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives) to fine-tune reranker.
+The data format for reranker is the same as [embedding fine-tune](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder#2-data-format).
+Besides, we strongly suggest to [mine hard negatives](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/reranker#hard-negatives) to fine-tune reranker.


 ## 3. Train

 ```
 torchrun --nproc_per_node {number of gpus} \
-m FlagEmbedding.reranker.run \
+-m run \
 --output_dir {path to save model} \
 --model_name_or_path BAAI/bge-reranker-base \
 --train_data ./toy_finetune_data.jsonl \
@ -55,9 +44,9 @@ Besides the negatives in this group, the in-batch negatives also will be used in
 More training arguments please refer to [transformers.TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)


-### 4. Model merging via [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail) [optional]
+### 4. Model merging via [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail) [optional]

-For more details please refer to [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail).
+For more details please refer to [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail).

 Fine-tuning the base bge model can improve its performance on target task, 
 but maybe lead to severe degeneration of model’s general capabilities 
--- a/research/old-examples/unified_finetune/README.md
+++ b/research/old-examples/unified_finetune/README.md
@ -33,7 +33,7 @@ See [toy_train_data](./toy_train_data) for an example of training data.

 ## 3. Train

-> **Note**: If you only want to fine-tune the dense embedding of `BAAI/bge-m3`, you can refer to [here](../finetune/README.md).
+> **Note**: If you only want to fine-tune the dense embedding of `BAAI/bge-m3`, you can refer to [here](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder#1-standard-model).

 Here is an simple example of how to perform unified fine-tuning (dense embedding, sparse embedding and colbert) based on `BAAI/bge-m3`:

@ -63,9 +63,9 @@ torchrun --nproc_per_node {number of gpus} \
 You can also refer to [this script](./unified_finetune_bge-m3_exmaple.sh) for more details. In this script, we use `deepspeed` to perform distributed training. Learn more about `deepspeed` at https://www.deepspeed.ai/getting-started/. Note that there are some important parameters to be modified in this script:

 - `HOST_FILE_CONTENT`: Machines and GPUs for training. If you want to use multiple machines for training, please refer to https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node (note that you should configure `pdsh` and `ssh` properly).
- `DS_CONFIG_FILE`: Path of deepspeed config file. [Here](../finetune/ds_config.json) is an example of `ds_config.json`.
+- `DS_CONFIG_FILE`: Path of deepspeed config file. [Here](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/examples/finetune/ds_stage0.json) is an example of `ds_config.json`.
 - `DATA_PATH`: One or more paths of training data. **Each path must be a directory containing one or more jsonl files**.
- `DEFAULT_BATCH_SIZE`: Default batch size for training. If you use efficient batching strategy, which means you have split your data to different parts by sequence length, then the batch size for each part will be decided by the `get_file_batch_size()` function in [`BGE_M3/data.py`](../../FlagEmbedding/BGE_M3/data.py). Before starting training, you should set the corresponding batch size for each part in this function according to the GPU memory of your machines. `DEFAULT_BATCH_SIZE` will be used for the part whose sequence length is not in the `get_file_batch_size()` function.
+- `DEFAULT_BATCH_SIZE`: Default batch size for training. If you use efficient batching strategy, which means you have split your data to different parts by sequence length, then the batch size for each part will be decided by the `get_file_batch_size()` function in [`BGE_M3/data.py`](../../BGE_M3/data.py). Before starting training, you should set the corresponding batch size for each part in this function according to the GPU memory of your machines. `DEFAULT_BATCH_SIZE` will be used for the part whose sequence length is not in the `get_file_batch_size()` function.
 - `EPOCHS`: Number of training epochs.
 - `LEARNING_RATE`: The initial learning rate.
 - `SAVE_PATH`: Path of saving finetuned model.
@ -73,4 +73,4 @@ You can also refer to [this script](./unified_finetune_bge-m3_exmaple.sh) for mo
 You should set these parameters appropriately.


-For more detailed arguments setting, please refer to [`BGE_M3/arguments.py`](../../FlagEmbedding/BGE_M3/arguments.py).
+For more detailed arguments setting, please refer to [`BGE_M3/arguments.py`](../../BGE_M3/arguments.py).