2024-10-31 21:41:40 +08:00
# Examples
2024-10-31 21:43:01 +08:00
- [1. Introduction ](#1-Introduction )
- [2. Installation ](#2-Installation )
- [3. Inference ](#3-Inference )
- [4. Finetune ](#4-Finetune )
- [5. Evaluation ](#5-Evaluation )
2024-10-31 21:41:40 +08:00
## 1. Introduction
2024-10-28 14:59:46 +08:00
2024-10-29 15:52:10 +08:00
In this example, we show how to **inference** , **finetune** and **evaluate** the baai-general-embedding.
2024-10-28 14:59:46 +08:00
2024-10-31 21:41:40 +08:00
## 2. Installation
2024-10-28 14:59:46 +08:00
* **with pip**
2024-10-31 21:41:40 +08:00
2024-10-28 14:59:46 +08:00
```shell
pip install -U FlagEmbedding
```
* **from source**
2024-10-31 21:41:40 +08:00
2024-10-28 14:59:46 +08:00
```shell
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install .
```
2024-10-31 21:41:40 +08:00
2024-10-28 14:59:46 +08:00
For development, install as editable:
2024-10-31 21:41:40 +08:00
2024-10-28 14:59:46 +08:00
```shell
pip install -e .
```
2024-10-31 21:41:40 +08:00
## 3. Inference
2024-10-28 14:59:46 +08:00
2024-10-30 18:43:26 +08:00
We have provided the inference code for two types of models: the **embedder** and the **reranker** . These can be loaded using `FlagAutoModel` and `FlagAutoReranker` , respectively. For more detailed instructions on their use, please refer to the documentation for the [embedder ](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/inference/embedder ) and [reranker ](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/inference/reranker ).
2024-10-28 14:59:46 +08:00
2024-10-31 21:41:40 +08:00
### 1. Embedder
2024-10-28 14:59:46 +08:00
```python
from FlagEmbedding import FlagAutoModel
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = FlagAutoModel.from_finetuned('BAAI/bge-large-zh-v1.5',
query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
use_fp16=True,
2024-10-30 18:43:26 +08:00
devices=['cuda:0']) # Setting use_fp16 to True speeds up computation with a slight performance degradation
2024-10-28 14:59:46 +08:00
embeddings_1 = model.encode_corpus(sentences_1)
embeddings_2 = model.encode_corpus(sentences_2)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
# for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query
# corpus in retrieval task can still use encode_corpus(), since they don't need instruction
queries = ['query_1', 'query_2']
passages = ["样例文档-1", "样例文档-2"]
q_embeddings = model.encode_queries(queries)
p_embeddings = model.encode_corpus(passages)
scores = q_embeddings @ p_embeddings.T
print(scores)
```
2024-10-31 21:41:40 +08:00
### 2. Reranker
2024-10-28 14:59:46 +08:00
```python
from FlagEmbedding import FlagAutoReranker
pairs = [("样例数据-1", "样例数据-3"), ("样例数据-2", "样例数据-4")]
model = FlagAutoReranker.from_finetuned('BAAI/bge-reranker-large',
use_fp16=True,
2024-10-30 18:43:26 +08:00
devices=['cuda:0']) # Setting use_fp16 to True speeds up computation with a slight performance degradation
2024-10-28 14:59:46 +08:00
similarity = model.compute_score(pairs, normalize=True)
print(similarity)
pairs = [("query_1", "样例文档-1"), ("query_2", "样例文档-2")]
scores = model.compute_score(pairs)
print(scores)
```
2024-10-31 21:41:40 +08:00
## 4. Finetune
2024-10-28 14:59:46 +08:00
2024-10-30 18:43:26 +08:00
We support fine-tuning a variety of BGE series models, including `bge-large-en-v1.5` , `bge-m3` , `bge-en-icl` , `bge-multilingual-gemma2` , `bge-reranker-v2-m3` , `bge-reranker-v2-gemma` , and `bge-reranker-v2-minicpm-layerwise` , among others. As examples, we use the basic models `bge-large-en-v1.5` and `bge-reranker-large` . For more details, please refer to the [embedder ](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune/embedder ) and [reranker ](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune/reranker ) sections.
2024-10-28 14:59:46 +08:00
2024-10-31 23:49:16 +08:00
If you do not have the `deepspeed` and `flash-attn` packages installed, you can install them with the following commands:
2024-10-31 19:58:18 +08:00
```shell
pip install deepspeed
pip install flash-attn --no-build-isolation
```
2024-10-31 21:41:40 +08:00
### 1. Embedder
2024-10-28 14:59:46 +08:00
2024-10-28 15:30:13 +08:00
```shell
torchrun --nproc_per_node 2 \
-m FlagEmbedding.finetune.embedder.encoder_only.base \
--model_name_or_path BAAI/bge-large-en-v1.5 \
--cache_dir ./cache/model \
--train_data ./finetune/embedder/example_data/retrieval \
--cache_path ./cache/data \
--train_group_size 8 \
--query_max_len 512 \
--passage_max_len 512 \
--pad_to_multiple_of 8 \
--query_instruction_for_retrieval 'Represent this sentence for searching relevant passages: ' \
--query_instruction_format '{}{}' \
--knowledge_distillation False \
--output_dir ./test_encoder_only_base_bge-large-en-v1.5 \
--overwrite_output_dir \
--learning_rate 1e-5 \
--fp16 \
2024-10-28 15:36:40 +08:00
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
2024-10-28 15:30:13 +08:00
--dataloader_drop_last True \
--warmup_ratio 0.1 \
--gradient_checkpointing \
--deepspeed ./finetune/ds_stage0.json \
--logging_steps 1 \
--save_steps 1000 \
--negatives_cross_device \
--temperature 0.02 \
--sentence_pooling_method cls \
--normalize_embeddings True \
--kd_loss_type kl_div
```
2024-10-28 14:59:46 +08:00
2024-10-31 21:41:40 +08:00
### 2. Reranker
2024-10-28 15:30:13 +08:00
```shell
torchrun --nproc_per_node 2 \
-m FlagEmbedding.finetune.reranker.encoder_only.base \
--model_name_or_path BAAI/bge-reranker-large \
--cache_dir ./cache/model \
--train_data ./finetune/reranker/example_data/normal/examples.jsonl \
2024-10-28 21:10:29 +08:00
--cache_path ./cache/data \
2024-10-28 15:30:13 +08:00
--train_group_size 8 \
--query_max_len 256 \
--passage_max_len 256 \
--pad_to_multiple_of 8 \
2024-10-30 19:17:50 +08:00
--knowledge_distillation False \
2024-10-28 15:30:13 +08:00
--output_dir ./test_encoder_only_base_bge-reranker-large \
--overwrite_output_dir \
--learning_rate 6e-5 \
--fp16 \
2024-10-28 15:36:40 +08:00
--num_train_epochs 1 \
2024-10-28 15:30:13 +08:00
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 1 \
--dataloader_drop_last True \
--warmup_ratio 0.1 \
--gradient_checkpointing \
--weight_decay 0.01 \
--deepspeed ./finetune/ds_stage0.json \
--logging_steps 1 \
2024-10-28 21:10:29 +08:00
--save_steps 1000
2024-10-28 15:30:13 +08:00
```
2024-10-28 14:59:46 +08:00
2024-10-31 21:41:40 +08:00
## 5. Evaluation
2024-10-28 14:59:46 +08:00
2025-10-10 20:20:48 +08:00
We support evaluations on [MTEB ](https://github.com/embeddings-benchmark/mteb ), [BEIR ](https://github.com/beir-cellar/beir ), [MSMARCO ](https://microsoft.github.io/msmarco/ ), [MIRACL ](https://github.com/project-miracl/miracl ), [MLDR ](https://huggingface.co/datasets/Shitao/MLDR ), [MKQA ](https://github.com/apple/ml-mkqa ), [AIR-Bench ](https://github.com/AIR-Bench/AIR-Bench ), [BRIGHT ](https://brightbenchmark.github.io/ ), and custom datasets. Below is an example of evaluating MSMARCO passages. For more details, please refer to the [evaluation examples ](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/evaluation ).
2024-10-28 15:30:13 +08:00
```shell
2024-10-30 19:48:37 +08:00
pip install pytrec_eval
2025-10-22 14:09:47 +08:00
# if you fail to install pytrec_eval, try the following command
2025-10-22 14:08:51 +08:00
# pip install pytrec-eval-terrier
2024-10-30 18:53:12 +08:00
pip install https://github.com/kyamagu/faiss-wheels/releases/download/v1.7.3/faiss_gpu-1.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
2024-10-28 15:30:13 +08:00
python -m FlagEmbedding.evaluation.msmarco \
--eval_name msmarco \
--dataset_dir ./data/msmarco \
--dataset_names passage \
--splits dev dl19 dl20 \
--corpus_embd_save_dir ./data/msmarco/corpus_embd \
--output_dir ./data/msmarco/search_results \
--search_top_k 1000 \
--rerank_top_k 100 \
--cache_path ./cache/data \
--overwrite True \
--k_values 10 100 \
--eval_output_method markdown \
--eval_output_path ./data/msmarco/msmarco_eval_results.md \
--eval_metrics ndcg_at_10 mrr_at_10 recall_at_100 \
--embedder_name_or_path BAAI/bge-large-en-v1.5 \
--embedder_batch_size 512 \
--embedder_query_max_length 512 \
--embedder_passage_max_length 512 \
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
--reranker_batch_size 512 \
--reranker_query_max_length 512 \
--reranker_max_length 1024 \
--devices cuda:0 cuda:1 cuda:2 cuda:3 cuda:4 cuda:5 cuda:6 cuda:7 \
--cache_dir ./cache/model
```