mirror of
https://github.com/FlagOpen/FlagEmbedding.git
synced 2026-01-08 05:03:10 +00:00
171 lines
6.5 KiB
Markdown
171 lines
6.5 KiB
Markdown
# 1. Introduction
|
|
|
|
In this example, we show how to **inference**, **finetune** and **evaluation** the baai-general-embedding.
|
|
|
|
# 2. Installation
|
|
|
|
* **with pip**
|
|
```shell
|
|
pip install -U FlagEmbedding
|
|
```
|
|
|
|
* **from source**
|
|
```shell
|
|
git clone https://github.com/FlagOpen/FlagEmbedding.git
|
|
cd FlagEmbedding
|
|
pip install .
|
|
```
|
|
For development, install as editable:
|
|
```shell
|
|
pip install -e .
|
|
```
|
|
|
|
# 3. Inference
|
|
|
|
We have provided the inference code for two models, namely the **embedder** and the **reranker**. These can be loaded using `FlagAutoModel` and `FlagAutoReranker`, respectively. For more detailed instructions on their use, please refer to the documentation for the [embedder](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/examples/inference/embedder) and [reranker](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/examples/inference/reranker).
|
|
|
|
## 1. Embedder
|
|
|
|
```python
|
|
from FlagEmbedding import FlagAutoModel
|
|
sentences_1 = ["样例数据-1", "样例数据-2"]
|
|
sentences_2 = ["样例数据-3", "样例数据-4"]
|
|
model = FlagAutoModel.from_finetuned('BAAI/bge-large-zh-v1.5',
|
|
query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
|
|
use_fp16=True,
|
|
devices=['cuda:1']) # Setting use_fp16 to True speeds up computation with a slight performance degradation
|
|
embeddings_1 = model.encode_corpus(sentences_1)
|
|
embeddings_2 = model.encode_corpus(sentences_2)
|
|
similarity = embeddings_1 @ embeddings_2.T
|
|
print(similarity)
|
|
|
|
# for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query
|
|
# corpus in retrieval task can still use encode_corpus(), since they don't need instruction
|
|
queries = ['query_1', 'query_2']
|
|
passages = ["样例文档-1", "样例文档-2"]
|
|
q_embeddings = model.encode_queries(queries)
|
|
p_embeddings = model.encode_corpus(passages)
|
|
scores = q_embeddings @ p_embeddings.T
|
|
print(scores)
|
|
```
|
|
|
|
## 2. Reranker
|
|
|
|
```python
|
|
from FlagEmbedding import FlagAutoReranker
|
|
pairs = [("样例数据-1", "样例数据-3"), ("样例数据-2", "样例数据-4")]
|
|
model = FlagAutoReranker.from_finetuned('BAAI/bge-reranker-large',
|
|
use_fp16=True,
|
|
devices=['cuda:1']) # Setting use_fp16 to True speeds up computation with a slight performance degradation
|
|
similarity = model.compute_score(pairs, normalize=True)
|
|
print(similarity)
|
|
|
|
pairs = [("query_1", "样例文档-1"), ("query_2", "样例文档-2")]
|
|
scores = model.compute_score(pairs)
|
|
print(scores)
|
|
```
|
|
|
|
# 4. Finetune
|
|
|
|
We support the finetune of various BGE series models, including bge-large-en-v1.5, bge-m3, bge-en-icl, bge-reranker-v2-m3, bge-reranker-v2-gemma, and bge-reranker-v2-minicpm-layerwise, etc. Here, we take the basic models bge-en-large-v1.5 and bge-reranker-large as examples. For more details, please see the [embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder) and [reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/reranker) sections.
|
|
|
|
## 1. Embedder
|
|
|
|
```shell
|
|
torchrun --nproc_per_node 2 \
|
|
-m FlagEmbedding.finetune.embedder.encoder_only.base \
|
|
--model_name_or_path BAAI/bge-large-en-v1.5 \
|
|
--cache_dir ./cache/model \
|
|
--train_data ./finetune/embedder/example_data/retrieval \
|
|
--cache_path ./cache/data \
|
|
--train_group_size 8 \
|
|
--query_max_len 512 \
|
|
--passage_max_len 512 \
|
|
--pad_to_multiple_of 8 \
|
|
--query_instruction_for_retrieval 'Represent this sentence for searching relevant passages: ' \
|
|
--query_instruction_format '{}{}' \
|
|
--knowledge_distillation False \
|
|
--output_dir ./test_encoder_only_base_bge-large-en-v1.5 \
|
|
--overwrite_output_dir \
|
|
--learning_rate 1e-5 \
|
|
--fp16 \
|
|
--num_train_epochs $num_train_epochs \
|
|
--per_device_train_batch_size $per_device_train_batch_size \
|
|
--dataloader_drop_last True \
|
|
--warmup_ratio 0.1 \
|
|
--gradient_checkpointing \
|
|
--deepspeed ./finetune/ds_stage0.json \
|
|
--logging_steps 1 \
|
|
--save_steps 1000 \
|
|
--negatives_cross_device \
|
|
--temperature 0.02 \
|
|
--sentence_pooling_method cls \
|
|
--normalize_embeddings True \
|
|
--kd_loss_type kl_div
|
|
```
|
|
|
|
## 2. Reranker
|
|
|
|
```shell
|
|
torchrun --nproc_per_node 2 \
|
|
-m FlagEmbedding.finetune.reranker.encoder_only.base \
|
|
--model_name_or_path BAAI/bge-reranker-large \
|
|
--cache_dir ./cache/model \
|
|
--train_data ./finetune/reranker/example_data/normal/examples.jsonl \
|
|
--cache_path ~/.cache \
|
|
--train_group_size 8 \
|
|
--query_max_len 256 \
|
|
--passage_max_len 256 \
|
|
--pad_to_multiple_of 8 \
|
|
--knowledge_distillation True \
|
|
--output_dir ./test_encoder_only_base_bge-reranker-large \
|
|
--overwrite_output_dir \
|
|
--learning_rate 6e-5 \
|
|
--fp16 \
|
|
--num_train_epochs $num_train_epochs \
|
|
--per_device_train_batch_size 2 \
|
|
--gradient_accumulation_steps 1 \
|
|
--dataloader_drop_last True \
|
|
--warmup_ratio 0.1 \
|
|
--gradient_checkpointing \
|
|
--weight_decay 0.01 \
|
|
--deepspeed ./finetune/ds_stage0.json \
|
|
--logging_steps 1 \
|
|
--save_steps 1000 \
|
|
```
|
|
|
|
# 5. Evaluation
|
|
|
|
We support evaluations on MTEB, BEIR, MSMARCO, MIRACL, MLDR, MKQA, and AIR-Bench. Here, we provide an example of evaluating MSMARCO passages. For more details, please refer to the [evaluation examples](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation).
|
|
|
|
```shell
|
|
export HF_HUB_CACHE="$HOME/.cache/huggingface/hub"
|
|
|
|
python -m FlagEmbedding.evaluation.msmarco \
|
|
--eval_name msmarco \
|
|
--dataset_dir ./data/msmarco \
|
|
--dataset_names passage \
|
|
--splits dev dl19 dl20 \
|
|
--corpus_embd_save_dir ./data/msmarco/corpus_embd \
|
|
--output_dir ./data/msmarco/search_results \
|
|
--search_top_k 1000 \
|
|
--rerank_top_k 100 \
|
|
--cache_path ./cache/data \
|
|
--overwrite True \
|
|
--k_values 10 100 \
|
|
--eval_output_method markdown \
|
|
--eval_output_path ./data/msmarco/msmarco_eval_results.md \
|
|
--eval_metrics ndcg_at_10 mrr_at_10 recall_at_100 \
|
|
--embedder_name_or_path BAAI/bge-large-en-v1.5 \
|
|
--embedder_batch_size 512 \
|
|
--embedder_query_max_length 512 \
|
|
--embedder_passage_max_length 512 \
|
|
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
|
|
--reranker_batch_size 512 \
|
|
--reranker_query_max_length 512 \
|
|
--reranker_max_length 1024 \
|
|
--devices cuda:0 cuda:1 cuda:2 cuda:3 cuda:4 cuda:5 cuda:6 cuda:7 \
|
|
--cache_dir ./cache/model
|
|
```
|
|
|