FlagEmbedding/README.md
2023-08-03 11:16:51 +08:00

10 KiB

FlagEmbedding

Map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. It also can be used in vector database for LLMs.

Updates

  • 08/02/2023: Release baai-general-embedding-* Models, ranking 1st in MTEB and C-MTEB respectively !
  • 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset.

Model List

Model Language Description query instruction for retrieval
BAAI/baai-general-embedding-large-en-instruction English rank 1st in MTEB leaderboard Represent this sentence for searching relevant passages:
BAAI/baai-general-embedding-large-zh-instruction Chinese rank 1st in C-MTEB bechmark 为这个句子生成表示以用于检索相关文章:
BAAI/baai-general-embedding-large-zh Chinese rank 2nd in C-MTEB bechmark --

Usage

Sentence-Transformers

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["样例数据-1", "样例数据-2"]
model = SentenceTransformer('BAAI/baai-general-embedding-large-zh-instruction')
embeddings = model.encode(sentences, normalize_embeddings=True)
print(embeddings)

#For retrieval task, when you use the model whose name ends with `-instruction`
#each query should start with a instruction. 
queries = ["手机开不了机怎么办?"]
passages = ["样例段落-1", "样例段落-2"]
instruction = "为这个句子生成表示以用于检索相关文章:"
model = SentenceTransformer('BAAI/baai-general-embedding-large-zh-instruction')
q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
scores = q_embeddings @ p_embeddings.T

HuggingFace Transformers

With transformers package, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BAAI/baai-general-embedding-large-zh-instruction')
model = AutoModel.from_pretrained('BAAI/baai-general-embedding-large-zh-instruction')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
    # Perform pooling. In this case, cls pooling.
    sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

Evaluation Results

  • MTEB:
Model Name Model Size (GB) Dimension Sequence Length Average (56) Retrieval (15) Clustering (11) Pair Classification (3) Reranking (4) STS (10) Summarization (1) Classification (12)
baai-general-embedding-large-en-instruction 0.67 1024 512 63.34 53.23 48.47 86.34 59.87 81.89 30.55 72.28
gte-large 0.67 1024 512 63.13 52.22 46.84 85.00 59.13 83.35 31.66 73.33
gte-base 0.22 768 512 62.39 51.14 46.2 84.57 58.61 82.3 31.17 73.01
e5-large-v2 1.34 1024 512 62.25 50.56 44.49 86.03 56.61 82.05 30.19 75.24
instructor-xl 4.96 768 512 61.79 49.26 44.74 86.62 57.29 83.06 32.32 61.79
e5-base-v2 0.44 768 512 61.5 50.29 43.80 85.73 55.91 81.05 30.28 73.84
gte-small 0.07 384 512 61.36 49.46 44.89 83.54 57.7 82.07 30.42 72.31
text-embedding-ada-002 - 1536 8192 60.99 49.25 45.9 84.89 56.32 80.97 30.8 70.93
e5-small-v2 0.13 384 512 59.93 49.04 39.92 84.67 54.32 80.39 31.16 72.94
sentence-t5-xxl 9.73 768 512 59.51 42.24 43.72 85.06 56.42 82.63 30.08 73.42
all-mpnet-base-v2 0.44 768 514 57.78 43.81 43.69 83.04 59.36 80.28 27.49 65.07
sgpt-bloom-7b1-msmarco 28.27 4096 2048 57.59 48.22 38.93 81.9 55.65 77.74 33.6 66.19
all-MiniLM-L12-v2 0.13 384 512 56.53 42.69 41.81 82.41 58.44 79.8 27.9 63.21
all-MiniLM-L6-v2 0.09 384 512 56.26 41.95 42.35 82.37 58.04 78.9 30.81 63.05
contriever-base-msmarco 0.44 768 512 56.00 41.88 41.1 82.54 53.14 76.51 30.36 66.68
sentence-t5-base 0.22 768 512 55.27 33.63 40.21 85.18 53.09 81.14 31.39 69.81
  • C-MTEB:
    We create a benchmark C-MTEB for chinese text embedding which consists of 31 datasets from 6 tasks. More details and evaluation scripts see evaluation.
Model Embedding dimension Avg Retrieval STS PairClassification Classification Reranking Clustering
baai-general-embedding-large-zh-instruction 1024 64.20 71.53 53.23 78.94 72.26 65.11 48.39
baai-general-embedding-large-zh 1024 63.53 70.55 50.98 76.77 72.49 64.91 50.01
m3e-base 768 57.10 56.91 48.15 63.99 70.28 59.34 47.68
m3e-large 1024 57.05 54.75 48.64 64.3 71.22 59.66 48.88
text-embedding-ada-002(OpenAI) 1536 53.02 52.0 40.61 69.56 67.38 54.28 45.68
luotuo 1024 49.37 44.4 39.41 66.62 65.29 49.25 44.39
text2vec 768 47.63 38.79 41.71 67.41 65.18 49.45 37.66
text2vec-large 1024 47.36 41.94 41.98 70.86 63.42 49.16 30.02

Train

This section will introduce the way we used to train the general embedding. The training scripts are in universal_embedding, and we provide some examples to do pre-train and fine-tune.

1. RetroMAE Pre-train
We pre-train the model following the method retromae, which shows promising improvement in retrieval task (see https://aclanthology.org/2022.emnlp-main.35.pdf). The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720. In retromae, the mask ratio of encoder and decoder are 0.3, 0.5 respectively. We used the AdamW optimizer and the learning rate is 2e-5.

Pre-training data:

2. Finetune
We fine-tune the model using a contrastive objective. The format of input data is a triple(query, positive, negative). Besides the negative in the triple, we also adopt in-batch negatives strategy.

We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768. We used the AdamW optimizer and the learning rate is 1e-5. The sequence length was limited to 128 tokens. The temperature for contrastive loss is 0.01.

For the version with *-instrcution, we add instruction to the query for retrieval task in the training. For english, the instruction is Represent this sentence for searching relevant passages: ; For chinese, the instruction is 为这个句子生成表示以用于检索相关文章:. In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.

The finetune script is accessible in this repository: universal_embedding. You can easily finetune your model with it.

Training data:

  • For English, we collect 230M text pairs from wikipedia, cc-net, and so on.

  • For chinese, we collect 120M text pairs from wudao, zhihu, news websites and so on.

The data collection is to be released in the future.