mirror of
https://github.com/FlagOpen/FlagEmbedding.git
synced 2026-01-08 05:03:10 +00:00
update readme
This commit is contained in:
parent
e07e69599b
commit
9876bb2b84
@ -51,7 +51,7 @@ And it also can be used in vector database for LLMs.
|
||||
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
|
||||
|
||||
\*: If you need to search the answer to a short query, you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly.
|
||||
\*: If you need to search the relevant passages to a short query, you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. **In all cases, no instruction need to be added to passages**.
|
||||
|
||||
## Usage
|
||||
|
||||
@ -97,6 +97,7 @@ print(embeddings)
|
||||
```
|
||||
For retrieval task,
|
||||
each query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
|
||||
But the instruction is not needed for passages.
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer
|
||||
queries = ['query_1', 'query_2']
|
||||
@ -139,7 +140,7 @@ model = AutoModel.from_pretrained('BAAI/bge-large-zh')
|
||||
|
||||
# Tokenize sentences
|
||||
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
||||
# for retrieval task, add an instruction to query
|
||||
# for retrieval task, add an instruction to query (not add instruction for passages)
|
||||
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
|
||||
|
||||
# Compute token embeddings
|
||||
@ -237,8 +238,8 @@ The temperature for contrastive loss is 0.01.
|
||||
Besides, we add instruction to the query for retrieval task in the training.
|
||||
For English, the instruction is `Represent this sentence for searching relevant passages: `;
|
||||
For Chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
|
||||
In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
|
||||
|
||||
In the evaluation, the instruction should be added for queries in retrieval task, not be added for other tasks.
|
||||
Noted that the instruction is not needed for passages.
|
||||
|
||||
The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
|
||||
You can easily finetune your model with it.
|
||||
|
||||
11
README_zh.md
11
README_zh.md
@ -49,7 +49,7 @@
|
||||
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | base-scale模型,与bge-large性能类似,但推理更快,向量维度更小 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | small-scale模型,推理比base模型更快 | `为这个句子生成表示以用于检索相关文章:` |
|
||||
|
||||
\*: 如果您需要为一个简短的查询搜索相关文档,您需要在查询中添加指令;在其他情况下,不需要指令,直接使用原始查询即可。
|
||||
\*: 如果您需要为一个简短的查询搜索相关文档,您需要在查询中添加指令;在其他情况下,不需要指令,直接使用原始查询即可。**在任何情况下,您都不需要为候选文档增加指令**。
|
||||
|
||||
|
||||
|
||||
@ -100,9 +100,10 @@ model = SentenceTransformer('BAAI/bge-large-zh')
|
||||
embeddings = model.encode(sentences, normalize_embeddings=True)
|
||||
print(embeddings)
|
||||
```
|
||||
对于检索任务,当您使用名称以`-instruction`结尾的模型时,
|
||||
对于检索任务,
|
||||
每个查询都应该以一条指令开始(指令参考 [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
|
||||
```python
|
||||
但对于文档,不需要添加任何指令。
|
||||
``python
|
||||
queries = ["手机开不了机怎么办?"]
|
||||
passages = ["样例段落-1", "样例段落-2"]
|
||||
instruction = "为这个句子生成表示以用于检索相关文章:"
|
||||
@ -139,7 +140,7 @@ model = AutoModel.from_pretrained('BAAI/bge-large-zh')
|
||||
|
||||
# Tokenize sentences
|
||||
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
||||
# for retrieval task, add a instruction to query
|
||||
# for retrieval task, add an instruction to query (not add instruction for passages)
|
||||
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
|
||||
|
||||
# Compute token embeddings
|
||||
@ -238,7 +239,7 @@ print("Sentence embeddings:", sentence_embeddings)
|
||||
同时,我们在训练中为检索任务的查询添加了instruction。
|
||||
对于英语,指令是`Represent this sentence for searching relevant passages: `;
|
||||
对于中文,指令是`为这个句子生成表示以用于检索相关文章:`.
|
||||
在评测中,针对段落检索任务的任务需要在查询中添加指令。
|
||||
在评测中,针对段落检索任务的任务需要在查询中添加指令,但不需要为段落文档添加指令。
|
||||
|
||||
|
||||
微调脚本可以在这个存储库中访问:[FlagEmbedding](./FlagEmbedding/baai_general_embedding), 你可以用它轻松地微调你的模型。
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user