update readme

2026-01-08 05:03:10 +00:00 · 2023-08-08 15:32:12 +08:00 · 2023-08-08 15:32:12 +08:00 · 9876bb2b84
commit 9876bb2b84
parent e07e69599b
2 changed files with 11 additions and 9 deletions
--- a/README.md
+++ b/README.md
@ -51,7 +51,7 @@ And it also can be used in vector database for LLMs.
 |  [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) |   Chinese |  a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章：`  |
 |  [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) |   Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章：`  |

-\*: If you need to search the answer to a short query, you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly.
+\*: If you need to search the relevant passages to a short query, you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. **In all cases, no instruction need to be added to passages**.

 ## Usage 

@ -97,6 +97,7 @@ print(embeddings)
 ```
 For retrieval task, 
 each query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)). 
+But the instruction is not needed for passages.
 ```python
 from sentence_transformers import SentenceTransformer
 queries = ['query_1', 'query_2']
@ -139,7 +140,7 @@ model = AutoModel.from_pretrained('BAAI/bge-large-zh')

 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
-# for retrieval task, add an instruction to query
+# for retrieval task, add an instruction to query (not add instruction for passages)
 # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')

 # Compute token embeddings
@ -237,8 +238,8 @@ The temperature for contrastive loss is 0.01.
 Besides, we add instruction to the query for retrieval task in the training. 
 For English, the instruction is `Represent this sentence for searching relevant passages: `;
 For Chinese, the instruction is `为这个句子生成表示以用于检索相关文章：`.
-In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
-
+In the evaluation, the instruction should be added for queries in retrieval task, not be added for other tasks.
+Noted that the instruction is not needed for passages.

 The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md). 
 You can easily finetune your model with it.
--- a/README_zh.md
+++ b/README_zh.md
@ -49,7 +49,7 @@
 |  [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) |   Chinese |  base-scale模型，与bge-large性能类似，但推理更快，向量维度更小 | `为这个句子生成表示以用于检索相关文章：`  |
 |  [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) |   Chinese | small-scale模型，推理比base模型更快  | `为这个句子生成表示以用于检索相关文章：`  |

-\*: 如果您需要为一个简短的查询搜索相关文档，您需要在查询中添加指令；在其他情况下，不需要指令，直接使用原始查询即可。
+\*: 如果您需要为一个简短的查询搜索相关文档，您需要在查询中添加指令；在其他情况下，不需要指令，直接使用原始查询即可。**在任何情况下，您都不需要为候选文档增加指令**。



@ -100,9 +100,10 @@ model = SentenceTransformer('BAAI/bge-large-zh')
 embeddings = model.encode(sentences, normalize_embeddings=True)
 print(embeddings)
 ```
-对于检索任务，当您使用名称以`-instruction`结尾的模型时，
+对于检索任务，
 每个查询都应该以一条指令开始(指令参考 [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)). 
-```python
+但对于文档，不需要添加任何指令。
+``python
 queries = ["手机开不了机怎么办？"]
 passages = ["样例段落-1", "样例段落-2"]
 instruction = "为这个句子生成表示以用于检索相关文章："
@ -139,7 +140,7 @@ model = AutoModel.from_pretrained('BAAI/bge-large-zh')

 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
-# for retrieval task, add a instruction to query
+# for retrieval task, add an instruction to query (not add instruction for passages)
 # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')

 # Compute token embeddings
@ -238,7 +239,7 @@ print("Sentence embeddings:", sentence_embeddings)
 同时，我们在训练中为检索任务的查询添加了instruction。
 对于英语，指令是`Represent this sentence for searching relevant passages: `;
 对于中文，指令是`为这个句子生成表示以用于检索相关文章：`.
-在评测中，针对段落检索任务的任务需要在查询中添加指令。
+在评测中，针对段落检索任务的任务需要在查询中添加指令，但不需要为段落文档添加指令。


 微调脚本可以在这个存储库中访问:[FlagEmbedding](./FlagEmbedding/baai_general_embedding), 你可以用它轻松地微调你的模型。