diff --git a/FlagEmbedding/.DS_Store b/FlagEmbedding/.DS_Store index 70918f2..32609fe 100644 Binary files a/FlagEmbedding/.DS_Store and b/FlagEmbedding/.DS_Store differ diff --git a/FlagEmbedding/baai_general_embedding/README.md b/FlagEmbedding/baai_general_embedding/README.md index 2de92bc..4ccb9d3 100644 --- a/FlagEmbedding/baai_general_embedding/README.md +++ b/FlagEmbedding/baai_general_embedding/README.md @@ -240,7 +240,8 @@ Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C | [text2vec-base](https://huggingface.co/shibing624/text2vec-base-chinese) | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 | | [text2vec-large](https://huggingface.co/GanymedeNil/text2vec-large-chinese) | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 | - +- **Your data** +If you want to evaluate the model on your data, you can refer to this [command]() ## Acknowledgement diff --git a/FlagEmbedding/baai_general_embedding/finetune/eval_msmarco.py b/FlagEmbedding/baai_general_embedding/finetune/eval_msmarco.py index 2b6d36f..1c0d03a 100644 --- a/FlagEmbedding/baai_general_embedding/finetune/eval_msmarco.py +++ b/FlagEmbedding/baai_general_embedding/finetune/eval_msmarco.py @@ -26,6 +26,15 @@ class Args: default=False, metadata={'help': 'Add query-side instruction?'} ) + + corpus_data: str = field( + default="namespace-Pt/msmarco", + metadata={'help': 'candidate passages'} + ) + query_data: str = field( + default="namespace-Pt/msmarco-corpus", + metadata={'help': 'queries and their positive passages for evaluation'} + ) max_query_length: int = field( default=32, @@ -183,9 +192,14 @@ def evaluate(preds, labels, cutoffs=[1,10,100]): def main(): parser = HfArgumentParser([Args]) args: Args = parser.parse_args_into_dataclasses()[0] - - eval_data = datasets.load_dataset("namespace-Pt/msmarco", split="dev") - corpus = datasets.load_dataset("namespace-Pt/msmarco-corpus", split="train") + + if args.query_data == 'namespace-Pt/msmarco-corpus': + assert args.corpus_data == 'namespace-Pt/msmarco' + eval_data = datasets.load_dataset("namespace-Pt/msmarco", split="dev") + corpus = datasets.load_dataset("namespace-Pt/msmarco-corpus", split="train") + else: + eval_data = datasets.load_dataset('json', data_files=args.query_data, split='train') + corpus = datasets.load_dataset('json', data_files=args.corpus_data, split='train') model = FlagModel( args.encoder, diff --git a/examples/finetune/README.md b/examples/finetune/README.md index 2d3c311..c949516 100644 --- a/examples/finetune/README.md +++ b/examples/finetune/README.md @@ -87,7 +87,7 @@ Noted that the number of negatives should not be larger than the numbers of nega Besides the negatives in this group, the in-batch negatives also will be used in fine-tuning. - `negatives_cross_device`: share the negatives across all GPUs. This argument will extend the number of negatives. - `learning_rate`: select a appropriate for your model. Recommend 1e-5/2e-5/3e-5 for large/base/small-scale. -- `temperature`: It will influence the distribution of similarity scores. **Recommend set it 0.01-0.1.** +- `temperature`: It will influence the distribution of similarity scores. **Recommended value: 0.01-0.1.** - `query_max_len`: max length for query. Please set it according the average length of queries in your data. - `passage_max_len`: max length for passage. Please set it according the average length of passages in your data. - `query_instruction_for_retrieval`: instruction for query, which will be added to each query. You also can set it `""` to add nothing to query. @@ -150,16 +150,24 @@ Please replace the `query_instruction_for_retrieval` with your instruction if yo ### 6. Evaluate model -We provide [a simple script](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding/finetune/eval_msmarco.py) to evaluate the model's performance on MSMARCO, a widely used retrieval benchmark. +We provide [a simple script](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding/finetune/eval_msmarco.py) to evaluate the model's performance. +A brief summary of how the script works: +1. Load the model on all available GPUs through [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). +2. Encode the corpus and offload the embeddings in `faiss` Flat index. By default, `faiss` also dumps the index on all available GPUs. +3. Encode the queries and search `100` nearest neighbors for each query. +4. Compute Recall and MRR metrics. First, install `faiss`, a popular approximate nearest neighbor search library: ```bash conda install -c conda-forge faiss-gpu ``` -Next, you can check the data formats for the [msmarco corpus](https://huggingface.co/datasets/namespace-Pt/msmarco-corpus) and [evaluation queries](https://huggingface.co/datasets/namespace-Pt/msmarco). +#### 6.1 MSMARCO dataset +The default evaluate data is MSMARCO, a widely used retrieval benchmark. -Finally, run the following command: +You can check the data formats for the [msmarco corpus](https://huggingface.co/datasets/namespace-Pt/msmarco-corpus) and [evaluation queries](https://huggingface.co/datasets/namespace-Pt/msmarco). + +Run the following command: ```bash python -m FlagEmbedding.baai_general_embedding.finetune.eval_msmarco \ @@ -186,8 +194,33 @@ The results should be similar to } ``` -A brief summary of how the script works: -1. Load the model on all available GPUs through [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). -2. Encode the corpus and offload the embeddings in `faiss` Flat index. By default, `faiss` also dumps the index on all available GPUs. -3. Encode the queries and search `100` nearest neighbors for each query. -4. Compute Recall and MRR metrics. +#### 6.1 Your dataset + +You should prepare two files with jsonl format: +- One is corpus_data, which contains the text you want to search. A toy example: [toy_corpus.json](./toy_evaluation_data/toy_corpus.json) +``` +{"content": "A is ..."} +{"content": "B is ..."} +{"content": "C is ..."} +{"content": "Panda is ..."} +{"content": "... is A"} +``` +- The other is query_data, which contains the queries and the ground truth. A toy example: [toy_corpus.json](./toy_evaluation_data/toy_query.json) +``` +{"query": "What is A?", "positive": ["A is ...", "... is A"]} +{"query": "What is B?", "positive": ["B is ..."]} +{"query": "What is C?", "positive": ["C is ..."]} +``` + +Then, pass the data path to evaluation script: +```bash +python -m FlagEmbedding.baai_general_embedding.finetune.eval_msmarco \ +--encoder BAAI/bge-base-en-v1.5 \ +--fp16 \ +--add_instruction \ +--k 100 \ +--corpus_data ./toy_evaluation_data/toy_corpus.json \ +--query_data ./toy_evaluation_data/toy_query.json +``` + + diff --git a/examples/finetune/toy_evaluation_data/toy_corpus.json b/examples/finetune/toy_evaluation_data/toy_corpus.json new file mode 100644 index 0000000..4e7d44a --- /dev/null +++ b/examples/finetune/toy_evaluation_data/toy_corpus.json @@ -0,0 +1,5 @@ +{"content": "A is ..."} +{"content": "B is ..."} +{"content": "C is ..."} +{"content": "Panda is ..."} +{"content": "... is A"} \ No newline at end of file diff --git a/examples/finetune/toy_evaluation_data/toy_query.json b/examples/finetune/toy_evaluation_data/toy_query.json new file mode 100644 index 0000000..f3b88a9 --- /dev/null +++ b/examples/finetune/toy_evaluation_data/toy_query.json @@ -0,0 +1,3 @@ +{"query": "What is A?", "positive": ["A is ...", "... is A"]} +{"query": "What is B?", "positive": ["B is ..."]} +{"query": "What is C?", "positive": ["C is ..."]} \ No newline at end of file