evluation local data

This commit is contained in:
shitao 2024-05-12 20:02:12 +08:00
parent 34d24e85e0
commit 4a7412d33f
6 changed files with 69 additions and 13 deletions

Binary file not shown.

View File

@ -240,7 +240,8 @@ Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C
| [text2vec-base](https://huggingface.co/shibing624/text2vec-base-chinese) | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 |
| [text2vec-large](https://huggingface.co/GanymedeNil/text2vec-large-chinese) | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 |
- **Your data**
If you want to evaluate the model on your data, you can refer to this [command]()
## Acknowledgement

View File

@ -26,6 +26,15 @@ class Args:
default=False,
metadata={'help': 'Add query-side instruction?'}
)
corpus_data: str = field(
default="namespace-Pt/msmarco",
metadata={'help': 'candidate passages'}
)
query_data: str = field(
default="namespace-Pt/msmarco-corpus",
metadata={'help': 'queries and their positive passages for evaluation'}
)
max_query_length: int = field(
default=32,
@ -183,9 +192,14 @@ def evaluate(preds, labels, cutoffs=[1,10,100]):
def main():
parser = HfArgumentParser([Args])
args: Args = parser.parse_args_into_dataclasses()[0]
eval_data = datasets.load_dataset("namespace-Pt/msmarco", split="dev")
corpus = datasets.load_dataset("namespace-Pt/msmarco-corpus", split="train")
if args.query_data == 'namespace-Pt/msmarco-corpus':
assert args.corpus_data == 'namespace-Pt/msmarco'
eval_data = datasets.load_dataset("namespace-Pt/msmarco", split="dev")
corpus = datasets.load_dataset("namespace-Pt/msmarco-corpus", split="train")
else:
eval_data = datasets.load_dataset('json', data_files=args.query_data, split='train')
corpus = datasets.load_dataset('json', data_files=args.corpus_data, split='train')
model = FlagModel(
args.encoder,

View File

@ -87,7 +87,7 @@ Noted that the number of negatives should not be larger than the numbers of nega
Besides the negatives in this group, the in-batch negatives also will be used in fine-tuning.
- `negatives_cross_device`: share the negatives across all GPUs. This argument will extend the number of negatives.
- `learning_rate`: select a appropriate for your model. Recommend 1e-5/2e-5/3e-5 for large/base/small-scale.
- `temperature`: It will influence the distribution of similarity scores. **Recommend set it 0.01-0.1.**
- `temperature`: It will influence the distribution of similarity scores. **Recommended value: 0.01-0.1.**
- `query_max_len`: max length for query. Please set it according the average length of queries in your data.
- `passage_max_len`: max length for passage. Please set it according the average length of passages in your data.
- `query_instruction_for_retrieval`: instruction for query, which will be added to each query. You also can set it `""` to add nothing to query.
@ -150,16 +150,24 @@ Please replace the `query_instruction_for_retrieval` with your instruction if yo
### 6. Evaluate model
We provide [a simple script](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding/finetune/eval_msmarco.py) to evaluate the model's performance on MSMARCO, a widely used retrieval benchmark.
We provide [a simple script](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding/finetune/eval_msmarco.py) to evaluate the model's performance.
A brief summary of how the script works:
1. Load the model on all available GPUs through [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html).
2. Encode the corpus and offload the embeddings in `faiss` Flat index. By default, `faiss` also dumps the index on all available GPUs.
3. Encode the queries and search `100` nearest neighbors for each query.
4. Compute Recall and MRR metrics.
First, install `faiss`, a popular approximate nearest neighbor search library:
```bash
conda install -c conda-forge faiss-gpu
```
Next, you can check the data formats for the [msmarco corpus](https://huggingface.co/datasets/namespace-Pt/msmarco-corpus) and [evaluation queries](https://huggingface.co/datasets/namespace-Pt/msmarco).
#### 6.1 MSMARCO dataset
The default evaluate data is MSMARCO, a widely used retrieval benchmark.
Finally, run the following command:
You can check the data formats for the [msmarco corpus](https://huggingface.co/datasets/namespace-Pt/msmarco-corpus) and [evaluation queries](https://huggingface.co/datasets/namespace-Pt/msmarco).
Run the following command:
```bash
python -m FlagEmbedding.baai_general_embedding.finetune.eval_msmarco \
@ -186,8 +194,33 @@ The results should be similar to
}
```
A brief summary of how the script works:
1. Load the model on all available GPUs through [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html).
2. Encode the corpus and offload the embeddings in `faiss` Flat index. By default, `faiss` also dumps the index on all available GPUs.
3. Encode the queries and search `100` nearest neighbors for each query.
4. Compute Recall and MRR metrics.
#### 6.1 Your dataset
You should prepare two files with jsonl format:
- One is corpus_data, which contains the text you want to search. A toy example: [toy_corpus.json](./toy_evaluation_data/toy_corpus.json)
```
{"content": "A is ..."}
{"content": "B is ..."}
{"content": "C is ..."}
{"content": "Panda is ..."}
{"content": "... is A"}
```
- The other is query_data, which contains the queries and the ground truth. A toy example: [toy_corpus.json](./toy_evaluation_data/toy_query.json)
```
{"query": "What is A?", "positive": ["A is ...", "... is A"]}
{"query": "What is B?", "positive": ["B is ..."]}
{"query": "What is C?", "positive": ["C is ..."]}
```
Then, pass the data path to evaluation script:
```bash
python -m FlagEmbedding.baai_general_embedding.finetune.eval_msmarco \
--encoder BAAI/bge-base-en-v1.5 \
--fp16 \
--add_instruction \
--k 100 \
--corpus_data ./toy_evaluation_data/toy_corpus.json \
--query_data ./toy_evaluation_data/toy_query.json
```

View File

@ -0,0 +1,5 @@
{"content": "A is ..."}
{"content": "B is ..."}
{"content": "C is ..."}
{"content": "Panda is ..."}
{"content": "... is A"}

View File

@ -0,0 +1,3 @@
{"query": "What is A?", "positive": ["A is ...", "... is A"]}
{"query": "What is B?", "positive": ["B is ..."]}
{"query": "What is C?", "positive": ["C is ..."]}