mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-12-29 07:59:27 +00:00
* add time and perf benchmark for es * Add retriever benchmarking * Add Reader benchmarking * add nq to squad conversion * add conversion stats * clean benchmarks * Add link to dataset * Update imports * add first support for neg psgs * Refactor test * set max_seq_len * cleanup benchmark * begin retriever speed benchmarking * Add support for retriever query index benchmarking * improve reader eval, retriever speed benchmarking * improve retriever speed benchmarking * Add retriever accuracy benchmark * Add neg doc shuffling * Add top_n * 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging * Add models to sweep * add option for faiss index type * remove unneeded line * change faiss to faiss_flat * begin automatic benchmark script * remove existing postgres docker for benchmarking * Add data processing scripts * Remove shuffle in script bc data already shuffled * switch hnsw setup from 256 to 128 * change es similarity to dot product by default * Error includes stack trace * Change ES default timeout * remove delete_docs() from timing for indexing * Add support for website export * update website on push to benchmarks * add complete benchmarks results * new json format * removed NaN as is not a valid json token * versioning for docs * unsaved changes * cleaning * cleaning * Edit format of benchmarks data * update also jsons in v0.4.0 Co-authored-by: brandenchan <brandenchan@icloud.com> Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
19 lines
1.7 KiB
JSON
19 lines
1.7 KiB
JSON
{
|
|
"chart_type": "BarChart",
|
|
"title": "Reader Performance",
|
|
"subtitle": "Time and Accuracy Benchmarks",
|
|
"description": "Performance benchmarks of different Readers that can be used off-the-shelf in Haystack. Some models are geared towards speed, while others are more performance-focused. Accuracy is measured as F1 score and speed as passages/sec (with passages of 384 tokens). Each Reader is benchmarked using the SQuAD v2.0 development set, which contains 11866 question answer pairs. When tokenized using the BERT tokenizer and split using a sliding window approach, these become 12350 passages that are passed into the model. We set <i>max_seq_len=384</i> and <i>doc_stride=128</i>. These benchmarking tests are run using an AWS p3.2xlarge instance with a Nvidia V100 GPU with this <a href='https://github.com/deepset-ai/haystack/blob/master/test/benchmarks/reader.py'>script</a>. Please note that we are using the FARMReader class rather than the TransformersReader class. Also, the F1 measure that is reported here is in fact calculated on token level, rather than word level as is done in the official SQuAD script.",
|
|
"bars": "horizontal",
|
|
"columns": [
|
|
"Model",
|
|
"F1",
|
|
"Speed (passages/sec)"
|
|
],
|
|
"data": [
|
|
{"F1": 80.67985794671885, "Model": "RoBERTa", "Speed": 92.3039712094936},
|
|
{"F1": 78.23306265318686, "Model": "MiniLM", "Speed": 98.62387044489223},
|
|
{"F1": 74.90271600053505, "Model": "BERT base", "Speed": 99.92750782409666},
|
|
{"F1": 82.64545708097472, "Model": "BERT large", "Speed": 39.529824033964466},
|
|
{"F1": 85.26275190954586, "Model": "XLM-RoBERTa", "Speed": 39.29142006004379}
|
|
]
|
|
} |