haystack/test/benchmarks/data_scripts/embeddings_slice.py
Branden Chan 1cebcb7dda
Create time and performance benchmarks for all readers and retrievers (#339)
* add time and perf benchmark for es

* Add retriever benchmarking

* Add Reader benchmarking

* add nq to squad conversion

* add conversion stats

* clean benchmarks

* Add link to dataset

* Update imports

* add first support for neg psgs

* Refactor test

* set max_seq_len

* cleanup benchmark

* begin retriever speed benchmarking

* Add support for retriever query index benchmarking

* improve reader eval, retriever speed benchmarking

* improve retriever speed benchmarking

* Add retriever accuracy benchmark

* Add neg doc shuffling

* Add top_n

* 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging

* Add models to sweep

* add option for faiss index type

* remove unneeded line

* change faiss to faiss_flat

* begin automatic benchmark script

* remove existing postgres docker for benchmarking

* Add data processing scripts

* Remove shuffle in script bc data already shuffled

* switch hnsw setup from 256 to 128

* change es similarity to dot product by default

* Error includes stack trace

* Change ES default timeout

* remove delete_docs() from timing for indexing

* Add support for website export

* update website on push to benchmarks

* add complete benchmarks results

* new json format

* removed NaN as is not a valid json token

* fix benchmarking for faiss hnsw queries. do sql calls in update_embeddings() as batches

* update benchmarks for hnsw 128,20,80

* don't delete full index in delete_all_documents()

* update texts for charts

* update recall column for retriever

* change scale and add units to desc

* add units to legend

* add axis titles. update desc

* add html tags

Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2020-10-12 13:34:42 +02:00

53 lines
1.3 KiB
Python

import pickle
from pathlib import Path
from tqdm import tqdm
import json
n_passages = 1_000_000
embeddings_dir = Path("embeddings")
embeddings_filenames = [f"wikipedia_passages_{i}.pkl" for i in range(50)]
neg_passages_filename = "psgs_w100_minus_gold.tsv"
gold_passages_filename = "nq2squad-dev.json"
# Extract gold passage ids
passage_ids = []
gold_data = json.load(open(gold_passages_filename))["data"]
for d in gold_data:
for p in d["paragraphs"]:
passage_ids.append(str(p["passage_id"]))
print("gold_ids")
print(len(passage_ids))
print()
# Extract neg passage ids
with open(neg_passages_filename) as f:
f.readline() # Ignore column headers
for _ in range(n_passages - len(passage_ids)):
l = f.readline()
passage_ids.append(str(l.split()[0]))
assert len(passage_ids) == len(set(passage_ids))
assert set([type(x) for x in passage_ids]) == {str}
passage_ids = set(passage_ids)
print("all_ids")
print(len(passage_ids))
print()
# Gather vectors for passages
ret = []
for ef in tqdm(embeddings_filenames):
curr = pickle.load(open(embeddings_dir / ef, "rb"))
for i, vec in curr:
if i in passage_ids:
ret.append((i, vec))
print("n_vectors")
print(len(ret))
print()
# Write vectors to file
with open(f"wikipedia_passages_{n_passages}.pkl", "wb") as f:
pickle.dump(ret, f)