haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-12-29 16:08:38 +00:00

Author	SHA1	Message	Date
Branden Chan	1cebcb7dda	Create time and performance benchmarks for all readers and retrievers (#339 ) * add time and perf benchmark for es * Add retriever benchmarking * Add Reader benchmarking * add nq to squad conversion * add conversion stats * clean benchmarks * Add link to dataset * Update imports * add first support for neg psgs * Refactor test * set max_seq_len * cleanup benchmark * begin retriever speed benchmarking * Add support for retriever query index benchmarking * improve reader eval, retriever speed benchmarking * improve retriever speed benchmarking * Add retriever accuracy benchmark * Add neg doc shuffling * Add top_n * 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging * Add models to sweep * add option for faiss index type * remove unneeded line * change faiss to faiss_flat * begin automatic benchmark script * remove existing postgres docker for benchmarking * Add data processing scripts * Remove shuffle in script bc data already shuffled * switch hnsw setup from 256 to 128 * change es similarity to dot product by default * Error includes stack trace * Change ES default timeout * remove delete_docs() from timing for indexing * Add support for website export * update website on push to benchmarks * add complete benchmarks results * new json format * removed NaN as is not a valid json token * fix benchmarking for faiss hnsw queries. do sql calls in update_embeddings() as batches * update benchmarks for hnsw 128,20,80 * don't delete full index in delete_all_documents() * update texts for charts * update recall column for retriever * change scale and add units to desc * add units to legend * add axis titles. update desc * add html tags Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai> Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>	2020-10-12 13:34:42 +02:00
Malte Pietsch	8edeb844f7	Remove phi normalization from FAISS, support more index types, 3x speedup (#467 ) * remove phi normalization * add special case for hnsw * rename vector_size to vector_dim * fix loading. fix extra dim in tests * switch to new ES syntax for vector similarity * 3x sql speed up. cascade deletes. add train_index() * add docstrings. remove vector_dim from load() * delete docs from faiss and sql * fix delete of docs in test * relax type hint for faiss index * rename metric to metric_type Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>	2020-10-06 16:09:56 +02:00
Lalit Pagaria	465ccbc12e	Allow multiple write calls to existing FAISS index. (#422 ) - Fixing issue when update_embeddings always create new FAISS index instead of clearing existing one. New index creation may not free existing used memory and cause memory leak. Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2020-10-05 12:01:20 +02:00
Tanay Soni	669c72d538	Enable bulk operations on vector IDs for FAISSDocumentStore (#460 )	2020-10-02 14:43:25 +02:00
Lalit Pagaria	9b58374b7c	Skip file conversion if file type is not supported (#456 ) * Skip file converter if file type is not supported. Refer https://github.com/deepset-ai/haystack/issues/453 * Fixing issue reported by mypy * Addressing review comments	2020-10-01 14:47:45 +02:00
Malte Pietsch	271ff30262	fix type casting of embeddings for tutorial 4 (#402 )	2020-09-18 18:10:50 +02:00
Malte Pietsch	db6864d159	Fix type casting for vectors in FAISS (#399 ) * Fix type casting for vectors in FAISS Co-authored-by: philipp-bode <philipp.bode@student.hpi.de> * add type casts for elastic. refactor embedding retriever tests * fix case: empty embedding field * fix faiss tolerance * add assert in test_faiss_retrieving Co-authored-by: philipp-bode <philipp.bode@student.hpi.de>	2020-09-18 17:08:13 +02:00
Malte Pietsch	d69133966d	Fix faiss test tolerance	2020-09-18 13:57:29 +02:00
Malte Pietsch	4c503158a7	Fix duplicate vector ids in FAISS (#395 ) * fix duplicate vector ids in faiss * Add test Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com> * revert score change * switch to faiss_index.ntotal for ids. add tests Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>	2020-09-18 12:52:22 +02:00
Tanay Soni	0859da8f74	Fix document filtering in SQLDocumentStore (#396 )	2020-09-18 12:22:52 +02:00
Tanay Soni	3399fc784d	Refactor file converter interface (#393 )	2020-09-18 10:42:13 +02:00
Malte Pietsch	9727829cc6	Rename and restructure modules (database, indexing, schemas) (#379 ) * rename database to documentstore * move document, label, multilabel to haystack/schema.py * rename documentstore -> document_store * split indexing modules -> file_converter + preprocessor * fix order of imports * Update tutorial notebooks * fix torch version in tutorial 4	2020-09-16 18:33:23 +02:00
Lalit P	de5ad42e46	Adjust tests for MacOS (#374 )	2020-09-15 15:04:46 +02:00
brandenchan	cca8676f90	More robust eval	2020-08-26 12:01:59 +02:00
kolk	f2b6cc761b	Refactor DPR from FB to Transformers codebase (#308 ) * change_HFBertEncoder to transformers DPREncoder * Removed BertTensorizer * model download relative path * Refactor model load * Tutorial5 DPR updated * fix print_eval_results typo * copy transformers DPR modules in dpr_utils and test * transformer v3.0.2 import errors fixed * remove dependency of DPRConfig on attribute use_return_tuple * Adjust transformers 302 locally to work with dpr * projection layer removed from DPR encoders * fixed mypy errors * transformers DPR compatible code added * transformers DPR compatibility added * bug fix in tutorial 6 notebook * Docstring update and variable naming issues fix * tutorial modified to reflect DPR variable naming change * title addition to passage use-cases handled * modified handling untitled batch * resolved mypy errors * typos in docstrings and comments fixed * cleaned DPR code and added new test cases * warnings added for non-bert model [SEP] token removal * changed warning to logger warning * title mask creation refactored * bug fix on cuda issues * tutorial 6 instantiates modified DPR * tutorial 5 modified * tutorial 5 ipython notebook modified: DPR instantiation * batch_size added to DPR instantiation * tutorial 5 jupyter notebook typos fixed * improved docstrings, fixed typos * Update docstring Co-authored-by: Timo Moeller <timo.moeller@deepset.ai> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2020-08-25 20:16:00 +05:30
Tanay Soni	3a42eb663e	Include InMemoryDocumetStore for DPR test	2020-08-24 14:44:12 +02:00
bogdankostic	f388ca025c	Aggregate multiple no answers in MultiLabel (#324 ) * Aggregate multiple no answers * Add test for multiple no answers	2020-08-18 18:25:01 +02:00
bogdankostic	72b1013560	Restructure update embeddings (#304 ) * Restructure update embeddings * Adapt FAISSDocStore * Adapt test and tutorial Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>	2020-08-18 14:04:31 +02:00
bogdankostic	b30963d0cd	Add Tests for MultiLabel (#318 ) * Add tests for MultiLabel * Add test for no_answer and is_correct_answer=False + fix bug in MultiLabel aggregation * Fix bug in MultiLabel aggregation	2020-08-17 20:14:31 +02:00
Tanay Soni	01ff66dfd6	Remove redundant test fixture	2020-08-17 14:19:38 +02:00
Dany	403318b1f5	Add Tika Converter (#314 )	2020-08-17 11:21:09 +02:00
Tanay Soni	1637ce1184	Revert "Add Tika Converter (#314 )" This reverts commit 5ef59b1901da6d51bfa085683321a243228d4fc9.	2020-08-17 11:13:52 +02:00
Tanay Soni	5ef59b1901	Add Tika Converter (#314 )	2020-08-14 14:13:59 +02:00
Tanay Soni	089fecf99e	Fix indexing of metadata for FAISS/SQL Document Store (#310 )	2020-08-13 12:25:32 +02:00
bogdankostic	5186d2d235	Batch prediction in evaluation (#137 ) * Add Batch evaluation * Separate evaluation methods * Clean calculation of eval metrics * Adapt eval to Label objects * Fix format of no_answer * Adapt to MultiLabel * Add tests	2020-08-10 19:30:31 +02:00
Karim Jana	c7078a36c0	Custom fields for indexing in ElasticsearchDocumentStore (#297 )	2020-08-10 11:34:39 +02:00
Tanay Soni	9d0df60aad	Add FAISS Document Store (#253 )	2020-08-07 14:25:08 +02:00
Timo Moeller	d9e8b522a1	Add "no answer" aggregation to Transformersreader (#259 ) * Add no answer aggregation * Change to covariant type annotation * Remove n_best_per_passage from transformersreader	2020-08-06 17:32:55 +02:00
Tanay Soni	5937f9cf16	Deprecate Tags for Document Stores (#286 )	2020-08-04 14:24:12 +02:00
Tanay Soni	723921475f	Make document ids of str type (#284 )	2020-08-03 16:20:17 +02:00
Tanay Soni	d90435efd6	Add wait for Elasticsearch update call	2020-07-31 12:06:27 +02:00
Malte Pietsch	29a15c0d59	Add eval for Dense Passage Retriever & Refactor handling of labels/feedback (#243 )	2020-07-31 11:34:06 +02:00
Tanay Soni	5210c8c2ab	Add method to update meta fields for documents in Elasticsearch (#242 )	2020-07-16 15:34:55 +02:00
Malte Pietsch	6bed2f509f	Refactor DPR for latest transformers version & change init arg `gpu` -> `use_gpu` for DPR and EmbeddingRetriever (#239 ) * fix tokenizer warning in latest transformers * change dpr arg from gpu to use_gpu * change gpu arg for EmbeddingRetriever	2020-07-16 10:45:01 +02:00
Tanay Soni	5c1a5fe61d	Add dummy retriever for benchmarking / reader-only settings (#235 )	2020-07-15 17:22:17 +02:00
Tanay Soni	912e98cd40	Fix id for documents returned by the TfidfRetriever (#232 )	2020-07-15 14:55:07 +02:00
Malte Pietsch	99a6a34047	Upgrade to new FARM / Transformers / PyTorch versions (#212 )	2020-07-14 18:53:15 +02:00
Anirban Saha	6b217732f5	Add basic support for Docx Files (#225 )	2020-07-14 12:28:19 +02:00
Tanay Soni	b886e054a3	Move document_name attribute to meta (#217 )	2020-07-14 09:53:31 +02:00
Malte Pietsch	d2b26a99ff	Add more tests (#213 )	2020-07-10 10:54:56 +02:00
Malte Pietsch	07ecfb60b9	Dense Passage Retriever (Inference) (#167 )	2020-06-30 19:05:45 +02:00
Tanay Soni	ec433a5ed6	Move out REST API from PyPI package (#160 )	2020-06-22 12:07:12 +02:00
Tanay Soni	a349eef0db	Add API endpoint to upload files (#154 )	2020-06-17 16:28:26 +02:00
Tanay Soni	180dc8cbd6	Start Elasticsearch with a Github Action (#142 )	2020-06-09 12:46:15 +02:00
Tanay Soni	160345f3d5	Update build workflow	2020-06-09 11:45:25 +02:00
Tanay Soni	ef9e4f4467	Add PDF text extraction (#109 )	2020-06-08 11:07:19 +02:00
Stan Kirdey	ca6778d934	Add metadata for TF-IDF Retriever (#122 )	2020-05-28 10:55:28 +02:00
Stan Kirdey	bf8e506c45	Add embedding query for InMemoryDocumentStore	2020-05-18 14:47:41 +02:00
Stan Kirdey	72a3b70d7a	Add filtering by tags for InMemoryDocumentStore (#108 )	2020-05-14 22:12:25 +02:00
Tanay Soni	37e0ff70f7	Add test for Elasticsearch document store (#88 )	2020-05-04 18:00:07 +02:00

1 2

59 Commits