haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-07-08 09:31:31 +00:00

Author	SHA1	Message	Date
Tanay Soni	db4151bbc0	Fix scoring in Elasticsearch for dot product (#517 )	2020-10-23 17:50:49 +02:00
Branden Chan	fbacdfd263	Add logging of error, add n_docs assert	2020-10-22 15:45:46 +02:00
Branden Chan	b0483cfd99	add readme	2020-10-22 15:32:56 +02:00
Tanay Soni	3bec264d76	Add filters for document count (#512 )	2020-10-22 12:42:13 +02:00
brandenchan	87e5f06fa8	add automatic json update	2020-10-21 17:59:44 +02:00
brandenchan	d3743d00e9	Merge branch 'master' into automate_benchmarks	2020-10-21 17:48:10 +02:00
Lalit Pagaria	63c12371b9	Change arg "model" to "model_name_or_path" in TransformersReader (#510 ) * Consistent parameter naming for TransformersReader along with removing unused imports as well. * Addressing review comments	2020-10-21 17:15:35 +02:00
Malte Pietsch	956543e239	Restructure checks in PreProcessor (#504 ) * restructure checks * fix variable name * Fix test	2020-10-20 06:43:59 +02:00
Malte Pietsch	11a3976945	update deletes. fix arg in run.py	2020-10-19 14:40:26 +02:00
Markus Paff	2531c8e061	Add versioning docs (#495 ) * add time and perf benchmark for es * Add retriever benchmarking * Add Reader benchmarking * add nq to squad conversion * add conversion stats * clean benchmarks * Add link to dataset * Update imports * add first support for neg psgs * Refactor test * set max_seq_len * cleanup benchmark * begin retriever speed benchmarking * Add support for retriever query index benchmarking * improve reader eval, retriever speed benchmarking * improve retriever speed benchmarking * Add retriever accuracy benchmark * Add neg doc shuffling * Add top_n * 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging * Add models to sweep * add option for faiss index type * remove unneeded line * change faiss to faiss_flat * begin automatic benchmark script * remove existing postgres docker for benchmarking * Add data processing scripts * Remove shuffle in script bc data already shuffled * switch hnsw setup from 256 to 128 * change es similarity to dot product by default * Error includes stack trace * Change ES default timeout * remove delete_docs() from timing for indexing * Add support for website export * update website on push to benchmarks * add complete benchmarks results * new json format * removed NaN as is not a valid json token * versioning for docs * unsaved changes * cleaning * cleaning * Edit format of benchmarks data * update also jsons in v0.4.0 Co-authored-by: brandenchan <brandenchan@icloud.com> Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2020-10-19 11:46:51 +02:00
Lalit Pagaria	b9da789475	Add Elasticsearch Query DSL compliant Query API (#471 )	2020-10-16 13:25:31 +02:00
brandenchan	b9bb8d6cc1	Fix try except	2020-10-16 12:16:32 +02:00
brandenchan	6d60cc9451	add automation pipeline	2020-10-15 18:12:17 +02:00
Tanay Soni	974b37eded	Add PreProcessor to simplify splitting and cleaning of docs (#473 ) * Add PreProcessing * Adjust PDF conversion tests * Add tests for Preprocessing * Add requirement * Fix tests * Ignore decoding errors for TextConverter * Rename split_size to split_length * Adjust tests Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2020-10-15 10:42:08 +02:00
Lalit Pagaria	2e9f3c1512	Fix update_embeddings function in FAISSDocumentStore and add retriever fixture in tests (#481 ) * 1. Prevent update_embeddings function in FAISSDocumentStore to set faiss_index as None when document store does not have any docs. 2. cleaning up tests by adding fixture for retriever. * TfidfRetriever need document store with documents during initialization as it call fit() function in constructor so fixing it by checking self.paragraphs of None * Fix naming of retriever's fixture (embedded to embedding and tfid to tfidf)	2020-10-14 16:15:04 +02:00
Branden Chan	1cebcb7dda	Create time and performance benchmarks for all readers and retrievers (#339 ) * add time and perf benchmark for es * Add retriever benchmarking * Add Reader benchmarking * add nq to squad conversion * add conversion stats * clean benchmarks * Add link to dataset * Update imports * add first support for neg psgs * Refactor test * set max_seq_len * cleanup benchmark * begin retriever speed benchmarking * Add support for retriever query index benchmarking * improve reader eval, retriever speed benchmarking * improve retriever speed benchmarking * Add retriever accuracy benchmark * Add neg doc shuffling * Add top_n * 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging * Add models to sweep * add option for faiss index type * remove unneeded line * change faiss to faiss_flat * begin automatic benchmark script * remove existing postgres docker for benchmarking * Add data processing scripts * Remove shuffle in script bc data already shuffled * switch hnsw setup from 256 to 128 * change es similarity to dot product by default * Error includes stack trace * Change ES default timeout * remove delete_docs() from timing for indexing * Add support for website export * update website on push to benchmarks * add complete benchmarks results * new json format * removed NaN as is not a valid json token * fix benchmarking for faiss hnsw queries. do sql calls in update_embeddings() as batches * update benchmarks for hnsw 128,20,80 * don't delete full index in delete_all_documents() * update texts for charts * update recall column for retriever * change scale and add units to desc * add units to legend * add axis titles. update desc * add html tags Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai> Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>	2020-10-12 13:34:42 +02:00
Malte Pietsch	8edeb844f7	Remove phi normalization from FAISS, support more index types, 3x speedup (#467 ) * remove phi normalization * add special case for hnsw * rename vector_size to vector_dim * fix loading. fix extra dim in tests * switch to new ES syntax for vector similarity * 3x sql speed up. cascade deletes. add train_index() * add docstrings. remove vector_dim from load() * delete docs from faiss and sql * fix delete of docs in test * relax type hint for faiss index * rename metric to metric_type Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>	2020-10-06 16:09:56 +02:00
Lalit Pagaria	465ccbc12e	Allow multiple write calls to existing FAISS index. (#422 ) - Fixing issue when update_embeddings always create new FAISS index instead of clearing existing one. New index creation may not free existing used memory and cause memory leak. Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2020-10-05 12:01:20 +02:00
Tanay Soni	669c72d538	Enable bulk operations on vector IDs for FAISSDocumentStore (#460 )	2020-10-02 14:43:25 +02:00
Lalit Pagaria	9b58374b7c	Skip file conversion if file type is not supported (#456 ) * Skip file converter if file type is not supported. Refer https://github.com/deepset-ai/haystack/issues/453 * Fixing issue reported by mypy * Addressing review comments	2020-10-01 14:47:45 +02:00
Malte Pietsch	271ff30262	fix type casting of embeddings for tutorial 4 (#402 )	2020-09-18 18:10:50 +02:00
Malte Pietsch	db6864d159	Fix type casting for vectors in FAISS (#399 ) * Fix type casting for vectors in FAISS Co-authored-by: philipp-bode <philipp.bode@student.hpi.de> * add type casts for elastic. refactor embedding retriever tests * fix case: empty embedding field * fix faiss tolerance * add assert in test_faiss_retrieving Co-authored-by: philipp-bode <philipp.bode@student.hpi.de>	2020-09-18 17:08:13 +02:00
Malte Pietsch	d69133966d	Fix faiss test tolerance	2020-09-18 13:57:29 +02:00
Malte Pietsch	4c503158a7	Fix duplicate vector ids in FAISS (#395 ) * fix duplicate vector ids in faiss * Add test Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com> * revert score change * switch to faiss_index.ntotal for ids. add tests Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>	2020-09-18 12:52:22 +02:00
Tanay Soni	0859da8f74	Fix document filtering in SQLDocumentStore (#396 )	2020-09-18 12:22:52 +02:00
Tanay Soni	3399fc784d	Refactor file converter interface (#393 )	2020-09-18 10:42:13 +02:00
Malte Pietsch	9727829cc6	Rename and restructure modules (database, indexing, schemas) (#379 ) * rename database to documentstore * move document, label, multilabel to haystack/schema.py * rename documentstore -> document_store * split indexing modules -> file_converter + preprocessor * fix order of imports * Update tutorial notebooks * fix torch version in tutorial 4	2020-09-16 18:33:23 +02:00
Lalit P	de5ad42e46	Adjust tests for MacOS (#374 )	2020-09-15 15:04:46 +02:00
brandenchan	cca8676f90	More robust eval	2020-08-26 12:01:59 +02:00
kolk	f2b6cc761b	Refactor DPR from FB to Transformers codebase (#308 ) * change_HFBertEncoder to transformers DPREncoder * Removed BertTensorizer * model download relative path * Refactor model load * Tutorial5 DPR updated * fix print_eval_results typo * copy transformers DPR modules in dpr_utils and test * transformer v3.0.2 import errors fixed * remove dependency of DPRConfig on attribute use_return_tuple * Adjust transformers 302 locally to work with dpr * projection layer removed from DPR encoders * fixed mypy errors * transformers DPR compatible code added * transformers DPR compatibility added * bug fix in tutorial 6 notebook * Docstring update and variable naming issues fix * tutorial modified to reflect DPR variable naming change * title addition to passage use-cases handled * modified handling untitled batch * resolved mypy errors * typos in docstrings and comments fixed * cleaned DPR code and added new test cases * warnings added for non-bert model [SEP] token removal * changed warning to logger warning * title mask creation refactored * bug fix on cuda issues * tutorial 6 instantiates modified DPR * tutorial 5 modified * tutorial 5 ipython notebook modified: DPR instantiation * batch_size added to DPR instantiation * tutorial 5 jupyter notebook typos fixed * improved docstrings, fixed typos * Update docstring Co-authored-by: Timo Moeller <timo.moeller@deepset.ai> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2020-08-25 20:16:00 +05:30
Tanay Soni	3a42eb663e	Include InMemoryDocumetStore for DPR test	2020-08-24 14:44:12 +02:00
bogdankostic	f388ca025c	Aggregate multiple no answers in MultiLabel (#324 ) * Aggregate multiple no answers * Add test for multiple no answers	2020-08-18 18:25:01 +02:00
bogdankostic	72b1013560	Restructure update embeddings (#304 ) * Restructure update embeddings * Adapt FAISSDocStore * Adapt test and tutorial Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>	2020-08-18 14:04:31 +02:00
bogdankostic	b30963d0cd	Add Tests for MultiLabel (#318 ) * Add tests for MultiLabel * Add test for no_answer and is_correct_answer=False + fix bug in MultiLabel aggregation * Fix bug in MultiLabel aggregation	2020-08-17 20:14:31 +02:00
Tanay Soni	01ff66dfd6	Remove redundant test fixture	2020-08-17 14:19:38 +02:00
Dany	403318b1f5	Add Tika Converter (#314 )	2020-08-17 11:21:09 +02:00
Tanay Soni	1637ce1184	Revert "Add Tika Converter (#314 )" This reverts commit 5ef59b1901da6d51bfa085683321a243228d4fc9.	2020-08-17 11:13:52 +02:00
Tanay Soni	5ef59b1901	Add Tika Converter (#314 )	2020-08-14 14:13:59 +02:00
Tanay Soni	089fecf99e	Fix indexing of metadata for FAISS/SQL Document Store (#310 )	2020-08-13 12:25:32 +02:00
bogdankostic	5186d2d235	Batch prediction in evaluation (#137 ) * Add Batch evaluation * Separate evaluation methods * Clean calculation of eval metrics * Adapt eval to Label objects * Fix format of no_answer * Adapt to MultiLabel * Add tests	2020-08-10 19:30:31 +02:00
Karim Jana	c7078a36c0	Custom fields for indexing in ElasticsearchDocumentStore (#297 )	2020-08-10 11:34:39 +02:00
Tanay Soni	9d0df60aad	Add FAISS Document Store (#253 )	2020-08-07 14:25:08 +02:00
Timo Moeller	d9e8b522a1	Add "no answer" aggregation to Transformersreader (#259 ) * Add no answer aggregation * Change to covariant type annotation * Remove n_best_per_passage from transformersreader	2020-08-06 17:32:55 +02:00
Tanay Soni	5937f9cf16	Deprecate Tags for Document Stores (#286 )	2020-08-04 14:24:12 +02:00
Tanay Soni	723921475f	Make document ids of str type (#284 )	2020-08-03 16:20:17 +02:00
Tanay Soni	d90435efd6	Add wait for Elasticsearch update call	2020-07-31 12:06:27 +02:00
Malte Pietsch	29a15c0d59	Add eval for Dense Passage Retriever & Refactor handling of labels/feedback (#243 )	2020-07-31 11:34:06 +02:00
Tanay Soni	5210c8c2ab	Add method to update meta fields for documents in Elasticsearch (#242 )	2020-07-16 15:34:55 +02:00
Malte Pietsch	6bed2f509f	Refactor DPR for latest transformers version & change init arg `gpu` -> `use_gpu` for DPR and EmbeddingRetriever (#239 ) * fix tokenizer warning in latest transformers * change dpr arg from gpu to use_gpu * change gpu arg for EmbeddingRetriever	2020-07-16 10:45:01 +02:00
Tanay Soni	5c1a5fe61d	Add dummy retriever for benchmarking / reader-only settings (#235 )	2020-07-15 17:22:17 +02:00

... 27 28 29 30 31

1524 Commits