1524 Commits

Author SHA1 Message Date
Tanay Soni
db4151bbc0
Fix scoring in Elasticsearch for dot product (#517) 2020-10-23 17:50:49 +02:00
Branden Chan
fbacdfd263 Add logging of error, add n_docs assert 2020-10-22 15:45:46 +02:00
Branden Chan
b0483cfd99 add readme 2020-10-22 15:32:56 +02:00
Tanay Soni
3bec264d76
Add filters for document count (#512) 2020-10-22 12:42:13 +02:00
brandenchan
87e5f06fa8 add automatic json update 2020-10-21 17:59:44 +02:00
brandenchan
d3743d00e9 Merge branch 'master' into automate_benchmarks 2020-10-21 17:48:10 +02:00
Lalit Pagaria
63c12371b9
Change arg "model" to "model_name_or_path" in TransformersReader (#510)
* Consistent parameter naming for TransformersReader along with removing unused imports as well.

* Addressing review comments
2020-10-21 17:15:35 +02:00
Malte Pietsch
956543e239
Restructure checks in PreProcessor (#504)
* restructure checks

* fix variable name

* Fix test
2020-10-20 06:43:59 +02:00
Malte Pietsch
11a3976945 update deletes. fix arg in run.py 2020-10-19 14:40:26 +02:00
Markus Paff
2531c8e061
Add versioning docs (#495)
* add time and perf benchmark for es

* Add retriever benchmarking

* Add Reader benchmarking

* add nq to squad conversion

* add conversion stats

* clean benchmarks

* Add link to dataset

* Update imports

* add first support for neg psgs

* Refactor test

* set max_seq_len

* cleanup benchmark

* begin retriever speed benchmarking

* Add support for retriever query index benchmarking

* improve reader eval, retriever speed benchmarking

* improve retriever speed benchmarking

* Add retriever accuracy benchmark

* Add neg doc shuffling

* Add top_n

* 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging

* Add models to sweep

* add option for faiss index type

* remove unneeded line

* change faiss to faiss_flat

* begin automatic benchmark script

* remove existing postgres docker for benchmarking

* Add data processing scripts

* Remove shuffle in script bc data already shuffled

* switch hnsw setup from 256 to 128

* change es similarity to dot product by default

* Error includes stack trace

* Change ES default timeout

* remove delete_docs() from timing for indexing

* Add support for website export

* update website on push to benchmarks

* add complete benchmarks results

* new json format

* removed NaN as is not a valid json token

* versioning for docs

* unsaved changes

* cleaning

* cleaning

* Edit format of benchmarks data

* update also jsons in v0.4.0

Co-authored-by: brandenchan <brandenchan@icloud.com>
Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-19 11:46:51 +02:00
Lalit Pagaria
b9da789475
Add Elasticsearch Query DSL compliant Query API (#471) 2020-10-16 13:25:31 +02:00
brandenchan
b9bb8d6cc1 Fix try except 2020-10-16 12:16:32 +02:00
brandenchan
6d60cc9451 add automation pipeline 2020-10-15 18:12:17 +02:00
Tanay Soni
974b37eded
Add PreProcessor to simplify splitting and cleaning of docs (#473)
* Add PreProcessing

* Adjust PDF conversion tests

* Add tests for Preprocessing

* Add requirement

* Fix tests

* Ignore decoding errors for TextConverter

* Rename split_size to split_length

* Adjust tests

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-15 10:42:08 +02:00
Lalit Pagaria
2e9f3c1512
Fix update_embeddings function in FAISSDocumentStore and add retriever fixture in tests (#481)
* 1. Prevent update_embeddings function in FAISSDocumentStore to set faiss_index as None when document store does not have any docs.

2. cleaning up tests by adding fixture for retriever.

* TfidfRetriever need document store with documents during initialization as it call fit() function in constructor so fixing it by checking self.paragraphs of None

* Fix naming of retriever's fixture (embedded to embedding and tfid to tfidf)
2020-10-14 16:15:04 +02:00
Branden Chan
1cebcb7dda
Create time and performance benchmarks for all readers and retrievers (#339)
* add time and perf benchmark for es

* Add retriever benchmarking

* Add Reader benchmarking

* add nq to squad conversion

* add conversion stats

* clean benchmarks

* Add link to dataset

* Update imports

* add first support for neg psgs

* Refactor test

* set max_seq_len

* cleanup benchmark

* begin retriever speed benchmarking

* Add support for retriever query index benchmarking

* improve reader eval, retriever speed benchmarking

* improve retriever speed benchmarking

* Add retriever accuracy benchmark

* Add neg doc shuffling

* Add top_n

* 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging

* Add models to sweep

* add option for faiss index type

* remove unneeded line

* change faiss to faiss_flat

* begin automatic benchmark script

* remove existing postgres docker for benchmarking

* Add data processing scripts

* Remove shuffle in script bc data already shuffled

* switch hnsw setup from 256 to 128

* change es similarity to dot product by default

* Error includes stack trace

* Change ES default timeout

* remove delete_docs() from timing for indexing

* Add support for website export

* update website on push to benchmarks

* add complete benchmarks results

* new json format

* removed NaN as is not a valid json token

* fix benchmarking for faiss hnsw queries. do sql calls in update_embeddings() as batches

* update benchmarks for hnsw 128,20,80

* don't delete full index in delete_all_documents()

* update texts for charts

* update recall column for retriever

* change scale and add units to desc

* add units to legend

* add axis titles. update desc

* add html tags

Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2020-10-12 13:34:42 +02:00
Malte Pietsch
8edeb844f7
Remove phi normalization from FAISS, support more index types, 3x speedup (#467)
* remove phi normalization

* add special case for hnsw

* rename vector_size to vector_dim

* fix loading. fix extra dim in tests

* switch to new ES syntax for vector similarity

* 3x sql speed up. cascade deletes. add train_index()

* add docstrings. remove vector_dim from load()

* delete docs from faiss and sql

* fix delete of docs in test

* relax type hint for faiss index

* rename metric to metric_type

Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>
2020-10-06 16:09:56 +02:00
Lalit Pagaria
465ccbc12e
Allow multiple write calls to existing FAISS index. (#422)
- Fixing issue when update_embeddings always create new FAISS index instead of clearing existing one. New index creation may not free existing used memory and cause memory leak.

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-05 12:01:20 +02:00
Tanay Soni
669c72d538
Enable bulk operations on vector IDs for FAISSDocumentStore (#460) 2020-10-02 14:43:25 +02:00
Lalit Pagaria
9b58374b7c
Skip file conversion if file type is not supported (#456)
* Skip file converter if file type is not supported. Refer https://github.com/deepset-ai/haystack/issues/453

* Fixing issue reported by mypy

* Addressing review comments
2020-10-01 14:47:45 +02:00
Malte Pietsch
271ff30262
fix type casting of embeddings for tutorial 4 (#402) 2020-09-18 18:10:50 +02:00
Malte Pietsch
db6864d159
Fix type casting for vectors in FAISS (#399)
* Fix type casting for vectors in FAISS

Co-authored-by: philipp-bode <philipp.bode@student.hpi.de>

* add type casts for elastic. refactor embedding retriever tests

* fix case: empty embedding field

* fix faiss tolerance

* add assert in test_faiss_retrieving

Co-authored-by: philipp-bode <philipp.bode@student.hpi.de>
2020-09-18 17:08:13 +02:00
Malte Pietsch
d69133966d Fix faiss test tolerance 2020-09-18 13:57:29 +02:00
Malte Pietsch
4c503158a7
Fix duplicate vector ids in FAISS (#395)
* fix duplicate vector ids in faiss

* Add test

Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>

* revert score change

* switch to faiss_index.ntotal for ids. add tests

Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>
2020-09-18 12:52:22 +02:00
Tanay Soni
0859da8f74
Fix document filtering in SQLDocumentStore (#396) 2020-09-18 12:22:52 +02:00
Tanay Soni
3399fc784d
Refactor file converter interface (#393) 2020-09-18 10:42:13 +02:00
Malte Pietsch
9727829cc6
Rename and restructure modules (database, indexing, schemas) (#379)
* rename database to documentstore

* move document, label, multilabel to haystack/schema.py

* rename documentstore -> document_store

* split indexing modules -> file_converter + preprocessor

* fix order of imports

* Update tutorial notebooks

* fix torch version in tutorial 4
2020-09-16 18:33:23 +02:00
Lalit P
de5ad42e46
Adjust tests for MacOS (#374) 2020-09-15 15:04:46 +02:00
brandenchan
cca8676f90 More robust eval 2020-08-26 12:01:59 +02:00
kolk
f2b6cc761b
Refactor DPR from FB to Transformers codebase (#308)
* change_HFBertEncoder to transformers DPREncoder

* Removed BertTensorizer

* model download relative path

* Refactor model load

* Tutorial5 DPR updated

* fix print_eval_results typo

* copy transformers DPR modules in dpr_utils and test

* transformer v3.0.2 import errors fixed

* remove dependency of DPRConfig on attribute use_return_tuple

* Adjust transformers 302 locally to work with dpr

* projection layer removed from DPR encoders

* fixed mypy errors

* transformers DPR compatible code added

* transformers DPR compatibility added

* bug fix in tutorial 6 notebook

* Docstring update and variable naming issues fix

* tutorial modified to reflect DPR variable naming change

* title addition to passage use-cases handled

* modified handling untitled batch

* resolved mypy errors

* typos in docstrings and comments fixed

* cleaned DPR code and added new test cases

* warnings added for non-bert model [SEP] token removal

* changed warning to logger warning

* title mask creation refactored

* bug fix on cuda issues

* tutorial 6 instantiates modified DPR

* tutorial 5 modified

* tutorial 5 ipython notebook modified: DPR instantiation

* batch_size added to DPR instantiation

* tutorial 5 jupyter notebook typos fixed

* improved docstrings, fixed typos

* Update docstring

Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-08-25 20:16:00 +05:30
Tanay Soni
3a42eb663e
Include InMemoryDocumetStore for DPR test 2020-08-24 14:44:12 +02:00
bogdankostic
f388ca025c
Aggregate multiple no answers in MultiLabel (#324)
* Aggregate multiple no answers

* Add test for multiple no answers
2020-08-18 18:25:01 +02:00
bogdankostic
72b1013560
Restructure update embeddings (#304)
* Restructure update embeddings

* Adapt FAISSDocStore

* Adapt test and tutorial

Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
2020-08-18 14:04:31 +02:00
bogdankostic
b30963d0cd
Add Tests for MultiLabel (#318)
* Add tests for MultiLabel

* Add test for no_answer and is_correct_answer=False + fix bug in MultiLabel aggregation

* Fix bug in MultiLabel aggregation
2020-08-17 20:14:31 +02:00
Tanay Soni
01ff66dfd6 Remove redundant test fixture 2020-08-17 14:19:38 +02:00
Dany
403318b1f5 Add Tika Converter (#314) 2020-08-17 11:21:09 +02:00
Tanay Soni
1637ce1184 Revert "Add Tika Converter (#314)"
This reverts commit 5ef59b1901da6d51bfa085683321a243228d4fc9.
2020-08-17 11:13:52 +02:00
Tanay Soni
5ef59b1901
Add Tika Converter (#314) 2020-08-14 14:13:59 +02:00
Tanay Soni
089fecf99e
Fix indexing of metadata for FAISS/SQL Document Store (#310) 2020-08-13 12:25:32 +02:00
bogdankostic
5186d2d235
Batch prediction in evaluation (#137)
* Add Batch evaluation

* Separate evaluation methods

* Clean calculation of eval metrics

* Adapt eval to Label objects

* Fix format of no_answer

* Adapt to MultiLabel

* Add tests
2020-08-10 19:30:31 +02:00
Karim Jana
c7078a36c0
Custom fields for indexing in ElasticsearchDocumentStore (#297) 2020-08-10 11:34:39 +02:00
Tanay Soni
9d0df60aad
Add FAISS Document Store (#253) 2020-08-07 14:25:08 +02:00
Timo Moeller
d9e8b522a1
Add "no answer" aggregation to Transformersreader (#259)
* Add no answer aggregation

* Change to covariant type annotation

* Remove n_best_per_passage from transformersreader
2020-08-06 17:32:55 +02:00
Tanay Soni
5937f9cf16
Deprecate Tags for Document Stores (#286) 2020-08-04 14:24:12 +02:00
Tanay Soni
723921475f
Make document ids of str type (#284) 2020-08-03 16:20:17 +02:00
Tanay Soni
d90435efd6 Add wait for Elasticsearch update call 2020-07-31 12:06:27 +02:00
Malte Pietsch
29a15c0d59
Add eval for Dense Passage Retriever & Refactor handling of labels/feedback (#243) 2020-07-31 11:34:06 +02:00
Tanay Soni
5210c8c2ab
Add method to update meta fields for documents in Elasticsearch (#242) 2020-07-16 15:34:55 +02:00
Malte Pietsch
6bed2f509f
Refactor DPR for latest transformers version & change init arg gpu -> use_gpu for DPR and EmbeddingRetriever (#239)
* fix tokenizer warning in latest transformers

* change dpr arg from gpu to use_gpu

* change gpu arg for EmbeddingRetriever
2020-07-16 10:45:01 +02:00
Tanay Soni
5c1a5fe61d
Add dummy retriever for benchmarking / reader-only settings (#235) 2020-07-15 17:22:17 +02:00