Branden Chan
1cebcb7dda
Create time and performance benchmarks for all readers and retrievers ( #339 )
...
* add time and perf benchmark for es
* Add retriever benchmarking
* Add Reader benchmarking
* add nq to squad conversion
* add conversion stats
* clean benchmarks
* Add link to dataset
* Update imports
* add first support for neg psgs
* Refactor test
* set max_seq_len
* cleanup benchmark
* begin retriever speed benchmarking
* Add support for retriever query index benchmarking
* improve reader eval, retriever speed benchmarking
* improve retriever speed benchmarking
* Add retriever accuracy benchmark
* Add neg doc shuffling
* Add top_n
* 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging
* Add models to sweep
* add option for faiss index type
* remove unneeded line
* change faiss to faiss_flat
* begin automatic benchmark script
* remove existing postgres docker for benchmarking
* Add data processing scripts
* Remove shuffle in script bc data already shuffled
* switch hnsw setup from 256 to 128
* change es similarity to dot product by default
* Error includes stack trace
* Change ES default timeout
* remove delete_docs() from timing for indexing
* Add support for website export
* update website on push to benchmarks
* add complete benchmarks results
* new json format
* removed NaN as is not a valid json token
* fix benchmarking for faiss hnsw queries. do sql calls in update_embeddings() as batches
* update benchmarks for hnsw 128,20,80
* don't delete full index in delete_all_documents()
* update texts for charts
* update recall column for retriever
* change scale and add units to desc
* add units to legend
* add axis titles. update desc
* add html tags
Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2020-10-12 13:34:42 +02:00
Malte Pietsch
8edeb844f7
Remove phi normalization from FAISS, support more index types, 3x speedup ( #467 )
...
* remove phi normalization
* add special case for hnsw
* rename vector_size to vector_dim
* fix loading. fix extra dim in tests
* switch to new ES syntax for vector similarity
* 3x sql speed up. cascade deletes. add train_index()
* add docstrings. remove vector_dim from load()
* delete docs from faiss and sql
* fix delete of docs in test
* relax type hint for faiss index
* rename metric to metric_type
Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>
2020-10-06 16:09:56 +02:00
Lalit Pagaria
465ccbc12e
Allow multiple write calls to existing FAISS index. ( #422 )
...
- Fixing issue when update_embeddings always create new FAISS index instead of clearing existing one. New index creation may not free existing used memory and cause memory leak.
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-05 12:01:20 +02:00
Tanay Soni
669c72d538
Enable bulk operations on vector IDs for FAISSDocumentStore ( #460 )
2020-10-02 14:43:25 +02:00
Lalit Pagaria
9b58374b7c
Skip file conversion if file type is not supported ( #456 )
...
* Skip file converter if file type is not supported. Refer https://github.com/deepset-ai/haystack/issues/453
* Fixing issue reported by mypy
* Addressing review comments
2020-10-01 14:47:45 +02:00
Malte Pietsch
271ff30262
fix type casting of embeddings for tutorial 4 ( #402 )
2020-09-18 18:10:50 +02:00
Malte Pietsch
db6864d159
Fix type casting for vectors in FAISS ( #399 )
...
* Fix type casting for vectors in FAISS
Co-authored-by: philipp-bode <philipp.bode@student.hpi.de>
* add type casts for elastic. refactor embedding retriever tests
* fix case: empty embedding field
* fix faiss tolerance
* add assert in test_faiss_retrieving
Co-authored-by: philipp-bode <philipp.bode@student.hpi.de>
2020-09-18 17:08:13 +02:00
Malte Pietsch
d69133966d
Fix faiss test tolerance
2020-09-18 13:57:29 +02:00
Malte Pietsch
4c503158a7
Fix duplicate vector ids in FAISS ( #395 )
...
* fix duplicate vector ids in faiss
* Add test
Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>
* revert score change
* switch to faiss_index.ntotal for ids. add tests
Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>
2020-09-18 12:52:22 +02:00
Tanay Soni
0859da8f74
Fix document filtering in SQLDocumentStore ( #396 )
2020-09-18 12:22:52 +02:00
Tanay Soni
3399fc784d
Refactor file converter interface ( #393 )
2020-09-18 10:42:13 +02:00
Malte Pietsch
9727829cc6
Rename and restructure modules (database, indexing, schemas) ( #379 )
...
* rename database to documentstore
* move document, label, multilabel to haystack/schema.py
* rename documentstore -> document_store
* split indexing modules -> file_converter + preprocessor
* fix order of imports
* Update tutorial notebooks
* fix torch version in tutorial 4
2020-09-16 18:33:23 +02:00
Lalit P
de5ad42e46
Adjust tests for MacOS ( #374 )
2020-09-15 15:04:46 +02:00
brandenchan
cca8676f90
More robust eval
2020-08-26 12:01:59 +02:00
kolk
f2b6cc761b
Refactor DPR from FB to Transformers codebase ( #308 )
...
* change_HFBertEncoder to transformers DPREncoder
* Removed BertTensorizer
* model download relative path
* Refactor model load
* Tutorial5 DPR updated
* fix print_eval_results typo
* copy transformers DPR modules in dpr_utils and test
* transformer v3.0.2 import errors fixed
* remove dependency of DPRConfig on attribute use_return_tuple
* Adjust transformers 302 locally to work with dpr
* projection layer removed from DPR encoders
* fixed mypy errors
* transformers DPR compatible code added
* transformers DPR compatibility added
* bug fix in tutorial 6 notebook
* Docstring update and variable naming issues fix
* tutorial modified to reflect DPR variable naming change
* title addition to passage use-cases handled
* modified handling untitled batch
* resolved mypy errors
* typos in docstrings and comments fixed
* cleaned DPR code and added new test cases
* warnings added for non-bert model [SEP] token removal
* changed warning to logger warning
* title mask creation refactored
* bug fix on cuda issues
* tutorial 6 instantiates modified DPR
* tutorial 5 modified
* tutorial 5 ipython notebook modified: DPR instantiation
* batch_size added to DPR instantiation
* tutorial 5 jupyter notebook typos fixed
* improved docstrings, fixed typos
* Update docstring
Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-08-25 20:16:00 +05:30
Tanay Soni
3a42eb663e
Include InMemoryDocumetStore for DPR test
2020-08-24 14:44:12 +02:00
bogdankostic
f388ca025c
Aggregate multiple no answers in MultiLabel ( #324 )
...
* Aggregate multiple no answers
* Add test for multiple no answers
2020-08-18 18:25:01 +02:00
bogdankostic
72b1013560
Restructure update embeddings ( #304 )
...
* Restructure update embeddings
* Adapt FAISSDocStore
* Adapt test and tutorial
Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
2020-08-18 14:04:31 +02:00
bogdankostic
b30963d0cd
Add Tests for MultiLabel ( #318 )
...
* Add tests for MultiLabel
* Add test for no_answer and is_correct_answer=False + fix bug in MultiLabel aggregation
* Fix bug in MultiLabel aggregation
2020-08-17 20:14:31 +02:00
Tanay Soni
01ff66dfd6
Remove redundant test fixture
2020-08-17 14:19:38 +02:00
Dany
403318b1f5
Add Tika Converter ( #314 )
2020-08-17 11:21:09 +02:00
Tanay Soni
1637ce1184
Revert "Add Tika Converter ( #314 )"
...
This reverts commit 5ef59b1901da6d51bfa085683321a243228d4fc9.
2020-08-17 11:13:52 +02:00
Tanay Soni
5ef59b1901
Add Tika Converter ( #314 )
2020-08-14 14:13:59 +02:00
Tanay Soni
089fecf99e
Fix indexing of metadata for FAISS/SQL Document Store ( #310 )
2020-08-13 12:25:32 +02:00
bogdankostic
5186d2d235
Batch prediction in evaluation ( #137 )
...
* Add Batch evaluation
* Separate evaluation methods
* Clean calculation of eval metrics
* Adapt eval to Label objects
* Fix format of no_answer
* Adapt to MultiLabel
* Add tests
2020-08-10 19:30:31 +02:00
Karim Jana
c7078a36c0
Custom fields for indexing in ElasticsearchDocumentStore ( #297 )
2020-08-10 11:34:39 +02:00
Tanay Soni
9d0df60aad
Add FAISS Document Store ( #253 )
2020-08-07 14:25:08 +02:00
Timo Moeller
d9e8b522a1
Add "no answer" aggregation to Transformersreader ( #259 )
...
* Add no answer aggregation
* Change to covariant type annotation
* Remove n_best_per_passage from transformersreader
2020-08-06 17:32:55 +02:00
Tanay Soni
5937f9cf16
Deprecate Tags for Document Stores ( #286 )
2020-08-04 14:24:12 +02:00
Tanay Soni
723921475f
Make document ids of str type ( #284 )
2020-08-03 16:20:17 +02:00
Tanay Soni
d90435efd6
Add wait for Elasticsearch update call
2020-07-31 12:06:27 +02:00
Malte Pietsch
29a15c0d59
Add eval for Dense Passage Retriever & Refactor handling of labels/feedback ( #243 )
2020-07-31 11:34:06 +02:00
Tanay Soni
5210c8c2ab
Add method to update meta fields for documents in Elasticsearch ( #242 )
2020-07-16 15:34:55 +02:00
Malte Pietsch
6bed2f509f
Refactor DPR for latest transformers version & change init arg gpu -> use_gpu for DPR and EmbeddingRetriever ( #239 )
...
* fix tokenizer warning in latest transformers
* change dpr arg from gpu to use_gpu
* change gpu arg for EmbeddingRetriever
2020-07-16 10:45:01 +02:00
Tanay Soni
5c1a5fe61d
Add dummy retriever for benchmarking / reader-only settings ( #235 )
2020-07-15 17:22:17 +02:00
Tanay Soni
912e98cd40
Fix id for documents returned by the TfidfRetriever ( #232 )
2020-07-15 14:55:07 +02:00
Malte Pietsch
99a6a34047
Upgrade to new FARM / Transformers / PyTorch versions ( #212 )
2020-07-14 18:53:15 +02:00
Anirban Saha
6b217732f5
Add basic support for Docx Files ( #225 )
2020-07-14 12:28:19 +02:00
Tanay Soni
b886e054a3
Move document_name attribute to meta ( #217 )
2020-07-14 09:53:31 +02:00
Malte Pietsch
d2b26a99ff
Add more tests ( #213 )
2020-07-10 10:54:56 +02:00
Malte Pietsch
07ecfb60b9
Dense Passage Retriever (Inference) ( #167 )
2020-06-30 19:05:45 +02:00
Tanay Soni
ec433a5ed6
Move out REST API from PyPI package ( #160 )
2020-06-22 12:07:12 +02:00
Tanay Soni
a349eef0db
Add API endpoint to upload files ( #154 )
2020-06-17 16:28:26 +02:00
Tanay Soni
180dc8cbd6
Start Elasticsearch with a Github Action ( #142 )
2020-06-09 12:46:15 +02:00
Tanay Soni
160345f3d5
Update build workflow
2020-06-09 11:45:25 +02:00
Tanay Soni
ef9e4f4467
Add PDF text extraction ( #109 )
2020-06-08 11:07:19 +02:00
Stan Kirdey
ca6778d934
Add metadata for TF-IDF Retriever ( #122 )
2020-05-28 10:55:28 +02:00
Stan Kirdey
bf8e506c45
Add embedding query for InMemoryDocumentStore
2020-05-18 14:47:41 +02:00
Stan Kirdey
72a3b70d7a
Add filtering by tags for InMemoryDocumentStore ( #108 )
2020-05-14 22:12:25 +02:00
Tanay Soni
37e0ff70f7
Add test for Elasticsearch document store ( #88 )
2020-05-04 18:00:07 +02:00