2539 Commits

Author SHA1 Message Date
bogdankostic
18d315d61a
Make returning predictions in evaluation possible (#524)
* Make returning preds in evaluation possible

* Make returning preds in evaluation possible

* Add automated check if eval dict contains predictions
2020-10-28 09:55:31 +01:00
Branden Chan
4fa5d9c3eb
Merge pull request #522 from deepset-ai/automate_benchmarks
Add --ci and --update-json to CLI for benchmarks
2020-10-27 12:56:47 +01:00
Branden Chan
8c4865ee5f Rename n_docs variable to max_docs 2020-10-27 12:45:15 +01:00
Branden Chan
7c81dfdc3a Address reviewer comments 2020-10-27 12:41:11 +01:00
Branden Chan
d5cb227909 Merge branch 'master' into automate_benchmarks 2020-10-27 11:50:49 +01:00
Lalit Pagaria
9521e180b3
Standardize behavior of DocumentStores to return embeddings (#514)
* Adding support to return embedding along with other result via query_by_embedding function

* Adding test case to check return embedding

* By default for all tests but DPR tests: disable return_embedding flag

* Reducing None test case and fixing query_by_embedding of ElasticsearchDocumentStore when it updating self.excluded_meta_data directly

* Fixing mypy reported issue
2020-10-27 08:33:39 +01:00
Lalit Pagaria
abda994116
Pytest fix memory leak and put pytest marker on slow tests (#520)
* Clear faiss_index during teardown

* Marking slow test with pytest markers. So In future these test can be optimized. Also command line option can be added to skip them refer https://pytest.org/en/stable/example/simple.html#control-skipping-of-tests-according-to-command-line-option

* Fixing test
2020-10-26 19:19:10 +01:00
Tanay Soni
db4151bbc0
Fix scoring in Elasticsearch for dot product (#517) 2020-10-23 17:50:49 +02:00
Timo Moeller
def8fd617a
Make title info optional when evaluating on QA data (#494)
* Add check for title present in QA file and make title extraction optional

* Make missing title None
2020-10-23 11:06:56 +02:00
bogdankostic
f62117c232
Add urllib version requirement to colab notebooks (#509) 2020-10-23 10:43:58 +02:00
Branden Chan
fbacdfd263 Add logging of error, add n_docs assert 2020-10-22 15:45:46 +02:00
Branden Chan
b0483cfd99 add readme 2020-10-22 15:32:56 +02:00
Tanay Soni
3bec264d76
Add filters for document count (#512) 2020-10-22 12:42:13 +02:00
brandenchan
87e5f06fa8 add automatic json update 2020-10-21 17:59:44 +02:00
brandenchan
d3743d00e9 Merge branch 'master' into automate_benchmarks 2020-10-21 17:48:10 +02:00
Lalit Pagaria
63c12371b9
Change arg "model" to "model_name_or_path" in TransformersReader (#510)
* Consistent parameter naming for TransformersReader along with removing unused imports as well.

* Addressing review comments
2020-10-21 17:15:35 +02:00
Malte Pietsch
956543e239
Restructure checks in PreProcessor (#504)
* restructure checks

* fix variable name

* Fix test
2020-10-20 06:43:59 +02:00
Malte Pietsch
c13abba6d6
Better defaults for PreProcessor & update docstring 2020-10-19 17:37:58 +02:00
Sanjay Kamath
dc16258dab
Updated the example code in readme for Indexing PDF / Docx files (#502)
* Updated the example code to Indexing PDF / Docx files

The example code was referencing a structure haystack.indexing which does not exist anymore. Modified this and the function "extract_pages" with "convert"

* Update converter example in readme

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-19 15:04:33 +02:00
Malte Pietsch
11a3976945 update deletes. fix arg in run.py 2020-10-19 14:40:26 +02:00
Malte Pietsch
3434d5205d
Update doc string for ElasticsearchDocumentStore.write_documents() & sync markdown files (#501)
* update doc string for ElasticsearchDocumentStore.write_documents()

* update all markdowns with latest docstrings
2020-10-19 13:56:38 +02:00
Markus Paff
2531c8e061
Add versioning docs (#495)
* add time and perf benchmark for es

* Add retriever benchmarking

* Add Reader benchmarking

* add nq to squad conversion

* add conversion stats

* clean benchmarks

* Add link to dataset

* Update imports

* add first support for neg psgs

* Refactor test

* set max_seq_len

* cleanup benchmark

* begin retriever speed benchmarking

* Add support for retriever query index benchmarking

* improve reader eval, retriever speed benchmarking

* improve retriever speed benchmarking

* Add retriever accuracy benchmark

* Add neg doc shuffling

* Add top_n

* 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging

* Add models to sweep

* add option for faiss index type

* remove unneeded line

* change faiss to faiss_flat

* begin automatic benchmark script

* remove existing postgres docker for benchmarking

* Add data processing scripts

* Remove shuffle in script bc data already shuffled

* switch hnsw setup from 256 to 128

* change es similarity to dot product by default

* Error includes stack trace

* Change ES default timeout

* remove delete_docs() from timing for indexing

* Add support for website export

* update website on push to benchmarks

* add complete benchmarks results

* new json format

* removed NaN as is not a valid json token

* versioning for docs

* unsaved changes

* cleaning

* cleaning

* Edit format of benchmarks data

* update also jsons in v0.4.0

Co-authored-by: brandenchan <brandenchan@icloud.com>
Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-19 11:46:51 +02:00
Malte Pietsch
4a77dc7a02
Allow null filter value in api (#497) 2020-10-16 18:44:15 +02:00
Malte Pietsch
5a885fc2d1
Fix meta data = None in PreProcessor (#496) 2020-10-16 17:17:26 +02:00
Lalit Pagaria
b9da789475
Add Elasticsearch Query DSL compliant Query API (#471) 2020-10-16 13:25:31 +02:00
brandenchan
b9bb8d6cc1 Fix try except 2020-10-16 12:16:32 +02:00
Malte Pietsch
5555274170 Make creation of label index optional in feedback and file_upload api 2020-10-15 19:03:58 +02:00
Malte Pietsch
bdbd1b323b
Add create_index and similarity metric to api config (#493)
* make creation of label index optional

* add params for rest api

* reset tutorial flag
2020-10-15 18:41:36 +02:00
brandenchan
6d60cc9451 add automation pipeline 2020-10-15 18:12:17 +02:00
Malte Pietsch
ceb5c87da0
Make creation of label index optional (#490) 2020-10-15 14:40:59 +02:00
Tanay Soni
974b37eded
Add PreProcessor to simplify splitting and cleaning of docs (#473)
* Add PreProcessing

* Adjust PDF conversion tests

* Add tests for Preprocessing

* Add requirement

* Fix tests

* Ignore decoding errors for TextConverter

* Rename split_size to split_length

* Adjust tests

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-15 10:42:08 +02:00
Lalit Pagaria
2e9f3c1512
Fix update_embeddings function in FAISSDocumentStore and add retriever fixture in tests (#481)
* 1. Prevent update_embeddings function in FAISSDocumentStore to set faiss_index as None when document store does not have any docs.

2. cleaning up tests by adding fixture for retriever.

* TfidfRetriever need document store with documents during initialization as it call fit() function in constructor so fixing it by checking self.paragraphs of None

* Fix naming of retriever's fixture (embedded to embedding and tfid to tfidf)
2020-10-14 16:15:04 +02:00
Tanay Soni
ecaf7b8f0b Add psycopg2 requirement 2020-10-14 12:28:33 +02:00
Tanay Soni
3c6a125380
Add deepcopy for meta dicts in answers (#485) 2020-10-14 12:15:18 +02:00
Lalit Pagaria
12c4dd7b4b
Adjust requirements for Windows (#480) 2020-10-13 17:12:24 +02:00
Antonio Lanza
3caaf99dcb
Add automatic mixed precision (AMP) support for reader training (#463)
* Added automatic mixed precision (AMP) support for reader training

* Added clearer comments on docstring
2020-10-12 21:53:05 +02:00
Zenahr Barzani
955e6f7b3a
Add explicit encoding mode to file_converter/txt.py (#478)
* add explicit encoding mode

* parameterize encoding

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-12 17:32:04 +02:00
Branden Chan
1cebcb7dda
Create time and performance benchmarks for all readers and retrievers (#339)
* add time and perf benchmark for es

* Add retriever benchmarking

* Add Reader benchmarking

* add nq to squad conversion

* add conversion stats

* clean benchmarks

* Add link to dataset

* Update imports

* add first support for neg psgs

* Refactor test

* set max_seq_len

* cleanup benchmark

* begin retriever speed benchmarking

* Add support for retriever query index benchmarking

* improve reader eval, retriever speed benchmarking

* improve retriever speed benchmarking

* Add retriever accuracy benchmark

* Add neg doc shuffling

* Add top_n

* 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging

* Add models to sweep

* add option for faiss index type

* remove unneeded line

* change faiss to faiss_flat

* begin automatic benchmark script

* remove existing postgres docker for benchmarking

* Add data processing scripts

* Remove shuffle in script bc data already shuffled

* switch hnsw setup from 256 to 128

* change es similarity to dot product by default

* Error includes stack trace

* Change ES default timeout

* remove delete_docs() from timing for indexing

* Add support for website export

* update website on push to benchmarks

* add complete benchmarks results

* new json format

* removed NaN as is not a valid json token

* fix benchmarking for faiss hnsw queries. do sql calls in update_embeddings() as batches

* update benchmarks for hnsw 128,20,80

* don't delete full index in delete_all_documents()

* update texts for charts

* update recall column for retriever

* change scale and add units to desc

* add units to legend

* add axis titles. update desc

* add html tags

Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2020-10-12 13:34:42 +02:00
Malte Pietsch
8edeb844f7
Remove phi normalization from FAISS, support more index types, 3x speedup (#467)
* remove phi normalization

* add special case for hnsw

* rename vector_size to vector_dim

* fix loading. fix extra dim in tests

* switch to new ES syntax for vector similarity

* 3x sql speed up. cascade deletes. add train_index()

* add docstrings. remove vector_dim from load()

* delete docs from faiss and sql

* fix delete of docs in test

* relax type hint for faiss index

* rename metric to metric_type

Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>
2020-10-06 16:09:56 +02:00
Markus Paff
56852f820b
READ.me for Docstring Generation and remove not needed files (#468) 2020-10-06 15:16:56 +02:00
Markus Paff
25f34babce
Separate data and view for benchmarks (#451)
* separate data and view for benchmarks

* fixed typo
2020-10-06 10:30:19 +02:00
Lalit Pagaria
465ccbc12e
Allow multiple write calls to existing FAISS index. (#422)
- Fixing issue when update_embeddings always create new FAISS index instead of clearing existing one. New index creation may not free existing used memory and cause memory leak.

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-05 12:01:20 +02:00
Futurne
072e32b38a
Fix filters in query_embedding for ElasticsearchDocumentStore (#464)
Co-authored-by: Pierre Pereira <pierre.pereira@lexistems.com>
2020-10-05 11:25:07 +02:00
Tanay Soni
669c72d538
Enable bulk operations on vector IDs for FAISSDocumentStore (#460) 2020-10-02 14:43:25 +02:00
Malte Pietsch
029d1b75f2
Update docstring in DPR for embed_title (#459) 2020-10-02 13:41:33 +02:00
Lalit Pagaria
9b58374b7c
Skip file conversion if file type is not supported (#456)
* Skip file converter if file type is not supported. Refer https://github.com/deepset-ai/haystack/issues/453

* Fixing issue reported by mypy

* Addressing review comments
2020-10-01 14:47:45 +02:00
Malte Pietsch
a92ca04648
Update GPU docker & fix race condition with multiple workers (#436)
* fix gpu CMD and set tag to latest

* udpate dockerfiles. resolve race condition of index creation with multiple workers

* update dockerfiles for preload. remove try catch for elastic index creation

* add back try/catch. disable multiproc in default config to comply with --preload of gunicorn

* change to pip3 for GPU dockerfile

* remove --preload for gpu
2020-09-29 21:12:44 +02:00
Markus Paff
5d1e208186
Create deploy_website.yml (#450)
Creates a dispatch event on push to master so that we can trigger a build in haystack-website. The website should always have the latest docs version
2020-09-29 19:49:04 +02:00
Tanay Soni
52000ff678
Add Docker setup for the annotation tool (#444) 2020-09-29 14:09:45 +02:00
Tanay Soni
93fd4aa72f
Update ONNX conversion for FARMReader (#438) 2020-09-28 16:10:32 +02:00