3803 Commits

Author SHA1 Message Date
Malte Pietsch
df13a6830d
Update annotation docs for website (#505)
* update annotation docs for website

* add md file for docs

* add user manual
2020-11-03 11:24:06 +01:00
Guillim
7a43d1a72d
Update readme path in Dockerfile (#537)
* Update Dockerfile

forgot to change the extension i believe

* Update Dockerfile

* Update Dockerfile-GPU
2020-11-03 10:19:18 +01:00
Malte Pietsch
f0969d8310
Update setup.py 2020-11-02 20:15:10 +01:00
Malte Pietsch
c363fefc6e
New readme (#534)
* WIP readme to md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* delete rst readme
2020-11-02 20:03:22 +01:00
Malte Pietsch
50709a3f9d
Fix retriever mAP benchmarks 2020-11-02 19:55:58 +01:00
Lalit Pagaria
5d45992c84
Removing (deprecation) warnings (#530)
1. Few warnings need fix in FARM
2. Can't remove warning from docx library.
2020-11-02 15:18:43 +01:00
Yaser Martinez Palenzuela
f5419163e7
Add annotation tool manual to readme (#523)
* Update README.md

* Update README.md

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-11-02 10:51:50 +01:00
Branden Chan
eb9e9ceca2
Fix FARMReader.eval( ) handling of no_answers (#531)
* Fix handling of no_answers

* Remove commented out code

* Remove extra spaces
2020-10-30 19:22:55 +01:00
kolk
72b637ae6d
DensePassageRetriever: Add Training, Refactor Inference to FARM modules (#527)
* dpr training and inference code refactored with FARM modules

* dpr test cases modified

* docstring and default arguments updated

* dpr training docstring updated

* bugfix in dense retriever inference, DPR tutorials modified

* Bump FARM to 0.5.0

* update README for DPR

* dpr training and inference code refactored with FARM modules

* dpr test cases modified

* docstring and default arguments updated

* dpr training docstring updated

* bugfix in dense retriever inference, DPR tutorials modified

* Bump FARM to 0.5.0

* update README for DPR

* mypy errors fix

* DPR instantiation bugfix

* Fix DPR init in RAG Tutorial

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-30 19:22:06 +01:00
Lalit Pagaria
f13443054a
[RAG] Integrate "Retrieval-Augmented Generation" with Haystack (#484)
* Adding dummy generator implementation

* Adding tutorial to try the model

* Committing current non working code

* Committing current update where we need to call generate function directly and need to convert embedding to tensor way

* Addressing review comments.

* Refactoring finder, and implementing rag_generator class.

* Refined the implementation of RAGGenerator and now it is in clean shape

* Renaming RAGGenerator to RAGenerator

* Reverting change from finder.py and addressing review comments

* Remove support for RagSequenceForGeneration

* Utilizing embed_passage function from DensePassageRetriever

* Adding sample test data to verify generator output

* Updating testing script

* Updating testing script

* Fixing bug related to top_k

* Updating latest farm dependency

* Comment out farm dependency

* Reverting changes from TransformersReader

* Adding transformers dataset to compare transformers and haystack generator implementation

* Using generator_encoder instead of question_encoder to generate context_input_ids

* Adding workaround to install FARM dependency from master branch

* Removing unnecessary changes

* Fixing generator test

* Removing transformers datasets

* Fixing generator test

* Some cleanup and updating TODO comments

* Adding tutorial notebook

* Updating tutorials with comments

* Explicitly passing token model in RAG test

* Addressing review comments

* Fixing notebook

* Refactoring tests to reduce memory footprint

* Split generator tests in separate ci step and before running it reclaim memory by terminating containers

* Moving tika dependent test to separate dir

* Remove unwanted code

* Brining reader under session scope

* Farm is now session object hence restoring changes from default value

* Updating assert for pdf converter

* Dummy commit to trigger CI flow

* REducing memory footprint required for generator tests

* Fixing mypy issues

* Marking test with tika and elasticsearch markers. Reverting changes in CI and pytest splits

* reducing changes

* Fixing CI

* changing elastic search ci

* Fixing test error

* Disabling return of embedding

* Marking generator test as well

* Refactoring tutorials

* Increasing ES memory to 750M

* Trying another fix for ES CI

* Reverting CI changes

* Splitting tests in CI

* Generator and non-generator markers split

* Adding pytest.ini to add markers and enable strict-markers option

* Reducing elastic search container memory

* Simplifying generator test by using documents with embedding directly

* Bump up farm to 0.5.0
2020-10-30 18:06:02 +01:00
Branden Chan
fbf41e53ff
Merge pull request #529 from deepset-ai/fix_website
Change metric to queries per second on benchmarks webpage
2020-10-29 10:40:04 +01:00
Branden Chan
7a9f32f264 Fix template 2020-10-29 10:30:03 +01:00
Branden Chan
3793205aa3 Merge branch 'master' into fix_website 2020-10-29 10:29:25 +01:00
Branden Chan
2ba5417f8e Fix metric for benchmarks website page 2020-10-29 10:26:48 +01:00
bogdankostic
18d315d61a
Make returning predictions in evaluation possible (#524)
* Make returning preds in evaluation possible

* Make returning preds in evaluation possible

* Add automated check if eval dict contains predictions
2020-10-28 09:55:31 +01:00
Branden Chan
4fa5d9c3eb
Merge pull request #522 from deepset-ai/automate_benchmarks
Add --ci and --update-json to CLI for benchmarks
2020-10-27 12:56:47 +01:00
Branden Chan
8c4865ee5f Rename n_docs variable to max_docs 2020-10-27 12:45:15 +01:00
Branden Chan
7c81dfdc3a Address reviewer comments 2020-10-27 12:41:11 +01:00
Branden Chan
d5cb227909 Merge branch 'master' into automate_benchmarks 2020-10-27 11:50:49 +01:00
Lalit Pagaria
9521e180b3
Standardize behavior of DocumentStores to return embeddings (#514)
* Adding support to return embedding along with other result via query_by_embedding function

* Adding test case to check return embedding

* By default for all tests but DPR tests: disable return_embedding flag

* Reducing None test case and fixing query_by_embedding of ElasticsearchDocumentStore when it updating self.excluded_meta_data directly

* Fixing mypy reported issue
2020-10-27 08:33:39 +01:00
Lalit Pagaria
abda994116
Pytest fix memory leak and put pytest marker on slow tests (#520)
* Clear faiss_index during teardown

* Marking slow test with pytest markers. So In future these test can be optimized. Also command line option can be added to skip them refer https://pytest.org/en/stable/example/simple.html#control-skipping-of-tests-according-to-command-line-option

* Fixing test
2020-10-26 19:19:10 +01:00
Tanay Soni
db4151bbc0
Fix scoring in Elasticsearch for dot product (#517) 2020-10-23 17:50:49 +02:00
Timo Moeller
def8fd617a
Make title info optional when evaluating on QA data (#494)
* Add check for title present in QA file and make title extraction optional

* Make missing title None
2020-10-23 11:06:56 +02:00
bogdankostic
f62117c232
Add urllib version requirement to colab notebooks (#509) 2020-10-23 10:43:58 +02:00
Branden Chan
fbacdfd263 Add logging of error, add n_docs assert 2020-10-22 15:45:46 +02:00
Branden Chan
b0483cfd99 add readme 2020-10-22 15:32:56 +02:00
Tanay Soni
3bec264d76
Add filters for document count (#512) 2020-10-22 12:42:13 +02:00
brandenchan
87e5f06fa8 add automatic json update 2020-10-21 17:59:44 +02:00
brandenchan
d3743d00e9 Merge branch 'master' into automate_benchmarks 2020-10-21 17:48:10 +02:00
Lalit Pagaria
63c12371b9
Change arg "model" to "model_name_or_path" in TransformersReader (#510)
* Consistent parameter naming for TransformersReader along with removing unused imports as well.

* Addressing review comments
2020-10-21 17:15:35 +02:00
Malte Pietsch
956543e239
Restructure checks in PreProcessor (#504)
* restructure checks

* fix variable name

* Fix test
2020-10-20 06:43:59 +02:00
Malte Pietsch
c13abba6d6
Better defaults for PreProcessor & update docstring 2020-10-19 17:37:58 +02:00
Sanjay Kamath
dc16258dab
Updated the example code in readme for Indexing PDF / Docx files (#502)
* Updated the example code to Indexing PDF / Docx files

The example code was referencing a structure haystack.indexing which does not exist anymore. Modified this and the function "extract_pages" with "convert"

* Update converter example in readme

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-19 15:04:33 +02:00
Malte Pietsch
11a3976945 update deletes. fix arg in run.py 2020-10-19 14:40:26 +02:00
Malte Pietsch
3434d5205d
Update doc string for ElasticsearchDocumentStore.write_documents() & sync markdown files (#501)
* update doc string for ElasticsearchDocumentStore.write_documents()

* update all markdowns with latest docstrings
2020-10-19 13:56:38 +02:00
Markus Paff
2531c8e061
Add versioning docs (#495)
* add time and perf benchmark for es

* Add retriever benchmarking

* Add Reader benchmarking

* add nq to squad conversion

* add conversion stats

* clean benchmarks

* Add link to dataset

* Update imports

* add first support for neg psgs

* Refactor test

* set max_seq_len

* cleanup benchmark

* begin retriever speed benchmarking

* Add support for retriever query index benchmarking

* improve reader eval, retriever speed benchmarking

* improve retriever speed benchmarking

* Add retriever accuracy benchmark

* Add neg doc shuffling

* Add top_n

* 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging

* Add models to sweep

* add option for faiss index type

* remove unneeded line

* change faiss to faiss_flat

* begin automatic benchmark script

* remove existing postgres docker for benchmarking

* Add data processing scripts

* Remove shuffle in script bc data already shuffled

* switch hnsw setup from 256 to 128

* change es similarity to dot product by default

* Error includes stack trace

* Change ES default timeout

* remove delete_docs() from timing for indexing

* Add support for website export

* update website on push to benchmarks

* add complete benchmarks results

* new json format

* removed NaN as is not a valid json token

* versioning for docs

* unsaved changes

* cleaning

* cleaning

* Edit format of benchmarks data

* update also jsons in v0.4.0

Co-authored-by: brandenchan <brandenchan@icloud.com>
Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-19 11:46:51 +02:00
Malte Pietsch
4a77dc7a02
Allow null filter value in api (#497) 2020-10-16 18:44:15 +02:00
Malte Pietsch
5a885fc2d1
Fix meta data = None in PreProcessor (#496) 2020-10-16 17:17:26 +02:00
Lalit Pagaria
b9da789475
Add Elasticsearch Query DSL compliant Query API (#471) 2020-10-16 13:25:31 +02:00
brandenchan
b9bb8d6cc1 Fix try except 2020-10-16 12:16:32 +02:00
Malte Pietsch
5555274170 Make creation of label index optional in feedback and file_upload api 2020-10-15 19:03:58 +02:00
Malte Pietsch
bdbd1b323b
Add create_index and similarity metric to api config (#493)
* make creation of label index optional

* add params for rest api

* reset tutorial flag
2020-10-15 18:41:36 +02:00
brandenchan
6d60cc9451 add automation pipeline 2020-10-15 18:12:17 +02:00
Malte Pietsch
ceb5c87da0
Make creation of label index optional (#490) 2020-10-15 14:40:59 +02:00
Tanay Soni
974b37eded
Add PreProcessor to simplify splitting and cleaning of docs (#473)
* Add PreProcessing

* Adjust PDF conversion tests

* Add tests for Preprocessing

* Add requirement

* Fix tests

* Ignore decoding errors for TextConverter

* Rename split_size to split_length

* Adjust tests

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-15 10:42:08 +02:00
Lalit Pagaria
2e9f3c1512
Fix update_embeddings function in FAISSDocumentStore and add retriever fixture in tests (#481)
* 1. Prevent update_embeddings function in FAISSDocumentStore to set faiss_index as None when document store does not have any docs.

2. cleaning up tests by adding fixture for retriever.

* TfidfRetriever need document store with documents during initialization as it call fit() function in constructor so fixing it by checking self.paragraphs of None

* Fix naming of retriever's fixture (embedded to embedding and tfid to tfidf)
2020-10-14 16:15:04 +02:00
Tanay Soni
ecaf7b8f0b Add psycopg2 requirement 2020-10-14 12:28:33 +02:00
Tanay Soni
3c6a125380
Add deepcopy for meta dicts in answers (#485) 2020-10-14 12:15:18 +02:00
Lalit Pagaria
12c4dd7b4b
Adjust requirements for Windows (#480) 2020-10-13 17:12:24 +02:00
Antonio Lanza
3caaf99dcb
Add automatic mixed precision (AMP) support for reader training (#463)
* Added automatic mixed precision (AMP) support for reader training

* Added clearer comments on docstring
2020-10-12 21:53:05 +02:00