19 Commits

Author SHA1 Message Date
Tanay Soni
337376c81d Add batch_size and generators to document stores. (#733)
* Add batch update of embeddings in document stores

* Resolve merge conflict

* Remove document ordering dependency in tests

* Adjust index buffer size for tests

* Adjust ES Scroll Slice

* Use generator for document store pagination

* Add pagination for InMemoryDocumentStore

* Fix missing index parameter in FAISS update_embeddings()

* Fix FAISS update_embeddings()

* Update FAISS tests

* Update eval tests

* Revert code formatting change

* Fix document count in FAISS update embeddings

* Fix vector_ids reset in SQLDocumentStore

* Update doctrings

* Update docstring
2021-01-21 16:00:08 +01:00
Timo Moeller
7522d2d1b0
Increase FARM to Version 0.6.2 (#755)
* Increase farm version

* Fix test
2021-01-21 10:15:41 +01:00
Tanay Soni
aa8a3666c3
Support filters for DensePassageRetriever + InMemoryDocumentStore (#754) 2021-01-20 12:52:52 +01:00
Tanay Soni
ea976ba5b5
Add return_embedding parameter for get_all_documents() (#615) 2020-11-26 10:32:30 +01:00
bogdankostic
53be92c155
Add save and load method for DPR (#550)
* Add save and load method for DPR

* lower memory footprint for test. change names to load() and save()

* add test cases

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-11-05 13:29:23 +01:00
kolk
72b637ae6d
DensePassageRetriever: Add Training, Refactor Inference to FARM modules (#527)
* dpr training and inference code refactored with FARM modules

* dpr test cases modified

* docstring and default arguments updated

* dpr training docstring updated

* bugfix in dense retriever inference, DPR tutorials modified

* Bump FARM to 0.5.0

* update README for DPR

* dpr training and inference code refactored with FARM modules

* dpr test cases modified

* docstring and default arguments updated

* dpr training docstring updated

* bugfix in dense retriever inference, DPR tutorials modified

* Bump FARM to 0.5.0

* update README for DPR

* mypy errors fix

* DPR instantiation bugfix

* Fix DPR init in RAG Tutorial

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-30 19:22:06 +01:00
Lalit Pagaria
f13443054a
[RAG] Integrate "Retrieval-Augmented Generation" with Haystack (#484)
* Adding dummy generator implementation

* Adding tutorial to try the model

* Committing current non working code

* Committing current update where we need to call generate function directly and need to convert embedding to tensor way

* Addressing review comments.

* Refactoring finder, and implementing rag_generator class.

* Refined the implementation of RAGGenerator and now it is in clean shape

* Renaming RAGGenerator to RAGenerator

* Reverting change from finder.py and addressing review comments

* Remove support for RagSequenceForGeneration

* Utilizing embed_passage function from DensePassageRetriever

* Adding sample test data to verify generator output

* Updating testing script

* Updating testing script

* Fixing bug related to top_k

* Updating latest farm dependency

* Comment out farm dependency

* Reverting changes from TransformersReader

* Adding transformers dataset to compare transformers and haystack generator implementation

* Using generator_encoder instead of question_encoder to generate context_input_ids

* Adding workaround to install FARM dependency from master branch

* Removing unnecessary changes

* Fixing generator test

* Removing transformers datasets

* Fixing generator test

* Some cleanup and updating TODO comments

* Adding tutorial notebook

* Updating tutorials with comments

* Explicitly passing token model in RAG test

* Addressing review comments

* Fixing notebook

* Refactoring tests to reduce memory footprint

* Split generator tests in separate ci step and before running it reclaim memory by terminating containers

* Moving tika dependent test to separate dir

* Remove unwanted code

* Brining reader under session scope

* Farm is now session object hence restoring changes from default value

* Updating assert for pdf converter

* Dummy commit to trigger CI flow

* REducing memory footprint required for generator tests

* Fixing mypy issues

* Marking test with tika and elasticsearch markers. Reverting changes in CI and pytest splits

* reducing changes

* Fixing CI

* changing elastic search ci

* Fixing test error

* Disabling return of embedding

* Marking generator test as well

* Refactoring tutorials

* Increasing ES memory to 750M

* Trying another fix for ES CI

* Reverting CI changes

* Splitting tests in CI

* Generator and non-generator markers split

* Adding pytest.ini to add markers and enable strict-markers option

* Reducing elastic search container memory

* Simplifying generator test by using documents with embedding directly

* Bump up farm to 0.5.0
2020-10-30 18:06:02 +01:00
Lalit Pagaria
9521e180b3
Standardize behavior of DocumentStores to return embeddings (#514)
* Adding support to return embedding along with other result via query_by_embedding function

* Adding test case to check return embedding

* By default for all tests but DPR tests: disable return_embedding flag

* Reducing None test case and fixing query_by_embedding of ElasticsearchDocumentStore when it updating self.excluded_meta_data directly

* Fixing mypy reported issue
2020-10-27 08:33:39 +01:00
Lalit Pagaria
abda994116
Pytest fix memory leak and put pytest marker on slow tests (#520)
* Clear faiss_index during teardown

* Marking slow test with pytest markers. So In future these test can be optimized. Also command line option can be added to skip them refer https://pytest.org/en/stable/example/simple.html#control-skipping-of-tests-according-to-command-line-option

* Fixing test
2020-10-26 19:19:10 +01:00
Lalit Pagaria
2e9f3c1512
Fix update_embeddings function in FAISSDocumentStore and add retriever fixture in tests (#481)
* 1. Prevent update_embeddings function in FAISSDocumentStore to set faiss_index as None when document store does not have any docs.

2. cleaning up tests by adding fixture for retriever.

* TfidfRetriever need document store with documents during initialization as it call fit() function in constructor so fixing it by checking self.paragraphs of None

* Fix naming of retriever's fixture (embedded to embedding and tfid to tfidf)
2020-10-14 16:15:04 +02:00
Malte Pietsch
9727829cc6
Rename and restructure modules (database, indexing, schemas) (#379)
* rename database to documentstore

* move document, label, multilabel to haystack/schema.py

* rename documentstore -> document_store

* split indexing modules -> file_converter + preprocessor

* fix order of imports

* Update tutorial notebooks

* fix torch version in tutorial 4
2020-09-16 18:33:23 +02:00
kolk
f2b6cc761b
Refactor DPR from FB to Transformers codebase (#308)
* change_HFBertEncoder to transformers DPREncoder

* Removed BertTensorizer

* model download relative path

* Refactor model load

* Tutorial5 DPR updated

* fix print_eval_results typo

* copy transformers DPR modules in dpr_utils and test

* transformer v3.0.2 import errors fixed

* remove dependency of DPRConfig on attribute use_return_tuple

* Adjust transformers 302 locally to work with dpr

* projection layer removed from DPR encoders

* fixed mypy errors

* transformers DPR compatible code added

* transformers DPR compatibility added

* bug fix in tutorial 6 notebook

* Docstring update and variable naming issues fix

* tutorial modified to reflect DPR variable naming change

* title addition to passage use-cases handled

* modified handling untitled batch

* resolved mypy errors

* typos in docstrings and comments fixed

* cleaned DPR code and added new test cases

* warnings added for non-bert model [SEP] token removal

* changed warning to logger warning

* title mask creation refactored

* bug fix on cuda issues

* tutorial 6 instantiates modified DPR

* tutorial 5 modified

* tutorial 5 ipython notebook modified: DPR instantiation

* batch_size added to DPR instantiation

* tutorial 5 jupyter notebook typos fixed

* improved docstrings, fixed typos

* Update docstring

Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-08-25 20:16:00 +05:30
Tanay Soni
3a42eb663e
Include InMemoryDocumetStore for DPR test 2020-08-24 14:44:12 +02:00
bogdankostic
72b1013560
Restructure update embeddings (#304)
* Restructure update embeddings

* Adapt FAISSDocStore

* Adapt test and tutorial

Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
2020-08-18 14:04:31 +02:00
Tanay Soni
9d0df60aad
Add FAISS Document Store (#253) 2020-08-07 14:25:08 +02:00
Malte Pietsch
29a15c0d59
Add eval for Dense Passage Retriever & Refactor handling of labels/feedback (#243) 2020-07-31 11:34:06 +02:00
Malte Pietsch
6bed2f509f
Refactor DPR for latest transformers version & change init arg gpu -> use_gpu for DPR and EmbeddingRetriever (#239)
* fix tokenizer warning in latest transformers

* change dpr arg from gpu to use_gpu

* change gpu arg for EmbeddingRetriever
2020-07-16 10:45:01 +02:00
Tanay Soni
b886e054a3
Move document_name attribute to meta (#217) 2020-07-14 09:53:31 +02:00
Malte Pietsch
07ecfb60b9
Dense Passage Retriever (Inference) (#167) 2020-06-30 19:05:45 +02:00