1524 Commits

Author SHA1 Message Date
oryx1729
c4607cbd98
Revamp CI (#825) 2021-02-12 13:38:54 +01:00
Tanay Soni
fd5c5dd23c
Introduce incremental updates for embeddings in document stores (#812) 2021-02-09 21:25:01 +01:00
Malte Pietsch
ac9f92466f
Allow custom encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions (#813)
* fix encoding of pdftotext. fix version in download instructions

* fix test

* Add latest docstring and tutorial changes

* make latin-1 default encoding again

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-09 13:42:43 +01:00
Tanay Soni
f95b70df38
Fix file upload API (#808) 2021-02-05 12:17:38 +01:00
Branden Chan
f3a3b73d9b
Choose correct similarity fns during benchmark runs & re-run benchmarks (#773)
* Adapt to new dataset_from_dicts return signature

* rename fn

* Align similarity fn in benchmark doc store

* Better choice of similarity fn

* Increase postgres wait time

* Add more expected returned variables

* update benchmark results

* Fix typo

* update all benchmark runs

* multiply stats by 100

* Specify similarity fns for website

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-02-03 11:45:18 +01:00
Tanay Soni
8a5dc8f826
Load Pipeline with YAML config file (#785) 2021-02-02 17:32:17 +01:00
Timo Moeller
f3ccd59045
Improve preprocessing and adding of eval data (#780)
* Remove empty document when splitting text

* Move error message of problematic ids to a highler level
2021-02-01 17:08:27 +01:00
Tanay Soni
b87dd244c1
Get metadata values for a key from Elasticsearch (#776) 2021-02-01 16:13:26 +01:00
brandenchan
5665d55ab4 Remove duplicate file 2021-02-01 15:43:53 +01:00
Pavel Soriano
16b8291091
SQuAD to DPR dataset converter (#765)
* Create squad_to_dpr.py

First commit of the squad2dpr script.

* adding review corrections/improvements

* Merge master 5bf351e

* Move script, add docstring

* Add type hints

Co-authored-by: brandenchan <brandenchan@icloud.com>
2021-02-01 15:40:43 +01:00
Lalit Pagaria
9f7f95221f
Milvus integration (#771)
* Initial commit for Milvus integration

* Add latest docstring and tutorial changes

* Updating implementation of Milvus document store

* Add latest docstring and tutorial changes

* Adding tests and updating doc string

* Add latest docstring and tutorial changes

* Fixing issue caught by tests

* Addressing review comments

* Fixing mypy detected issue

* Fixing issue caught in test about sorting of vector ids

* fixing test

* Fixing generator test failure

* update docstrings

* Addressing review comments about multiple network call while fetching embedding from milvus server

* Add latest docstring and tutorial changes

* Ignoring mypy issue while converting vector_id to int

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-29 13:29:12 +01:00
Tanay Soni
d9f011da9a
Add flag for use of window queries in SQLDocumentStore (#768) 2021-01-25 12:54:34 +01:00
Tanay Soni
46307d1571
Remove quotes around placeholders in Elasticsearch custom query (#762) 2021-01-25 12:46:43 +01:00
Tanay Soni
f0aa879a1c
Fix delete_all_documents for the SQLDocumentStore (#761) 2021-01-22 14:39:24 +01:00
Tanay Soni
337376c81d Add batch_size and generators to document stores. (#733)
* Add batch update of embeddings in document stores

* Resolve merge conflict

* Remove document ordering dependency in tests

* Adjust index buffer size for tests

* Adjust ES Scroll Slice

* Use generator for document store pagination

* Add pagination for InMemoryDocumentStore

* Fix missing index parameter in FAISS update_embeddings()

* Fix FAISS update_embeddings()

* Update FAISS tests

* Update eval tests

* Revert code formatting change

* Fix document count in FAISS update embeddings

* Fix vector_ids reset in SQLDocumentStore

* Update doctrings

* Update docstring
2021-01-21 16:00:08 +01:00
Timo Moeller
7522d2d1b0
Increase FARM to Version 0.6.2 (#755)
* Increase farm version

* Fix test
2021-01-21 10:15:41 +01:00
Timo Moeller
4803da009a
Using PreProcessor functions on eval data (#751)
* Add eval data splitting

* Adjust for split by passage, add test and test data, adjust docstrings, add max_docs to highler level fct
2021-01-20 14:40:10 +01:00
Tanay Soni
aa8a3666c3
Support filters for DensePassageRetriever + InMemoryDocumentStore (#754) 2021-01-20 12:52:52 +01:00
bogdankostic
7709b6cee0
Make batchwise adding of evaluation data possible (#717)
* Make batchwise adding of evaluation data possible

* Fix typos in docstrings

* Merge add_eval_data and add_eval_data_batchwise

* Improve import statements

* Move add_eval_data to BaseDocumentStore

* Add batch_size param to write_documents and write_labels in EsDocStore

* Adjust docstring

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-12 17:54:43 +01:00
Tanay Soni
281f9ff970
Fix SQLite errors in tests (#723) 2021-01-11 13:24:38 +01:00
Lalit Pagaria
75d0ebd076
Add Summarizer (standalone + node in custom pipelines + SearchSummarizationPipeline) (#698)
* Integration of SummarizationQAPipeline with Haystack.

* Moving summarizer tests because of OOM issue

* Fixing typo

* Splitting summarizer test in separate ci step

* Removing sysctl configuration as we already running elastic search in docker container

* fixing mypy issue

* update parameter names and docstrings

* update parameter names in BaseSummarizer

* rename pipeline

* change return type of summarizer from answer to document

* change scope of doc store fixture

* revert scope

* temp. disable test_faiss_index_save_and_load()

* fix mypy. change order for mypy in CI

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-08 14:29:46 +01:00
Lalit Pagaria
3a9a756810
Using Columns names instead of ORM to get all documents (#620)
* Using Columns name instead of ORM object for get all documents call

* Separating meta search from documents. This way it will optimize the memory not duplicating document.text

* Fixing mypy issue

* SQLite have limit on number of host variable hence using batching to fetch meta information

* Query meta only if meta field is not Null in DocOrm

* Add batch_size to other functions except label

* meta can be none so fix that issue

* Dummy commit to trigger CI

* Using chunked dictionary

* Upgrading faiss

* reverting change related to  faiss upgrade

* Changing DB name in test_faiss_retrieving test as it might interfere with exiting files by corrupting DB file

* Updating doc string related to batch_size

* Update docstring for batch_size

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-06 15:56:19 +01:00
Branden Chan
bb8aba18e0
Create Preprocessing Tutorial (#706)
* WIP: First version of preprocessing tutorial

* stride renamed overlap, ipynb and py files created

* rename split_stride in test

* Update preprocessor api documentation

* define order for markdown files

* define order of modules in api docs

* Add colab links

* Incorporate review feedback

Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2021-01-06 15:54:05 +01:00
Tanay Soni
0e4eec9499
Add tests for custom embedding field (#640) 2020-12-17 09:18:57 +01:00
Tanay Soni
4c2804e38e
Add support for aggregating scores in JoinDocuments node (#683) 2020-12-16 15:54:58 +01:00
Tanay Soni
33fe597949
Cleanup Pytest Fixtures (#639) 2020-12-14 18:15:44 +01:00
Malte Pietsch
149d98a0fd
Add latest benchmark run (#652)
* add latest benchmark run

* update templates and fix small json errors

* Change scale

Co-authored-by: brandenchan <brandenchan@icloud.com>
2020-12-10 16:25:51 +01:00
Timo Moeller
efc754b166
Redone: Fix concatenation of sentences in PreProcessor. Add stride for word-based splits with sentence boundaries (#641)
* Update preprocessor.py

Concatenation of sentences done correctly. Stride functionality enabled for splitting by words while respecting sentence boundaries.

* Simplify code, add test

Co-authored-by: Krak91 <45461739+Krak91@users.noreply.github.com>
2020-12-09 16:12:36 +01:00
Tanay Soni
4152ad8426
Enable dynamic parameter updates for the FARMReader (#650) 2020-12-07 14:07:20 +01:00
Tanay Soni
8e52b48e1d
Add pipelines for GenerativeQA & FAQs (#645) 2020-12-03 10:27:06 +01:00
Malte Pietsch
216787ed34
Fix benchmarks (#648)
* disable fasttokenizer, increase ES timeout for delete requests

* add session.close()

* fix deletion of docs
2020-12-02 16:59:42 +01:00
Tanay Soni
5e62e54875
Rename question parameter to query (#614) 2020-11-30 17:50:04 +01:00
Tanay Soni
ea976ba5b5
Add return_embedding parameter for get_all_documents() (#615) 2020-11-26 10:32:30 +01:00
Tanay Soni
e3a68aedaf
Add support for building custom Search Pipelines (#596) 2020-11-20 17:41:08 +01:00
Malte Pietsch
0acafc403a
Automate benchmarks via CML (#518)
* initial test cml

* Update cml.yaml

* WIP test workflow

* switch to general ubuntu ami

* switch to general ubuntu ami

* disable gpu for tests

* rm gpu infos

* rm gpu infos

* update token env

* switch github token

* add postgres

* test db connection

* fix typo

* remove tty

* add sleep for db

* debug runner

* debug removal postgres

* debug: reset to working commit

* debug: change github token

* switch to new bot token

* debug token

* add back postgres

* adjust network runner docker

* add elastic

* fix typo

* adjust working dir

* fix benchmark execution

* enable s3 downloads

* add query benchmark. fix path

* add saving of markdown files

* cat md files. add faiss+dpr. increase n_queries

* switch to GPU instance

* switch availability zone

* switch to public aws DL ami

* increase volume size

* rm faiss. fix error logging

* save markdown files

* add reader benchmarks

* add download of squad data

* correct reader metric normalization

* fix newlines between reports

* fix max_docs for reader eval data. remove max_docs from ci run config

* fix mypy. switch workflow trigger

* try trigger for label

* try trigger for label

* change trigger syntax

* debug machine shutdown with test workflow

* add es and postgres to test workflow

* Revert "add es and postgres to test workflow"

This reverts commit 6f038d3d7f12eea924b54529e61b192858eaa9d5.

* Revert "debug machine shutdown with test workflow"

This reverts commit db70eabae8850b88e1d61fd79b04d4f49d54990a.

* fix typo in action. set benchmark config back to original
2020-11-18 18:28:17 +01:00
Lalit Pagaria
3f81c93f36
Add document update for SQL and FAISS Document Store (#584) 2020-11-16 16:08:13 +01:00
Tanay Soni
3e095ddd7d
Add filters for delete_all_documents() (#591) 2020-11-16 14:15:32 +01:00
Timo Moeller
f118e4b738
Add needed whitespace before sentence start (#582) 2020-11-13 14:14:24 +01:00
Branden Chan
44230fca45
Fix CI bug due to new Elasticsearch release and new model release (#579)
* Cast generator to list

* Restrict ES version range

* Loosen ES requirement

* Change no_answer_test value
2020-11-13 10:35:53 +01:00
Tanay Soni
acd088808b
Allow list of filter values in REST API (#568) 2020-11-09 20:41:53 +01:00
Branden Chan
99e924aede
Update Documentation for Haystack 0.5.0 (#557)
* Add languages and preprocessing pages

* add content

* address review comments

* make link relative

* update api ref with latest docstrings

* move doc readme and update

* add generator API docs

* fix example code

* design and link fix

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2020-11-06 10:53:22 +01:00
bogdankostic
ffaa0249f7
Fix retriever evaluation metrics (#547)
* Add mean reciprocal rank and fix mean average precision

* Add mrr metric to docstring

* Fix mypy error
2020-11-05 13:34:47 +01:00
bogdankostic
53be92c155
Add save and load method for DPR (#550)
* Add save and load method for DPR

* lower memory footprint for test. change names to load() and save()

* add test cases

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-11-05 13:29:23 +01:00
kolk
72b637ae6d
DensePassageRetriever: Add Training, Refactor Inference to FARM modules (#527)
* dpr training and inference code refactored with FARM modules

* dpr test cases modified

* docstring and default arguments updated

* dpr training docstring updated

* bugfix in dense retriever inference, DPR tutorials modified

* Bump FARM to 0.5.0

* update README for DPR

* dpr training and inference code refactored with FARM modules

* dpr test cases modified

* docstring and default arguments updated

* dpr training docstring updated

* bugfix in dense retriever inference, DPR tutorials modified

* Bump FARM to 0.5.0

* update README for DPR

* mypy errors fix

* DPR instantiation bugfix

* Fix DPR init in RAG Tutorial

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-30 19:22:06 +01:00
Lalit Pagaria
f13443054a
[RAG] Integrate "Retrieval-Augmented Generation" with Haystack (#484)
* Adding dummy generator implementation

* Adding tutorial to try the model

* Committing current non working code

* Committing current update where we need to call generate function directly and need to convert embedding to tensor way

* Addressing review comments.

* Refactoring finder, and implementing rag_generator class.

* Refined the implementation of RAGGenerator and now it is in clean shape

* Renaming RAGGenerator to RAGenerator

* Reverting change from finder.py and addressing review comments

* Remove support for RagSequenceForGeneration

* Utilizing embed_passage function from DensePassageRetriever

* Adding sample test data to verify generator output

* Updating testing script

* Updating testing script

* Fixing bug related to top_k

* Updating latest farm dependency

* Comment out farm dependency

* Reverting changes from TransformersReader

* Adding transformers dataset to compare transformers and haystack generator implementation

* Using generator_encoder instead of question_encoder to generate context_input_ids

* Adding workaround to install FARM dependency from master branch

* Removing unnecessary changes

* Fixing generator test

* Removing transformers datasets

* Fixing generator test

* Some cleanup and updating TODO comments

* Adding tutorial notebook

* Updating tutorials with comments

* Explicitly passing token model in RAG test

* Addressing review comments

* Fixing notebook

* Refactoring tests to reduce memory footprint

* Split generator tests in separate ci step and before running it reclaim memory by terminating containers

* Moving tika dependent test to separate dir

* Remove unwanted code

* Brining reader under session scope

* Farm is now session object hence restoring changes from default value

* Updating assert for pdf converter

* Dummy commit to trigger CI flow

* REducing memory footprint required for generator tests

* Fixing mypy issues

* Marking test with tika and elasticsearch markers. Reverting changes in CI and pytest splits

* reducing changes

* Fixing CI

* changing elastic search ci

* Fixing test error

* Disabling return of embedding

* Marking generator test as well

* Refactoring tutorials

* Increasing ES memory to 750M

* Trying another fix for ES CI

* Reverting CI changes

* Splitting tests in CI

* Generator and non-generator markers split

* Adding pytest.ini to add markers and enable strict-markers option

* Reducing elastic search container memory

* Simplifying generator test by using documents with embedding directly

* Bump up farm to 0.5.0
2020-10-30 18:06:02 +01:00
Branden Chan
7a9f32f264 Fix template 2020-10-29 10:30:03 +01:00
Branden Chan
7c81dfdc3a Address reviewer comments 2020-10-27 12:41:11 +01:00
Branden Chan
d5cb227909 Merge branch 'master' into automate_benchmarks 2020-10-27 11:50:49 +01:00
Lalit Pagaria
9521e180b3
Standardize behavior of DocumentStores to return embeddings (#514)
* Adding support to return embedding along with other result via query_by_embedding function

* Adding test case to check return embedding

* By default for all tests but DPR tests: disable return_embedding flag

* Reducing None test case and fixing query_by_embedding of ElasticsearchDocumentStore when it updating self.excluded_meta_data directly

* Fixing mypy reported issue
2020-10-27 08:33:39 +01:00
Lalit Pagaria
abda994116
Pytest fix memory leak and put pytest marker on slow tests (#520)
* Clear faiss_index during teardown

* Marking slow test with pytest markers. So In future these test can be optimized. Also command line option can be added to skip them refer https://pytest.org/en/stable/example/simple.html#control-skipping-of-tests-according-to-command-line-option

* Fixing test
2020-10-26 19:19:10 +01:00