240 Commits

Author SHA1 Message Date
Branden Chan
77d4c2ca1c
Benchmark milvus (#850)
* Add milvus benchmarking support

* Add latest docstring and tutorial changes

* Edit config

* Disable docker interactive mode

* Add milvus index type support

* Adjust FAISS and Milvus node branching

* Remove duplicate in config

* Revert method for speedup

* Add latest docstring and tutorial changes

* Add latest benchmark run

* Add latest docstring and tutorial changes

* Add json files

* Revert "Add latest docstring and tutorial changes"

This reverts commit e2efa5f08aa4fb55bbeeed42aa76817d63fc8923.

* Add latest docstring and tutorial changes

* Revert "Add latest docstring and tutorial changes"

This reverts commit b085a679b9d5f175e91c2c59565e73c5dec1374b.

* Fix typo

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-13 14:54:15 +02:00
Markus Paff
b87daed62b
fixed link to dpr (#962) 2021-04-13 09:45:04 +02:00
Timo Moeller
837dea4e6d
Integrate sentence transformers into benchmarks (#843)
* Integrate sentence transformers into benchmarks

* Add doc store asserts

* switch data downloads from s3 client to https. add license info

* Fix mypy, revert config

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-09 17:24:16 +02:00
Julian Risch
d38c07e0ee
knowledge graph example (#934)
* Add knowledge graph module

* Fix type hint

* Add graph retriver module

* Change type annotations, change return format

* Add graph retriever that executes questions as sparql queries

* Linking only those entities that are in the knowledge graph

* Added logging and using relations extracted from Knowledge graph for linking

* Preventing entity linking from linking the same token to multiple entities

* Pruning triples that have no variables for select and count queries

* Support knowledge graphs with Pipelines

* Add text2sparql

* Entity linking and relation linking consider more special cases now based on evaluation on labelled data

* Separating example code from KGQA implementation

* Add eval on combined extarctive and kg questions

* Remove references to hp-test

* Add fields sparql_query and long_answer_list to metadata

* Removing modular Question2SPARQL approach

* Removing additional classes used for modular kgqa approach

* preparing lcquad data

* change graph db

* Translating namespaces in knowledge graph queries

* Creating graphdb index and loading triples from .ttl file

* Fetching graph config files, triples and model from S3

* Fix incompatibility issues with BaseGraphRetriever and BaseComponent

* Removing unused utility functions

* Adding doc strings and tutorial header

* Adding sparqlwrapper dependency

* Moving tutorial header

* Sorting tutorials by number within name of notebook

* Add latest docstring and tutorial changes

* Creating test cases for knowledge graph

* Changing knowledge graph example to harry potter

* Add latest docstring and tutorial changes

* Adapting the tutorial notebook to harry potter example

* Add GraphDB fixture for tests

* Add latest docstring and tutorial changes

* Added GraphDB docker launch to CI

* Use correct GraphDB fixture

* Check if GraphDB instance is already running

* Renaming question/query and incorporating other feedback from Timo and Tanay

* Removed type annotation

* Add latest docstring and tutorial changes

Co-authored-by: oryx1729 <oryx1729@protonmail.com>
Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-08 14:05:33 +02:00
oryx1729
8c68699e1c
Refactor REST APIs to use Pipelines (#922) 2021-04-07 17:53:32 +02:00
Timo Moeller
5d2b16f3cc
Update farm version (#936)
* Update farm version

* Add new DPR loading, fix dpr param name

* Add QA model confidence as answer probability, fix prams in test
2021-04-01 18:23:05 +02:00
Branden Chan
d77152c469
WIP: Add evaluation nodes for Pipelines (#904)
* Add main eval fns

* WIP: make pipeline_eval.py run

* Fix typo

* Add support for no_answers

* Add latest docstring and tutorial changes

* Working pipeline eval

* Add timing of nodes

* Add latest docstring and tutorial changes

* Refactor and clean

* Update tutorial script

* Set default params

* Update tutorials

* Fix indent

* Add latest docstring and tutorial changes

* Address mypy issues

* Add test

* Fix mypy error

* Clear outputs

* Add doc strings

* Incorporate reviewer feedback

* Add latest docstring and tutorial changes

* Revert query counting

* Fix typo

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-01 17:35:18 +02:00
Lalit Pagaria
e904deefa7
Add Markdown file convertor (#875) 2021-03-23 16:31:26 +01:00
oryx1729
e9f0076dbd
Fix execution of Pipelines with parallel nodes (#901) 2021-03-18 12:41:30 +01:00
oryx1729
e0a118fd9a
Add support for parallel paths in Pipeline (#884) 2021-03-10 18:17:23 +01:00
oryx1729
f3fb9aacce
Fix validation for split_respect_sentence_boundary in Preprocessor (#869) 2021-03-04 15:09:08 +01:00
Malte Pietsch
e641bff7a6
Allow more options for elasticsearch client (auth, multiple hosts) (#845)
* allow more options for elasticsearch client (auth, multiple hosts)

* Add latest docstring and tutorial changes

* fix mypy

* Add latest docstring and tutorial changes

* test client connection via ping()

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-19 14:29:59 +01:00
Tanay Soni
07907f9eac
Add support for indexing pipelines (#816) 2021-02-16 16:24:28 +01:00
Malte Pietsch
47aae14efa relax assert precision of arrays 2021-02-15 14:52:13 +01:00
Lalit Pagaria
5bd94ac5f7
Adding Translator (standalone component & wrapper for pipelines) (#782)
* Adding translator with many generic input parameter support

* Making dict_key as generic

* Fixing mypy issue

* Adding pipeline and using opus models

* Add latest docstring and tutorial changes

* Adding test cases for end-to-end translation for generator, summerizer etc

* raise error join and merge nodes

* Fix test failure

* add docstrings. add usage documentation. rm skip_special_tokens param

* Add latest docstring and tutorial changes

* fix code snippets in md

* Adding few extra configuration parameters and fixing tests

* Fixingmypy issue and updating usage document

* fix for mypy issue in pipeline.py

* reverting renaming of pytest_collection_modifyitems method

* Addressing review comments

* setting skip_special_tokens to True

* removing model_max_length argument as None type is not supported to many models

* Removing padding parameter. Better to leave it as default otherwise it cause tensor size miss match error. If this option required by used then it can be added later.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-02-12 15:58:26 +01:00
oryx1729
4059805d89
Fix ElasticsearchDocumentStore.query_by_embedding() (#823) 2021-02-12 14:57:06 +01:00
oryx1729
c4607cbd98
Revamp CI (#825) 2021-02-12 13:38:54 +01:00
Tanay Soni
fd5c5dd23c
Introduce incremental updates for embeddings in document stores (#812) 2021-02-09 21:25:01 +01:00
Malte Pietsch
ac9f92466f
Allow custom encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions (#813)
* fix encoding of pdftotext. fix version in download instructions

* fix test

* Add latest docstring and tutorial changes

* make latin-1 default encoding again

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-09 13:42:43 +01:00
Tanay Soni
f95b70df38
Fix file upload API (#808) 2021-02-05 12:17:38 +01:00
Branden Chan
f3a3b73d9b
Choose correct similarity fns during benchmark runs & re-run benchmarks (#773)
* Adapt to new dataset_from_dicts return signature

* rename fn

* Align similarity fn in benchmark doc store

* Better choice of similarity fn

* Increase postgres wait time

* Add more expected returned variables

* update benchmark results

* Fix typo

* update all benchmark runs

* multiply stats by 100

* Specify similarity fns for website

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-02-03 11:45:18 +01:00
Tanay Soni
8a5dc8f826
Load Pipeline with YAML config file (#785) 2021-02-02 17:32:17 +01:00
Timo Moeller
f3ccd59045
Improve preprocessing and adding of eval data (#780)
* Remove empty document when splitting text

* Move error message of problematic ids to a highler level
2021-02-01 17:08:27 +01:00
Tanay Soni
b87dd244c1
Get metadata values for a key from Elasticsearch (#776) 2021-02-01 16:13:26 +01:00
brandenchan
5665d55ab4 Remove duplicate file 2021-02-01 15:43:53 +01:00
Pavel Soriano
16b8291091
SQuAD to DPR dataset converter (#765)
* Create squad_to_dpr.py

First commit of the squad2dpr script.

* adding review corrections/improvements

* Merge master 5bf351e

* Move script, add docstring

* Add type hints

Co-authored-by: brandenchan <brandenchan@icloud.com>
2021-02-01 15:40:43 +01:00
Lalit Pagaria
9f7f95221f
Milvus integration (#771)
* Initial commit for Milvus integration

* Add latest docstring and tutorial changes

* Updating implementation of Milvus document store

* Add latest docstring and tutorial changes

* Adding tests and updating doc string

* Add latest docstring and tutorial changes

* Fixing issue caught by tests

* Addressing review comments

* Fixing mypy detected issue

* Fixing issue caught in test about sorting of vector ids

* fixing test

* Fixing generator test failure

* update docstrings

* Addressing review comments about multiple network call while fetching embedding from milvus server

* Add latest docstring and tutorial changes

* Ignoring mypy issue while converting vector_id to int

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-29 13:29:12 +01:00
Tanay Soni
d9f011da9a
Add flag for use of window queries in SQLDocumentStore (#768) 2021-01-25 12:54:34 +01:00
Tanay Soni
46307d1571
Remove quotes around placeholders in Elasticsearch custom query (#762) 2021-01-25 12:46:43 +01:00
Tanay Soni
f0aa879a1c
Fix delete_all_documents for the SQLDocumentStore (#761) 2021-01-22 14:39:24 +01:00
Tanay Soni
337376c81d Add batch_size and generators to document stores. (#733)
* Add batch update of embeddings in document stores

* Resolve merge conflict

* Remove document ordering dependency in tests

* Adjust index buffer size for tests

* Adjust ES Scroll Slice

* Use generator for document store pagination

* Add pagination for InMemoryDocumentStore

* Fix missing index parameter in FAISS update_embeddings()

* Fix FAISS update_embeddings()

* Update FAISS tests

* Update eval tests

* Revert code formatting change

* Fix document count in FAISS update embeddings

* Fix vector_ids reset in SQLDocumentStore

* Update doctrings

* Update docstring
2021-01-21 16:00:08 +01:00
Timo Moeller
7522d2d1b0
Increase FARM to Version 0.6.2 (#755)
* Increase farm version

* Fix test
2021-01-21 10:15:41 +01:00
Timo Moeller
4803da009a
Using PreProcessor functions on eval data (#751)
* Add eval data splitting

* Adjust for split by passage, add test and test data, adjust docstrings, add max_docs to highler level fct
2021-01-20 14:40:10 +01:00
Tanay Soni
aa8a3666c3
Support filters for DensePassageRetriever + InMemoryDocumentStore (#754) 2021-01-20 12:52:52 +01:00
bogdankostic
7709b6cee0
Make batchwise adding of evaluation data possible (#717)
* Make batchwise adding of evaluation data possible

* Fix typos in docstrings

* Merge add_eval_data and add_eval_data_batchwise

* Improve import statements

* Move add_eval_data to BaseDocumentStore

* Add batch_size param to write_documents and write_labels in EsDocStore

* Adjust docstring

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-12 17:54:43 +01:00
Tanay Soni
281f9ff970
Fix SQLite errors in tests (#723) 2021-01-11 13:24:38 +01:00
Lalit Pagaria
75d0ebd076
Add Summarizer (standalone + node in custom pipelines + SearchSummarizationPipeline) (#698)
* Integration of SummarizationQAPipeline with Haystack.

* Moving summarizer tests because of OOM issue

* Fixing typo

* Splitting summarizer test in separate ci step

* Removing sysctl configuration as we already running elastic search in docker container

* fixing mypy issue

* update parameter names and docstrings

* update parameter names in BaseSummarizer

* rename pipeline

* change return type of summarizer from answer to document

* change scope of doc store fixture

* revert scope

* temp. disable test_faiss_index_save_and_load()

* fix mypy. change order for mypy in CI

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-08 14:29:46 +01:00
Lalit Pagaria
3a9a756810
Using Columns names instead of ORM to get all documents (#620)
* Using Columns name instead of ORM object for get all documents call

* Separating meta search from documents. This way it will optimize the memory not duplicating document.text

* Fixing mypy issue

* SQLite have limit on number of host variable hence using batching to fetch meta information

* Query meta only if meta field is not Null in DocOrm

* Add batch_size to other functions except label

* meta can be none so fix that issue

* Dummy commit to trigger CI

* Using chunked dictionary

* Upgrading faiss

* reverting change related to  faiss upgrade

* Changing DB name in test_faiss_retrieving test as it might interfere with exiting files by corrupting DB file

* Updating doc string related to batch_size

* Update docstring for batch_size

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-06 15:56:19 +01:00
Branden Chan
bb8aba18e0
Create Preprocessing Tutorial (#706)
* WIP: First version of preprocessing tutorial

* stride renamed overlap, ipynb and py files created

* rename split_stride in test

* Update preprocessor api documentation

* define order for markdown files

* define order of modules in api docs

* Add colab links

* Incorporate review feedback

Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2021-01-06 15:54:05 +01:00
Tanay Soni
0e4eec9499
Add tests for custom embedding field (#640) 2020-12-17 09:18:57 +01:00
Tanay Soni
4c2804e38e
Add support for aggregating scores in JoinDocuments node (#683) 2020-12-16 15:54:58 +01:00
Tanay Soni
33fe597949
Cleanup Pytest Fixtures (#639) 2020-12-14 18:15:44 +01:00
Malte Pietsch
149d98a0fd
Add latest benchmark run (#652)
* add latest benchmark run

* update templates and fix small json errors

* Change scale

Co-authored-by: brandenchan <brandenchan@icloud.com>
2020-12-10 16:25:51 +01:00
Timo Moeller
efc754b166
Redone: Fix concatenation of sentences in PreProcessor. Add stride for word-based splits with sentence boundaries (#641)
* Update preprocessor.py

Concatenation of sentences done correctly. Stride functionality enabled for splitting by words while respecting sentence boundaries.

* Simplify code, add test

Co-authored-by: Krak91 <45461739+Krak91@users.noreply.github.com>
2020-12-09 16:12:36 +01:00
Tanay Soni
4152ad8426
Enable dynamic parameter updates for the FARMReader (#650) 2020-12-07 14:07:20 +01:00
Tanay Soni
8e52b48e1d
Add pipelines for GenerativeQA & FAQs (#645) 2020-12-03 10:27:06 +01:00
Malte Pietsch
216787ed34
Fix benchmarks (#648)
* disable fasttokenizer, increase ES timeout for delete requests

* add session.close()

* fix deletion of docs
2020-12-02 16:59:42 +01:00
Tanay Soni
5e62e54875
Rename question parameter to query (#614) 2020-11-30 17:50:04 +01:00
Tanay Soni
ea976ba5b5
Add return_embedding parameter for get_all_documents() (#615) 2020-11-26 10:32:30 +01:00
Tanay Soni
e3a68aedaf
Add support for building custom Search Pipelines (#596) 2020-11-20 17:41:08 +01:00