3174 Commits

Author SHA1 Message Date
Malte Pietsch
ac9f92466f
Allow custom encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions (#813)
* fix encoding of pdftotext. fix version in download instructions

* fix test

* Add latest docstring and tutorial changes

* make latin-1 default encoding again

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-09 13:42:43 +01:00
Tanay Soni
f95b70df38
Fix file upload API (#808) 2021-02-05 12:17:38 +01:00
Tanay Soni
7b18e324f2
Fix building Pipeline with YAML (#800) 2021-02-04 11:53:51 +01:00
Branden Chan
f3a3b73d9b
Choose correct similarity fns during benchmark runs & re-run benchmarks (#773)
* Adapt to new dataset_from_dicts return signature

* rename fn

* Align similarity fn in benchmark doc store

* Better choice of similarity fn

* Increase postgres wait time

* Add more expected returned variables

* update benchmark results

* Fix typo

* update all benchmark runs

* multiply stats by 100

* Specify similarity fns for website

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-02-03 11:45:18 +01:00
Tanay Soni
8a5dc8f826
Load Pipeline with YAML config file (#785) 2021-02-02 17:32:17 +01:00
Malte Pietsch
1318b55eec
Make tqdm progress bars optional (less verbose prod logs) (#796)
* make dpr queries less verbose

* add progress bar flag to more components

* Add latest docstring and tutorial changes

* add type

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-01 20:51:55 +01:00
Timo Moeller
f3ccd59045
Improve preprocessing and adding of eval data (#780)
* Remove empty document when splitting text

* Move error message of problematic ids to a highler level
2021-02-01 17:08:27 +01:00
Tanay Soni
b87dd244c1
Get metadata values for a key from Elasticsearch (#776) 2021-02-01 16:13:26 +01:00
brandenchan
5665d55ab4 Remove duplicate file 2021-02-01 15:43:53 +01:00
Pavel Soriano
16b8291091
SQuAD to DPR dataset converter (#765)
* Create squad_to_dpr.py

First commit of the squad2dpr script.

* adding review corrections/improvements

* Merge master 5bf351e

* Move script, add docstring

* Add type hints

Co-authored-by: brandenchan <brandenchan@icloud.com>
2021-02-01 15:40:43 +01:00
Tanay Soni
5bf351ea7b
Fix refresh behaviour for Elasticsearch delete (#794) 2021-02-01 14:07:55 +01:00
Tanay Soni
d62355ca88
Fix mypy typing (#792) 2021-02-01 12:15:36 +01:00
Branden Chan
1dc74c7067
Add model versioning support (#784)
* Add model versioning support

* Add latest docstring and tutorial changes

* Support DPR versioning

* Add RAG versioning support

* Add latest docstring and tutorial changes

* Add summarizer support

* Add Embedding Retriever support

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-01 11:42:36 +01:00
Malte Pietsch
2b05e801c3
Fix pdftotext dependency in CI (#788)
* Fix pdftotext dependency in CI

* udpate xpdf version

* Fix version
2021-01-29 16:07:37 +01:00
Lalit Pagaria
9f7f95221f
Milvus integration (#771)
* Initial commit for Milvus integration

* Add latest docstring and tutorial changes

* Updating implementation of Milvus document store

* Add latest docstring and tutorial changes

* Adding tests and updating doc string

* Add latest docstring and tutorial changes

* Fixing issue caught by tests

* Addressing review comments

* Fixing mypy detected issue

* Fixing issue caught in test about sorting of vector ids

* fixing test

* Fixing generator test failure

* update docstrings

* Addressing review comments about multiple network call while fetching embedding from milvus server

* Add latest docstring and tutorial changes

* Ignoring mypy issue while converting vector_id to int

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-29 13:29:12 +01:00
brandenchan
6efa4f06c1 Add Streamlit UI Image 2021-01-27 17:01:29 +01:00
Timo Moeller
f94bd96ddf
Remove RAG todos after transformers update (#781) 2021-01-27 16:50:02 +01:00
Tanay Soni
d9f011da9a
Add flag for use of window queries in SQLDocumentStore (#768) 2021-01-25 12:54:34 +01:00
Tanay Soni
46307d1571
Remove quotes around placeholders in Elasticsearch custom query (#762) 2021-01-25 12:46:43 +01:00
Tanay Soni
f0aa879a1c
Fix delete_all_documents for the SQLDocumentStore (#761) 2021-01-22 14:39:24 +01:00
Markus Paff
aee90c5df9
Docs v0.7.0 (#757)
* new docs version

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-01-22 10:28:33 +01:00
Malte Pietsch
50815421b0 bump haystack version v0.7.0 2021-01-21 16:02:33 +01:00
Tanay Soni
337376c81d Add batch_size and generators to document stores. (#733)
* Add batch update of embeddings in document stores

* Resolve merge conflict

* Remove document ordering dependency in tests

* Adjust index buffer size for tests

* Adjust ES Scroll Slice

* Use generator for document store pagination

* Add pagination for InMemoryDocumentStore

* Fix missing index parameter in FAISS update_embeddings()

* Fix FAISS update_embeddings()

* Update FAISS tests

* Update eval tests

* Revert code formatting change

* Fix document count in FAISS update embeddings

* Fix vector_ids reset in SQLDocumentStore

* Update doctrings

* Update docstring
2021-01-21 16:00:08 +01:00
Markus Paff
0b583b8972
Generate docstrings and deploy to branches to Staging (Website) (#731)
* test pre commit hook

* test status

* test on this branch

* push generated docstrings and tutorials to branch

* fixed syntax error

* Add latest docstring and tutorial changes

* add files before commit

* catch commit error

* separate generation from deployment

* add deployment process for staging

* add current branch to payload

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-01-21 11:01:09 +01:00
Markus Paff
0f62e0b2ee
Script for releasing docs (#736)
* script for releasing docs

* fix formatting
2021-01-21 10:58:54 +01:00
Timo Moeller
7522d2d1b0
Increase FARM to Version 0.6.2 (#755)
* Increase farm version

* Fix test
2021-01-21 10:15:41 +01:00
Branden Chan
725c03220f
Reduce memory consumption of fetch_archive_from_http (#737) 2021-01-21 09:57:55 +01:00
Timo Moeller
4803da009a
Using PreProcessor functions on eval data (#751)
* Add eval data splitting

* Adjust for split by passage, add test and test data, adjust docstrings, add max_docs to highler level fct
2021-01-20 14:40:10 +01:00
Tanay Soni
aa8a3666c3
Support filters for DensePassageRetriever + InMemoryDocumentStore (#754) 2021-01-20 12:52:52 +01:00
Rob192
35dcf23a4b
Use Path class in add_eval_data of haystack.document_store.base.py (#745)
* use Path class in method add_eval_data of haystack.document_store.base.py

* change type of jsonl_filename as squad_json_to_jsonl and add_eval_data are expecting string type
2021-01-19 12:08:49 +01:00
Andrey A
7a0b65a079
Add links to slack, twitter etc (#746)
* Update README.md
2021-01-19 11:30:26 +01:00
Branden Chan
8d47a71b00
Fix Tutorial 9 (#734)
* Add package download

* Change dev to train file
2021-01-14 10:56:58 +01:00
Julian Risch
3331608e03
Adding a guard that prevents the tutorial code from being executed in every subprocess when using multiprocessing on windows (#729) 2021-01-13 18:17:54 +01:00
Branden Chan
a3a12bc95b Remove broken link 2021-01-13 17:32:10 +01:00
brandenchan
01fd9940d8 Fix tutorial link 2021-01-13 15:29:25 +01:00
Branden Chan
7376185b65
Create DPR training tutorial (#708)
* WIP: Start DPR training tutorial

* Create basics of DPR Train tutorial

* Update documentation

* Allow DPR to be initialized without document store

* WIP: Add param descriptions to DPR notebook

* Clean tutorial

* Improve loading

* Make doc store optional when loading DPR

* Satisfy mypy type check

* Add links

* Add tutorial header

* Add colab badge

* Clear outputs

* Incorporate reviewer feedback

* WIP: Start DPR training tutorial

* Create basics of DPR Train tutorial

* Update documentation

* Allow DPR to be initialized without document store

* WIP: Add param descriptions to DPR notebook

* Clean tutorial

* Improve loading

* Make doc store optional when loading DPR

* Satisfy mypy type check

* Add links

* Add tutorial header

* Add colab badge

* Clear outputs

* Incorporate reviewer feedback

* Add readme links

* Regenerate tutorials

* Add excitement

* Fix typo

* Fix hard negatives comment

* Wrap tutorial for windows users

* Fix mypy issue
2021-01-13 10:33:55 +01:00
bogdankostic
7709b6cee0
Make batchwise adding of evaluation data possible (#717)
* Make batchwise adding of evaluation data possible

* Fix typos in docstrings

* Merge add_eval_data and add_eval_data_batchwise

* Improve import statements

* Move add_eval_data to BaseDocumentStore

* Add batch_size param to write_documents and write_labels in EsDocStore

* Adjust docstring

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-12 17:54:43 +01:00
Antonio Lanza
1f00599d2e
Change signature and docstring for ca_certs parameter for SSL connection (#730) 2021-01-12 17:30:09 +01:00
Malte Pietsch
e9b5439b00
Rename label id field for elastic & add UPDATE_EXISTING_DOCUMENTS to API config (#728)
* rename label id field for elastic

* add UPDATE_EXISTING_DOCUMENTS param to API config
2021-01-12 13:00:56 +01:00
Malte Pietsch
b6e64ca42d
Add ID to label schema (#727) 2021-01-12 10:02:40 +01:00
Markus Paff
3af3ee1a12
Automate docstring and tutorial generation with every push to master (#718)
* automate docstring and tutorial generation with every push to master

* test CI for current branch

* fixed yaml syntax

* add setupttools to install process

* checkout repo

* fixed command for shell script

* install wheel as it is needed for CI

* install mkdocs

* test without shell script

* use package from github actions

* test other configuration

* back to right config

* cleaning script
2021-01-11 16:25:43 +01:00
Tanay Soni
281f9ff970
Fix SQLite errors in tests (#723) 2021-01-11 13:24:38 +01:00
Malte Pietsch
fcc052b554
Pass custom label index name in api config (#724) 2021-01-11 12:24:09 +01:00
Lalit Pagaria
88b5cbe736
Correcting pypi download badge (#722) 2021-01-10 06:26:17 +01:00
Lalit Pagaria
75d0ebd076
Add Summarizer (standalone + node in custom pipelines + SearchSummarizationPipeline) (#698)
* Integration of SummarizationQAPipeline with Haystack.

* Moving summarizer tests because of OOM issue

* Fixing typo

* Splitting summarizer test in separate ci step

* Removing sysctl configuration as we already running elastic search in docker container

* fixing mypy issue

* update parameter names and docstrings

* update parameter names in BaseSummarizer

* rename pipeline

* change return type of summarizer from answer to document

* change scope of doc store fixture

* revert scope

* temp. disable test_faiss_index_save_and_load()

* fix mypy. change order for mypy in CI

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-08 14:29:46 +01:00
Lalit Pagaria
3a9a756810
Using Columns names instead of ORM to get all documents (#620)
* Using Columns name instead of ORM object for get all documents call

* Separating meta search from documents. This way it will optimize the memory not duplicating document.text

* Fixing mypy issue

* SQLite have limit on number of host variable hence using batching to fetch meta information

* Query meta only if meta field is not Null in DocOrm

* Add batch_size to other functions except label

* meta can be none so fix that issue

* Dummy commit to trigger CI

* Using chunked dictionary

* Upgrading faiss

* reverting change related to  faiss upgrade

* Changing DB name in test_faiss_retrieving test as it might interfere with exiting files by corrupting DB file

* Updating doc string related to batch_size

* Update docstring for batch_size

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-06 15:56:19 +01:00
Branden Chan
bb8aba18e0
Create Preprocessing Tutorial (#706)
* WIP: First version of preprocessing tutorial

* stride renamed overlap, ipynb and py files created

* rename split_stride in test

* Update preprocessor api documentation

* define order for markdown files

* define order of modules in api docs

* Add colab links

* Incorporate review feedback

Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2021-01-06 15:54:05 +01:00
Malte Pietsch
5db73d4107
Update stale bot 2021-01-05 08:29:24 +01:00
Malte Pietsch
74b0868d28
Fix GPU docker build (#703) 2020-12-31 15:04:13 +01:00
Malte Pietsch
a284af3ae5
Remove sourcerer.io widget (#702)
Fix #699
2020-12-30 09:57:02 +01:00