3803 Commits

Author SHA1 Message Date
Adrien Wald
c401e86099
Use ElasticsearchDocumentStore.get_all_documents in ElasticsearchFilterOnlyRetriever.retrieve (#2151)
* use get_all_documents in ElasticsearchFilterOnlyRetriever.retrieve

* Update Documentation & Code Style

* add test case for es_filter_only retriever

* Update Documentation & Code Style

* fix test by adding empty string for query

* Update Documentation & Code Style

* add explicit name of argument "query"

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2022-04-25 09:53:48 +02:00
tstadel
25475a68c7
Match answer sorting in QuestionAnsweringHead with FARMReader (#2414)
* match no_answer confidence

* Update Documentation & Code Style

* test added

* Update Documentation & Code Style

* fix tests

* Update Documentation & Code Style

* apply penalties of scores to confidences too

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-21 11:24:39 +02:00
Malte Pietsch
4bf470286b
Upgrade xpdf to 4.0.4 (#2443)
* Update minimal gpu docker image to xpdf 4.0.4

* Update Dockerfile-GPU

* Update Dockerfile

* Update Dockerfile-GPU

* Update Dockerfile-GPU-minimal
2022-04-21 10:27:56 +02:00
Malte Pietsch
133a76229b
Add info about execution env to minimal GPU image (#2441) 2022-04-21 08:30:42 +02:00
Sara Zan
07d7ecbff1
Make python-magic fully optional (#2412)
* Add windows specific package for python-magic

* Disable some tests on Windows and add explanatory warning in case of issues with libmagic

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-20 09:18:02 +02:00
tstadel
e862400256
Prevent Stackoverflow on Windows CI (#2426)
* prevent stackoverflow on windows ci

* Update Documentation & Code Style

* fix is_windows condition

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2022-04-19 16:10:39 +02:00
Sara Zan
4eec2dc45e
Change YAML version exception into a warning (#2385)
* Change exception into warning, add strict_version param, and remove compatibility between schemas

* Simplify update_json_schema

* Rename unstable into master

* Prevent validate_config from changing the config to validate

* Fix version validation and add tests

* Rename master into ignore

* Complete parameter rename

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-19 16:08:08 +02:00
Sara Zan
8abf11fbd3
Update pdftotext also on pinecone and milvus1 CI jobs (#2433)
* Upgrade pdftotext also on pinecone and milvus1 jobs

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-19 16:06:27 +02:00
Sara Zan
ba9c976bfe
Update pdftotext link (#2432)
* Update pdftotext link

* Update Documentation & Code Style

* Update Tutorial8_Preprocessing.ipynb

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-19 14:30:18 +02:00
Sara Zan
929c685cda
Forbid usage of *args and **kwargs in any node's __init__ (#2362)
* Add failing test

* Remove `**kwargs` from docstores' `__init__` functions (#2407)

* Remove kwargs from ESDocStore subclasses

* Remove kwargs from subclasses of SQLDocumentStore

* Remove kwargs from Weaviate

* Revert change in pinecone

* Fix tests

* Fix retriever test wirh weaviate

* Change Exception into DocumentStoreError

* Update Documentation & Code Style

* Remove `**kwargs` from `FARMReader` (#2413)

* Remove FARMReader kwargs without trying to replace them functionally

* Update Documentation & Code Style

* enforce same index values before and after saving/loading eval dataframes (#2398)

* Add tests for missing `__init__` and `super().__init__()` in custom nodes (#2350)

* Add tests for missing init and super

* Update Documentation & Code Style

* change in with endswith

* Move test in pipeline.py and change test in pipeline_yaml.py

* Update Documentation & Code Style

* Use caplog to test the warning

* Update Documentation & Code Style

* move tests into test_pipeline and use get_config

* Update Documentation & Code Style

* Unmock version name

* Improve variadic args test

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-14 16:42:02 +02:00
tstadel
46a50fb979
Fix python-magic making Windows CI stuck (#2425)
* revert python-magic PR #2330

* Revert "revert python-magic PR #2330"

This reverts commit 23fa2cc836e36daecd9e77d340dde6e32e25c82b.

* remove python-magic dep

* use python-magic-bin only

* add comment about python-magic-bin
2022-04-14 16:08:55 +02:00
Sebastian
3d42b70fbb
Added macos version of xpdf in tutorial 8 (#2424)
* Added macos version of xpdf in tutorial 8

* mini-error
2022-04-14 15:31:40 +02:00
Sara Zan
60428020ff
Exclude beir from windows install (#2419) 2022-04-13 19:06:04 +02:00
Sara Zan
1a81080e8a
Add apt update in Linux CI (#2415)
* Update linux_ci.yml
2022-04-13 15:35:56 +02:00
Sara Zan
d98883b79d
Add tests for missing __init__ and super().__init__() in custom nodes (#2350)
* Add tests for missing init and super

* Update Documentation & Code Style

* change in with endswith

* Move test in pipeline.py and change test in pipeline_yaml.py

* Update Documentation & Code Style

* Use caplog to test the warning

* Update Documentation & Code Style

* move tests into test_pipeline and use get_config

* Update Documentation & Code Style

* Unmock version name

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-13 14:29:05 +02:00
tstadel
73f9ab0f57
enforce same index values before and after saving/loading eval dataframes (#2398) 2022-04-13 13:35:36 +02:00
Sara Zan
96a538b182
Pylint (import related warnings) and REST API improvements (#2326)
* remove duplicate imports

* fix ungrouped-imports

* Fix wrong-import-position

* Fix unused-import

* pyproject.toml

* Working on wrong-import-order

* Solve wrong-import-order

* fix Pool import

* Move open_search_index_to_document_store and elasticsearch_index_to_document_store in elasticsearch.py

* remove Converter from modeling

* Fix mypy issues on adaptive_model.py

* create es_converter.py

* remove converter import

* change import path in tests

* Restructure REST API to not rely on global vars from search.apy and improve tests

* Fix openapi generator

* Move variable initialization

* Change type of FilterRequest.filters

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-12 16:41:05 +02:00
Branden Chan
75dcfd3fab
Delete files in docs/_src (#2322)
* Delete files in _src

* Filter unused images and re-add images that were in use in docs/img

* Remove all usages of user-images.githubusercontent.com

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2022-04-12 16:19:03 +02:00
Sara Zan
4862bbcd73
Add devices alongside use_gpu in FARMReader (#2294)
* Make initialize_device_settings take a devices list, and change signature of FARMReader

* reintroduce use_gpu and propagate devices to other methods

* fix typing for initialize_device_settings

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-12 14:21:25 +02:00
Michele Pangrazzi
dd4361c129
Print warning in EmbeddingRetriever if sentence-transformers model used with different model format (#2377)
* ensure correct embedding_encoder is loaded when embedding_model is a sentence-transformers model but model_format is missing or wrong

* minor refactoring

* do not update model_format and ensure a warning is logged when it could be wrong

* Apply black

* Apply black

Co-authored-by: Michele Pangrazzi <michele@wonderflow.ai>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-04-12 11:52:27 +02:00
tstadel
8342a6c1d6
Fix eval discrepancies (#2381)
* fix eval discrepancies

* Update Documentation & Code Style

* fix reader eval comparison

* Update Documentation & Code Style

* slightly improve messed up top_n_f1 func

* add no_answer hint to reader.eval metrics

* fix tut5

* Update Documentation & Code Style

* correct doc_relevance_col in tests

* Update Documentation & Code Style

* redefine recall metrics for no_answers

* fix bugs in EvalAnswers

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-12 09:24:22 +02:00
MichelBartels
a6927be132
Pass use_auth_token to sentence transformers EmbeddingRetriever (#2284)
* enable auth token for sentence transformers

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-11 19:07:32 +02:00
mathislucka
5ac5b4e241
Fix: Auth token not passed for EmbeddingRetriever (#2404)
* passing auth token allows to access private models

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-11 17:28:14 +02:00
tstadel
ab8ba75664
Set ci job timeout to 45 minutes (#2401) 2022-04-11 16:28:26 +02:00
Branden Chan
4ef099d211
Reduce num REST API workers to accommodate smaller machines (#2400)
* Reduce num REST API workers from 8 to 2

* Incorporate reviewer feedback
2022-04-11 13:26:27 +02:00
Giannis Kitsos Kalyvianakis
b94d9effaf
extract extension based on file's content (#2330)
* extract extension based on file's content

* Add python-magic dependency

* fix the _estimate_extension function and lowercase the file extensions

* check if the FileTypeClassifier can be imported

* add test and new file types

* fix typing

* import Optional

* revert Optional and make sure a string is always returned

* fix test so that it skips markdown files

* Emulate Code & Docs action

* Generate schemas

* Tidy up test code & extensioness files

* Improve error messages

* Revert schema changes

* Emulate black and docs CI again
2022-04-11 09:16:30 +02:00
Sara Zan
ae712fe6bf
Upgrade weaviate-client to 3.3.3 and fix get_all_documents (#1895)
* Fix 'bug' on Weaviate only returning max. 100 docs on get_all_documents

* Add type

* Update Weaviate version on the CI

* Fix bug on get_document_count where there are no documents

* Add more info in the docstrings of get_all_documents and get_all_documents_generator

* Add latest docstring and tutorial changes

* Apply Black

* Update Documentation & Code Style

* Trigger pipeline

* Update Documentation & Code Style

* Include StefanBogdan feedback

* Fix mypy issues and LogicalFilterClause

* Add more types

* Update Documentation & Code Style

* update setup.cfg

* Upgrade weaviate containers too

* Allow to filter for content field in Weaviate

* Use convert_to_weaviate instead of convert_to_pinecone

* Fix _get_all_documents_in_index

* Update docstrings and docs

* Catching an exception in get_document(s)_by_id

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-04-01 15:37:34 +03:00
Timo Moeller
3459020600
Add confidence filtering to FARMReader (#2376)
Add confidence filtering to FARMReader
2022-03-31 15:18:05 +02:00
tstadel
3561037e82
Use cache for hf requests during CI (#2379)
* increase all_close tolerance for milvus2, improve assertion infos

* use request-cache for huggingface
2022-03-31 12:36:45 +02:00
Sara Zan
57bb8c4131
Update launch script for Milvus from 1.x to 2.x (#2378) 2022-03-31 12:03:18 +02:00
MichelBartels
fc1cb63bcc
Fix RouteDocuments documentation (#2380)
* fix RouteDocuments documentation

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-31 11:45:02 +02:00
tstadel
5b52690c5c
Increase all_close tolerance for milvus2, improve assertion infos (#2375) 2022-03-31 11:41:13 +02:00
Florian Hardow
a273c3a51d
EvaluationSetClient for deepset cloud to fetch evaluation sets and la… (#2345)
* EvaluationSetClient for deepset cloud to fetch evaluation sets and labels for one specific evaluation set

* make DeepsetCloudDocumentStore able to fetch uploaded evaluation set names

* fix missing renaming of get_evaluation_set_names in DeepsetCloudDocumentStore

* update documentation for evaluation set functionality in deepset cloud document store

* DeepsetCloudDocumentStore tests for evaluation set functionality

* rename index to evaluation_set_name for DeepsetCloudDocumentStore evaluation set functionality

* raise DeepsetCloudError when no labels were found for evaluation set

* make use of .get_with_auto_paging in EvaluationSetClient

* Return result of get_with_auto_paging() as it parses the response already

* Make schema import source more specific

* fetch all evaluation sets for a workspace in deepset Cloud

* Rename evaluation_set_name to label_index

* make use of generator functionality for fetching labels

* Update Documentation & Code Style

* Adjust function input for DeepsetCloudDocumentStore.get_all_labels, adjust tests for it, fix typos, make linter happy

* Match error message with pytest.raises

* Update Documentation & Code Style

* DeepsetCloudDocumentStore.get_labels_count raises DeepsetCloudError when no evaluation set was found to count labels on

* remove unneeded import in tests

* DeepsetCloudDocumentStore tests, make reponse bodies a string through json.dumps

* DeepsetcloudDocumentStore.get_label_count - move raise to return

* stringify uuid before json.dump as uuid is not serilizable

* DeepsetcloudDocumentStore - adjust response mocking in tests

* DeepsetcloudDocumentStore - json dump response body in test

* DeepsetCloudDocumentStore introduce label_index, EvaluationSetClient rename label_index to evaluation_set

* Update Documentation & Code Style

* DeepsetCloudDocumentStore rename evaluation_set to evaluation_set_response as there is a name clash with the input variable

* DeepsetCloudDocumentStore - rename missed variable in test

* DeepsetCloudDocumentStore - rename missed label_index to index in doc string, rename label_index to evaluation_set in EvaluationSetClient

* Update Documentation & Code Style

* DeepsetCloudDocumentStore - update docstrings for EvaluationSetClient

* DeepsetCloudDocumentStore - fix typo in doc string

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-31 08:59:58 +02:00
bogdankostic
ca988917c9
Fix TableReader for tables without rows (#2369)
* Skip tables without rows

* Update Documentation & Code Style

* Add tests

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-30 17:02:39 +02:00
MichelBartels
eb514a6167
Add evaluation and document conversion to tutorial 15 (#2325)
* update tutorial 15 with newer features

* Update Documentation & Code Style

* fix tutorial 15

* update telemetry with tutorial changes

* Update Documentation & Code Style

* remove error output

* add output

* update non-notebook tutorial 15

* Update Documentation & Code Style

* delete distracting output from tutorial 15 notebook

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-29 17:09:05 +02:00
bogdankostic
834f8c4902
Change return types of indexing pipeline nodes (#2342)
* Change return types of file converters

* Change return types of preprocessor

* Change return types of crawler

* Adapt utils to functions to new return types

* Adapt __init__.py to new method names

* Prevent circular imports

* Update Documentation & Code Style

* Let DocStores' run method accept Documents

* Adapt tests to new return types

* Update Documentation & Code Style

* Put "# type: ignore" to right place

* Remove id_hash_keys property from Document primitive

* Update Documentation & Code Style

* Adapt tests to new return types and missing id_hash_keys property

* Fix mypy

* Fix mypy

* Adapt PDFToTextOCRConverter

* Remove id_hash_keys from RestAPI tests

* Update Documentation & Code Style

* Rename tests

* Remove redundant setting of content_type="text"

* Add DeprecationWarning

* Add id_hash_keys to elasticsearch_index_to_document_store

* Change document type from dict to Docuemnt in PreProcessor test

* Fix file path in Tutorial 5

* Remove added output in Tutorial 5

* Update Documentation & Code Style

* Fix file_paths in Tutorial 9 + fix gz files in fetch_archive_from_http

* Adapt tutorials to new return types

* Adapt tutorial 14 to new return types

* Update Documentation & Code Style

* Change assertions to HaystackErrors

* Import HaystackError correctly

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-29 13:53:35 +02:00
tstadel
a73717b2ea
Support conjunctive queries in sparse retrieval (#2361)
* support conjunctive queries in sparse retrieval

* fix typo

* test added

* Update Documentation & Code Style

* fix test_DeepsetCloudDocumentStore_query

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-28 22:10:50 +02:00
mkkuemmel
04b56f0b1c
Replace dpr with embeddingretriever tut14 (#2336)
* add updated graph images for tutorial14

* ipynb: replaced DPR with EmbeddingRetriever, added TODO for further inspection of failing code

* Revert "ipynb: replaced DPR with EmbeddingRetriever, added TODO for further inspection of failing code"

This reverts commit f4b6f3e1dbbedfd1bbe5e0e33645899dbea5d924.

* ipynb: replaced DPR with EmbeddingRetriever, added TODO for further inspection of failing code

* ipynb: quick fix to avoid failure in print_answers

* py: quick fix to avoid failure in print_answers

* Update Documentation & Code Style

* ipynb: remove DPR, remove images

* Revert "ipynb: remove DPR, remove images"

This reverts commit dfa1e7585da6743fcf97488405c356bf935a976d.

* ipynb: remove DPR, remove images

* py: replace DPR with EmbeddingRetriever

* Update Documentation & Code Style

* correcting a typo

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: TuanaCelik <tuana.celik@deepset.ai>
2022-03-28 16:54:49 +02:00
tstadel
b20a1f874b
Fix sparse retrieval with filters returns results without any text-match (#2359)
* use "must" instead of "should" for query-matching

* Update Documentation & Code Style

* fix mypy issue

* fix finding of new pylint version

* add test

* fix test_retrieval

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-25 17:53:42 +01:00
Julian Risch
a398094243
update version to next release candidate (#2355)
* update version to next release candidate

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-25 12:06:35 +01:00
Raphaël Merx
4ebb71d42d
Fix link to squad_to_dpr.py in DPR train tutorial (#2334)
* Fix link to squad_to_dpr.py in DPR train tutorial

* update tutorial 9
2022-03-25 12:05:12 +01:00
Julian Risch
70bbb649a7
change docu text about how to opt-out (#2358)
* change docu text about how to opt-out

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-25 11:59:39 +01:00
Julian Risch
bf71f03ff2
release v1.3.0 and re-add Makefile (#2354)
* release v1.3.0 and re-add Makefile

* Update Documentation & Code Style

* make BaseKnowledgeGraph abstract to remove it from the JSON schema

* Logging paths for JSON schema generation

* Add debug command in autoforma.yml

* Typo

* Update Documentation & Code Style

* Fix schema path in CI

* Update Documentation & Code Style

* Remove debug statement from autoformat.yml

* Reintroduce compatibility between 1.3.0 and 1.2.1rc0 schema

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ZanSara <sarazanzo94@gmail.com>
v1.3.0
2022-03-23 17:22:06 +01:00
Julian Risch
cec0137693
Change document attribute from text to content (#2352)
* Change document attribute from text to content

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-23 16:55:01 +03:00
Chris Byrd
3b2001e66f
Set provider parameter when instantiating onnxruntime.InferenceSession (#1976)
* Set provider parameter when instantiating onnxruntime.InferenceSession
fixes #1973

* Change device type to torch.device

* set type annotation of device to torch.device everywhere

* Apply Black

* Change types of device and devices params across the codebase

* Update Documentation & Code Style

* Add type: ignore in the right location

* Update Documentation & Code Style

* Add type: ignore

* feedback

* Update Documentation & Code Style

* feedback 2

* Fix convert_to_transformers

* Fix syntax error

* Update Documentation & Code Style

* Consider augment and load_glove user-facing as well

* Update Documentation & Code Style

* Fix mypy

* Update Documentation & Code Style

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-03-23 12:08:56 +01:00
tstadel
851fe1cf07
Fix normalize_embedding using numba (#2347)
* fix normalize_embedding using numba

* Update Documentation & Code Style

* fix too-many-public-methods pylint msg

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-22 23:04:55 +01:00
bogdankostic
7e6ff8a205
Run Pinecone tests only if files related to Pinecone changed (#2343)
* Run Pinecone tests only if files related to Pinecone changed

* Change in pinecone.py that will be reverted

* Revert change in pinecone.py

* Test Pinecone also when filter_utils.py changes
2022-03-22 15:58:12 +01:00
tstadel
d438011432
fix launch scripts (#2341) 2022-03-22 10:48:29 +01:00
Branden Chan
6233dfce2f
Let SquadData support data from Annotation Tool (#2329)
* Support data from Annotation Tool

* Update Documentation & Code Style

* Incorporate reviewer feedback

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-03-22 10:17:25 +01:00
Julian Risch
7ffeccece6
Fix tutorial dataset paths (#2340)
* fix tutorial 4 dataset path

* fix tutorial 8 dataset path

* fix tutorial 10 event

* Update Documentation & Code Style

* fix send event for tutorial 15

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-22 09:19:50 +01:00