108 Commits

Author SHA1 Message Date
Stefano Fiorucci
b579b9d54a
bug: make ElasticSearchDocumentStore use batch_size in get_documents_by_id (#3166)
* use batch_size

* try to fix git mess

* improve docstrings

* fix
2022-09-26 13:21:59 +02:00
tstadel
05a86b9d3d
feat: FAISS in OpenSearch: Support HNSW for cosine (#3217)
* support cosine similiarity with faiss

* update docs

* update api docs

* fix tests

* Revert "update api docs"

This reverts commit 6138fdfefb3beaee2d55c5729cd4a2745ea6b143.

* fix api docs

* collapse test

* rename similairity to space_type mappings

* only normalize for faiss

* fix merge

* fix docs normalization

* get rid of List[np.array]

* update docs

* fix tests and tutorials

* fix mypy

* fix mypy

* fix mypy again

* again mypy

* blacken

* update tutorial  4 docs

* fix embeddingretriever

* fix faiss

* move dense specific logic to DenseRetriever

* fix mypy

* cosine tests for all documents stores

* fix pinecone

* add docstring

* docstring corrections

* update docs

* add integration test marker

* docstrings update

* update docs

* fix typo

* update docs

* fix MockDenseRetriever

* run integration tests for all documentstores

* fix test_update_embeddings_cosine_similarity

* fix faiss tests not running

* blacken

* make test_cosine_sanity_check integration test

* split PR

* update docs

* manually revert tutorial doc change

* Fix embedding type

* set integration marker correctly

* make BaseDocumentStore.normalize_embedding static

* format

* fix handling of opensearch_faiss param

* fix merge

* add DenseRetriever typing

* organize imports in conftest.py

* organize imports in conftest.py (2)

* fix DenseRetriever import

* add opensearch-tests-linux
2022-09-23 13:26:49 +02:00
tstadel
b10e2c392e
chore: add DenseRetriever abstraction (#3252)
* support cosine similiarity with faiss

* update docs

* update api docs

* fix tests

* Revert "update api docs"

This reverts commit 6138fdfefb3beaee2d55c5729cd4a2745ea6b143.

* fix api docs

* collapse test

* rename similairity to space_type mappings

* only normalize for faiss

* fix merge

* fix docs normalization

* get rid of List[np.array]

* update docs

* fix tests and tutorials

* fix mypy

* fix mypy

* fix mypy again

* again mypy

* blacken

* update tutorial  4 docs

* fix embeddingretriever

* fix faiss

* move dense specific logic to DenseRetriever

* fix mypy

* cosine tests for all documents stores

* fix pinecone

* add docstring

* docstring corrections

* update docs

* add integration test marker

* docstrings update

* update docs

* fix typo

* update docs

* fix MockDenseRetriever

* run integration tests for all documentstores

* fix test_update_embeddings_cosine_similarity

* fix faiss tests not running

* blacken

* make test_cosine_sanity_check integration test

* update docs

* fix imports

* import  DenseRetriever normally

* update docs

* fix deepcopy of documents

* update schema

* Revert "update schema"

This reverts commit 83cf8f323648468e1c322d54852bec084d637e3f.

* fix schema for ci manually
2022-09-21 19:08:54 +02:00
Daniel Bichuetti
e1f399284f
refactor: update dependencies and remove pins (#3147)
* refactor: remove azure-core, pydoc and hf-hub pins

* fix: remove extra-comma

* fix: force minimum version of azure forms recognizer

* refactor: allow newer ocr libs

* refactor: update more dependencies and container versions

* refactor: remove extra comment

* docs: pre-commit manual run

* refactor: remove unnecessary dependency

* tests: update weaviate container image version
2022-09-05 14:30:35 +02:00
Vladimir Blagojevic
356537c883
Standardize devices parameter and device initialization (#3062)
* Use devices parameter and initialize devices consistently
2022-08-31 15:30:31 +02:00
tstadel
92046ce5b5
feat: FAISS in OpenSearch: Support HNSW for dot product and l2 (#3029)
* support faiss hnsw

* blacken

* update docs

* improve similarity check

* add tests

* update schema

* set ef_search param correctly

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* regenerate docs

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-08-24 16:43:48 +02:00
James Briggs
9b1b03002f
update to PineconeDocumentStore to remove dependency on SQL db (#2749)
* update to PineconeDocumentStore to remove dependency on SQL db

* Update Documentation & Code Style

* typing fixes

* Update Documentation & Code Style

* fixed embedding generator to yield Documents

* Update Documentation & Code Style

* fixes for final typing issues

* fixes for pylint

* Update Documentation & Code Style

* uncomment pinecone tests

* added new params to docstrings

* Update Documentation & Code Style

* Update Documentation & Code Style

* Update haystack/document_stores/pinecone.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* Update haystack/document_stores/pinecone.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* Update Documentation & Code Style

* Update haystack/document_stores/pinecone.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* Update haystack/document_stores/pinecone.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* Update haystack/document_stores/pinecone.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* Update haystack/document_stores/pinecone.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* changes based on comments, updated errors and install

* Update Documentation & Code Style

* mypy

* implement simple filtering in pinecone mock

* typo

* typo in reverse

* account for missing meta key in filtering

* typo

* added metadata filtering to describe index

* added handling for users switching indexes in same doc store, and handling duplicate docs in write

* syntax tweaks

* added index option to document/embedding count calls

* labels implementation in progress

* added metadata fields to be indexed for pinecone tests

* further changes to mock

* WIP implementation of labels+multilabels

* switched to rely on labels namespace rather than filter

* simpler delete_labels

* label fixes, remove debug code

* Apply dostring fixes

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* mypy

* pylint

* docs

* temporarily un-mock Pinecone

* Small Pinecone test suite

* pylint

* Add fake test key to pass the None check

* Add again fake test key to pass the None check

* Add Pinecone to default docstores and fix filters

* Fix field name

* Change field name

* Change field value

* Remove comments

* forgot to upgrade pyproject.toml

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-08-24 13:27:15 +02:00
Dmitry Goryunov
da7836a931
feat: Support embedding dimensions on DeepsetCloudDocumentStore (#2995)
* Add embedding_dim to dc store

* Remove similarity from query params, it is not used

* Remove unused `return_embedding` parameter

* Remove unused param

* Update the documentation

* Update schemas

* Revert openapi changes

* Revert openapi changes

* Fix openapi

* Fix json schema

* Improve docstrings

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Improve logs

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update the docs

* Fix similarity

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-08-12 11:46:52 +02:00
James Briggs
26c938a8e6
test: add meta fields for meta_config to be used during testing (#3021)
* added meta fields for meta_config to be used during realtime testing of PineconeDocumentStore

* Add documentation on metadata filtering in  docstring

* docs

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-08-12 10:27:56 +02:00
Zoltan Fedor
f4128d3581
Adding support for additional distance/similarity metrics for Weaviate (#3001)
* Adding support for additional distance metrics for Weaviate

Fixes #3000

* Updating the docs

* Fixing error texts

* Fixing issues raised by the review

* Addressing the last issue from the reviews - removing test `test_weaviate.py::test_similarity`

* [EMPTY] Re-trigger CI

* Fixing things based on review

* [EMPTY] Re-trigger CI
2022-08-11 09:48:21 +02:00
Massimiliano Pippi
e7627c3f8b
Use opensearch-py in OpenSearchDocumentStore (#2691)
* add Opensearch extras

* let OpenSearchDocumentStore use opensearch-py

* Update Documentation & Code Style

* fix a bug found after adding tests

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-07-28 10:04:49 +02:00
Zoltan Fedor
adb2b2c312
Add support for BM25 with the Weaviate document store (#2860)
* Upgrading Weaviate used for testing to 1.14.1 from 1.11.0

This has also brought up an issue with one of the test filtering for value "a". This test has started to fail, as "a" is a default stopword in Weaviate, so I have changed this test to look for value "c" instead of value "a" to get around the stopword issue.

* Weaviate client upgrade

From v3.3.3 to v3.6.0

* Adding BM25 Retrieval to Weaviate

Weaviate now supports BM25 retrieval in experiment mode and with some limitations (like it cannot be combined with filters).
This commit adds support for inverted index (BM25) querying against Weaviate.

* Running Black on the recent code changes

* Update Documentation & Code Style

* Fixing linting issues after code changes by black

* The BM25 query needs to be in all lowercase for now

The BM25 query needs to be provided all lowercase while the functionality is in experimental mode in Weaviate.
See https://app.slack.com/client/T0181DYT9KN/C017EG2SL3H/thread/C017EG2SL3H-1658790227.208119

* Fixing method parameter docstring to highlight that they are not supported in Weaviate

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-27 10:07:13 +02:00
Julian Risch
f599ce9458
Change "text" to "content" as dict key (#2800)
* change "text" to "content" as dict key

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-13 16:36:06 +02:00
bogdankostic
195aed942f
Add update_document_meta to InMemoryDocumentStore (#2689)
* Add update_document_meta to InMemoryDocumentStore

* Fix typo

* Update Documentation & Code Style

* Add update_document_meta to BaseDocumentStore

* Update Documentation & Code Style

* Fix mypy

* Update Documentation & Code Style

* Add update_document_meta to MockDocumentStore

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-07 15:44:07 +02:00
Julian Risch
1c1faa4742
Make check of document & embedding count optional in FAISS and Pinecone (#2677)
* make validation optional & add method call in pinecone init

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-04 10:12:31 +02:00
Massimiliano Pippi
79b287b568
Extract common code for ES and OS into a base class (#2664)
* extract common code for ES and OS into a base class

* Update Documentation & Code Style

* give the base class a more obvious name

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-20 09:47:44 +02:00
James Briggs
2688135481
Pinecone unary queries upgrade (#2657)
* update query and response process for unary query update

* added metadata_config parameter

* Update Documentation & Code Style

Co-authored-by: James Briggs <jamesbriggs@Jamess-MacBook-Pro-2.local>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-15 09:45:39 +02:00
Massimiliano Pippi
374155fd5c
Move Opensearch document store in its own module (#2603)
* move OpenSearchDocumentStore into its own Python module

* Update Documentation & Code Style

* mark test with (sigh) elasticsearch

* skip opensearch tests on windows

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-08 16:37:23 +02:00
bogdankostic
867695ad0c
Change signature of queries param in batch methods (#2575)
* Change signature of queries param in batch methods

* Update Documentation & Code Style

* Fix mypy

* Remove unused import

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-05-24 12:33:45 +02:00
tstadel
110b9c2b0a
Warnings for write operations of DeepsetCloudDocumentStore (#2565)
* log inputs to write operations

* Update Documentation & Code Style

* adjust tests

* simplify by using decorator for write operation functions

* Update Documentation & Code Style

* fix comma

* fix comma in test

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-05-17 17:53:55 +02:00
bogdankostic
738e008020
Add run_batch method to all nodes and Pipeline to allow batch querying (#2481)
* Add run_batch methods for batch querying

* Update Documentation & Code Style

* Fix mypy

* Update Documentation & Code Style

* Fix mypy

* Fix linter

* Fix tests

* Update Documentation & Code Style

* Fix tests

* Update Documentation & Code Style

* Fix mypy

* Fix rest api test

* Update Documentation & Code Style

* Add Doc strings

* Update Documentation & Code Style

* Add batch_size as attribute to nodes supporting batching

* Adapt error messages

* Adapt type of filters in retrievers

* Revert change about truncation_warning in summarizer

* Unify multiple_doc_lists tests

* Use smaller models in extractor tests

* Add return types to JoinAnswers and RouteDocuments

* Adapt return statements in reader's run_batch method

* Allow list of filters

* Adapt error messages

* Update Documentation & Code Style

* Fix tests

* Fix mypy

* Adapt print_questions

* Remove disabling warning about too many public methods

* Add flag for pylint to disable warning about too many public methods in pipelines/base.py and document_stores/base.py

* Add type check

* Update Documentation & Code Style

* Adapt tutorial 11

* Update Documentation & Code Style

* Add query_batch method for DCDocStore

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-05-11 11:11:00 +02:00
bogdankostic
4581b91e83
Make DeepsetCloudDocumentStore work with non-existing index (#2513)
* Make DeepsetCloudDocumentStore work with non-existing index

* Update Documentation & Code Style

* Add tests

* Update Documentation & Code Style

* Fix tests, adapt warning messages + lowercase deepset

* Update Documentation & Code Style

* Fix typo in test

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-05-10 15:21:35 +02:00
Branden Chan
caf1336424
Adjust pydoc markdown config so methods shown with classes (#2511)
* add_member_class_prefix: true

* Update Documentation & Code Style

* Trigger redeploy

* Trigger redeploy

* Fix pydoc param

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-05-06 16:00:08 +02:00
Julian Risch
1418f0c603
change milvus links from 2.0.0 to 2.0.x (#2496)
* change milvus links from 2.0.0 to 2.0.x

* Update Documentation & Code Style

* fix two broken links

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-05-04 18:30:50 +02:00
bogdankostic
a4e603ce87
Deprecate Milvus1DocumentStore (#2495)
* Add warning message

* Update doc string

* Update Documentation & Code Style

* Change DeprecationWarning to FutureWarning

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-05-04 15:09:57 +02:00
tstadel
7d6b3fe954
Add flag to disable scaling scores to probabilities (#2454)
* add scale_scores_to_probabilities flag

* Update Documentation & Code Style

* fix tests

* fix sql mypy

* Update Documentation & Code Style

* fix responses

* Update Documentation & Code Style

* rename to scale_score_to_probability + docstrings

* use BaseDocumentStore.score_to_probability in elasticsearch and milvus2

* Update Documentation & Code Style

* fix tests

* Update Documentation & Code Style

* add tests

* improve naming

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-05-02 13:35:07 +02:00
Malte Pietsch
766e75370c
Update docs of DeepsetCloudDocumentStore (#2460)
* Update docs of DeepsetCloudDocumentStore

* Update Documentation & Code Style

* Update docstring

Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>

* Update Documentation & Code Style

* move DEFAULT_API_ENDPOINT

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
2022-04-27 19:40:39 +02:00
tstadel
7498c7c6fb
Fix and use delete_index instead of delete_documents in tests (#2453)
* use delete_index instead of delete_documents in tests

* fix delete_index

* fix  delete_index() in memory and milvus

* fix imports

* fix memory keyerrors

* Update Documentation & Code Style

* increase timeout for pinecone tests to 60 minutes

* clean get_document_store()

* use recreate_index in tests

* Update Documentation & Code Style

* fix tests

* fix remaining tests

* log index deleted

* fix test_eval_pipeline

* simplify existing index detection in weaviate

* delete label_index on recreate_index for pinecone and milvus

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-26 19:06:30 +02:00
Tuana Celik
d49e92e21c
ElasticsearchRetriever to BM25Retriever (#2423)
* change class names to bm25

* Update Documentation & Code Style

* Update Documentation & Code Style

* Update Documentation & Code Style

* Add back all_terms_must_match

* fix syntax

* Update Documentation & Code Style

* Update Documentation & Code Style

* Creating a wrapper for old ES retriever with deprecated wrapper

* Update Documentation & Code Style

* New method for deprecating old ESRetriever

* New attempt for deprecating the ESRetriever

* Reverting to the simplest solution - warning logged

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-04-26 16:09:39 +02:00
Sara Zan
929c685cda
Forbid usage of *args and **kwargs in any node's __init__ (#2362)
* Add failing test

* Remove `**kwargs` from docstores' `__init__` functions (#2407)

* Remove kwargs from ESDocStore subclasses

* Remove kwargs from subclasses of SQLDocumentStore

* Remove kwargs from Weaviate

* Revert change in pinecone

* Fix tests

* Fix retriever test wirh weaviate

* Change Exception into DocumentStoreError

* Update Documentation & Code Style

* Remove `**kwargs` from `FARMReader` (#2413)

* Remove FARMReader kwargs without trying to replace them functionally

* Update Documentation & Code Style

* enforce same index values before and after saving/loading eval dataframes (#2398)

* Add tests for missing `__init__` and `super().__init__()` in custom nodes (#2350)

* Add tests for missing init and super

* Update Documentation & Code Style

* change in with endswith

* Move test in pipeline.py and change test in pipeline_yaml.py

* Update Documentation & Code Style

* Use caplog to test the warning

* Update Documentation & Code Style

* move tests into test_pipeline and use get_config

* Update Documentation & Code Style

* Unmock version name

* Improve variadic args test

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-14 16:42:02 +02:00
Sara Zan
96a538b182
Pylint (import related warnings) and REST API improvements (#2326)
* remove duplicate imports

* fix ungrouped-imports

* Fix wrong-import-position

* Fix unused-import

* pyproject.toml

* Working on wrong-import-order

* Solve wrong-import-order

* fix Pool import

* Move open_search_index_to_document_store and elasticsearch_index_to_document_store in elasticsearch.py

* remove Converter from modeling

* Fix mypy issues on adaptive_model.py

* create es_converter.py

* remove converter import

* change import path in tests

* Restructure REST API to not rely on global vars from search.apy and improve tests

* Fix openapi generator

* Move variable initialization

* Change type of FilterRequest.filters

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-04-12 16:41:05 +02:00
Sara Zan
ae712fe6bf
Upgrade weaviate-client to 3.3.3 and fix get_all_documents (#1895)
* Fix 'bug' on Weaviate only returning max. 100 docs on get_all_documents

* Add type

* Update Weaviate version on the CI

* Fix bug on get_document_count where there are no documents

* Add more info in the docstrings of get_all_documents and get_all_documents_generator

* Add latest docstring and tutorial changes

* Apply Black

* Update Documentation & Code Style

* Trigger pipeline

* Update Documentation & Code Style

* Include StefanBogdan feedback

* Fix mypy issues and LogicalFilterClause

* Add more types

* Update Documentation & Code Style

* update setup.cfg

* Upgrade weaviate containers too

* Allow to filter for content field in Weaviate

* Use convert_to_weaviate instead of convert_to_pinecone

* Fix _get_all_documents_in_index

* Update docstrings and docs

* Catching an exception in get_document(s)_by_id

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-04-01 15:37:34 +03:00
Florian Hardow
a273c3a51d
EvaluationSetClient for deepset cloud to fetch evaluation sets and la… (#2345)
* EvaluationSetClient for deepset cloud to fetch evaluation sets and labels for one specific evaluation set

* make DeepsetCloudDocumentStore able to fetch uploaded evaluation set names

* fix missing renaming of get_evaluation_set_names in DeepsetCloudDocumentStore

* update documentation for evaluation set functionality in deepset cloud document store

* DeepsetCloudDocumentStore tests for evaluation set functionality

* rename index to evaluation_set_name for DeepsetCloudDocumentStore evaluation set functionality

* raise DeepsetCloudError when no labels were found for evaluation set

* make use of .get_with_auto_paging in EvaluationSetClient

* Return result of get_with_auto_paging() as it parses the response already

* Make schema import source more specific

* fetch all evaluation sets for a workspace in deepset Cloud

* Rename evaluation_set_name to label_index

* make use of generator functionality for fetching labels

* Update Documentation & Code Style

* Adjust function input for DeepsetCloudDocumentStore.get_all_labels, adjust tests for it, fix typos, make linter happy

* Match error message with pytest.raises

* Update Documentation & Code Style

* DeepsetCloudDocumentStore.get_labels_count raises DeepsetCloudError when no evaluation set was found to count labels on

* remove unneeded import in tests

* DeepsetCloudDocumentStore tests, make reponse bodies a string through json.dumps

* DeepsetcloudDocumentStore.get_label_count - move raise to return

* stringify uuid before json.dump as uuid is not serilizable

* DeepsetcloudDocumentStore - adjust response mocking in tests

* DeepsetcloudDocumentStore - json dump response body in test

* DeepsetCloudDocumentStore introduce label_index, EvaluationSetClient rename label_index to evaluation_set

* Update Documentation & Code Style

* DeepsetCloudDocumentStore rename evaluation_set to evaluation_set_response as there is a name clash with the input variable

* DeepsetCloudDocumentStore - rename missed variable in test

* DeepsetCloudDocumentStore - rename missed label_index to index in doc string, rename label_index to evaluation_set in EvaluationSetClient

* Update Documentation & Code Style

* DeepsetCloudDocumentStore - update docstrings for EvaluationSetClient

* DeepsetCloudDocumentStore - fix typo in doc string

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-31 08:59:58 +02:00
bogdankostic
834f8c4902
Change return types of indexing pipeline nodes (#2342)
* Change return types of file converters

* Change return types of preprocessor

* Change return types of crawler

* Adapt utils to functions to new return types

* Adapt __init__.py to new method names

* Prevent circular imports

* Update Documentation & Code Style

* Let DocStores' run method accept Documents

* Adapt tests to new return types

* Update Documentation & Code Style

* Put "# type: ignore" to right place

* Remove id_hash_keys property from Document primitive

* Update Documentation & Code Style

* Adapt tests to new return types and missing id_hash_keys property

* Fix mypy

* Fix mypy

* Adapt PDFToTextOCRConverter

* Remove id_hash_keys from RestAPI tests

* Update Documentation & Code Style

* Rename tests

* Remove redundant setting of content_type="text"

* Add DeprecationWarning

* Add id_hash_keys to elasticsearch_index_to_document_store

* Change document type from dict to Docuemnt in PreProcessor test

* Fix file path in Tutorial 5

* Remove added output in Tutorial 5

* Update Documentation & Code Style

* Fix file_paths in Tutorial 9 + fix gz files in fetch_archive_from_http

* Adapt tutorials to new return types

* Adapt tutorial 14 to new return types

* Update Documentation & Code Style

* Change assertions to HaystackErrors

* Import HaystackError correctly

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-29 13:53:35 +02:00
tstadel
a73717b2ea
Support conjunctive queries in sparse retrieval (#2361)
* support conjunctive queries in sparse retrieval

* fix typo

* test added

* Update Documentation & Code Style

* fix test_DeepsetCloudDocumentStore_query

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-28 22:10:50 +02:00
tstadel
851fe1cf07
Fix normalize_embedding using numba (#2347)
* fix normalize_embedding using numba

* Update Documentation & Code Style

* fix too-many-public-methods pylint msg

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-22 23:04:55 +01:00
tstadel
ca86cc834d
Integrate BEIR (#2333)
* introduce eval_beir() to Pipeline

* add beir dependency

* Update Documentation & Code Style

* top_k_values added + refactoring

* Update Documentation & Code Style

* enable titles during beir eval

* Update Documentation & Code Style

* raise HaystackError instead of PipelineError

* get rid of forced dedicated index

* minor docstring and comment fixes

* show warning on default index deletion

* Update Documentation & Code Style

* add delete_index to MockDocumentStore

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-21 19:04:28 +01:00
James Briggs
8cd73a9d20
Add PineconeDocumentStore (#2254)
* added core install and functionality of pinecone doc store (init, upsert, query, delete)

* implemented core functionality of Pinecone doc store

* Update Documentation & Code Style

* updated filtering to use Haystack filtering and reduced default batch_size

* Update Documentation & Code Style

* removed debugging code

* updated Pinecone filtering to use filter_utils

* removed uneeded methods and minor tweaks to current methods

* fixed typing issues

* Update Documentation & Code Style

* Allow filters in al methods except get_embedding_count

* Fix skipping document store tests

* Update Documentation & Code Style

* Fix handling of Milvus1 and Milvus2 in tests

* Update Documentation & Code Style

* Fix handling of Milvus1 and Milvus2 in tests

* Update Documentation & Code Style

* Remove SQL from tests requiring embeddings

* Update Documentation & Code Style

* Fix get_embedding_count of Milvus2

* Make sure to start Milvus2 tests with a new collection

* Add pinecone to test suite

* Update Documentation & Code Style

* Fix typing

* Update Documentation & Code Style

* Add pinecone to docstores dependendcy

* Add PineconeDocStore to API Documentation

* Add missing comma

* Update Documentation & Code Style

* Adapt format of doc strings

* Update Documentation & Code Style

* Set API key as environment variable

* Skip Pinecone tests in forks

* Add sleep after deleting index

* Add sleep after deleting index

* Add sleep after creating index

* Add check if index ready

* Remove printing of index stats

* Create new index for each pinecone test

* Use RestAPI instead of Python API for describe_index_stats

* Fix accessing describe_index_stats

* Remove usages of describe_index_stats

* Run pinecone tests separately

* Update Documentation & Code Style

* Add pdftotext to pinecone tests

* Remove sleep from doc store fixture

* Add describe_index_stats

* Remove unused imports

* Use pull_request_target trigger

* Revert use pull_request_target trigger

* Remove set_config

* Add os to conftest

* Integrate review comments

* Set include_values to False

* Remove quotation marks from pinecone.Index type

* Update Documentation & Code Style

* Update Documentation & Code Style

* Fix number of args in error messages

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-03-21 16:24:09 +01:00
Tuana Celik
6fb58d09a9
'os' wrapper to function for brownfield support (#2282)
* 'os' wrapper to function for brownfield support

* Changing function names and fixing default parameter values

* Including parameter keys

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-16 11:53:55 +01:00
Sara Zan
11cf94a965
Pipeline's YAML: syntax validation (#2226)
* Add BasePipeline.validate_config, BasePipeline.validate_yaml, and some new custom exception classes

* Make error composition work properly

* Clarify typing

* Help mypy a bit more

* Update Documentation & Code Style

* Enable autogenerated docs for Milvus1 and 2 separately

* Revert "Enable autogenerated docs for Milvus1 and 2 separately"

This reverts commit 282be4a78a6e95862a9b4c924fc3dea5ca71e28d.

* Update Documentation & Code Style

* Re-enable 'additionalProperties: False'

* Add pipeline.type to JSON Schema, was somehow forgotten

* Disable additionalProperties on the pipeline properties too

* Fix json-schemas for 1.1.0 and 1.2.0 (should not do it again in the future)

* Cal super in PipelineValidationError

* Improve _read_pipeline_config_from_yaml's error handling

* Fix generate_json_schema.py to include document stores

* Fix json schemas (retro-fix 1.1.0 again)

* Improve custom errors printing, add link to docs

* Add function in BaseComponent to list its subclasses in a module

* Make some document stores base classes abstract

* Add marker 'integration' in pytest flags

* Slighly improve validation of pipelines at load

* Adding tests for YAML loading and validation

* Make custom_query Optional for validation issues

* Fix bug in _read_pipeline_config_from_yaml

* Improve error handling in BasePipeline and Pipeline and add DAG check

* Move json schema generation into haystack/nodes/_json_schema.py (useful for tests)

* Simplify errors slightly

* Add some YAML validation tests

* Remove load_from_config from BasePipeline, it was never used anyway

* Improve tests

* Include json-schemas in package

* Fix conftest imports

* Make BasePipeline abstract

* Improve mocking by making the test independent from the YAML version

* Add exportable_to_yaml decorator to forget about set_config on mock nodes

* Fix mypy errors

* Comment out one monkeypatch

* Fix typing again

* Improve error message for validation

* Add required properties to pipelines

* Fix YAML version for REST API YAMLs to 1.2.0

* Fix load_from_yaml call in load_from_deepset_cloud

* fix HaystackError.__getattr__

* Add super().__init__()in most nodes and docstore, comment set_config

* Remove type from REST API pipelines

* Remove useless init from doc2answers

* Call super in Seq3SeqGenerator

* Typo in deepsetcloud.py

* Fix rest api indexing error mismatch and mock version of JSON schema in all tests

* Working on pipeline tests

* Improve errors printing slightly

* Add back test_pipeline.yaml

* _json_schema.py supports different versions with identical schemas

* Add type to 0.7 schema for backwards compatibility

* Fix small bug in _json_schema.py

* Try alternative to generate json schemas on the CI

* Update Documentation & Code Style

* Make linux CI match autoformat CI

* Fix super-init-not-called

* Accidentally committed file

* Update Documentation & Code Style

* fix test_summarizer_translation.py's import

* Mock YAML in a few suites, split and simplify test_pipeline_debug_and_validation.py::test_invalid_run_args

* Fix json schema for ray tests too

* Update Documentation & Code Style

* Reintroduce validation

* Usa unstable version in tests and rest api

* Make unstable support the latest versions

* Update Documentation & Code Style

* Remove needless fixture

* Make type in pipeline optional in the strings validation

* Fix schemas

* Fix string validation for pipeline type

* Improve validate_config_strings

* Remove type from test p[ipelines

* Update Documentation & Code Style

* Fix test_pipeline

* Removing more type from pipelines

* Temporary CI patc

* Fix issue with exportable_to_yaml never invoking the wrapped init

* rm stray file

* pipeline tests are green again

* Linux CI now needs .[all] to generate the schema

* Bugfixes, pipeline tests seems to be green

* Typo in version after merge

* Implement missing methods in Weaviate

* Trying to avoid FAISS tests from running in the Milvus1 test suite

* Fix some stray test paths and faiss index dumping

* Fix pytest markers list

* Temporarily disable cache to be able to see tests failures

* Fix pyproject.toml syntax

* Use only tmp_path

* Fix preprocessor signature after merge

* Fix faiss bug

* Fix Ray test

* Fix documentation issue by removing quotes from faiss type

* Update Documentation & Code Style

* use document properly in preprocessor tests

* Update Documentation & Code Style

* make preprocessor capable of handling documents

* import document

* Revert support for documents in preprocessor, do later

* Fix bug in _json_schema.py that was breaking validation

* re-enable cache

* Update Documentation & Code Style

* Simplify calling _json_schema.py from the CI

* Remove redundant ABC inheritance

* Ensure exportable_to_yaml works only on implementations

* Rename subclass to class_ in Meta

* Make run() and get_config() abstract in BasePipeline

* Revert unintended change in preprocessor

* Move outgoing_edges_input_node check inside try block

* Rename VALID_CODE_GEN_INPUT_REGEX into VALID_INPUT_REGEX

* Add check for a RecursionError on validate_config_strings

* Address usages of _pipeline_config in data silo and elasticsearch

* Rename _pipeline_config into _init_parameters

* Fix pytest marker and remove unused imports

* Remove most redundant ABCs

* Rename _init_parameters into _component_configuration

* Remove set_config and type from _component_configuration's dict

* Remove last instances of set_config and replace with super().__init__()

* Implement __init_subclass__ approach

* Simplify checks on the existence of _component_configuration

* Fix faiss issue

* Dynamic generation of node schemas & weed out old schemas

* Add debatable test

* Add docstring to debatable test

* Positive diff between schemas implemented

* Improve diff printing

* Rename REST API YAML files to trigger IDE validation

* Fix typing issues

* Fix more typing

* Typo in YAML filename

* Remove needless type:ignore

* Add tests

* Fix tests & validation feedback for accessory classes in custom nodes

* Refactor RAGeneratorType out

* Fix broken import in conftest

* Improve source error handling

* Remove unused import in test_eval.py breaking tests

* Fix changed error message in tests matches too

* Normalize generate_openapi_specs.py and generate_json_schema.py in the actions

* Fix path to generate_openapi_specs.py in autoformat.yml

* Update Documentation & Code Style

* Add test for FAISSDocumentStore-like situations (superclass with init params)

* Update Documentation & Code Style

* Fix indentation

* Remove commented set_config

* Store model_name_or_path in FARMReader to use in DistillationDataSilo

* Rename _component_configuration into _component_config

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-15 11:17:26 +01:00
Branden Chan
e4c979990c
Bring back init defs to api in v1.2 and latest (#2296)
* Bring back init defs to api in v1.2

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-10 15:01:05 +01:00
OmniScience
004e7f33af
Bulk insert in sql document stores (#2264)
* bulk_save_objects

* use bulk_insert_mappings in sql document store

* bug fix : write_document overwrite

* update documentation write_documents

* Update Documentation & Code Style

* small changes for readability

* bug fix: missing quotation marks

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-03-09 12:11:26 +01:00
Branden Chan
b563b6622c
Api pages (#2248)
* Update Readme WIP

* Update Documentation & Code Style

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-02-25 13:53:46 +01:00
MichelBartels
2c423ba063
Introduce support for pymilvus>=2.0.0 (#2126)
* update remaining occurences of get_connection

* fix milvus2 import and fix wrong extra references

* change MilvusDocumentStore to Milvus1DocumentStore

* update milvus docstrings to reflect updated dependency management

* enable milvus 2 tests

* fix milvus2 env variable processing

* fix dropping collections for each milvus 2 test

* make Milvus 2 doc store tests work

* allow user to specify consistency level

* Fist attempt at running Milvus2 in the CI

* Install the correct pymilvus

* add batch deletion for milvus2

* change default from milvus 1 to milvus 2

* make milvus2 the default in the docstores extra

* Switch milvus1 and milvus2 in base test run on CI

* Rename docstore flags for pytest: 'milvus'->'milvus1', 'milvus2'->'milvus'

* Rename milvus.py->milvus1.py and milvus2x.py->milvus2.py

* Enable autogenerated docs for Milvus1 and 2 separately

* Partial fix to docstring of Milvus2DocumentStore

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michel Bartels <kontakt@michelbartels.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-02-24 17:43:38 +01:00
MichelBartels
f4efc008f4
Adding extended meta data filtering support for InMemoryDocumenStore (#2120)
* add filter classes

* update filter comments

* Add util classes for converting filters (#2123)

* Apply Black

* reintroduce eval functions to filter ops

* Update documentation

* update to latest pymilvus version

* Apply Black

* fixing type hints

* Apply Black

* update write_documents method of milvus2 doc store

* remove unnecessary method

* update init

* remove changes to milvus 2 as they are part of other PR

* remove changes to milvus 2 as they are part of other PR

* updating doc strings to match elastic search filter doc

* Update Documentation & Code Style

* add support for case where there is no meta data defined for key

* update behaviour in case of field not existing in entry

* Update Documentation & Code Style

* add test for InMemoryDocumentStore extended meta data filtering

* make type hint more precise

Co-authored-by: bogdankostic <bogdankostic@web.de>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
2022-02-22 17:44:58 +01:00
bogdankostic
2a674eaff7
Support more data types and extended filters in WeaviateDocStore (#2143)
* Support more data types and extended filters in WeaviateDocStore

* Adapt types to extended filters

* Update Documentation & Code Style

* Fix mypy

* Fix type of filters

* Update Documentation & Code Style

* Add Docstrings for BaseDocStore

* Update Documentation & Code Style

* Add + prettify DocStrings

* Update Documentation & Code Style

* Fix types

* Update Documentation & Code Style

* Remove import of TypedDict

* Fix tests

* Update Documentation & Code Style

* Fix circular import

* Fix inversion of not operation + add test case

* Fix mypy

* Update Documentation & Code Style

* Apply black

* Use convert_date_to_rfc3339 instead of datetime.fromisoformat

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-02-18 08:55:17 +01:00
Sara Zan
ffbba90323
Move pytest configuration into pyproject.toml (#2141)
* Move pytest configuration into pyproject.toml

* Fix markers format

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-02-08 17:23:59 +01:00
tstadel
518a439482
OpenSearchDocumentStore: Extend similarity support (#2070)
* get rid of global space_type setting

* full_similarity_support

* fallback to exact vector similarity

* cone_embedding_field() instead of full_similarity_support

* multiple embedding fields handling

* update documentation and messages

* revert unnecessary changes

* Add latest docstring and tutorial changes

* typo

* Add latest docstring and tutorial changes

* update docs

* Add latest docstring and tutorial changes

* improve messages

* further improve messages

* support l2 in ElasticsearchDocumentStore

* Apply Black

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-02-04 16:37:08 +01:00
Sara Zan
957e78ed9e
Upgrade pydoc-markdown & refactor GitHub Actions (#2117)
* Upgrade pydoc-markdown and fix the YAMLs to work with it

* Pin pydoc-markdown to major version

* Generalize pydoc-markdown workflow

* Make a single Action to perform all tasks that require committing into the local branch

* Merge the code updates and the docs in the Linux CI to prevent the bot from always show the pipeline as green

* Installing Jupyter deps for Black

* Build cache before running generation tasks

* Add check not to run the code generation on master

* Simplify push action

* Add more test deps in setup.cfg and remove from GH Action workflow

* Remove forced upgrades on pip install

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-02-04 15:45:09 +01:00
bogdankostic
f062911040
Extend metadata filtering support in ElasticsearchDocumentStore (#2108)
* Add extended filtering to ESDocumentStore

* Add Docstrings

* Fix definition of filter queries

* Fix mypy

* Add tests

* Add latest docstring and tutorial changes

* Adapt Docstrings

* Adapt tests to added test_docs

* Adapt tests to added test_docs

* Adapt tests to added test_docs

* Adapt tests to added test_docs

* Add filtering utils for same representation in all doc stores

* Apply balck formatting

* Update documentation

* Fix mypy

* Apply Black

* Fix mypy

* Adopt Doc Strings

* Add more tests

* Apply Black

* Allow filtering in OpenSearchDocStore

* Update documentation

* Adapt Docstrings

* Update documentation

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-02-04 13:43:12 +01:00