1414 Commits

Author SHA1 Message Date
Sebastian
71de0524de
fix: fixed InMemoryDocumentStore.get_embedding_count to return correct number (#3980)
* Fix the embedding count function of InMemoryDocumentStore

* Adding some doc strings explaining how many docs with embeddings to expect.
2023-01-30 12:38:30 +01:00
hsm207
08ec059b14
refactor: use weaviate client to build BM25 query (#3939)
* refactor: use weaviate client to build BM25 query

* refactor: remove manual BM25 query building

* refactor: apply BM25 to the content_field only

* test: update weaviate BM25 retrieval test case

update to account for lack of stemming

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-01-30 10:07:07 +01:00
Tuana Celik
93312138de
fix: removing code block in MarkdownConverter (#3960)
* first attempt to add frontmatter of markdown to the metadata

* remove bug fix

* running black and pre-commit

* moving the import line

* adding a test

* adding pydoc

* fix to removing code blocks in markdown converter

* adding a test

* fixing a test

* improving tests

* adding language to code block
2023-01-27 15:25:54 +01:00
Tuana Celik
790e9acd3e
feat: add frontmatter to meta in MarkdownConverter (#3953)
* first attempt to add frontmatter of markdown to the metadata

* remove bug fix

* running black and pre-commit

* moving the import line

* adding a test

* adding pydoc
2023-01-26 17:15:02 +01:00
Massimiliano Pippi
52b195faf6
increase the timeout for testing (#3957) 2023-01-26 16:04:43 +01:00
Vladimir Blagojevic
ec85207cf7
Remove __eq__ and __hash__ from PromptNode (#3923) 2023-01-26 13:38:35 +01:00
Vladimir Blagojevic
b945eaeabd
PromptNode: expose output_variable, adjust unit tests (#3892) 2023-01-26 11:01:11 +01:00
ZanSara
0e471d5e5a
fix: change model in distillation test (#3944)
* change model

* change layer count

* move promptnode tests in integration

* fix marker
2023-01-25 23:32:11 +05:30
Mayank Jobanputra
5c53b2bd4a
feat: adding secure loading of models by default for haystack (#3901)
* adding secure loading of models by default

* simplified set function

* testing import effect correctly

* added appropriate log line, adapted the test

* change log string formatting

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* remove extra closing bracket )

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-01-24 23:01:20 +05:30
Vladimir Blagojevic
4d8b1d0b22
refactor: Improve stop_words handling, add unit test cases (#3918)
* Improve stop_words handling, add unit test cases

* Update test/nodes/test_prompt_node.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-01-24 12:52:41 +01:00
Fabian
61ebe4b5dc
fix: authenticate with aws4auth if set in OpenSearchDocumentStore (#3741)
* bug(OpenSearchDocumentStore): fix authenticate with aws4auth if set.

Rearrange check to authenticate with aws4auth before username
and password, as the username is set to "admin" by default.

* Make username check less restrictive

* Fix test, do not used mocked _init_client function

* Add warning for aws4auth and username to ElasticSearchDocumentStore

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-01-24 10:01:39 +01:00
Zoltan Fedor
e447bd728a
feat: adding the ability to use Ray Serve async functionality (#3769)
* Adding the ability to call the Ray pipeline from concurrent apps with async

This is to fix #2968

* Fixes: mype + pylint (`invalid-overridden-method`)

* Simplifying - no real need for an `AsyncRayPipeline` anymore

* Moving the new `run_async` method to the `RayPipeline`

* Cleanup

* [EMPTY] Re-trigger CI
2023-01-23 16:23:09 +01:00
Benjamin BERNARD
eed009eddb
feat: Add CsvTextConverter (#3587)
* feat: Add Csv2Documents, EmbedDocuments nodes and FAQ indexing pipeline

Fixes #3550, allow user to build full FAQ using YAML pipeline description and with CSV import and indexing.

* feat: Add Csv2Documents, EmbedDocuments nodes and FAQ indexing pipeline

Fix linter issues mypy and pylint.

* feat: Add Csv2Documents, EmbedDocuments nodes and FAQ indexing pipeline

Fix linter issues mypy.

* implement proposal's feedback

* tidy up for merge

* use BaseConverter

* use BaseConverter

* pylint

* black

* Revert "black"

This reverts commit e1c45cb1848408bd52a630328750cb67c8eb7110.

* black

* add check for column names

* add check for column names

* add tests

* fix tests

* address lists of paths

* typo

* remove duplicate line

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2023-01-23 15:56:36 +01:00
ZanSara
94f660c56f
feat: store id_hash_keys in Document objects to make documents clonable (#3697)
* store id_hash_keys in Document objects

* fix id_hash_keys calls throughout codebase

* generate schema

* fix es

* fix weaviate

* backward compatible

* openapi schema

* remove unused deprecation warning

* remove unused imports

* openapi

* unused var

* Apply suggestions from code review

Co-authored-by: bogdankostic <bogdankostic@web.de>

* Update haystack/schema.py

* Apply suggestions from code review

Co-authored-by: bogdankostic <bogdankostic@web.de>

* Update haystack/schema.py

* review feedback

* trailing spaces

* pylint

* add deprecation test

Co-authored-by: bogdankostic <bogdankostic@web.de>
2023-01-23 15:00:52 +01:00
Stefano Fiorucci
b910df7ec7
feat: ImageToText (caption generator) (#3859)
* first draft

* fix pylint and mypy

* retry w mypy

* mypy :-)

* rem unused import

* incorporate feedback and initial tests

* better tests

* fix import order

* fix docstring

* other fix docstring

* more and better tests

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2023-01-23 11:59:56 +01:00
ZanSara
90c877a559
bug: mypy should ignore files in test/ (#3894)
* exclude files in test/

* verify that the CI ignores test files

* dont fail in case of no files
2023-01-19 18:12:26 +01:00
Vladimir Blagojevic
4c28253955
feat: PromptNode - implement stop words (#3884) 2023-01-19 12:26:15 +01:00
Vladimir Blagojevic
e2fb82b148
refactor: Move invocation_context from meta to own pipeline variable (#3888) 2023-01-19 11:17:06 +01:00
ZanSara
6f5a2fb1da
fix: remove string validation in YAML (#3854)
* remove string validation in YAML

* unused import

* fix import

* remove tests

* fix tests
2023-01-19 10:06:53 +01:00
Ahmed Nabil
12e057837b
Adding condition to pinecone object. (#3768)
* Adding condition to `pinecone` object.

While you can assign any values to `PineconeDocumentStore`'s parameter `pinecone_index`, it must have another condition to prevent that from happening.

* Added test, and changed the code to make sure the pinecone idx variable has correct instance

* fixed black error

Co-authored-by: Mayank Jobanputra <mayankjobanputra@gmail.com>
2023-01-19 01:34:44 +05:30
ZanSara
6af4f14fe0
feat: preprocessor raises warning when doc length exceeds threshold (#3837)
* add warning for excessive lenght

* improve test

* review feedback

* fix test

* move into _process_single
2023-01-17 13:48:28 +01:00
ZanSara
9e457db2e9
test: add version deprecation fixture (#3851)
* add fixture

* Update test/conftest.py

* remove +2 and add tests

* few typos

* more cases

* Update test/conftest.py
2023-01-16 15:36:14 +01:00
ZanSara
3ffdb0a9a3
chore: fix all EOF (#3852)
* fix all eof

* fix test

* fix test

* fix test

* typo

* fix sample

* fix sample

* add logs

* fix page_dynamic_result.txt
2023-01-16 12:34:50 +01:00
Massimiliano Pippi
fa4404baa0
fix: ignore non-serializable params when hashing pipeline objects (#3842)
* ignore non-serializable params when hashing pipeline objects

* make tests more clear
2023-01-11 17:11:41 +01:00
Stefano Fiorucci
be31178892
fix: make the crawler runnable and testable on Windows (#3830)
* fix crawler and try to run CI

* more compact expression

* try to fix

* improve naming regex

* revert regex

* make test_url compatible wirh Windows

* better conditional expression
2023-01-10 20:27:28 +01:00
Tobias Wochinger
dea10a51d3
fix: gracefully handle FileExistsError during Preprocessor resource download (#3816)
* fix: use temp path for downloading punkt resources

* fix: gracefully handle file exists error during download
2023-01-10 11:22:49 +01:00
Zoltan Fedor
0288e1be76
bug: The PromptNode handles all parameters as lists without checking if they are in fact lists (#3820) 2023-01-10 08:08:17 +01:00
tstadel
6ca88bfd23
fix: Despite return_embedding=False SearchEngineDocumentStore.query retrieves embedding_field (#3662)
* fix: Despite return_embedding=False SearchEngineDocumentStore.query retrieves embedding_field

* fix pylint

* add tests

* fix mypy

* fix merge

* format

* fix pylint

* move tests to SearchEngineDocumentStoreTestAbstract

* move missed constants

* add mocked_document_store fixture to TestElasticsearchDocumentStore

* fix mocked_document_store

* fix get_all_documents tests for elasticsearch>=7.16

* fix tests

* fix tests try 2
2023-01-09 11:58:23 +01:00
Sebastian
5b0b338175
fix: Ensure eval mode for TableReader model for predictions (#3743)
* Adding model.eval() calls to prediction functions in table reader

* Add unit test to check if model is set in train mode that inference time prediction still works.
2023-01-09 11:07:06 +01:00
Sebastian
659020fcac
fix: Convert table cells to strings for compatibility with TableReader (#3762)
* Add table = table.astype(str) to make sure cells are converted into to strings to be compatible witht the TableReader

* Turn more strings into ints

* Make sure answer text is always a string.
2023-01-09 10:42:11 +01:00
tstadel
4a0a054164
fix: linefeeds in custom_query (#3813)
* fix linefeeds in custom_query

* add double quote test case
2023-01-05 17:13:04 +01:00
Julian Risch
0c2d13f1b8
bug: skip validating empty embeddings (#3774)
* skip validating empty embeddings

* skip batches without embeddings to update

* add unit test with mocked retriever
2023-01-05 15:13:57 +01:00
Sebastian
e84fae2894
Migrating to use native Pytorch AMP (#2827)
* Started making changes to use native Pytorch AMP

* Updated compute_loss functions to use torch.cuda.amp.autocast

* Updating docstrings

* Add use_amp to trainer_checkpoint

* Removed mentions of apex and started to add the necessary warnings

* Removing unused instances of use_amp variable

* Added fast training test for FARMReader. Needed to add max_query_length as a parameter in FARMReader.__init__ and FARMReader.train

* Make max_query_length optional in FARMReader.train

* Update lg

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-01-05 09:14:28 +01:00
Julian Risch
a2c160e7d8
bug: skip empty documents in reader (#3773)
* skip empty documents

* test eval_batch and account for tables
2023-01-03 15:50:14 +01:00
Julian Risch
b155297a06
feat: change PipelineConfigError to DocumentStoreError with more details (#3783) 2023-01-02 19:40:45 +01:00
Vladimir Blagojevic
bebd6b26ec
Improve robustness of PromptNode unit tests (#3747) 2023-01-02 16:28:56 +01:00
bogdankostic
594d2a10f8
fix: Fix predict_batch in TransformersReader for single nested Document list (#3748)
* Fix restoring of list structure

* Add tests
2022-12-29 11:48:18 +01:00
Stefano Fiorucci
136928714c
refactor: remove deprecated parameters from Summarizer (#3740)
* remove deprecated parameters

* remove deprecation/removal test
2022-12-29 15:37:47 +05:30
tstadel
6c067b2b4f
feat: make score_script first class citizen via knn_engine param (#3284)
* OpenSearchDocumentStore: make score_script accessible via knn_engine

* blacken

* fix tests

* fix format

* fix naming of 'score_script' consistently

* fix tests

* fix test

* fix ef_search tests

* always validate index

* improve clone_embedding_field

* fix pylint

* reformat

* remove port

* update tests

* set no_implicit_optional = false

* fix myp

* fix test

* refactorings

* reformat

* fix and refactor tests

* better tests

* create search_field mappings

* remove no_implicit_optional = false

* skip validation for custom mapping

* format

* Apply suggestions from docs code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply tougher suggestions from code review

* fix messages

* fix typos

* update tests

* Update haystack/document_stores/opensearch.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* fix tests

* fix ef_search validation

* add test for ef_search nmslib

* fix assert_not_called

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-12-27 15:24:31 +01:00
Mayank Jobanputra
76a16807d5
fix: Fixed local reader model loading (#3663)
* Fixed local loading issue
2022-12-24 03:46:36 +05:30
Sebastian
756e0114e6
refactor: Remove duplicate code in TableReader (#3708)
* Refactor table reader to use util functions to reduce code duplication.

* Expanding the tests for the table reader

* Adding types

* Updating tests to work for RCIReader

* Fix bug in RCIReader. Saving the wrong queries list.

* Update _flatten_inputs to not change input variable

* Remove duplicate code
2022-12-21 14:33:19 +01:00
Vladimir Blagojevic
9ebf164cfd
feat: Expand LLM support with PromptModel, PromptNode, and PromptTemplate (#3667)
Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2022-12-20 11:21:26 +01:00
Zoltan Fedor
e143f7cc36
Fixing broken BM25 support with Weaviate - fixes #3720 (#3723)
* Fixing broken BM25 support with Weaviate - fixes #3720

Unfortunately the BM25 support with Weaviate got broken with Haystack v1.11.0+, which is getting fixed with this commit.

Please see more under issue #3720.

* Fixing mypy issue - method signature wasn't matching the base class

* Mypy related test fix

Mypy forced me to set the signature of the `query` method of the Weaviate document store to the same as its parent, the `KeywordDocumentStore`, where the `query` parame is `Optional`, but has NO default value, so it must be provided (as None) at runtime.
I am not quite sure why the abstract method's `query` param was set without a default value while its type is `Optional`, but I didn't want to change that, so instead I have changed the Weaviate tests.

* Adding a note regarding an upcomming fix in Weaviate v1.17.0

* Apply suggestions from code review

* revert

* [EMPTY] Re-trigger CI
2022-12-19 17:24:46 +01:00
Vladimir Blagojevic
56803e5465
feat: Enable text-embedding-ada-002 for EmbeddingRetriever (#3721)
* Enable text-embedding-ada-002 for EmbeddingRetriever

* Easier to understand code, more unit tests
2022-12-19 17:06:48 +01:00
Stefano Fiorucci
5b9c661155
feat: add index parameter to TfidfRetriever (#3666)
* first draft to add index param to tfidf

* better mypy handling

* Revert "better mypy handling"

This reverts commit 91a22516320f9dcbeae53827ec69f9dc51e1785c.

* new check in auto_fit

* new check also in retrieve

* better dict typings

* new test and improvements to other test

* remove unnecessary lambda

* improve test

* remove newline from openapi json

* fix test

* language fix

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* language fix 2

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* language fix 3

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* language fix 4

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* language fix 5

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* language fix 6

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* explicit index value handling

* fix test

* better error messages

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-12-19 12:07:49 +01:00
Stefano Fiorucci
e1401f79b6
refactor: improve Multilabel design (#3658)
* first try and new test

* fix test

* fix unused import

* remove comments

* no more dataclass

* add __eq__ and extend test

* better design from review

* Update schema.py

* fix black

* fix openapi

* fix openapi 2

* new try to fix openapi

* remove newline from openapi json
2022-12-13 10:45:56 +01:00
James Briggs
520b23ec1b
fix: pinecone metadata format (#3660)
* fix for multilevel metadata dictionaries

* add metadata dict formating to update function

* typing

* added check for labels meta

* added more info to input parameters

* added test for multilayer metadata

* removed todo
2022-12-13 10:11:24 +01:00
tstadel
600dc2d611
refactor: filters type (#3682)
* consolidate filters type

* remove unnecessary optionals

* fix mypy

* fix pylint

* fix pylint

* move FilterType to schema

* remove Optional from FilterType

* move to Dict[str, Any]

* Revert "move to Dict[str, Any]"

This reverts commit e8c561bb7885949e19825697fa4c469945f90ce5.

* fix mypy

* fix pylint

* revert isort changes in elasticsearch

* remove todos in milvus.py

* remove todos in sql.py

* add aggregate_labels tests

* consolidate aggregate_labels tests

* remove superfluous type todos

* remove ALL superfluous #todos
2022-12-12 14:04:29 +01:00
Unai Garay Maestre
77cea8b140
feat: Adds all_terms_must_match parameter to BM25Retriever at runtime (#3627)
* Adds all_terms_must_match implementation and tests

* Adds all_terms_must_match as Optional

Signed-off-by: Unai Garay <unaigaraymaestre@gmail.com>

* Avoid mypy error and follow pattern checking var is None

* Mypy works ok on this file now

* added mypy ignores to BaseRetriever

* ignoring all overrides for this file

* Updates sparse retriever `all_terms_must_match` docstring

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Updates sparse retriever `all_terms_must_match` docstring

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Updates sparse retriever `all_terms_must_match` docstring

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Updates sparse retrieve_batch `all_terms_must_match` docstring

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Updates sparse retrieve_batch `all_terms_must_match` docstring

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Updates sparse retrieve_batch `all_terms_must_match` docstring

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* marked elasticsearch

Signed-off-by: Unai Garay <unaigaraymaestre@gmail.com>
Co-authored-by: Mayank Jobanputra <mayankjobanputra@gmail.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-12-08 17:18:43 +05:30
tstadel
c1c1c97bb2
feat: add query_by_embedding_batch (#3546)
* add query_by_embedding_batch

* fix mypy

* fix pylint

* add test

* move query_by_embedding_batch to search_engine

* fix and add tests

* fix pylint

* remove Retriever query logs

* add test for multimodal batch retrieval

* allow for np.ndarray
2022-12-08 08:28:43 +01:00