46 Commits

Author SHA1 Message Date
tstadel
d46c84bb61
feat: support dynamic filters in custom_query (#5427)
* support filters in custom_query

* better tests

* Update docstrings

---------

Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-08-08 15:48:15 +02:00
Sebastian Husch Lee
b5aef24a7e
feat: Add support for meta fields that are lists when using embed_meta_fields (#5307)
* Add support for meta fields that are lists when using embed_meta_fields

* Make sure unit test doesn't download model

* Adding more unit tests
2023-07-11 17:32:33 +02:00
Sebastian Husch Lee
22750d342c
test: Refactor some retriever tests into unit tests (#5306)
* Modify and reactivate two unit tests

* Refactor openai embedding tests into unit tests

* Update test_retriever.py

* Changing tests
2023-07-11 13:36:23 +02:00
Stefano Fiorucci
90ff3817e7
feat: support OpenAI-Organization for authentication (#5292)
* add openai_organization to invocation layer, generator and retriever

* added tests
2023-07-07 12:02:21 +02:00
Vladimir Blagojevic
0cc9ce7522
fix: WebRetriever top_k is ignored in a pipeline (#5106)
* Initial changes

* Add WebSearch, WebRetriever top_k unit tests

* Add exact integration test that failed Tuana

* PR review
2023-06-09 10:42:37 +02:00
Michael Feil
6ea8ae01a2
feat: Allow setting custom api_base for OpenAI nodes (#5033)
* add changes for api_base

* format retriever

* Update haystack/nodes/retriever/dense.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* Update haystack/nodes/audio/whisper_transcriber.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* Update haystack/preview/components/audio/whisper_remote.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* Update haystack/nodes/answer_generator/openai.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* Update test_retriever.py

* Update test_whisper_remote.py

* Update test_generator.py

* Update test_retriever.py

* reformat with black

* Update haystack/nodes/prompt/invocation_layer/chatgpt.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Add unit tests

* apply docstring suggestions

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
Co-authored-by: michaelfeil <me@michaelfeil.eu>
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
2023-06-05 11:32:06 +02:00
Massimiliano Pippi
c6ea542b57
chore: remove BaseKnowledgeGraph (#4953)
* remove BaseKnowledgeGraph

* fix pylint
2023-05-21 10:42:02 +02:00
Massimiliano Pippi
4974bf7ab3
chore: remove deprecated MilvusDocumentStore (#4951)
* remove deprecated MilvusDocumentStore

* remove leftovers

* fix pylint
2023-05-19 16:37:38 +02:00
bogdankostic
df46e7fadd
fix: Use AutoTokenizer instead of DPR specific tokenizer (#4898)
* Use AutoTokenizer instead of DPR specific tokenizer

* Adapt TableTextRetriever

* Adapt tests

* Adapt tests
2023-05-17 18:54:34 +02:00
ZanSara
1b57b96210
refactor!: extract elasticsearch (#4668)
* extract elasticsearch

* update pyproject.toml

* make more import optional

* move MockBaseRetriever in conftest

* install es in the es integration tests
2023-04-26 10:14:20 +02:00
Silvano Cerza
5ac3dffbef
test: Rework conftest (#4614)
* Split root conftest into multiple ones and remove unused fixtures

* Remove some constants and make them fixtures

* Remove unnecessary fixture scoping

* Fix failing whisper tests

* Fix image_file_paths fixture
2023-04-11 10:33:43 +02:00
Zoltan Fedor
32091d66cb
Adding filtering support for Weaviate when used for BM25 querying (#4385) 2023-03-29 16:51:22 +02:00
Vladimir Blagojevic
be25655663
feat: Add agent tools (#4437)
* Initial commit, add search_engine

* Add TopPSampler

* Add more TopPSampler unit tests

* Remove SearchEngineSampler (converted to TopPSampler)

* Add some basic WebSearch unit tests

* Rename unit tests

* Add WebRetriever into agent_tools

* Adjust to WebRetriever

* Add WebRetriever mode [snippet|document]

* Minor changes

* SerperDev: add peopleAlsoAsk search results

* First agent for hotpotqa

* Making WebRetriever work on hotpotqa

* refactor: minor WebRetriever improvements (#4377)

* refactor: remove doc ids rebuild + antecipate cache

* refactor: improve caching, fix Document ids

* Minor WebRetriever improvements

* Overlooked minor fixes

* feat: add Bing API as search engine

* refactor: let kwargs pass-through

* feat: increase search context

* check sampler result, improve batch typing

* refactor: increase mypy compliance

* Initial commit, add search_engine

* Add TopPSampler

* Add more TopPSampler unit tests

* Remove SearchEngineSampler (converted to TopPSampler)

* Add some basic WebSearch unit tests

* Rename unit tests

* Add WebRetriever into agent_tools

* Adjust to WebRetriever

* Add WebRetriever mode [snippet|document]

* Minor changes

* SerperDev: add peopleAlsoAsk search results

* First agent for hotpotqa

* Making WebRetriever work on hotpotqa

* refactor: minor WebRetriever improvements (#4377)

* refactor: remove doc ids rebuild + antecipate cache

* refactor: improve caching, fix Document ids

* Minor WebRetriever improvements

* Overlooked minor fixes

* feat: add Bing API as search engine

* refactor: let kwargs pass-through

* feat: increase search context

* check sampler result, improve batch typing

* refactor: increase mypy compliance

* Fix mypy

* Minor example fixes

* Fix the descriptions

* PR feedback updates

* More fixes

* TopPSampler: handle top p None value, add unit test

* Add top_k to WebSearch

* Use boilerpy3 instead trafilatura

* Remove date finding

* Add more WebRetriever docs

* Refactor long methods

* making the preprocessor optional

* hide WebSearch and make NeuralWebSearch a pipeline

* remove unused imports

* add WebQAPipeline and split example into two

* change example search engine to SerperDev

* Turn off progress bars in WebRetriever's PreProcesssor

* Agent tool examples - final updates

* Add webqa test, search results ranking scores

* Better answer box handling for SerperDev and SerpAPI

* Minor fixes

* pylint

* pylint fixes

* extract TopPSampler from WebRetriever

* use sampler only for WebRetriever modes other than snippet

* add web retriever tests

* add web retriever tests

* exclude rdflib@6.3.2 due to license issues

* add test for preprocessed docs and kwargs examples in docstrings

* Move test_webqa_pipeline to test/pipelines

* change docstring for join_documents_and_scores

* Use WebQAPipeline in examples/web_lfqa.py

* Use WebQAPipeline in examples/web_lfqa.py

* Move test_webqa_pipeline to e2e

* Updated lg

* Sampler added automatically in WebQAPipeline, no need to add it

* Updated lg

* Updated lg

* :ignore Update agent tools examples to new templates (#4503)

* Update examples to new templates

* Add print back

* fix linting and black format issues

---------

Co-authored-by: Daniel Bichuetti <daniel.bichuetti@gmail.com>
Co-authored-by: agnieszka-m <amarzec13@gmail.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-03-27 18:14:58 +02:00
Silvano Cerza
5b63c2086e
refactor: Deprecate BaseKnowledgeGraph, GraphDBKnowledgeGraph, InMemoryKnowledgeGraph and Text2SparqlRetriever (#4500)
* Deprecate BaseKnowledgeGraph and InMemoryKnowledgeGraph

* Deprecate GraphDBKnowledgeGraph

* Fix mypy

* Deprecate Text2SparqlRetriever
2023-03-27 15:31:22 +02:00
Silvano Cerza
1b5df55dbb
Skip flaky test (#4444) 2023-03-16 16:32:28 +01:00
Daniel Bichuetti
1548c5ba0f
feat: Add Azure OpenAI embeddings support (#4332)
* feate: add Azure OpenAI as embedding option

* feat: Add Azure OpenAI embeddings support

* refactor: check api key

* refactor: better type checking for Azure

* refactor: enable parallelism + separate and update tests

* refactor: string reformat

* refactor: explicit typing

* refactor: update refs and remove unused code
2023-03-06 13:37:20 +01:00
Jack Butler
e6b6f70ae2
fix: Fix TableTextRetriever for input consisting of tables only (#4048)
* fix: update kwargs for TriAdaptiveModel

* fix: squeeze batch for TTR inference

* test: add test for ttr + dataframe case

* test: update and reorganise ttr tests

* refactor: make triadaptive model handle shapes

* refactor: remove duplicate reshaping

* refactor: rename test with duplicate name

* fix: add device assignment back to TTR

* fix: remove duplicated vars in test

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
2023-02-09 11:38:16 +01:00
Sebastian
1bbf10a376
Remove double batching in retrieve_batch (#4014)
* Removed double batching around embed_queries

* Add back tests for retrieve_batch for dpr and embedding retrievers

* Updated table-text-retriever to not double batch

* Fixing pylint

* Update to test

* Remove code breaking test

* Updating dev comment to be clearer
2023-02-08 14:39:20 +01:00
hsm207
08ec059b14
refactor: use weaviate client to build BM25 query (#3939)
* refactor: use weaviate client to build BM25 query

* refactor: remove manual BM25 query building

* refactor: apply BM25 to the content_field only

* test: update weaviate BM25 retrieval test case

update to account for lack of stemming

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-01-30 10:07:07 +01:00
tstadel
4a0a054164
fix: linefeeds in custom_query (#3813)
* fix linefeeds in custom_query

* add double quote test case
2023-01-05 17:13:04 +01:00
Julian Risch
0c2d13f1b8
bug: skip validating empty embeddings (#3774)
* skip validating empty embeddings

* skip batches without embeddings to update

* add unit test with mocked retriever
2023-01-05 15:13:57 +01:00
Zoltan Fedor
e143f7cc36
Fixing broken BM25 support with Weaviate - fixes #3720 (#3723)
* Fixing broken BM25 support with Weaviate - fixes #3720

Unfortunately the BM25 support with Weaviate got broken with Haystack v1.11.0+, which is getting fixed with this commit.

Please see more under issue #3720.

* Fixing mypy issue - method signature wasn't matching the base class

* Mypy related test fix

Mypy forced me to set the signature of the `query` method of the Weaviate document store to the same as its parent, the `KeywordDocumentStore`, where the `query` parame is `Optional`, but has NO default value, so it must be provided (as None) at runtime.
I am not quite sure why the abstract method's `query` param was set without a default value while its type is `Optional`, but I didn't want to change that, so instead I have changed the Weaviate tests.

* Adding a note regarding an upcomming fix in Weaviate v1.17.0

* Apply suggestions from code review

* revert

* [EMPTY] Re-trigger CI
2022-12-19 17:24:46 +01:00
Vladimir Blagojevic
56803e5465
feat: Enable text-embedding-ada-002 for EmbeddingRetriever (#3721)
* Enable text-embedding-ada-002 for EmbeddingRetriever

* Easier to understand code, more unit tests
2022-12-19 17:06:48 +01:00
Stefano Fiorucci
5b9c661155
feat: add index parameter to TfidfRetriever (#3666)
* first draft to add index param to tfidf

* better mypy handling

* Revert "better mypy handling"

This reverts commit 91a22516320f9dcbeae53827ec69f9dc51e1785c.

* new check in auto_fit

* new check also in retrieve

* better dict typings

* new test and improvements to other test

* remove unnecessary lambda

* improve test

* remove newline from openapi json

* fix test

* language fix

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* language fix 2

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* language fix 3

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* language fix 4

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* language fix 5

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* language fix 6

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* explicit index value handling

* fix test

* better error messages

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-12-19 12:07:49 +01:00
tstadel
600dc2d611
refactor: filters type (#3682)
* consolidate filters type

* remove unnecessary optionals

* fix mypy

* fix pylint

* fix pylint

* move FilterType to schema

* remove Optional from FilterType

* move to Dict[str, Any]

* Revert "move to Dict[str, Any]"

This reverts commit e8c561bb7885949e19825697fa4c469945f90ce5.

* fix mypy

* fix pylint

* revert isort changes in elasticsearch

* remove todos in milvus.py

* remove todos in sql.py

* add aggregate_labels tests

* consolidate aggregate_labels tests

* remove superfluous type todos

* remove ALL superfluous #todos
2022-12-12 14:04:29 +01:00
Unai Garay Maestre
77cea8b140
feat: Adds all_terms_must_match parameter to BM25Retriever at runtime (#3627)
* Adds all_terms_must_match implementation and tests

* Adds all_terms_must_match as Optional

Signed-off-by: Unai Garay <unaigaraymaestre@gmail.com>

* Avoid mypy error and follow pattern checking var is None

* Mypy works ok on this file now

* added mypy ignores to BaseRetriever

* ignoring all overrides for this file

* Updates sparse retriever `all_terms_must_match` docstring

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Updates sparse retriever `all_terms_must_match` docstring

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Updates sparse retriever `all_terms_must_match` docstring

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Updates sparse retrieve_batch `all_terms_must_match` docstring

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Updates sparse retrieve_batch `all_terms_must_match` docstring

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Updates sparse retrieve_batch `all_terms_must_match` docstring

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* marked elasticsearch

Signed-off-by: Unai Garay <unaigaraymaestre@gmail.com>
Co-authored-by: Mayank Jobanputra <mayankjobanputra@gmail.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-12-08 17:18:43 +05:30
tstadel
c1c1c97bb2
feat: add query_by_embedding_batch (#3546)
* add query_by_embedding_batch

* fix mypy

* fix pylint

* add test

* move query_by_embedding_batch to search_engine

* fix and add tests

* fix pylint

* remove Retriever query logs

* add test for multimodal batch retrieval

* allow for np.ndarray
2022-12-08 08:28:43 +01:00
Mayank Jobanputra
95cf666a20
refactor: change MultiModal retriever to be of type DenseRetriever (#3598)
* changed Multimodal retriever to be of type DenseRetriever

* format fix

* Pylint fix

* Added embed_queries and tests
2022-11-28 19:24:22 +01:00
Stefano Fiorucci
3040e59c63
feat: add support for BM25Retriever in InMemoryDocumentStore (#3561)
* very first draft

* implement query and query_batch

* add more bm25 parameters

* add rank_bm25 dependency

* fix mypy

* remove tokenizer callable parameter

* remove unused import

* only json serializable attributes

* try to fix: pylint too-many-public-methods / R0904

* bm25 attribute always present

* convert errors into warnings to make the tutorial 1 work

* add docstrings; tests

* try to make tests run

* better docstrings; revert not running tests

* some suggestions from review

* rename elasticsearch retriever as bm25 in tests; try to test memory_bm25

* exclude tests with filters

* change elasticsearch to bm25 retriever in test_summarizer

* add tests

* try to improve tests

* better type hint

* adapt test_table_text_retriever_embedding

* handle non-textual docs

* query only textual documents
2022-11-22 09:24:52 +01:00
Massimiliano Pippi
6a48ace9b9
BREAKING CHANGE: remove Milvus1DocumentStore along with support for Milvus < 2.x (#3552)
* remove milvus1

* leftover

* revert deprecation process
2022-11-15 09:54:55 +01:00
Mayank Jobanputra
794fe5ffa4
bug: didn't clean up model files after running pytest for test_table_text_retriever_training (#3534)
* Added tmp path to avoid clean up of model files later
2022-11-07 15:07:04 +05:30
Vladimir Blagojevic
5ca96357ff
feat: Add CohereEmbeddingEncoder to EmbeddingRetriever (#3453) 2022-10-25 17:52:29 +02:00
Unai Garay Maestre
3a2c8ae3c5
bug: Adds better way of checking query in BaseRetriever and Pipeline.run() (#3304)
* changes how query and queries are checked if they have been passed in BaseRetriever

* Fixes checking query properly in Pipeline run

* Fixes checking query properly in Pipeline run

* Adds test for FilterRetriever using run method when query is empty

* Adds mock filter retriever and adapts test

* Removes old test, adds MockRetriever to test file and test uses document_store

* Logs error when query is not of type string with a new test for run batch

* Update test/nodes/test_retriever.py

* schemas
2022-10-17 19:00:13 +02:00
Sara Zan
101d2bc86c
feat: MultiModalRetriever (#2891)
* Adding Data2VecVision and Data2VecText to the supported models and adapt Tokenizers accordingly

* content_types

* Splitting classes into respective folders

* small changes

* Fix EOF

* eof

* black

* API

* EOF

* whitespace

* api

* improve multimodal similarity processor

* tokenizer -> feature extractor

* Making feature vectors come out of the feature extractor in the similarity head

* embed_queries is now self-sufficient

* couple trivial errors

* Implemented separate language model classes for multimodal inference

* Document embedding seems to work

* removing batch_encode_plus, is deprecated anyway

* Realized the base Data2Vec models are not trained on retrieval tasks

* Issue with the generated embeddings

* Add batching

* Try to fit CLIP in

* Stub of CLIP integration

* Retrieval goes through but returns noise only

* Still working on the scores

* Introduce temporary adapter for CLIP models

* Image retrieval now works with sentence-transformers

* Tidying up the code

* Refactoring is now functional

* Add MPNet to the supported sentence transformers models

* Remove unused classes

* pylint

* docs

* docs

* Remove the method renaming

* mpyp first pass

* docs

* tutorial

* schema

* mypy

* Move devices setup into get_model

* more mypy

* mypy

* pylint

* Move a few params in HaystackModel's init

* make feature extractor work with squadprocessor

* fix feature_extractor_kwargs forwarding

* Forgotten part of the fix

* Revert unrelated ES change

* Revert unrelated memdocstore changes

* comment

* Small corrections

* mypy and pylint

* mypy

* typo

* mypy

* Refactor the  call

* mypy

* Do not make FARMReader use the new FeatureExtractor

* mypy

* Detach DPR tests from FeatureExtractor too

* Detach processor tests too

* Add end2end marker

* extract end2end feature extractor tests

* temporary disable feature extraction tests

* Introduce end2end tests for tokenizer tests

* pylint

* Fix model loading from folder in FeatureExtractor

* working o n end2end

* end2end keeps failing

* Restructuring retriever tests

* Restructuring retriever tests

* remove covert_dataset_to_dataloader

* remove comment

* Better check sentence-transformers models

* Use embed_meta_fields properly

* rename passage into document

* Embedding dims can't be found

* Add check for models that support it

* pylint

* Split all retriever tests into suites, running mostly on InMemory only

* fix mypy

* fix tfidf test

* fix weaviate tests

* Parallelize on every docstore

* Fix schema and specify modality in base retriever suite

* tests

* Add first image tests

* remove comment

* Revert to simpler tests

* Update docs/_src/api/api/primitives.md

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/__init__.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* get_args

* mypy

* Update haystack/modeling/model/multimodal/__init__.py

* Update haystack/modeling/model/multimodal/base.py

* Update haystack/modeling/model/multimodal/base.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/sentence_transformers.py

* Update haystack/modeling/model/multimodal/sentence_transformers.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/transformers.py

* Update haystack/modeling/model/multimodal/transformers.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/transformers.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/nodes/retriever/multimodal/retriever.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* mypy

* mypy

* removing more ContentTypes

* more contentypes

* pylint

* add to __init__

* revert end2end workflow for now

* missing integration markers

* Update haystack/nodes/retriever/multimodal/embedder.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* review feedback, removing HaystackImageTransformerModel

* review feedback part 2

* mypy & pylint

* mypy

* mypy

* fix multimodal docs also for Pinecone

* add note on internal constants

* Fix pinecone write_documents

* schemas

* keep support for sentence-transformers only

* fix pinecone test

* schemas

* fix pinecone again

* temporarily disable some tests, need to understand if they're still relevant

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-10-17 18:58:35 +02:00
Vladimir Blagojevic
159cd5a666
feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever (#3356) 2022-10-14 15:01:03 +02:00
Vladimir Blagojevic
9582a423a2
fix: ONNX FARMReader model conversion is broken (#3211) 2022-09-26 09:18:12 -04:00
tstadel
4fa9d2d8e7
Fix milvus and faiss tests not running (#3263)
* fix milvus and faiss tests not running

* fix schema manually

* fix test_dpr_embedding test for milvus

* pip freeze on milvus tests

* fix milvus1 tests being executed: fix all_doc_stores order

* Revert "pip freeze on milvus tests"

This reverts commit 75ebb6f7e507bb8477e87d9e63b4a294f7946cab.

* make infer_required_doc_store more robust

* don't skip tests without docstore requirements

* use markers for docstore tests
2022-09-22 17:46:49 +02:00
Sara Zan
4e45062a00
Simplify language_modeling.py and tokenization.py (#2703)
* Simplification of language_model.py and tokenization.py to remove code duplication

Co-authored-by: vblagoje <dovlex@gmail.com>
2022-07-22 16:29:30 +02:00
Patrick Deutschmann
1db3fd0942
Add support for Multi-Hop Dense Retrieval (#2571)
* Implement MDR

* Adapt conftest to new MDR signature

* Update Documentation & Code Style

* Change signature of queries param in batch methods of MDR like in #2575

* Update Documentation & Code Style

* Rename MultihopDenseRetriever to MultihopEmbeddingRetriever

* Fix filters in retrieve_batch

* Add docstring for MultihopEmbeddingRetriever.__init__

* Update Documentation & Code Style

* Revert forward signature of TextSimilarityHead

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-05 11:31:11 +02:00
bogdankostic
dc48c444d4
Fix loading of tokenizers in DPR (#2755) 2022-07-04 18:18:14 +02:00
Aleksander Smywiński-Pohl
642229255f
Use AutoTokenizer by default, to easily adapt to new models and token… (#1902)
* Use AutoTokenizer by default, to easily adapt to new models and tokenizers

* Add missing AutoTokenizer import

* Apply Black

* Missing import

* Fix DPR tests

* Remove tests on max length

* Update Documentation & Code Style

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-15 13:13:48 +02:00
Sara Zan
54518ac790
[CI Refactoring] Refactor Document fixtures in tests (#2577)
* Refactor document fixtures

* Add embedding files

* Update Documentation & Code Style

* Indentation issue

* Update Documentation & Code Style

* Fix type conversion in conftest.py

* Update Documentation & Code Style

* mypy on sql.py

* mypy on crawler.py

* mypy on pinecone.py

* Adapt retriever tests

* Update Documentation & Code Style

* mypy on crawler.py

* Update Documentation & Code Style

* mypy on crawler.py again

* Update Documentation & Code Style

* mypy fix was too rough

* Fix some more tests

* Update Documentation & Code Style

* Skip meaningless test on FilterRetriever

* Make embedding values less specific

* Update Documentation & Code Style

* Use stable IDs in retriever tests that depend on it

* Remove needless fixtures

* docs_with_ids

* Update Documentation & Code Style

* Typo

* Fix retriever tests

* Fix reader tests

* Update Documentation & Code Style

* Workaround #2626

* Update Documentation & Code Style

* Fix label generator tests

* Reorder vectors

* remove print

* Update Documentation & Code Style

* Update Documentation & Code Style

* git tags leftover

* Update Documentation & Code Style

* fix last failing test

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-10 18:22:48 +02:00
Sara Zan
59608ca474
[CI Refactoring] Workflow refactoring (#2576)
* Unify CI tests (from #2466)

* Update Documentation & Code Style

* Change folder names

* Fix markers list

* Remove marker 'slow', replaced with 'integration'

* Soften children check

* Start ES first so it has time to boot while Python is setup

* Run the full workflow

* Try to make pip upgrade on Windows

* Set KG tests as integration

* Update Documentation & Code Style

* typo

* faster pylint

* Make Pylint use the cache

* filter diff files for pylint

* debug pylint statement

* revert pylint changes

* Remove path from asserted log (fails on Windows)

* Skip preprocessor test on Windows

* Tackling Windows specific failures

* Fix pytest command for windows suites

* Remove \ from command

* Move poppler test into integration

* Skip opensearch test on windows

* Add tolerance in reader sas score for Windows

* Another pytorch approx

* Raise time limit for unit tests :(

* Skip poppler test on Windows CI

* Specify to pull with FF only in docs check

* temporarily run the docs check immediately

* Allow merge commit for now

* Try without fetch depth

* Accelerating test

* Accelerating test

* Add repository and ref alongside fetch-depth

* Separate out code&docs check from tests

* Use setup-python cache

* Delete custom action

* Remove the pull step in the docs check, will find a way to run on bot commits

* Add requirements.txt in .github for caching

* Actually install dependencies

* Change deps group for pylint

* Unclear why the requirements.txt is still required :/

* Fix the code check python setup

* Install all deps for pylint

* Make the autoformat check depend on tests and doc updates workflows

* Try installing dependencies in another order

* Try again to install the deps

* quoting the paths

* Ad back the requirements

* Try again to install rest_api and ui

* Change deps group

* Duplicate haystack install line

* See if the cache is the problem

* Disable also in mypy, who knows

* split the install step

* Split install step everywhere

* Revert "Separate out code&docs check from tests"

This reverts commit 1cd59b15ffc5b984e1d642dcbf4c8ccc2bb6c9bd.

* Add back the action

* Proactive support for audio (see text2speech branch)

* Fix label generator tests

* Remove install of libsndfile1 on win temporarily

* exclude audio tests on win

* install ffmpeg for integration tests

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-07 09:23:03 +02:00
bogdankostic
61d9429c25
Simplify loading of EmbeddingRetriever (#2619)
* Infer model format for EmbeddingRetriever automatically

* Update Documentation & Code Style

* Adapt conftest to automatic inference of model_format

* Update Documentation & Code Style

* Fix tests

* Update Documentation & Code Style

* Fix tests

* Adapt tutorials

* Update Documentation & Code Style

* Add test for similarity scores with sentence transformers

* Adapt doc string and warning message

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-02 15:05:29 +02:00
bogdankostic
867695ad0c
Change signature of queries param in batch methods (#2575)
* Change signature of queries param in batch methods

* Update Documentation & Code Style

* Fix mypy

* Remove unused import

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-05-24 12:33:45 +02:00
Sara Zan
ff4303c51b
[CI refactoring] Categorize tests into folders (#2554)
* Categorize tests into folders

* Fix linux_ci.yml and an import

* Wrong path
2022-05-17 09:55:53 +01:00