haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-07-28 03:12:54 +00:00

Author	SHA1	Message	Date
Daniel Bichuetti	1548c5ba0f	feat: Add Azure OpenAI embeddings support (#4332 ) * feate: add Azure OpenAI as embedding option * feat: Add Azure OpenAI embeddings support * refactor: check api key * refactor: better type checking for Azure * refactor: enable parallelism + separate and update tests * refactor: string reformat * refactor: explicit typing * refactor: update refs and remove unused code	2023-03-06 13:37:20 +01:00
Jack Butler	e6b6f70ae2	fix: Fix `TableTextRetriever` for input consisting of tables only (#4048 ) * fix: update kwargs for TriAdaptiveModel * fix: squeeze batch for TTR inference * test: add test for ttr + dataframe case * test: update and reorganise ttr tests * refactor: make triadaptive model handle shapes * refactor: remove duplicate reshaping * refactor: rename test with duplicate name * fix: add device assignment back to TTR * fix: remove duplicated vars in test --------- Co-authored-by: bogdankostic <bogdankostic@web.de>	2023-02-09 11:38:16 +01:00
Sebastian	1bbf10a376	Remove double batching in retrieve_batch (#4014 ) * Removed double batching around embed_queries * Add back tests for retrieve_batch for dpr and embedding retrievers * Updated table-text-retriever to not double batch * Fixing pylint * Update to test * Remove code breaking test * Updating dev comment to be clearer	2023-02-08 14:39:20 +01:00
hsm207	08ec059b14	refactor: use weaviate client to build BM25 query (#3939 ) * refactor: use weaviate client to build BM25 query * refactor: remove manual BM25 query building * refactor: apply BM25 to the content_field only * test: update weaviate BM25 retrieval test case update to account for lack of stemming --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-01-30 10:07:07 +01:00
tstadel	4a0a054164	fix: linefeeds in custom_query (#3813 ) * fix linefeeds in custom_query * add double quote test case	2023-01-05 17:13:04 +01:00
Julian Risch	0c2d13f1b8	bug: skip validating empty embeddings (#3774 ) * skip validating empty embeddings * skip batches without embeddings to update * add unit test with mocked retriever	2023-01-05 15:13:57 +01:00
Zoltan Fedor	e143f7cc36	Fixing broken BM25 support with Weaviate - fixes #3720 (#3723 ) * Fixing broken BM25 support with Weaviate - fixes #3720 Unfortunately the BM25 support with Weaviate got broken with Haystack v1.11.0+, which is getting fixed with this commit. Please see more under issue #3720. * Fixing mypy issue - method signature wasn't matching the base class * Mypy related test fix Mypy forced me to set the signature of the `query` method of the Weaviate document store to the same as its parent, the `KeywordDocumentStore`, where the `query` parame is `Optional`, but has NO default value, so it must be provided (as None) at runtime. I am not quite sure why the abstract method's `query` param was set without a default value while its type is `Optional`, but I didn't want to change that, so instead I have changed the Weaviate tests. * Adding a note regarding an upcomming fix in Weaviate v1.17.0 * Apply suggestions from code review * revert * [EMPTY] Re-trigger CI	2022-12-19 17:24:46 +01:00
Vladimir Blagojevic	56803e5465	feat: Enable text-embedding-ada-002 for EmbeddingRetriever (#3721 ) * Enable text-embedding-ada-002 for EmbeddingRetriever * Easier to understand code, more unit tests	2022-12-19 17:06:48 +01:00
Stefano Fiorucci	5b9c661155	feat: add `index` parameter to `TfidfRetriever` (#3666 ) * first draft to add index param to tfidf * better mypy handling * Revert "better mypy handling" This reverts commit 91a22516320f9dcbeae53827ec69f9dc51e1785c. * new check in auto_fit * new check also in retrieve * better dict typings * new test and improvements to other test * remove unnecessary lambda * improve test * remove newline from openapi json * fix test * language fix Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * language fix 2 Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * language fix 3 Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * language fix 4 Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * language fix 5 Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * language fix 6 Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * explicit index value handling * fix test * better error messages Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-12-19 12:07:49 +01:00
tstadel	600dc2d611	refactor: filters type (#3682 ) * consolidate filters type * remove unnecessary optionals * fix mypy * fix pylint * fix pylint * move FilterType to schema * remove Optional from FilterType * move to Dict[str, Any] * Revert "move to Dict[str, Any]" This reverts commit e8c561bb7885949e19825697fa4c469945f90ce5. * fix mypy * fix pylint * revert isort changes in elasticsearch * remove todos in milvus.py * remove todos in sql.py * add aggregate_labels tests * consolidate aggregate_labels tests * remove superfluous type todos * remove ALL superfluous #todos	2022-12-12 14:04:29 +01:00
Unai Garay Maestre	77cea8b140	feat: Adds all_terms_must_match parameter to BM25Retriever at runtime (#3627 ) * Adds all_terms_must_match implementation and tests * Adds all_terms_must_match as Optional Signed-off-by: Unai Garay <unaigaraymaestre@gmail.com> * Avoid mypy error and follow pattern checking var is None * Mypy works ok on this file now * added mypy ignores to BaseRetriever * ignoring all overrides for this file * Updates sparse retriever `all_terms_must_match` docstring Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Updates sparse retriever `all_terms_must_match` docstring Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Updates sparse retriever `all_terms_must_match` docstring Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Updates sparse retrieve_batch `all_terms_must_match` docstring Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Updates sparse retrieve_batch `all_terms_must_match` docstring Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Updates sparse retrieve_batch `all_terms_must_match` docstring Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * marked elasticsearch Signed-off-by: Unai Garay <unaigaraymaestre@gmail.com> Co-authored-by: Mayank Jobanputra <mayankjobanputra@gmail.com> Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-12-08 17:18:43 +05:30
tstadel	c1c1c97bb2	feat: add query_by_embedding_batch (#3546 ) * add query_by_embedding_batch * fix mypy * fix pylint * add test * move query_by_embedding_batch to search_engine * fix and add tests * fix pylint * remove Retriever query logs * add test for multimodal batch retrieval * allow for np.ndarray	2022-12-08 08:28:43 +01:00
Mayank Jobanputra	95cf666a20	refactor: change MultiModal retriever to be of type DenseRetriever (#3598 ) * changed Multimodal retriever to be of type DenseRetriever * format fix * Pylint fix * Added embed_queries and tests	2022-11-28 19:24:22 +01:00
Stefano Fiorucci	3040e59c63	feat: add support for `BM25Retriever` in `InMemoryDocumentStore` (#3561 ) * very first draft * implement query and query_batch * add more bm25 parameters * add rank_bm25 dependency * fix mypy * remove tokenizer callable parameter * remove unused import * only json serializable attributes * try to fix: pylint too-many-public-methods / R0904 * bm25 attribute always present * convert errors into warnings to make the tutorial 1 work * add docstrings; tests * try to make tests run * better docstrings; revert not running tests * some suggestions from review * rename elasticsearch retriever as bm25 in tests; try to test memory_bm25 * exclude tests with filters * change elasticsearch to bm25 retriever in test_summarizer * add tests * try to improve tests * better type hint * adapt test_table_text_retriever_embedding * handle non-textual docs * query only textual documents	2022-11-22 09:24:52 +01:00
Massimiliano Pippi	6a48ace9b9	BREAKING CHANGE: remove Milvus1DocumentStore along with support for Milvus < 2.x (#3552 ) * remove milvus1 * leftover * revert deprecation process	2022-11-15 09:54:55 +01:00
Mayank Jobanputra	794fe5ffa4	bug: didn't clean up model files after running pytest for test_table_text_retriever_training (#3534 ) * Added tmp path to avoid clean up of model files later	2022-11-07 15:07:04 +05:30
Vladimir Blagojevic	5ca96357ff	feat: Add CohereEmbeddingEncoder to EmbeddingRetriever (#3453 )	2022-10-25 17:52:29 +02:00
Unai Garay Maestre	3a2c8ae3c5	bug: Adds better way of checking `query` in BaseRetriever and Pipeline.run() (#3304 ) * changes how query and queries are checked if they have been passed in BaseRetriever * Fixes checking query properly in Pipeline run * Fixes checking query properly in Pipeline run * Adds test for FilterRetriever using run method when query is empty * Adds mock filter retriever and adapts test * Removes old test, adds MockRetriever to test file and test uses document_store * Logs error when query is not of type string with a new test for run batch * Update test/nodes/test_retriever.py * schemas	2022-10-17 19:00:13 +02:00
Sara Zan	101d2bc86c	feat: `MultiModalRetriever` (#2891 ) * Adding Data2VecVision and Data2VecText to the supported models and adapt Tokenizers accordingly * content_types * Splitting classes into respective folders * small changes * Fix EOF * eof * black * API * EOF * whitespace * api * improve multimodal similarity processor * tokenizer -> feature extractor * Making feature vectors come out of the feature extractor in the similarity head * embed_queries is now self-sufficient * couple trivial errors * Implemented separate language model classes for multimodal inference * Document embedding seems to work * removing batch_encode_plus, is deprecated anyway * Realized the base Data2Vec models are not trained on retrieval tasks * Issue with the generated embeddings * Add batching * Try to fit CLIP in * Stub of CLIP integration * Retrieval goes through but returns noise only * Still working on the scores * Introduce temporary adapter for CLIP models * Image retrieval now works with sentence-transformers * Tidying up the code * Refactoring is now functional * Add MPNet to the supported sentence transformers models * Remove unused classes * pylint * docs * docs * Remove the method renaming * mpyp first pass * docs * tutorial * schema * mypy * Move devices setup into get_model * more mypy * mypy * pylint * Move a few params in HaystackModel's init * make feature extractor work with squadprocessor * fix feature_extractor_kwargs forwarding * Forgotten part of the fix * Revert unrelated ES change * Revert unrelated memdocstore changes * comment * Small corrections * mypy and pylint * mypy * typo * mypy * Refactor the call * mypy * Do not make FARMReader use the new FeatureExtractor * mypy * Detach DPR tests from FeatureExtractor too * Detach processor tests too * Add end2end marker * extract end2end feature extractor tests * temporary disable feature extraction tests * Introduce end2end tests for tokenizer tests * pylint * Fix model loading from folder in FeatureExtractor * working o n end2end * end2end keeps failing * Restructuring retriever tests * Restructuring retriever tests * remove covert_dataset_to_dataloader * remove comment * Better check sentence-transformers models * Use embed_meta_fields properly * rename passage into document * Embedding dims can't be found * Add check for models that support it * pylint * Split all retriever tests into suites, running mostly on InMemory only * fix mypy * fix tfidf test * fix weaviate tests * Parallelize on every docstore * Fix schema and specify modality in base retriever suite * tests * Add first image tests * remove comment * Revert to simpler tests * Update docs/_src/api/api/primitives.md Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/modeling/model/multimodal/__init__.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * get_args * mypy * Update haystack/modeling/model/multimodal/__init__.py * Update haystack/modeling/model/multimodal/base.py * Update haystack/modeling/model/multimodal/base.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/modeling/model/multimodal/sentence_transformers.py * Update haystack/modeling/model/multimodal/sentence_transformers.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/modeling/model/multimodal/transformers.py * Update haystack/modeling/model/multimodal/transformers.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/modeling/model/multimodal/transformers.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/nodes/retriever/multimodal/retriever.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * mypy * mypy * removing more ContentTypes * more contentypes * pylint * add to __init__ * revert end2end workflow for now * missing integration markers * Update haystack/nodes/retriever/multimodal/embedder.py Co-authored-by: bogdankostic <bogdankostic@web.de> * review feedback, removing HaystackImageTransformerModel * review feedback part 2 * mypy & pylint * mypy * mypy * fix multimodal docs also for Pinecone * add note on internal constants * Fix pinecone write_documents * schemas * keep support for sentence-transformers only * fix pinecone test * schemas * fix pinecone again * temporarily disable some tests, need to understand if they're still relevant Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> Co-authored-by: bogdankostic <bogdankostic@web.de>	2022-10-17 18:58:35 +02:00
Vladimir Blagojevic	159cd5a666	feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever (#3356 )	2022-10-14 15:01:03 +02:00
Vladimir Blagojevic	9582a423a2	fix: ONNX FARMReader model conversion is broken (#3211 )	2022-09-26 09:18:12 -04:00
tstadel	4fa9d2d8e7	Fix milvus and faiss tests not running (#3263 ) * fix milvus and faiss tests not running * fix schema manually * fix test_dpr_embedding test for milvus * pip freeze on milvus tests * fix milvus1 tests being executed: fix all_doc_stores order * Revert "pip freeze on milvus tests" This reverts commit 75ebb6f7e507bb8477e87d9e63b4a294f7946cab. * make infer_required_doc_store more robust * don't skip tests without docstore requirements * use markers for docstore tests	2022-09-22 17:46:49 +02:00
Sara Zan	4e45062a00	Simplify `language_modeling.py` and `tokenization.py` (#2703 ) * Simplification of language_model.py and tokenization.py to remove code duplication Co-authored-by: vblagoje <dovlex@gmail.com>	2022-07-22 16:29:30 +02:00
Patrick Deutschmann	1db3fd0942	Add support for Multi-Hop Dense Retrieval (#2571 ) * Implement MDR * Adapt conftest to new MDR signature * Update Documentation & Code Style * Change signature of queries param in batch methods of MDR like in #2575 * Update Documentation & Code Style * Rename MultihopDenseRetriever to MultihopEmbeddingRetriever * Fix filters in retrieve_batch * Add docstring for MultihopEmbeddingRetriever.__init__ * Update Documentation & Code Style * Revert forward signature of TextSimilarityHead Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-07-05 11:31:11 +02:00
bogdankostic	dc48c444d4	Fix loading of tokenizers in DPR (#2755 )	2022-07-04 18:18:14 +02:00
Aleksander Smywiński-Pohl	642229255f	Use AutoTokenizer by default, to easily adapt to new models and token… (#1902 ) * Use AutoTokenizer by default, to easily adapt to new models and tokenizers * Add missing AutoTokenizer import * Apply Black * Missing import * Fix DPR tests * Remove tests on max length * Update Documentation & Code Style Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-15 13:13:48 +02:00
Sara Zan	54518ac790	[CI Refactoring] Refactor `Document` fixtures in tests (#2577 ) * Refactor document fixtures * Add embedding files * Update Documentation & Code Style * Indentation issue * Update Documentation & Code Style * Fix type conversion in conftest.py * Update Documentation & Code Style * mypy on sql.py * mypy on crawler.py * mypy on pinecone.py * Adapt retriever tests * Update Documentation & Code Style * mypy on crawler.py * Update Documentation & Code Style * mypy on crawler.py again * Update Documentation & Code Style * mypy fix was too rough * Fix some more tests * Update Documentation & Code Style * Skip meaningless test on FilterRetriever * Make embedding values less specific * Update Documentation & Code Style * Use stable IDs in retriever tests that depend on it * Remove needless fixtures * docs_with_ids * Update Documentation & Code Style * Typo * Fix retriever tests * Fix reader tests * Update Documentation & Code Style * Workaround #2626 * Update Documentation & Code Style * Fix label generator tests * Reorder vectors * remove print * Update Documentation & Code Style * Update Documentation & Code Style * git tags leftover * Update Documentation & Code Style * fix last failing test Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-10 18:22:48 +02:00
Sara Zan	59608ca474	[CI Refactoring] Workflow refactoring (#2576 ) * Unify CI tests (from #2466) * Update Documentation & Code Style * Change folder names * Fix markers list * Remove marker 'slow', replaced with 'integration' * Soften children check * Start ES first so it has time to boot while Python is setup * Run the full workflow * Try to make pip upgrade on Windows * Set KG tests as integration * Update Documentation & Code Style * typo * faster pylint * Make Pylint use the cache * filter diff files for pylint * debug pylint statement * revert pylint changes * Remove path from asserted log (fails on Windows) * Skip preprocessor test on Windows * Tackling Windows specific failures * Fix pytest command for windows suites * Remove \ from command * Move poppler test into integration * Skip opensearch test on windows * Add tolerance in reader sas score for Windows * Another pytorch approx * Raise time limit for unit tests :( * Skip poppler test on Windows CI * Specify to pull with FF only in docs check * temporarily run the docs check immediately * Allow merge commit for now * Try without fetch depth * Accelerating test * Accelerating test * Add repository and ref alongside fetch-depth * Separate out code&docs check from tests * Use setup-python cache * Delete custom action * Remove the pull step in the docs check, will find a way to run on bot commits * Add requirements.txt in .github for caching * Actually install dependencies * Change deps group for pylint * Unclear why the requirements.txt is still required :/ * Fix the code check python setup * Install all deps for pylint * Make the autoformat check depend on tests and doc updates workflows * Try installing dependencies in another order * Try again to install the deps * quoting the paths * Ad back the requirements * Try again to install rest_api and ui * Change deps group * Duplicate haystack install line * See if the cache is the problem * Disable also in mypy, who knows * split the install step * Split install step everywhere * Revert "Separate out code&docs check from tests" This reverts commit 1cd59b15ffc5b984e1d642dcbf4c8ccc2bb6c9bd. * Add back the action * Proactive support for audio (see text2speech branch) * Fix label generator tests * Remove install of libsndfile1 on win temporarily * exclude audio tests on win * install ffmpeg for integration tests Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-07 09:23:03 +02:00
bogdankostic	61d9429c25	Simplify loading of `EmbeddingRetriever` (#2619 ) * Infer model format for EmbeddingRetriever automatically * Update Documentation & Code Style * Adapt conftest to automatic inference of model_format * Update Documentation & Code Style * Fix tests * Update Documentation & Code Style * Fix tests * Adapt tutorials * Update Documentation & Code Style * Add test for similarity scores with sentence transformers * Adapt doc string and warning message * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-02 15:05:29 +02:00
bogdankostic	867695ad0c	Change signature of queries param in batch methods (#2575 ) * Change signature of queries param in batch methods * Update Documentation & Code Style * Fix mypy * Remove unused import * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-05-24 12:33:45 +02:00
Sara Zan	ff4303c51b	[CI refactoring] Categorize tests into folders (#2554 ) * Categorize tests into folders * Fix linux_ci.yml and an import * Wrong path	2022-05-17 09:55:53 +01:00

31 Commits