haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-07-28 03:12:54 +00:00

Author	SHA1	Message	Date
tstadel	db4d6f43ba	Add tests on MultiLabel's meta and filter aggregation (#2169 )	2022-02-11 17:42:47 +01:00
tstadel	1e3edef803	List all pipeline(_configs) on Deepset Cloud (#2102 ) * add list_pipelines_on_deepset_cloud() * Apply Black * refactor auto paging and throw DeepsetCloudErrors * Apply Black * fix mypy findings * Update documentation * Fix merge error on pipelines.md * Update Documentation & Code Style Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-02-08 20:35:25 +01:00
bogdankostic	f062911040	Extend metadata filtering support in `ElasticsearchDocumentStore` (#2108 ) * Add extended filtering to ESDocumentStore * Add Docstrings * Fix definition of filter queries * Fix mypy * Add tests * Add latest docstring and tutorial changes * Adapt Docstrings * Adapt tests to added test_docs * Adapt tests to added test_docs * Adapt tests to added test_docs * Adapt tests to added test_docs * Add filtering utils for same representation in all doc stores * Apply balck formatting * Update documentation * Fix mypy * Apply Black * Fix mypy * Adopt Doc Strings * Add more tests * Apply Black * Allow filtering in OpenSearchDocStore * Update documentation * Adapt Docstrings * Update documentation Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-02-04 13:43:12 +01:00
Sara Zan	a59bca3661	Apply black formatting (#2115 ) * Testing black on ui/ * Applying black on docstores * Add latest docstring and tutorial changes * Create a single GH action for Black and docs to reduce commit noise to the minimum, slightly refactor the OpenAPI action too * Remove comments * Relax constraints on pydoc-markdown * Split temporary black from the docs. Pydoc-markdown was obsolete and needs a separate PR to upgrade * Fix a couple of bugs * Add a type: ignore that was missing somehow * Give path to black * Apply Black * Apply Black * Relocate a couple of type: ignore * Update documentation * Make Linux CI run after applying Black * Triggering Black * Apply Black * Remove dependency, does not work well * Remove manually double trailing commas * Update documentation Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-02-03 13:43:18 +01:00
mathislucka	88771b2bee	Provide option to recreate es doc store on initialization (#2084 ) * provide option to recreate es doc store on initialization * Add latest docstring and tutorial changes * Label expects more arguments * Label expects also an answer Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>	2022-02-02 11:03:15 +01:00
Sowmiya Jaganathan	7d769d8bf1	Fixed the Search Field mapping in ElasticSearch DocumentStore (#2080 ) * Review changes * Added the synonym analyser for search fields * Added the review requests. * Added the synonyms the OpenSearchDocumentStore and review requests.	2022-01-31 11:11:20 +01:00
Kristof Herrmann	7764b6992c	DC SDK - load pipeline from deepset cloud (#2013 ) * initial load_from_dc * typo * adjusted api endpoint * removed kwargs * added _load_from_dict * refactor pipeline loading mechanism * renaming load_from_dc api * renaming * fixed errors * fix comments and environment variable overrides * Add latest docstring and tutorial changes * fix outdated YAML examples * Add latest docstring and tutorial changes * Introduce readonly DCDocumentStore (without labels support) (#1991) * minimal DCDocumentStore * support filters * implement get_documents_by_id * handle not existing documents * add docstrings * auth added * add tests * generate docs * Add latest docstring and tutorial changes * add responses to dev dependencies * fix tests * support query() and quey_by_embedding() * Add latest docstring and tutorial changes * query tests added * read api_key and api_endpoint from env * Add latest docstring and tutorial changes * support query() and quey_by_embedding() * query tests added * Add latest docstring and tutorial changes * Add latest docstring and tutorial changes * support dynamic similarity and return_embedding values * Add latest docstring and tutorial changes * adjust KeywordDocumentStore description * refactoring * Add latest docstring and tutorial changes * implement get_document_count and raise on all not implemented methods * Add latest docstring and tutorial changes * don't use abbreviation DC in comments and errors * Add latest docstring and tutorial changes * docstring added to KeywordDocumentStore * Add latest docstring and tutorial changes * enhanced api key set * split tests into two parts * change setup.py in order to work around build cache * added link * Add latest docstring and tutorial changes * rename DCDocumentStore to DeepsetCloudDocumentStore * Add latest docstring and tutorial changes * remove dc.py * reinsert link to docs * fix imports * Add latest docstring and tutorial changes * better test structure Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ArzelaAscoIi <kristof.herrmann@rwth-aachen.de> * introduce DeepsetCloudAdapter * Add latest docstring and tutorial changes * introduce DeepsetCloudClient * Add latest docstring and tutorial changes * use json api for pipeline_config * indexing pipeline test added * pseudo change to force cache eviction * revert pseudo change to force cache eviction * remove conftest duplicates * minor formatting and docstring fixes * fix tests when MOCK_DC=False Co-authored-by: Thomas Stadelmann <thomas.stadelmann@deepset.ai> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>	2022-01-28 17:32:56 +01:00
Sara Zan	d470b9d0bd	Improve dependency management (#1994 ) * Fist attempt at using setup.cfg for dependency management * Trying the new package on the CI and in Docker too * Add composite extras_require * Add the safe_import function for document store imports and add some try-catch statements on rest_api and ui imports * Fix bug on class import and rephrase error message * Introduce typing for optional modules and add type: ignore in sparse.py * Include importlib_metadata backport for py3.7 * Add colab group to extra_requires * Fix pillow version * Fix grpcio * Separate out the crawler as another extra * Make paths relative in rest_api and ui * Update the test matrix in the CI * Add try catch statements around the optional imports too to account for direct imports * Never mix direct deps with self-references and add ES deps to the base install * Refactor several paths in tests to make them insensitive to the execution path * Include tstadel review and re-introduce Milvus1 in the tests suite, to fix * Wrap pdf conversion utils into safe_import * Update some tutorials and rever Milvus1 as default for now, see #2067 * Fix mypy config Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-01-26 18:12:55 +01:00
mathislucka	5b7e906e85	fix: get_documents_by_id should return docs for all passed ids (#2064 ) * doc store should return all documents matching ids passed to get_documents_by_id * test for get_document_by_id should be named correctly * add test for get_documents_by_id * Add latest docstring and tutorial changes * document es query limit * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-01-26 12:39:04 +01:00
tstadel	8a32d8da92	Introduce readonly DCDocumentStore (without labels support) (#1991 ) * minimal DCDocumentStore * support filters * implement get_documents_by_id * handle not existing documents * add docstrings * auth added * add tests * generate docs * Add latest docstring and tutorial changes * add responses to dev dependencies * fix tests * support query() and quey_by_embedding() * Add latest docstring and tutorial changes * query tests added * read api_key and api_endpoint from env * Add latest docstring and tutorial changes * support query() and quey_by_embedding() * query tests added * Add latest docstring and tutorial changes * Add latest docstring and tutorial changes * support dynamic similarity and return_embedding values * Add latest docstring and tutorial changes * adjust KeywordDocumentStore description * refactoring * Add latest docstring and tutorial changes * implement get_document_count and raise on all not implemented methods * Add latest docstring and tutorial changes * don't use abbreviation DC in comments and errors * Add latest docstring and tutorial changes * docstring added to KeywordDocumentStore * Add latest docstring and tutorial changes * enhanced api key set * split tests into two parts * change setup.py in order to work around build cache * added link * Add latest docstring and tutorial changes * rename DCDocumentStore to DeepsetCloudDocumentStore * Add latest docstring and tutorial changes * remove dc.py * reinsert link to docs * fix imports * Add latest docstring and tutorial changes * better test structure Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ArzelaAscoIi <kristof.herrmann@rwth-aachen.de>	2022-01-25 20:36:28 +01:00
Sara Zan	e28bf618d7	Implement proper FK in `MetaDocumentORM` and `MetaLabelORM` to work on PostgreSQL (#1990 ) * Properly fix MetaDocumentORM and MetaLabelORM with composite foreign key constraints * update_document_meta() was not using index properly * Exclude ES and Memory from the cosine_sanity_check test * move ensure_ids_are_correct_uuids in conftest and move one test back to faiss & milvus suite Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-01-14 13:48:58 +01:00
MichelBartels	3e4dbbb32c	Align similarity scores across document stores (#1967 ) * align document store similarity functions * remove unnecessary imports * undone accidental change * stopped weaviate from pretending to support dot product similarity * stopped weaviate from pretending to support dot product similarity * Add latest docstring and tutorial changes * fix fixture params for document stores * use cosine similarity for most tests * fix cosine similarity test * fix faiss test * fix weaviate test * fix accidental deletion * fix document_store fixture * test fix; shouldn't be merged * fix test_normalize_embeddings_diff_shapes * probably a better fix * fix for parameter combinations * revert new pytest_generate_tests functionality * simplify pytest_generate_tests * normalize embeddings for test_dpr_embedding * add to faiss doc that embeddings are normalized * Add latest docstring and tutorial changes * remove unnecessary parameters and add comments * simplify two lines of memory.py into one * test similarity scores with smaller language model * fix test_similarity_score Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-01-12 19:28:20 +01:00
Mathew Kuriakose	a44b6c18c0	Unify vector_dim and embedding_dim parameter in Document Store (#1922 ) * Refactored code to unify vector_dim and embedding_dim parameter in DocumentStores * Unit test cases updated to use `embedding_dim` instead of `vector_dim` * Unit test case update to use embedding_dim instead of vector_dim * Add latest docstring and tutorial changes * Put usage of `vector_dim` param in same if-block as corresponding warning Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: bogdankostic <bogdankostic@web.de>	2022-01-10 18:10:32 +01:00
Kristof Herrmann	6e8e3c68d9	Custom id hashing on documentstore level (#1910 ) * adding dynamic id hashing * Add latest docstring and tutorial changes * added pr review * Add latest docstring and tutorial changes * fixed tests * fix mypy error * fix mypy issue * ignore typing * fixed correct check * fixed tests * try fixing the tests * set id hash keys only if not none * dont store id_hash_keys * fix tests * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-01-03 16:58:19 +01:00
tstadel	a94c274134	Support custom headers per request in pipeline (#1861 ) * chain headers param down to document_stores * Add latest docstring and tutorial changes * fix InMemoryDocumentStore params * Add latest docstring and tutorial changes * fix TfidfRetriever params * Add latest docstring and tutorial changes * fix missing headers * Add latest docstring and tutorial changes * fix sparql client and update docs * Add latest docstring and tutorial changes * test for documentstores * pipeline tests added * update header param in docstrings * Add latest docstring and tutorial changes * refactoring: headers as implicit param * Add latest docstring and tutorial changes * remove unnecessary imports * propagade batch_size correctly * Add latest docstring and tutorial changes * revert InMemoryDocumentStore.write_documents signature * Add latest docstring and tutorial changes * remove #type: ignore * Add latest docstring and tutorial changes * replace MutableMapping by Dict * Add latest docstring and tutorial changes * improve docstrings * Add latest docstring and tutorial changes * get rid of *kwargs Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-01-03 11:38:02 +01:00
tstadel	c5540d05ed	Calculation of metrics and presentation of eval results (#1760 ) * retriever metrics added * Add latest docstring and tutorial changes * answer and document level matching metrics implemented * Add latest docstring and tutorial changes * answer related metrics for retriever * basic reader metrics implemented * handle no_answers * fix typing * fix tests * fix tests without sas * first draft for simulated top k * rename sas and f1 columns in dataframe * refactoring of EvaluationResult * Add latest docstring and tutorial changes * more eval tests added * fix sas expected value precision * distinction between ir and qa recall * EvaluationResult.worst_queries() implemented * print_evaluation_report() added * eval report for QA Pipeline improved * dynamic metrics for worst queries calc * Add latest docstring and tutorial changes * method names adjusted * simple test for print_eval_report() added * improved documentation * Add latest docstring and tutorial changes * minor formatting * Add latest docstring and tutorial changes * fix no_answer cases * adjust one docstring * Add latest docstring and tutorial changes * fix no_answer cases for sas * batchmode for sas implemented * fix for retriever metrics if there are only no_answers * fix multilabel tests * improve documentation for pipeline.eval() * streamline multilabel aggregates and docs * Add latest docstring and tutorial changes * fix multilabel tests * unify document_id * add dataframe schema description to EvaluationResult * Add latest docstring and tutorial changes * rename worst_queries to wrong_examples * Add latest docstring and tutorial changes * make query digesting standard pipelines work with pipeline.eval() * Add latest docstring and tutorial changes * tests for multi retriever pipelines added * remove unnecessary import * print_eval_report(): support all pipelines without junctions * Add latest docstring and tutorial changes * fix typos * Add latest docstring and tutorial changes * fix minor simulated_top_k bug and use memory documentstore throughout tests * sas model param description improved * Add latest docstring and tutorial changes * rename recall metrics * Add latest docstring and tutorial changes * fix mean average precision link * Add latest docstring and tutorial changes * adjust sas description docstring * Add latest docstring and tutorial changes * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-11-30 19:26:34 +01:00
Sowmiya Jaganathan	04d93ec247	Introduced an arg to add synonyms - Elasticsearch (#1625 ) * Introduced an arg add synonyms to Elasticsearch * Added the test code, removed the whitespace formatting changes, and overwrote the relevant parts from the already existing mapping instead of creating new mapping. * Added the test code * Remove whitespace change * Added the doc_string with examples and link * Removed unneccessary spaces * Add latest docstring and tutorial changes * fix text_field -> content_field Co-authored-by: sowmiya-emplay <sowmiya.j@emplay.net> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-11-23 19:10:34 +01:00
C V Goudar	a9a379784a	Facilitate concurrent query / indexing in Elasticsearch with dense retrievers (new `skip_missing_embeddings` param) (#1762 ) * Filtering records not having embeddings * Added support for skip_missing_embeddings Flag. Default behavior is throw error when embeddings are missing. If skip_missing_embeddings=True then documents without embeddings are ignored for vector similarity * Fix for below error: haystack/document_stores/elasticsearch.py:852: error: Need type annotation for "script_score_query" * docstring for skip_missing_embeddings parameter * Raise exception where no documents with embeddings is found for Embedding retriever. * Default skip_missing_embeddings to True * Explicitly check if embeddings are present if no results are returned by EmbeddingRetriever for Elasticsearch * Added test case for based on Julian's input * Added test case for based on Julian's input. Fix pytest error on the testcase * Added test case for based on Julian's input. Fix pytest error on the testcase * Added test case for based on Julian's input. Fix pytest error on the testcase * Simplify code by using get_embed_count * Adjust docstring & error msg slightly * Revert error msg Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-11-19 14:50:23 +01:00
bogdankostic	5e36988b31	Support Tables in all DocumentStores (#1744 ) * Add support for tables in SQLDocumentStore, FAISSDocumentStore and MilvuDocumentStore * Add support for WeaviateDocumentStore * Make sure that embedded meta fields are strings + add embedding_dim to WeaviateDocStore in test config * Add latest docstring and tutorial changes * Represent tables in WeaviateDocumentStore as nested lists * Fix mypy Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-11-17 16:41:04 +01:00
Julian Risch	892ce4a760	Make weaviate more compliant to other doc stores (UUIDs and dummy embedddings) (#1656 ) * create uuid and dummy embeddding in weaviate doc store * handle and test for duplicate non-uuid-formatted ids in weaviate * add uuid and dummy embedding to doc strings * Add latest docstring and tutorial changes * Upgrade weaviate * Include weaviate in common doc store test cases * Add latest docstring and tutorial changes * Exclude weaviate doc store from eval tests * Incorporate index name in uuid generation * Ignore mypy error * Fix typo * Restore DOCS without uuid and embeddings generated by weaviate * Supply docs for retriever tests as fixture * Limit scope of fixture to function instead of session * Add comments Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-11-04 09:27:12 +01:00
Lalit Pagaria	e5b4b62d75	Add CI for windows runner (#1458 ) * Feat: Removing use of temp file while downloading archive from url along with adding CI for windows and mac platform * Windows CI by default installing pytorch gpu hence updating CI to pick cpu version * fixing mac cache build issue * updating windows pip install command for torch * another attempt * updating ci * Adding sudo * fixing ls failure on windows * another attempt to fix build issue * Saving env variable of test files * Adding debug log * Github action differ on windows * adding debug * anohter attempt * Windows have different ways to receive env * fixing template * minor fx * Adding debug * Removing use of json * Adding back fromJson * addin toJson * removing print * anohter attempt * disabling parallel run at least for testing * installing docker for mac runner * correcting docker install command * Linux dockers are not suported in windows * Removing mac changes * Upgrading pytorch * using lts pytorch * Separating win and ubuntu * Install java 11 * enabling linux container env * docker cli command * docker cli command * start elastic service * List all service * correcting service name * Attempt to fix multiple test run * convert to json * another attempt to check * Updating build cache step * attempt * Add tika * Separating windows CI * Changing CI name * Skipping test which does not work in windows * Skipping tests for windows * create cleanup function in conftest * adding skipif marker on tests * Run windows PR on only push to master * Addressing review comments * Enabling windows ci for this PR * Tika init is being called when importing tika function * handling tika import issue * handling tika import issue in test * Fixing import issue * removing tika fixure * Removing fixture from tests * Disable windows ci on pull request * Add back extra pytorch install step Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-10-29 10:22:28 +02:00
Sara Zan	13510aa753	Refactoring of the `haystack` package (#1624 ) * Files moved, imports all broken * Fix most imports and docstrings into * Fix the paths to the modules in the API docs * Add latest docstring and tutorial changes * Add a few pipelines that were lost in the inports * Fix a bunch of mypy warnings * Add latest docstring and tutorial changes * Create a file_classifier module * Add docs for file_classifier * Fixed most circular imports, now the REST API can start * Add latest docstring and tutorial changes * Tackling more mypy issues * Reintroduce from FARM and fix last mypy issues hopefully * Re-enable old-style imports * Fix some more import from the top-level package in an attempt to sort out circular imports * Fix some imports in tests to new-style to prevent failed class equalities from breaking tests * Change document_store into document_stores * Update imports in tutorials * Add latest docstring and tutorial changes * Probably fixes summarizer tests * Improve the old-style import allowing module imports (should work) * Try to fix the docs * Remove dedicated KnowledgeGraph page from autodocs * Remove dedicated GraphRetriever page from autodocs * Fix generate_docstrings.sh with an updated list of yaml files to look for * Fix some more modules in the docs * Fix the document stores docs too * Fix a small issue on Tutorial14 * Add latest docstring and tutorial changes * Add deprecation warning to old-style imports * Remove stray folder and import Dict into dense.py * Change import path for MLFlowLogger * Add old loggers path to the import path aliases * Fix debug output of convert_ipynb.py * Fix circular import on BaseRetriever * Missed one merge block * re-run tutorial 5 * Fix imports in tutorial 5 * Re-enable squad_to_dpr CLI from the root package and move get_batches_from_generator into document_stores.base * Add latest docstring and tutorial changes * Fix typo in utils __init__ * Fix a few more imports * Fix benchmarks too * New-style imports in test_knowledge_graph * Rollback setup.py * Rollback squad_to_dpr too Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-25 15:50:23 +02:00
bogdankostic	51acf779f2	Add TableTextRetriever (#1529 ) * first draft / notes on new primitives * wip label / feedback refactor * rename doc.text -> doc.content. add doc.content_type * add datatype for content * remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field * update converters for . Add warning for empty * renam label.question -> label.query. Allow sorting of Answers. * WIP primitives * update ui/reader for new Answer format * Improve Label. First refactoring of MultiLabel. Adjust eval code * fixed workflow conflict with introducing new one (#1472) * Add latest docstring and tutorial changes * make add_eval_data() work again * fix reader formats. WIP fix _extract_docs_and_labels_from_dict * fix test reader * Add latest docstring and tutorial changes * fix another test case for reader * fix mypy in farm reader.eval() * fix mypy in farm reader.eval() * WIP ORM refactor * Add latest docstring and tutorial changes * fix mypy weaviate * make label and multilabel dataclasses * bump mypy env in CI to python 3.8 * WIP refactor Label ORM * WIP refactor Label ORM * simplify tests for individual doc stores * WIP refactoring markers of tests * test alternative approach for tests with existing parametrization * WIP refactor ORMs * fix skip logic of already parametrized tests * fix weaviate behaviour in tests - not parametrizing it in our general test cases. * Add latest docstring and tutorial changes * fix some tests * remove sql from document_store_types * fix markers for generator and pipeline test * remove inmemory marker * remove unneeded elasticsearch markers * add dataclasses-json dependency. adjust ORM to just store JSON repr * ignore type as dataclasses_json seems to miss functionality here * update readme and contributing.md * update contributing * adjust example * fix duplicate doc handling for custom index * Add latest docstring and tutorial changes * fix some ORM issues. fix get_all_labels_aggregated. * update drop flags where get_all_labels_aggregated() was used before * Add latest docstring and tutorial changes * add to_json(). add + fix tests * fix no_answer handling in label / multilabel * fix duplicate docs in memory doc store. change primary key for sql doc table * fix mypy issues * fix mypy issues * haystack/retriever/base.py * fix test_write_document_meta[elastic] * fix test_elasticsearch_custom_fields * fix test_labels[elastic] * fix crawler * fix converter * fix docx converter * fix preprocessor * fix test_utils * fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations * Add latest docstring and tutorial changes * fix crawler test. fix ocrconverter attribute * fix test_elasticsearch_custom_query * fix generator pipeline * fix ocr converter * fix ragenerator * Add latest docstring and tutorial changes * fix test_load_and_save_yaml for elasticsearch * fixes for pipeline tests * fix faq pipeline * fix pipeline tests * Add latest docstring and tutorial changes * Add MultimodalRetriever * Add latest docstring and tutorial changes * fix weaviate * Add latest docstring and tutorial changes * trigger CI * satisfy mypy * Add latest docstring and tutorial changes * satisfy mypy * Add latest docstring and tutorial changes * trigger CI * fix question generation test * fix ray. fix Q-generation * fix translator test * satisfy mypy * wip refactor feedback rest api * fix rest api feedback endpoint * fix doc classifier * remove relation of Labels -> Docs in SQL ORM * fix faiss/milvus tests * fix doc classifier test * fix eval test * fixing eval issues * Add latest docstring and tutorial changes * fix mypy * WIP replace dataclasses-json with manual serialization * Add methods to MultimodalRetriever * Add latest docstring and tutorial changes * revert to dataclass-json serialization for now. remove debug prints. * update docstrings * fix extractor. fix Answer Span init * fix api test * keep meta data of answers in reader.run() * fix meta handling * adress review feedback * Add latest docstring and tutorial changes * make document=None for open domain labels * add import * fix print utils * fix rest api * Add methods and tests * Add latest docstring and tutorial changes * Fix mypy * Add latest docstring and tutorial changes * Add type hints and doc strings * Make use of initialize_device_settings * Move serialization of pd.DataFrame to schema.py * Fix mypy * Adapt Document's from_dict method * Update docstrings * Add latest docstring and tutorial changes * Fix mypy * Fix mypy * Fix Document's from_dict method * Fix Document's to_dict method * Change handling of table metadata * Add latest docstring and tutorial changes * Change naming from Multimodal to TableText * Turn off tokenizers_parallelism in retriever tests * Add latest docstring and tutorial changes * Remove turning off tokenizers_parallelism in retriever tests * Adapt convert_es_hit_to_document * Change embed_surrounding_context to embed_meta_fields * Add latest docstring and tutorial changes * Add check if torch.distributed is available * Set n_gpu to 0 in training test * Set HIP_LAUNCH_BLOCKING to 1 * Set HIP_LAUNCH_BLOCKING to "1" * Set use_gpu to False * Use DataParallel only if more than one device * Remove --find-links=https://download.pytorch.org/whl/torch_stable.html Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai> Co-authored-by: Markus Paff <markuspaff.mp@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-25 12:27:02 +02:00
Julian Risch	4ed2b90bca	Add delete_labels() except for weaviate doc store (#1604 ) * Add delete_labels() except for weaviate doc store * Add latest docstring and tutorial changes * Add test for delete_labels() * Adapt filter for label deletion to different doc stores in test * Allow delete labels by _id in elasticsearch * Add latest docstring and tutorial changes * Add latest docstring and tutorial changes * re-add bugfix after merge * Add ids as optional parameter * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-19 17:20:28 +02:00
Sara Zan	575e64333c	Delete documents by ID in all document stores (#1606 ) * Modify BaseDocumentStore.delete_documents() signature, implement ElasticSearch, and add tests * Add implementation for InMemory * Implement for SQL, FAISS and Milvus too * Add tests for faiss and milvus * Fix delete_all_documents * Implement deletion by ID for weaviate Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: sarthakj2109 <54064348+sarthakj2109@users.noreply.github.com> Co-authored-by: prafgup <prafulgupta6@gmail.com> Co-authored-by: ankh6 <andynzemokalumu@live.be>	2021-10-19 12:30:15 +02:00
Malte Pietsch	4a6c9302b3	Redesign primitives - `Document`, `Answer`, `Label` (#1398 ) * first draft / notes on new primitives * wip label / feedback refactor * rename doc.text -> doc.content. add doc.content_type * add datatype for content * remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field * update converters for . Add warning for empty * renam label.question -> label.query. Allow sorting of Answers. * WIP primitives * update ui/reader for new Answer format * Improve Label. First refactoring of MultiLabel. Adjust eval code * fixed workflow conflict with introducing new one (#1472) * Add latest docstring and tutorial changes * make add_eval_data() work again * fix reader formats. WIP fix _extract_docs_and_labels_from_dict * fix test reader * Add latest docstring and tutorial changes * fix another test case for reader * fix mypy in farm reader.eval() * fix mypy in farm reader.eval() * WIP ORM refactor * Add latest docstring and tutorial changes * fix mypy weaviate * make label and multilabel dataclasses * bump mypy env in CI to python 3.8 * WIP refactor Label ORM * WIP refactor Label ORM * simplify tests for individual doc stores * WIP refactoring markers of tests * test alternative approach for tests with existing parametrization * WIP refactor ORMs * fix skip logic of already parametrized tests * fix weaviate behaviour in tests - not parametrizing it in our general test cases. * Add latest docstring and tutorial changes * fix some tests * remove sql from document_store_types * fix markers for generator and pipeline test * remove inmemory marker * remove unneeded elasticsearch markers * add dataclasses-json dependency. adjust ORM to just store JSON repr * ignore type as dataclasses_json seems to miss functionality here * update readme and contributing.md * update contributing * adjust example * fix duplicate doc handling for custom index * Add latest docstring and tutorial changes * fix some ORM issues. fix get_all_labels_aggregated. * update drop flags where get_all_labels_aggregated() was used before * Add latest docstring and tutorial changes * add to_json(). add + fix tests * fix no_answer handling in label / multilabel * fix duplicate docs in memory doc store. change primary key for sql doc table * fix mypy issues * fix mypy issues * haystack/retriever/base.py * fix test_write_document_meta[elastic] * fix test_elasticsearch_custom_fields * fix test_labels[elastic] * fix crawler * fix converter * fix docx converter * fix preprocessor * fix test_utils * fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations * Add latest docstring and tutorial changes * fix crawler test. fix ocrconverter attribute * fix test_elasticsearch_custom_query * fix generator pipeline * fix ocr converter * fix ragenerator * Add latest docstring and tutorial changes * fix test_load_and_save_yaml for elasticsearch * fixes for pipeline tests * fix faq pipeline * fix pipeline tests * Add latest docstring and tutorial changes * fix weaviate * Add latest docstring and tutorial changes * trigger CI * satisfy mypy * Add latest docstring and tutorial changes * satisfy mypy * Add latest docstring and tutorial changes * trigger CI * fix question generation test * fix ray. fix Q-generation * fix translator test * satisfy mypy * wip refactor feedback rest api * fix rest api feedback endpoint * fix doc classifier * remove relation of Labels -> Docs in SQL ORM * fix faiss/milvus tests * fix doc classifier test * fix eval test * fixing eval issues * Add latest docstring and tutorial changes * fix mypy * WIP replace dataclasses-json with manual serialization * Add latest docstring and tutorial changes * revert to dataclass-json serialization for now. remove debug prints. * update docstrings * fix extractor. fix Answer Span init * fix api test * keep meta data of answers in reader.run() * fix meta handling * adress review feedback * Add latest docstring and tutorial changes * make document=None for open domain labels * add import * fix print utils * fix rest api * adress review feedback * Add latest docstring and tutorial changes * fix mypy Co-authored-by: Markus Paff <markuspaff.mp@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-13 14:23:23 +02:00
Sara Zan	a30a826c6c	Standardize `delete_documents(filter=...)` across all document stores (#1509 ) * Make InMemoryDocumentStore accept and apply filters in delete_documents() * Modify test_document_store.py to test the filtered deletion in memory, sql and milvus too * Make FAISSDocumentStore accept and properly apply filters in delete_documents() * Add latest docstring and tutorial changes * Remove accidentally duplicated test * Remove unnecessary decorators from test/test_document_store.py::test_delete_documents_with_filters * Add embeddings count test for FAISS and Milvus; Milvus fails it. * Fixed a bug that made Milvus not deleting embeddings * Remove batch size parametrization in tests & update all documentstore's docstrings with a filter example * Add latest docstring and tutorial changes Co-authored-by: prafgup <prafulgupta6@gmail.com>	2021-09-29 09:27:06 +02:00
Malte Pietsch	183fd5ae5a	Simplify tests & allow running on individual doc stores (#1487 ) * simplify tests for individual doc stores * WIP refactoring markers of tests * test alternative approach for tests with existing parametrization * fix skip logic of already parametrized tests * fix weaviate behaviour in tests - not parametrizing it in our general test cases. * Add latest docstring and tutorial changes * fix some tests * remove sql from document_store_types * fix markers for generator and pipeline test * remove inmemory marker * remove unneeded elasticsearch markers * update readme and contributing.md * update contributing * adjust example Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-09-27 10:52:07 +02:00
ramgarg102	51f0a56e5d	delete_all_documents() replaced by delete_documents() (#1377 ) * [UPDT] delete_all_documents() replaced by delete_documents() * [UPDT] warning logs to be fixed * [UPDT] delete_all_documents() renamed and the same method added Co-authored-by: Ram Garg <ramgarg102@gmai.com>	2021-08-30 15:18:28 +02:00
vblagoje	02fc4c7783	Improve document stores unit test parametrization (#1202 )	2021-06-22 16:08:23 +02:00
Branden Chan	aa6f768efa	Prevent merge of same questions on different documents during evaluation (#1119 ) * Fix duplicate question in Reader.eval() * Add duplicate question support in document store * Support duplicate questions in retriever eval * Update tutorial * Rename key_tuple * Change error message * Add warning when more than 6 labels * Allow for label grouping options * Add support for aggregating by label meta * Satisfy mypy * Fix duplicate question in Reader.eval() * Add duplicate question support in document store * Support duplicate questions in retriever eval * Update tutorial * Rename key_tuple * Change error message * Add warning when more than 6 labels * Allow for label grouping options * Add support for aggregating by label meta * Satisfy mypy * Make label field flexible, add docstrings * Satisfy mypy * Fix failing tests * Adjust docstring * Fix tutorial Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-06-02 12:09:03 +02:00
Ikram Ali	b76ed4c5a4	Add options for handling duplicate documents (skip, fail, overwrite) (#1088 ) * [document_stores] Duplicate document implmentation added for memorystore. * [document_stores]duplicate documents implementation done for faiss store. * [document_store] Duplicate document feature added for elasticsearch document store fixed #1069 * [document_store] Duplicate documents feature added for milvus document store and bug fixed in faiss document store fixed #1069 * [document_store] Code refactored fixed #1069 * [document_store]Test cases refactored. * [document_store] mypy issue fixed. * [test_case] faiss and milvus test case refactored to support duplicate documents implementation. fixed #1069 * [document_store] duplicate_documents_options code refactored. * [document_store] Code refactored.	2021-05-25 13:30:06 +02:00
Ikram Ali	4ab1bc3c3e	Improve the progress bar in update_embeddings() + Fix filters in update_embeddings() (#1063 ) * [document_stores]Add the progressbar in update_embeddings() to track the overall documents progress closed #1037 * change 2nd level loop to docs. switch to tqdm.auto. * [document_stores] Elasticsearch new method get_document_without_embedding_count() added. * [test_case] Elasticsearch documentstore get_document_without_embedding_count() test case added. * [document_stores] Add new bool arg in get_document_count() method and fixed #1082 * [document_stores] typo fixed #1082 Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-05-21 14:18:07 +02:00
Lalit Pagaria	f46b09c756	Using text hash as id to prevent document duplication (#1000 ) * using text hash as id to prevent document duplication. Also providing a way customize it. * Add latest docstring and tutorial changes * Fixing duplicate value test when text is same * Adding test for duplicate ids in document store * Changing exception to generic Exception type * add exception for inmemory. update docstring Document. remove id_hash_keys from object attribute * Add latest docstring and tutorial changes * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-05-17 17:51:52 +02:00
Ikram Ali	a06e4450d1	Rename delete_all_documents() method to delete_documents() (#1047 )	2021-05-10 13:37:08 +02:00
oryx1729	8c1e411380	Fix update_embeddings() for FAISSDocumentStore (#978 )	2021-04-21 09:56:35 +02:00
Malte Pietsch	e641bff7a6	Allow more options for elasticsearch client (auth, multiple hosts) (#845 ) * allow more options for elasticsearch client (auth, multiple hosts) * Add latest docstring and tutorial changes * fix mypy * Add latest docstring and tutorial changes * test client connection via ping() Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-02-19 14:29:59 +01:00
Malte Pietsch	47aae14efa	relax assert precision of arrays	2021-02-15 14:52:13 +01:00
oryx1729	4059805d89	Fix ElasticsearchDocumentStore.query_by_embedding() (#823 )	2021-02-12 14:57:06 +01:00
Tanay Soni	fd5c5dd23c	Introduce incremental updates for embeddings in document stores (#812 )	2021-02-09 21:25:01 +01:00
Tanay Soni	b87dd244c1	Get metadata values for a key from Elasticsearch (#776 )	2021-02-01 16:13:26 +01:00
Lalit Pagaria	9f7f95221f	Milvus integration (#771 ) * Initial commit for Milvus integration * Add latest docstring and tutorial changes * Updating implementation of Milvus document store * Add latest docstring and tutorial changes * Adding tests and updating doc string * Add latest docstring and tutorial changes * Fixing issue caught by tests * Addressing review comments * Fixing mypy detected issue * Fixing issue caught in test about sorting of vector ids * fixing test * Fixing generator test failure * update docstrings * Addressing review comments about multiple network call while fetching embedding from milvus server * Add latest docstring and tutorial changes * Ignoring mypy issue while converting vector_id to int Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-01-29 13:29:12 +01:00
Tanay Soni	d9f011da9a	Add flag for use of window queries in SQLDocumentStore (#768 )	2021-01-25 12:54:34 +01:00
Tanay Soni	f0aa879a1c	Fix delete_all_documents for the SQLDocumentStore (#761 )	2021-01-22 14:39:24 +01:00

44 Commits