haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-08-11 10:07:50 +00:00

Author	SHA1	Message	Date
Julian Risch	33b2663fdc	ensure tf-idf matrix calculation before retrieval (#1665 ) * ensure tf-idf matrix calculation before retrieval * Run fit() automatically if new documents have been added * Add latest docstring and tutorial changes * Fix type error * Add test case for tfidf retriever yaml pipeline * Use InMemoryDocStore and add 2nd test case Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-28 16:48:06 +02:00
Sara Zan	eab475bb5d	Rename every occurrence of 'embed_passages' with 'embed_documents' (#1667 ) * Rename every occurrence of 'embed_passages' with 'embed_documents' * Remove aliased method embed_documents Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-28 12:17:56 +02:00
Sara Zan	13510aa753	Refactoring of the `haystack` package (#1624 ) * Files moved, imports all broken * Fix most imports and docstrings into * Fix the paths to the modules in the API docs * Add latest docstring and tutorial changes * Add a few pipelines that were lost in the inports * Fix a bunch of mypy warnings * Add latest docstring and tutorial changes * Create a file_classifier module * Add docs for file_classifier * Fixed most circular imports, now the REST API can start * Add latest docstring and tutorial changes * Tackling more mypy issues * Reintroduce from FARM and fix last mypy issues hopefully * Re-enable old-style imports * Fix some more import from the top-level package in an attempt to sort out circular imports * Fix some imports in tests to new-style to prevent failed class equalities from breaking tests * Change document_store into document_stores * Update imports in tutorials * Add latest docstring and tutorial changes * Probably fixes summarizer tests * Improve the old-style import allowing module imports (should work) * Try to fix the docs * Remove dedicated KnowledgeGraph page from autodocs * Remove dedicated GraphRetriever page from autodocs * Fix generate_docstrings.sh with an updated list of yaml files to look for * Fix some more modules in the docs * Fix the document stores docs too * Fix a small issue on Tutorial14 * Add latest docstring and tutorial changes * Add deprecation warning to old-style imports * Remove stray folder and import Dict into dense.py * Change import path for MLFlowLogger * Add old loggers path to the import path aliases * Fix debug output of convert_ipynb.py * Fix circular import on BaseRetriever * Missed one merge block * re-run tutorial 5 * Fix imports in tutorial 5 * Re-enable squad_to_dpr CLI from the root package and move get_batches_from_generator into document_stores.base * Add latest docstring and tutorial changes * Fix typo in utils __init__ * Fix a few more imports * Fix benchmarks too * New-style imports in test_knowledge_graph * Rollback setup.py * Rollback squad_to_dpr too Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-25 15:50:23 +02:00
bogdankostic	51acf779f2	Add TableTextRetriever (#1529 ) * first draft / notes on new primitives * wip label / feedback refactor * rename doc.text -> doc.content. add doc.content_type * add datatype for content * remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field * update converters for . Add warning for empty * renam label.question -> label.query. Allow sorting of Answers. * WIP primitives * update ui/reader for new Answer format * Improve Label. First refactoring of MultiLabel. Adjust eval code * fixed workflow conflict with introducing new one (#1472) * Add latest docstring and tutorial changes * make add_eval_data() work again * fix reader formats. WIP fix _extract_docs_and_labels_from_dict * fix test reader * Add latest docstring and tutorial changes * fix another test case for reader * fix mypy in farm reader.eval() * fix mypy in farm reader.eval() * WIP ORM refactor * Add latest docstring and tutorial changes * fix mypy weaviate * make label and multilabel dataclasses * bump mypy env in CI to python 3.8 * WIP refactor Label ORM * WIP refactor Label ORM * simplify tests for individual doc stores * WIP refactoring markers of tests * test alternative approach for tests with existing parametrization * WIP refactor ORMs * fix skip logic of already parametrized tests * fix weaviate behaviour in tests - not parametrizing it in our general test cases. * Add latest docstring and tutorial changes * fix some tests * remove sql from document_store_types * fix markers for generator and pipeline test * remove inmemory marker * remove unneeded elasticsearch markers * add dataclasses-json dependency. adjust ORM to just store JSON repr * ignore type as dataclasses_json seems to miss functionality here * update readme and contributing.md * update contributing * adjust example * fix duplicate doc handling for custom index * Add latest docstring and tutorial changes * fix some ORM issues. fix get_all_labels_aggregated. * update drop flags where get_all_labels_aggregated() was used before * Add latest docstring and tutorial changes * add to_json(). add + fix tests * fix no_answer handling in label / multilabel * fix duplicate docs in memory doc store. change primary key for sql doc table * fix mypy issues * fix mypy issues * haystack/retriever/base.py * fix test_write_document_meta[elastic] * fix test_elasticsearch_custom_fields * fix test_labels[elastic] * fix crawler * fix converter * fix docx converter * fix preprocessor * fix test_utils * fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations * Add latest docstring and tutorial changes * fix crawler test. fix ocrconverter attribute * fix test_elasticsearch_custom_query * fix generator pipeline * fix ocr converter * fix ragenerator * Add latest docstring and tutorial changes * fix test_load_and_save_yaml for elasticsearch * fixes for pipeline tests * fix faq pipeline * fix pipeline tests * Add latest docstring and tutorial changes * Add MultimodalRetriever * Add latest docstring and tutorial changes * fix weaviate * Add latest docstring and tutorial changes * trigger CI * satisfy mypy * Add latest docstring and tutorial changes * satisfy mypy * Add latest docstring and tutorial changes * trigger CI * fix question generation test * fix ray. fix Q-generation * fix translator test * satisfy mypy * wip refactor feedback rest api * fix rest api feedback endpoint * fix doc classifier * remove relation of Labels -> Docs in SQL ORM * fix faiss/milvus tests * fix doc classifier test * fix eval test * fixing eval issues * Add latest docstring and tutorial changes * fix mypy * WIP replace dataclasses-json with manual serialization * Add methods to MultimodalRetriever * Add latest docstring and tutorial changes * revert to dataclass-json serialization for now. remove debug prints. * update docstrings * fix extractor. fix Answer Span init * fix api test * keep meta data of answers in reader.run() * fix meta handling * adress review feedback * Add latest docstring and tutorial changes * make document=None for open domain labels * add import * fix print utils * fix rest api * Add methods and tests * Add latest docstring and tutorial changes * Fix mypy * Add latest docstring and tutorial changes * Add type hints and doc strings * Make use of initialize_device_settings * Move serialization of pd.DataFrame to schema.py * Fix mypy * Adapt Document's from_dict method * Update docstrings * Add latest docstring and tutorial changes * Fix mypy * Fix mypy * Fix Document's from_dict method * Fix Document's to_dict method * Change handling of table metadata * Add latest docstring and tutorial changes * Change naming from Multimodal to TableText * Turn off tokenizers_parallelism in retriever tests * Add latest docstring and tutorial changes * Remove turning off tokenizers_parallelism in retriever tests * Adapt convert_es_hit_to_document * Change embed_surrounding_context to embed_meta_fields * Add latest docstring and tutorial changes * Add check if torch.distributed is available * Set n_gpu to 0 in training test * Set HIP_LAUNCH_BLOCKING to 1 * Set HIP_LAUNCH_BLOCKING to "1" * Set use_gpu to False * Use DataParallel only if more than one device * Remove --find-links=https://download.pytorch.org/whl/torch_stable.html Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai> Co-authored-by: Markus Paff <markuspaff.mp@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-25 12:27:02 +02:00
Lalit Pagaria	5dbd899a93	Experimental changes to support Milvus 2.x (#1473 ) * Experimental changes to support Milvus 2.x * Milvus 2.0 need other containers hence adding them * Add latest docstring and tutorial changes * Fixing tests * Correcting use of list collections * correcting connection close * Removing connection close logic * removing flush * using collection instead of connection * fixing describe collection * Fixing insert, query and search based on new signature * Making mypy happy * Fixing one test case * Fixing search and embedding fetch based on newer api * Implementing delete vector id function * Wrapping up final changes * Add latest docstring and tutorial changes * Correcting requirements.txt * removing empty line in requirements.txt * add docstring and exception for delete * add docstring. condition import on env var. raise exception for deletion * fix typo * change delete signature * ignore typing for import Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-10-25 10:39:48 +02:00
Julian Risch	9de140110f	Use smaller model for one generator test case (#1622 ) * Use smaller model for one generator test case * Reduce max_length of generated sequences in tests	2021-10-20 17:57:15 +02:00
Julian Risch	4ed2b90bca	Add delete_labels() except for weaviate doc store (#1604 ) * Add delete_labels() except for weaviate doc store * Add latest docstring and tutorial changes * Add test for delete_labels() * Adapt filter for label deletion to different doc stores in test * Allow delete labels by _id in elasticsearch * Add latest docstring and tutorial changes * Add latest docstring and tutorial changes * re-add bugfix after merge * Add ids as optional parameter * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-19 17:20:28 +02:00
Sara Zan	96c05c34e4	Pipeline node names validation (#1601 ) * Add node names validation * Add tests * Improve test and test that params exists before validating * Fix the REST API * Use minilm-uncased-squad2 instead of roberta-base-squad2 * Use roberta model for test_pipeline.yaml * Turn off TOKENIZERS_PARALLELISM in generator tests (#1605) * Account for non-targeted parameters * Restore previous parameters handling in the rest api Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2021-10-19 15:22:44 +02:00
Malte Pietsch	3a7d029fdd	Fix Opensearch field type (flattened -> nested) (#1609 ) * fix field type flattened -> nested. change default port from 9201 to 9200 * change port in benchmarks	2021-10-19 14:40:53 +02:00
Sara Zan	575e64333c	Delete documents by ID in all document stores (#1606 ) * Modify BaseDocumentStore.delete_documents() signature, implement ElasticSearch, and add tests * Add implementation for InMemory * Implement for SQL, FAISS and Milvus too * Add tests for faiss and milvus * Fix delete_all_documents * Implement deletion by ID for weaviate Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: sarthakj2109 <54064348+sarthakj2109@users.noreply.github.com> Co-authored-by: prafgup <prafulgupta6@gmail.com> Co-authored-by: ankh6 <andynzemokalumu@live.be>	2021-10-19 12:30:15 +02:00
Malte Pietsch	eb95f0e8aa	Add more flexible options for model downloads (Proxies, resume_download, local_files_only...) (#1256 ) * allow passing more options for model/tokenizer download from remote * temporarily change dependency to current farm master * Add latest docstring and tutorial changes * add kwargs * add docstrings * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-18 15:47:36 +02:00
Malte Pietsch	3d58e81b5e	Switch from dataclass to pydantic dataclass & Fix Swagger API Docs (#1598 ) * test pydantic dataclasses * Add latest docstring and tutorial changes * enable pydantic mypy plugin * switch to pydentic dataclasses and implement custom to_json from_json * clean up Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-18 14:38:14 +02:00
bogdankostic	655d721371	Add Table Reader (#1446 ) * first draft / notes on new primitives * wip label / feedback refactor * rename doc.text -> doc.content. add doc.content_type * add datatype for content * remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field * update converters for . Add warning for empty * Add first draft of TableReader * renam label.question -> label.query. Allow sorting of Answers. * Add calculation of answer scores * WIP primitives * Adapt input and output to new primitives * Add doc strings * Add tests * update ui/reader for new Answer format * Improve Label. First refactoring of MultiLabel. Adjust eval code * fixed workflow conflict with introducing new one (#1472) * Add latest docstring and tutorial changes * make add_eval_data() work again * fix reader formats. WIP fix _extract_docs_and_labels_from_dict * fix test reader * Add latest docstring and tutorial changes * fix another test case for reader * fix mypy in farm reader.eval() * fix mypy in farm reader.eval() * WIP ORM refactor * Add latest docstring and tutorial changes * fix mypy weaviate * make label and multilabel dataclasses * bump mypy env in CI to python 3.8 * WIP refactor Label ORM * WIP refactor Label ORM * simplify tests for individual doc stores * WIP refactoring markers of tests * test alternative approach for tests with existing parametrization * WIP refactor ORMs * fix skip logic of already parametrized tests * fix weaviate behaviour in tests - not parametrizing it in our general test cases. * Add latest docstring and tutorial changes * fix some tests * remove sql from document_store_types * fix markers for generator and pipeline test * remove inmemory marker * remove unneeded elasticsearch markers * add dataclasses-json dependency. adjust ORM to just store JSON repr * ignore type as dataclasses_json seems to miss functionality here * update readme and contributing.md * update contributing * adjust example * fix duplicate doc handling for custom index * Add latest docstring and tutorial changes * fix some ORM issues. fix get_all_labels_aggregated. * update drop flags where get_all_labels_aggregated() was used before * Add latest docstring and tutorial changes * add to_json(). add + fix tests * fix no_answer handling in label / multilabel * fix duplicate docs in memory doc store. change primary key for sql doc table * fix mypy issues * fix mypy issues * haystack/retriever/base.py * fix test_write_document_meta[elastic] * fix test_elasticsearch_custom_fields * fix test_labels[elastic] * fix crawler * fix converter * fix docx converter * fix preprocessor * fix test_utils * fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations * Add latest docstring and tutorial changes * fix crawler test. fix ocrconverter attribute * fix test_elasticsearch_custom_query * fix generator pipeline * fix ocr converter * fix ragenerator * Add latest docstring and tutorial changes * fix test_load_and_save_yaml for elasticsearch * fixes for pipeline tests * fix faq pipeline * fix pipeline tests * Add latest docstring and tutorial changes * fix weaviate * Add latest docstring and tutorial changes * trigger CI * satisfy mypy * Add latest docstring and tutorial changes * satisfy mypy * Add latest docstring and tutorial changes * trigger CI * fix question generation test * fix ray. fix Q-generation * fix translator test * satisfy mypy * wip refactor feedback rest api * fix rest api feedback endpoint * fix doc classifier * remove relation of Labels -> Docs in SQL ORM * fix faiss/milvus tests * fix doc classifier test * fix eval test * fixing eval issues * Add latest docstring and tutorial changes * fix mypy * WIP replace dataclasses-json with manual serialization * Add latest docstring and tutorial changes * revert to dataclass-json serialization for now. remove debug prints. * update docstrings * fix extractor. fix Answer Span init * fix api test * Adapt answer format * Add latest docstring and tutorial changes * keep meta data of answers in reader.run() * Fix mypy * fix meta handling * adress review feedback * Add latest docstring and tutorial changes * Allow inference on GPU * Remove automatic aggregation * Add automatic aggregation * Add latest docstring and tutorial changes * Add torch-scatter dependency * Add wheel to torch-scatter dependency * Fix requirements * Fix requirements * Fix requirements * Adapt setup.py to allow for wheels * Fix requirements * Fix requirements * Add type hints and code snippet * Add latest docstring and tutorial changes Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai> Co-authored-by: Markus Paff <markuspaff.mp@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-15 16:34:48 +02:00
Julian Risch	5ec29a5283	Limit generator tests to memory doc store; split pipeline tests (#1602 ) * Limit generator tests to memory doc store; split pipeline tests * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-15 15:37:46 +02:00
Malte Pietsch	4a6c9302b3	Redesign primitives - `Document`, `Answer`, `Label` (#1398 ) * first draft / notes on new primitives * wip label / feedback refactor * rename doc.text -> doc.content. add doc.content_type * add datatype for content * remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field * update converters for . Add warning for empty * renam label.question -> label.query. Allow sorting of Answers. * WIP primitives * update ui/reader for new Answer format * Improve Label. First refactoring of MultiLabel. Adjust eval code * fixed workflow conflict with introducing new one (#1472) * Add latest docstring and tutorial changes * make add_eval_data() work again * fix reader formats. WIP fix _extract_docs_and_labels_from_dict * fix test reader * Add latest docstring and tutorial changes * fix another test case for reader * fix mypy in farm reader.eval() * fix mypy in farm reader.eval() * WIP ORM refactor * Add latest docstring and tutorial changes * fix mypy weaviate * make label and multilabel dataclasses * bump mypy env in CI to python 3.8 * WIP refactor Label ORM * WIP refactor Label ORM * simplify tests for individual doc stores * WIP refactoring markers of tests * test alternative approach for tests with existing parametrization * WIP refactor ORMs * fix skip logic of already parametrized tests * fix weaviate behaviour in tests - not parametrizing it in our general test cases. * Add latest docstring and tutorial changes * fix some tests * remove sql from document_store_types * fix markers for generator and pipeline test * remove inmemory marker * remove unneeded elasticsearch markers * add dataclasses-json dependency. adjust ORM to just store JSON repr * ignore type as dataclasses_json seems to miss functionality here * update readme and contributing.md * update contributing * adjust example * fix duplicate doc handling for custom index * Add latest docstring and tutorial changes * fix some ORM issues. fix get_all_labels_aggregated. * update drop flags where get_all_labels_aggregated() was used before * Add latest docstring and tutorial changes * add to_json(). add + fix tests * fix no_answer handling in label / multilabel * fix duplicate docs in memory doc store. change primary key for sql doc table * fix mypy issues * fix mypy issues * haystack/retriever/base.py * fix test_write_document_meta[elastic] * fix test_elasticsearch_custom_fields * fix test_labels[elastic] * fix crawler * fix converter * fix docx converter * fix preprocessor * fix test_utils * fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations * Add latest docstring and tutorial changes * fix crawler test. fix ocrconverter attribute * fix test_elasticsearch_custom_query * fix generator pipeline * fix ocr converter * fix ragenerator * Add latest docstring and tutorial changes * fix test_load_and_save_yaml for elasticsearch * fixes for pipeline tests * fix faq pipeline * fix pipeline tests * Add latest docstring and tutorial changes * fix weaviate * Add latest docstring and tutorial changes * trigger CI * satisfy mypy * Add latest docstring and tutorial changes * satisfy mypy * Add latest docstring and tutorial changes * trigger CI * fix question generation test * fix ray. fix Q-generation * fix translator test * satisfy mypy * wip refactor feedback rest api * fix rest api feedback endpoint * fix doc classifier * remove relation of Labels -> Docs in SQL ORM * fix faiss/milvus tests * fix doc classifier test * fix eval test * fixing eval issues * Add latest docstring and tutorial changes * fix mypy * WIP replace dataclasses-json with manual serialization * Add latest docstring and tutorial changes * revert to dataclass-json serialization for now. remove debug prints. * update docstrings * fix extractor. fix Answer Span init * fix api test * keep meta data of answers in reader.run() * fix meta handling * adress review feedback * Add latest docstring and tutorial changes * make document=None for open domain labels * add import * fix print utils * fix rest api * adress review feedback * Add latest docstring and tutorial changes * fix mypy Co-authored-by: Markus Paff <markuspaff.mp@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-13 14:23:23 +02:00
Sara Zan	6354528336	Add `/documents/get_by_filters` endpoint (#1580 ) * Add endpoint to get documents by filter * Add test for /documents/get_by_filter and extend the delete documents test * Add rest_api/file-upload to .gitignore * Make sure the document store is empty for each test * Improve docstrings of delete_documents_by_filters and get_documents_by_filters Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-12 10:53:54 +02:00
Sara Zan	25d76f508d	Create EntityExtractor (#1573 ) * Create extractor/entity.py * Aggregate NER words into entities * Support indexing * Add doc strings * Add utility for printing * Update signature of run() to match BaseComponent * Add test * Modify simplify_ner_for_qa to return the dictionary and add its test Co-authored-by: brandenchan <brandenchan@icloud.com>	2021-10-11 11:04:11 +02:00
Sara Zan	54947cb840	Return intermediate nodes output in pipelines (#1558 ) * First rough implementation * Add a flag to dump the debug logs to the console as well * Typing run() and _dispatch_run() * Allow debug and debug_logs to be passed as arguments of run() * Avoid overwriting _debug, later we might want to store other objects in it * Put logs under a separate key of the _debug dictionary and add input and output of the node alongside it * Introduce global arguments for pipeline.run() that get applied to every node when defined * Change default values of debug variables to None, otherwise their default would override the params values * Remove a potential infinite recursion on the overridden __getattr__ * Do not append the output of the last node in the _debug key, it causes infinite recursion * Add tests * Move the input/output collection into _dispatch_run to gather only relevant info * Add partial Pipeline.run() docstring * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-10-07 22:13:25 +02:00
Vladimir Blagojevic	72168eddaf	Add BatchEncoding flatten (#1562 ) * Add BatchEncoding flatten * Rename BatchEncoding flatten to flatten_rename * Unit test for BatchEncoding flatten_rename	2021-10-07 15:29:57 +02:00
Sara Zan	3539e6b041	Fix circular import in the REST API (#1556 ) * Fix circular import in the REST API * remove unneeded import in test Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-10-04 21:18:23 +02:00
Sara Zan	af4a44fcbd	WIP Add rest api endpoint to delete documents by filter (#1546 ) * Add rest api endpoint to delete documents by filter. * Remove parametrization of rest api tests * Make the paths in rest_api/config.py absolute * Fix path to pipelines.yaml * Restructuring test_rest_api.py to be able to test only my endpoint (and to make the suite more structured) * Convert DELETE /documents into POST /documents/delete_by_filters Co-authored by: sarthakj2109 <54064348+sarthakj2109@users.noreply.github.com>	2021-10-04 11:21:00 +02:00
Julian Risch	24483d7bad	TransformersDocumentClassifier replacing FARMClassifier (#1540 ) * Initial draft of TransformersClassifier * Add transformers classifier implementation * Add test for SentenceTransformersClassifier * Add truncation and corresponding test case to Classifier * Add zero-shot classification and test * Add document classifier documentation * Add latest docstring and tutorial changes * print meta data with print_documents() * Add latest docstring and tutorial changes * Remove top_k param from Classifier usage example * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-10-01 11:22:56 +02:00
Sara Zan	a30a826c6c	Standardize `delete_documents(filter=...)` across all document stores (#1509 ) * Make InMemoryDocumentStore accept and apply filters in delete_documents() * Modify test_document_store.py to test the filtered deletion in memory, sql and milvus too * Make FAISSDocumentStore accept and properly apply filters in delete_documents() * Add latest docstring and tutorial changes * Remove accidentally duplicated test * Remove unnecessary decorators from test/test_document_store.py::test_delete_documents_with_filters * Add embeddings count test for FAISS and Milvus; Milvus fails it. * Fixed a bug that made Milvus not deleting embeddings * Remove batch size parametrization in tests & update all documentstore's docstrings with a filter example * Add latest docstring and tutorial changes Co-authored-by: prafgup <prafulgupta6@gmail.com>	2021-09-29 09:27:06 +02:00
Malte Pietsch	2df1aa8713	Fix document_store_type flag for tests with multiple fixtures that get parametrized. (#1526 )	2021-09-28 16:38:21 +02:00
Julian Risch	f9d2f786ca	Replace FARM import statements; add dependencies (#1492 ) * Replace FARM import statements; add dependencies * Add InferenceProc., TextCl.Proc., TextPairCl.Proc. * Remove FARMRanker, add type annotations, rename max_sample * Add sample_to_features_text for InferenceProc. * Fix type annotations: model_name_or_path is str not Path * Fix mypy errors: implement _create_dataset in TextCl.Proc. * Add task_type "embeddings" in Inferencer * Allow loading AdaptiveModel for embedding task * Add SQuAD eval metrics; enable InferenceProc for embedding task * Add baskets as param to log_samples and handle empty basket list in log_samples * Remove unused dependencies * Remove FARMClassifier (doc classificer) due to ref to TextClassificationHead * Remove FARMRanker and Classifier from doc generation scripts Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-09-28 16:34:24 +02:00
Sara Zan	1cd17022af	Fix bug when loading FAISS from supplied config file path (#1506 ) * Fix the bug found in issue 135 * Add a test for the custom path	2021-09-27 11:25:05 +02:00
Malte Pietsch	183fd5ae5a	Simplify tests & allow running on individual doc stores (#1487 ) * simplify tests for individual doc stores * WIP refactoring markers of tests * test alternative approach for tests with existing parametrization * fix skip logic of already parametrized tests * fix weaviate behaviour in tests - not parametrizing it in our general test cases. * Add latest docstring and tutorial changes * fix some tests * remove sql from document_store_types * fix markers for generator and pipeline test * remove inmemory marker * remove unneeded elasticsearch markers * update readme and contributing.md * update contributing * adjust example Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-09-27 10:52:07 +02:00
Julian Risch	60471cecdf	Add inferencer for QA only (#1484 ) * Add inferencer for QA only * Add latest docstring and tutorial changes * Add QA inferencer tests * Add type annotations for inferencer * Fix type annotations, move util functions * Fix type annotations * Move fixtures to the top of the file Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-09-22 16:56:51 +02:00
Sara Zan	21513532e5	Improve save/load of FAISS document store by saving its configuration alongside the index (#1459 ) * Saves the FAISSDocumentStore init params to JSON at save() and loads them at load() if they're found. First draft, to be tested. * Fixing issue with string/Path objects in a few string operations, thanks mypy * Leverage self.set_config instead of saving the parameters in a separate attribute * Modify test_faiss_and_milvus:test_faiss_index_save_and_load to test that init params are preserved * Add assert to verify that the SQL doc count and FAISS vector count is equal. Needs to always specify the name of the SQL db for this to work * Simplified the implementation a bit, add better comments * Forgot a return at the end of the file * Fixing some of the suggestions from the review * Add a try-catch in the load method and fix the tests * Typo	2021-09-20 08:32:14 +02:00
mathislucka	9c4e67d9b6	Enable cosine similarity metric in FAISSDocumentStore (#1352 ) * feat: normalize embeddings for cosine sim * WIP add test case for faiss cosine * input to faiss normalize needs to be an array of vectors * fix: test should compare correct result embedding to original embedding * add sanity check for cosine sim * fix typo * normalize cosine score * Update docstring Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-09-20 07:54:26 +02:00
Timo Moeller	172de1c05f	Merge pull request #1422 from deepset-ai/farm_merging_base Farm merging base	2021-09-16 11:32:41 +02:00
Timo Moeller	d804861fb2	Fix tests	2021-09-13 20:00:22 +02:00
Timo Moeller	537204e8c9	Fix tests and adjust folder structure * Add type annotations in QuestionAnsweringHead * Fix test by increasing max_seq_len * Add SampleBasket type annotation * Remove prediction head param from adaptive model init * Add type ignore for AdaptiveModel init * Fix and rename tests * Adjust folder structure Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2021-09-13 18:38:14 +02:00
Ikram Ali	f186d6327d	Add MostSimilarDocumentsPipeline (#1413 ) * [pipeline] MostSimilarDocumentsPipeline added * [pipeline] mypy bug fixed. * [pipeline] mypy bug fixed. * [pipeline] test cases added. * [pipeline] test cases added. * [pipeline] set return_embedding back to false. * [pipeline] return a list of Documents * [pipeline] define the ids * [pipeline] code refactor. * [pipeline] code refactor. * [pipeline] test case improved. * Update docstring Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-09-13 12:43:45 +02:00
MichelBartels	da2e8da561	Adding multi gpu support for DPR inference (#1414 ) * Added support for Multi-GPU inference to DPR including benchmark * fixed multi gpu * added batch size to benchmark to better reflect multi gpu capabilities * remove unnecessary entry in config.json * fixed typos * fixed config name * update benchmark to use DEVICES constant * changed multi gpu parameters and updated docstring * adds silent fallback on cpu * update doc string, warning and config Co-authored-by: Michel Bartels <kontakt@michelbartels.com> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-09-10 13:25:02 +02:00
oryx1729	9dd7c74f4f	Refactor communication between Pipeline Components (#1321 )	2021-09-10 11:41:16 +02:00
Julian Risch	4a64c50c7e	Merge branch 'farm_merging_base' of github.com:deepset-ai/haystack into farm_merging_base	2021-09-09 13:03:38 +02:00
Julian Risch	ba1fe0ec61	Add fixture distilbert_squad	2021-09-09 13:02:35 +02:00
bogdankostic	2626388961	Fix DPR tests + add Tokenizer tests (#1429 ) * Fix DPR tests * Add Tokenizer tests	2021-09-09 12:56:44 +02:00
Julian Risch	23338f1b74	Add tests: prediction head, processor load/save, qa from FARM	2021-09-09 11:54:47 +02:00
Timo Moeller	b4fd08a296	Add testdata, add tests for qa processor, add dpr tests (some failing)	2021-09-08 12:02:08 +02:00
Shahrukh Khan	4822536886	Add ImageToTextConverter and PDFToTextOCRConverter that utilize OCR (#1349 ) * add image.py converter * add PDFtoImageConverter * add init to PDFtoImageConverter and classes to __init__ * update imagetotext pipeline * update imagetotext pipeline * update imagetotext pipeline * update imagetotext pipeline * update imagetotext pipeline * update imagetotext pipeline * update imagetotext pipeline * revert change in base.py in file_conv * Update base.py * Update pdf.py * add ocr file_converter testcase & update dockerfile * fix tesseract exception message typo * fix _image_to_text doctstring * add tesseract installation to CI * add tesseract installation to CI * add content test for PDF OCR converter * update PDFToTextOCRConverter constructor doctsring * replace image files with tmp paths for image.py convert * replace image files with tmp paths for image.py convert * Update README.md Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-09-01 16:42:25 +02:00
oryx1729	a71180a2ca	Refactor `replicas` config for Ray Pipelines (#1378 )	2021-08-31 10:14:55 +02:00
ramgarg102	51f0a56e5d	delete_all_documents() replaced by delete_documents() (#1377 ) * [UPDT] delete_all_documents() replaced by delete_documents() * [UPDT] warning logs to be fixed * [UPDT] delete_all_documents() renamed and the same method added Co-authored-by: Ram Garg <ramgarg102@gmai.com>	2021-08-30 15:18:28 +02:00
Markus Paff	be8d305190	Editing docs read.me for new docs website workflow (#1372 ) * editing docs read.me for new docs website workflow * added new links to docs	2021-08-30 14:59:40 +02:00
Ikram Ali	ead96730d3	Add Crawler support for indexing pipeline (#1360 )	2021-08-24 14:25:22 +02:00
Ikram Ali	ef27f0d386	Add tests for Crawler (#1339 )	2021-08-18 14:05:44 +02:00
Julian Risch	eb990c9688	Removing probability field from answers in favor of score field (#1340 ) * Removing probability field from reader and from test cases * Add switch to FARMReader to choose score/probability * Remove probability field from doc returned by doc store * Relax assertion testing joined es and dpr predictions * Use switch for confidence scores also for no_answer * Add test that checks switching to old answer scores > 10 * Normalize score in elastic doc store and reset reader.md * Scale weights of JoinDocuments to sum to 1 and adapt test case	2021-08-17 10:27:11 +02:00
Timo Moeller	07bd3c50ea	Add new QA eval metric: Semantic Answer Similarity (SAS) (#1338 ) * init * Add type annotation * Add test case, fix mypy * Add german model to docstring Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-08-12 14:31:48 +02:00
Malte Pietsch	a0921f0c35	Remove `Finder` (#1326 ) * deprecate finder * remove import * add doc section for moving from finder to pipelines	2021-08-09 13:41:40 +02:00

... 23 24 25 26 27 ...

1414 Commits