haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-08-30 11:26:17 +00:00

Author	SHA1	Message	Date
Julian Risch	f9d2f786ca	Replace FARM import statements; add dependencies (#1492 ) * Replace FARM import statements; add dependencies * Add InferenceProc., TextCl.Proc., TextPairCl.Proc. * Remove FARMRanker, add type annotations, rename max_sample * Add sample_to_features_text for InferenceProc. * Fix type annotations: model_name_or_path is str not Path * Fix mypy errors: implement _create_dataset in TextCl.Proc. * Add task_type "embeddings" in Inferencer * Allow loading AdaptiveModel for embedding task * Add SQuAD eval metrics; enable InferenceProc for embedding task * Add baskets as param to log_samples and handle empty basket list in log_samples * Remove unused dependencies * Remove FARMClassifier (doc classificer) due to ref to TextClassificationHead * Remove FARMRanker and Classifier from doc generation scripts Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-09-28 16:34:24 +02:00
Sara Zan	1cd17022af	Fix bug when loading FAISS from supplied config file path (#1506 ) * Fix the bug found in issue 135 * Add a test for the custom path	2021-09-27 11:25:05 +02:00
Malte Pietsch	183fd5ae5a	Simplify tests & allow running on individual doc stores (#1487 ) * simplify tests for individual doc stores * WIP refactoring markers of tests * test alternative approach for tests with existing parametrization * fix skip logic of already parametrized tests * fix weaviate behaviour in tests - not parametrizing it in our general test cases. * Add latest docstring and tutorial changes * fix some tests * remove sql from document_store_types * fix markers for generator and pipeline test * remove inmemory marker * remove unneeded elasticsearch markers * update readme and contributing.md * update contributing * adjust example Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-09-27 10:52:07 +02:00
Julian Risch	60471cecdf	Add inferencer for QA only (#1484 ) * Add inferencer for QA only * Add latest docstring and tutorial changes * Add QA inferencer tests * Add type annotations for inferencer * Fix type annotations, move util functions * Fix type annotations * Move fixtures to the top of the file Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-09-22 16:56:51 +02:00
Sara Zan	21513532e5	Improve save/load of FAISS document store by saving its configuration alongside the index (#1459 ) * Saves the FAISSDocumentStore init params to JSON at save() and loads them at load() if they're found. First draft, to be tested. * Fixing issue with string/Path objects in a few string operations, thanks mypy * Leverage self.set_config instead of saving the parameters in a separate attribute * Modify test_faiss_and_milvus:test_faiss_index_save_and_load to test that init params are preserved * Add assert to verify that the SQL doc count and FAISS vector count is equal. Needs to always specify the name of the SQL db for this to work * Simplified the implementation a bit, add better comments * Forgot a return at the end of the file * Fixing some of the suggestions from the review * Add a try-catch in the load method and fix the tests * Typo	2021-09-20 08:32:14 +02:00
mathislucka	9c4e67d9b6	Enable cosine similarity metric in FAISSDocumentStore (#1352 ) * feat: normalize embeddings for cosine sim * WIP add test case for faiss cosine * input to faiss normalize needs to be an array of vectors * fix: test should compare correct result embedding to original embedding * add sanity check for cosine sim * fix typo * normalize cosine score * Update docstring Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-09-20 07:54:26 +02:00
Timo Moeller	172de1c05f	Merge pull request #1422 from deepset-ai/farm_merging_base Farm merging base	2021-09-16 11:32:41 +02:00
Timo Moeller	d804861fb2	Fix tests	2021-09-13 20:00:22 +02:00
Timo Moeller	537204e8c9	Fix tests and adjust folder structure * Add type annotations in QuestionAnsweringHead * Fix test by increasing max_seq_len * Add SampleBasket type annotation * Remove prediction head param from adaptive model init * Add type ignore for AdaptiveModel init * Fix and rename tests * Adjust folder structure Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2021-09-13 18:38:14 +02:00
Ikram Ali	f186d6327d	Add MostSimilarDocumentsPipeline (#1413 ) * [pipeline] MostSimilarDocumentsPipeline added * [pipeline] mypy bug fixed. * [pipeline] mypy bug fixed. * [pipeline] test cases added. * [pipeline] test cases added. * [pipeline] set return_embedding back to false. * [pipeline] return a list of Documents * [pipeline] define the ids * [pipeline] code refactor. * [pipeline] code refactor. * [pipeline] test case improved. * Update docstring Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-09-13 12:43:45 +02:00
MichelBartels	da2e8da561	Adding multi gpu support for DPR inference (#1414 ) * Added support for Multi-GPU inference to DPR including benchmark * fixed multi gpu * added batch size to benchmark to better reflect multi gpu capabilities * remove unnecessary entry in config.json * fixed typos * fixed config name * update benchmark to use DEVICES constant * changed multi gpu parameters and updated docstring * adds silent fallback on cpu * update doc string, warning and config Co-authored-by: Michel Bartels <kontakt@michelbartels.com> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-09-10 13:25:02 +02:00
oryx1729	9dd7c74f4f	Refactor communication between Pipeline Components (#1321 )	2021-09-10 11:41:16 +02:00
Julian Risch	4a64c50c7e	Merge branch 'farm_merging_base' of github.com:deepset-ai/haystack into farm_merging_base	2021-09-09 13:03:38 +02:00
Julian Risch	ba1fe0ec61	Add fixture distilbert_squad	2021-09-09 13:02:35 +02:00
bogdankostic	2626388961	Fix DPR tests + add Tokenizer tests (#1429 ) * Fix DPR tests * Add Tokenizer tests	2021-09-09 12:56:44 +02:00
Julian Risch	23338f1b74	Add tests: prediction head, processor load/save, qa from FARM	2021-09-09 11:54:47 +02:00
Timo Moeller	b4fd08a296	Add testdata, add tests for qa processor, add dpr tests (some failing)	2021-09-08 12:02:08 +02:00
Shahrukh Khan	4822536886	Add ImageToTextConverter and PDFToTextOCRConverter that utilize OCR (#1349 ) * add image.py converter * add PDFtoImageConverter * add init to PDFtoImageConverter and classes to __init__ * update imagetotext pipeline * update imagetotext pipeline * update imagetotext pipeline * update imagetotext pipeline * update imagetotext pipeline * update imagetotext pipeline * update imagetotext pipeline * revert change in base.py in file_conv * Update base.py * Update pdf.py * add ocr file_converter testcase & update dockerfile * fix tesseract exception message typo * fix _image_to_text doctstring * add tesseract installation to CI * add tesseract installation to CI * add content test for PDF OCR converter * update PDFToTextOCRConverter constructor doctsring * replace image files with tmp paths for image.py convert * replace image files with tmp paths for image.py convert * Update README.md Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-09-01 16:42:25 +02:00
oryx1729	a71180a2ca	Refactor `replicas` config for Ray Pipelines (#1378 )	2021-08-31 10:14:55 +02:00
ramgarg102	51f0a56e5d	delete_all_documents() replaced by delete_documents() (#1377 ) * [UPDT] delete_all_documents() replaced by delete_documents() * [UPDT] warning logs to be fixed * [UPDT] delete_all_documents() renamed and the same method added Co-authored-by: Ram Garg <ramgarg102@gmai.com>	2021-08-30 15:18:28 +02:00
Markus Paff	be8d305190	Editing docs read.me for new docs website workflow (#1372 ) * editing docs read.me for new docs website workflow * added new links to docs	2021-08-30 14:59:40 +02:00
Ikram Ali	ead96730d3	Add Crawler support for indexing pipeline (#1360 )	2021-08-24 14:25:22 +02:00
Ikram Ali	ef27f0d386	Add tests for Crawler (#1339 )	2021-08-18 14:05:44 +02:00
Julian Risch	eb990c9688	Removing probability field from answers in favor of score field (#1340 ) * Removing probability field from reader and from test cases * Add switch to FARMReader to choose score/probability * Remove probability field from doc returned by doc store * Relax assertion testing joined es and dpr predictions * Use switch for confidence scores also for no_answer * Add test that checks switching to old answer scores > 10 * Normalize score in elastic doc store and reset reader.md * Scale weights of JoinDocuments to sum to 1 and adapt test case	2021-08-17 10:27:11 +02:00
Timo Moeller	07bd3c50ea	Add new QA eval metric: Semantic Answer Similarity (SAS) (#1338 ) * init * Add type annotation * Add test case, fix mypy * Add german model to docstring Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-08-12 14:31:48 +02:00
Malte Pietsch	a0921f0c35	Remove `Finder` (#1326 ) * deprecate finder * remove import * add doc section for moving from finder to pipelines	2021-08-09 13:41:40 +02:00
oryx1729	bafa1b46de	Add Ray integration for Pipelines (#1255 )	2021-08-02 14:51:24 +02:00
Branden Chan	937247d628	Add QuestionGenerator (#1267 ) * Create basic Question Generation * Split texts into 50 word chunks * Allow prompt to be changed * Implement iteration functionality in DS * Add docstrings, create pipelines * Make pipelines work * Add comments * Add tests * Add tutorials and docs * Add doc string	2021-07-26 17:20:43 +02:00
Branden Chan	363be65a78	Implement OpenSearch ANN (#1225 ) * Simplify ODES init * Add arguments to ES init and create script * Rename similarity_fn_name and add util fn * Create OpenSearchDocumentStore * Specify params of Open Search HNSW * Add better argument handling * Update opensearch index mapping * Edit opensearch default port * Fix HNSW mapping * Force small HNSW params * Implement auto start and stopping of document store services * Fix starting and stopping of ds service * Restore HNSW params * Add opensearch query benchmarks * Add write wait time * Revert wait time * Add timeout * Update benchmarks * Update benchmarks * Update benchmarks json * Update documentation * Update documentation * Fix similarity name * Improve argument passing * Improve stopping and starting of service	2021-07-26 10:52:52 +02:00
Julian Risch	4e6f7f349d	Add FARMClassifier node for Document Classification (#1265 ) * Add FARM classification node * Add classification output to meta field of document * Update usage example * Add test case for FARMClassifier * Replace FARMRanker with FARMClassifier in documentation strings * Remove base method not implemented by any child class, etc.	2021-07-13 21:44:26 +02:00
Julian Risch	dbb9efbd39	Add SentenceTransformersRanker with pre-trained Cross-Encoder (#1209 ) * Add SentenceTransformersRanker with pre-trained Cross-Encoder * Add test cases for Ranker nodes and update documentation * update docstring * Update docstring * Update __init__.py * update import for test Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-07-07 17:31:45 +02:00
Ikram Ali	29e140196b	[pipeline] Allow for batch indexing when using Pipelines fix #1168 (#1231 ) * [pipeline] Allow for batch indexing when using Pipelines fix #1168 * [pipeline] Test case fixed fix #1168 * [file_converter] Path.suffix updated #1168 * [file_converter] meta can be one of these three cases: A single dict that is applied to all files One dict for each file being converted None #1168 * [file_converter] mypy error fixed. * [file_converter] mypy error fixed. * [rest_api] batch file upload introduced in indexing API. * [test_case] Test_api file upload parameter name updated. * [ui] Streamlit file upload parameter updated.	2021-06-30 14:13:46 +02:00
vblagoje	02fc4c7783	Improve document stores unit test parametrization (#1202 )	2021-06-22 16:08:23 +02:00
vblagoje	2a5882578a	Add Longform-QA (LFQA), Seq2SeqGenerator for generative QA and Retribert Retriever (#1086 ) * Integrate LFQA with Haystack * Integrate LFQA with Haystack - unit tests * Properly initialize conftest default value for vector_dim * Update PR after inital feedback * Fix conftest.py import * Seq2SeqGenerator uses Callables instead of subclasses for custom model input * Update docstring * Fix Callable use * Add LFQA tutorials * Improve type error reporting for invalid input converter Callable * Generate docstrings * Format comments in tutorial script * Generate tutorial md * Add usage page Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai> Co-authored-by: brandenchan <brandenchan@icloud.com>	2021-06-14 17:53:43 +02:00
venuraja79	49886f88f0	Integrate Weaviate as another DocumentStore (#1064 ) * Annotation Tool: data is not persisted when using local version #853 * First version of weaviate * First version of weaviate * First version of weaviate * Updated comments * Updated comments * ran query, get and write tests * update embeddings, dynamic schema and filters implemented * Initial set of tests and fixes * Tests added for update_embeddings and delete documents * introduced duplicate documents fix * fixed mypy errors * Added Weaviate to requirements * Fix the weaviate docker env variables * Fixing test dependencies for now * Created weaviate test marker and fixed query * Update docstring * Add documentation * Bump up weaviate version * Bump up weaviate version in documentation * Bump up weaviate version in documentation * Updgrade weaviate version Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-06-10 09:43:53 +02:00
Shahrukh Khan	545c625a37	Add QueryClassifier incl. baseline models (#1099 ) * restructure query classifier code and add s3 based pickles * make model and vectorizer optional in query classifier * update query classifier as per init style * add query classifiers sklearn/hf * update docstrings for query classifiers * add unit test for query classifier * add type patch for sklearn classifier * fix mypy type issue * revert to pure formatting * add query classifiers * resolve conflict * add output names for query classifier * revert output and update docstring queryclassifier * Update docstring for SklearnQueryClassifier * update transformer query classifier docstring * fix typo * change arg names in query classifier classes * add set_config(). rename attributes * fix set_config() Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-06-08 15:20:13 +02:00
Branden Chan	c513865566	Add L2 support for FAISS HNSW (#1138 )	2021-06-04 11:05:18 +02:00
Branden Chan	09ba75073c	Improve Milvus HNSW Performance (#1127 ) * Add simplified script * Optimize HNSW index creation * Adjust benchmark order * Rename script	2021-06-02 13:17:35 +02:00
Branden Chan	9356f637d4	Update Milvus benchmarks (#1128 ) * Update Milvus benchmarks * Add sentence transformers * Update sentence transformers index results * Remove duplicate row	2021-06-02 13:09:45 +02:00
Branden Chan	aa6f768efa	Prevent merge of same questions on different documents during evaluation (#1119 ) * Fix duplicate question in Reader.eval() * Add duplicate question support in document store * Support duplicate questions in retriever eval * Update tutorial * Rename key_tuple * Change error message * Add warning when more than 6 labels * Allow for label grouping options * Add support for aggregating by label meta * Satisfy mypy * Fix duplicate question in Reader.eval() * Add duplicate question support in document store * Support duplicate questions in retriever eval * Update tutorial * Rename key_tuple * Change error message * Add warning when more than 6 labels * Allow for label grouping options * Add support for aggregating by label meta * Satisfy mypy * Make label field flexible, add docstrings * Satisfy mypy * Fix failing tests * Adjust docstring * Fix tutorial Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-06-02 12:09:03 +02:00
Julian Risch	84c34295a1	Re-ranking component for document search without QA (#1025 ) * Adding ranker similar to retriever and reader * Sort documents according to query-document similarity scores * Reranking and model training runs for small example * Added EvalRanker node * Calculate recall@k in EvalRetriever and EvalRanker nodes * Renaming EvalRetriever to EvalDocuments and EvalReader to EvalAnswers * Added mean reciprocal rank as metric for EvalDocuments * Fix bug that appeared when ranking documents with same score * Remove commented code for unimplmented eval() of Ranker node * Add documentation of k parameter in EvalDocuments * Add Ranker docu and renaming top_k param	2021-05-31 15:31:36 +02:00
Ikram Ali	b76ed4c5a4	Add options for handling duplicate documents (skip, fail, overwrite) (#1088 ) * [document_stores] Duplicate document implmentation added for memorystore. * [document_stores]duplicate documents implementation done for faiss store. * [document_store] Duplicate document feature added for elasticsearch document store fixed #1069 * [document_store] Duplicate documents feature added for milvus document store and bug fixed in faiss document store fixed #1069 * [document_store] Code refactored fixed #1069 * [document_store]Test cases refactored. * [document_store] mypy issue fixed. * [test_case] faiss and milvus test case refactored to support duplicate documents implementation. fixed #1069 * [document_store] duplicate_documents_options code refactored. * [document_store] Code refactored.	2021-05-25 13:30:06 +02:00
Ikram Ali	4ab1bc3c3e	Improve the progress bar in update_embeddings() + Fix filters in update_embeddings() (#1063 ) * [document_stores]Add the progressbar in update_embeddings() to track the overall documents progress closed #1037 * change 2nd level loop to docs. switch to tqdm.auto. * [document_stores] Elasticsearch new method get_document_without_embedding_count() added. * [test_case] Elasticsearch documentstore get_document_without_embedding_count() test case added. * [document_stores] Add new bool arg in get_document_count() method and fixed #1082 * [document_stores] typo fixed #1082 Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-05-21 14:18:07 +02:00
Lalit Pagaria	f46b09c756	Using text hash as id to prevent document duplication (#1000 ) * using text hash as id to prevent document duplication. Also providing a way customize it. * Add latest docstring and tutorial changes * Fixing duplicate value test when text is same * Adding test for duplicate ids in document store * Changing exception to generic Exception type * add exception for inmemory. update docstring Document. remove id_hash_keys from object attribute * Add latest docstring and tutorial changes * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>	2021-05-17 17:51:52 +02:00
Ikram Ali	a06e4450d1	Rename delete_all_documents() method to delete_documents() (#1047 )	2021-05-10 13:37:08 +02:00
Julian Risch	bf4563e5d2	Filtering duplicate answers (#1021 ) * Allow filtering of duplicate answers as implemented in FARM * Changed default behavior to filtering exact duplicates * Change expected test result due to filtering of duplicate answers by default * Rounding expected test results for comparison with predictions	2021-05-03 17:18:10 +02:00
oryx1729	99990e7249	Add export of Pipeline YAML config (#1003 )	2021-04-30 12:23:29 +02:00
oryx1729	8a57f6b16a	Update tests for FAISSDocumentStore (#999 )	2021-04-27 09:55:31 +02:00
oryx1729	7269530e45	Add validation for root node in Pipeline (#987 )	2021-04-21 12:18:33 +02:00
oryx1729	8c1e411380	Fix update_embeddings() for FAISSDocumentStore (#978 )	2021-04-21 09:56:35 +02:00

1 2 3 4 5

240 Commits