1204 Commits

Author SHA1 Message Date
Malte Pietsch
eb95f0e8aa
Add more flexible options for model downloads (Proxies, resume_download, local_files_only...) (#1256)
* allow passing more options for model/tokenizer download from remote

* temporarily change dependency to current farm master

* Add latest docstring and tutorial changes

* add kwargs

* add docstrings

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-18 15:47:36 +02:00
Malte Pietsch
3d58e81b5e
Switch from dataclass to pydantic dataclass & Fix Swagger API Docs (#1598)
* test pydantic dataclasses

* Add latest docstring and tutorial changes

* enable pydantic mypy plugin

* switch to pydentic dataclasses and implement custom to_json from_json

* clean up

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-18 14:38:14 +02:00
bogdankostic
655d721371
Add Table Reader (#1446)
* first draft / notes on new primitives

* wip label / feedback refactor

* rename doc.text -> doc.content. add doc.content_type

* add datatype for content

* remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field

* update converters for . Add warning for empty

* Add first draft of TableReader

* renam label.question -> label.query. Allow sorting of Answers.

* Add calculation of answer scores

* WIP primitives

* Adapt input and output to new primitives

* Add doc strings

* Add tests

* update ui/reader for new Answer format

* Improve Label. First refactoring of MultiLabel. Adjust eval code

* fixed workflow conflict with introducing new one (#1472)

* Add latest docstring and tutorial changes

* make add_eval_data() work again

* fix reader formats. WIP fix _extract_docs_and_labels_from_dict

* fix test reader

* Add latest docstring and tutorial changes

* fix another test case for reader

* fix mypy in farm reader.eval()

* fix mypy in farm reader.eval()

* WIP ORM refactor

* Add latest docstring and tutorial changes

* fix mypy weaviate

* make label and multilabel dataclasses

* bump mypy env in CI to python 3.8

* WIP refactor Label ORM

* WIP refactor Label ORM

* simplify tests for individual doc stores

* WIP refactoring markers of tests

* test alternative approach for tests with existing parametrization

* WIP refactor ORMs

* fix skip logic of already parametrized tests

* fix weaviate behaviour in tests - not parametrizing it in our general test cases.

* Add latest docstring and tutorial changes

* fix some tests

* remove sql from document_store_types

* fix markers for generator and pipeline test

* remove inmemory marker

* remove unneeded elasticsearch markers

* add dataclasses-json dependency. adjust ORM to just store JSON repr

* ignore type as dataclasses_json seems to miss functionality here

* update readme and contributing.md

* update contributing

* adjust example

* fix duplicate doc handling for custom index

* Add latest docstring and tutorial changes

* fix some ORM issues. fix get_all_labels_aggregated.

* update drop flags where get_all_labels_aggregated() was used before

* Add latest docstring and tutorial changes

* add to_json(). add + fix tests

* fix no_answer handling in label / multilabel

* fix duplicate docs in memory doc store. change primary key for sql doc table

* fix mypy issues

* fix mypy issues

* haystack/retriever/base.py

* fix test_write_document_meta[elastic]

* fix test_elasticsearch_custom_fields

* fix test_labels[elastic]

* fix crawler

* fix converter

* fix docx converter

* fix preprocessor

* fix test_utils

* fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations

* Add latest docstring and tutorial changes

* fix crawler test. fix ocrconverter attribute

* fix test_elasticsearch_custom_query

* fix generator pipeline

* fix ocr converter

* fix ragenerator

* Add latest docstring and tutorial changes

* fix test_load_and_save_yaml for elasticsearch

* fixes for pipeline tests

* fix faq pipeline

* fix pipeline tests

* Add latest docstring and tutorial changes

* fix weaviate

* Add latest docstring and tutorial changes

* trigger CI

* satisfy mypy

* Add latest docstring and tutorial changes

* satisfy mypy

* Add latest docstring and tutorial changes

* trigger CI

* fix question generation test

* fix ray. fix Q-generation

* fix translator test

* satisfy mypy

* wip refactor feedback rest api

* fix rest api feedback endpoint

* fix doc classifier

* remove relation of Labels -> Docs in SQL ORM

* fix faiss/milvus tests

* fix doc classifier test

* fix eval test

* fixing eval issues

* Add latest docstring and tutorial changes

* fix mypy

* WIP replace dataclasses-json with manual serialization

* Add latest docstring and tutorial changes

* revert to dataclass-json serialization for now. remove debug prints.

* update docstrings

* fix extractor. fix Answer Span init

* fix api test

* Adapt answer format

* Add latest docstring and tutorial changes

* keep meta data of answers in reader.run()

* Fix mypy

* fix meta handling

* adress review feedback

* Add latest docstring and tutorial changes

* Allow inference on GPU

* Remove automatic aggregation

* Add automatic aggregation

* Add latest docstring and tutorial changes

* Add torch-scatter dependency

* Add wheel to torch-scatter dependency

* Fix requirements

* Fix requirements

* Fix requirements

* Adapt setup.py to allow for wheels

* Fix requirements

* Fix requirements

* Add type hints and code snippet

* Add latest docstring and tutorial changes

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-15 16:34:48 +02:00
Julian Risch
5ec29a5283
Limit generator tests to memory doc store; split pipeline tests (#1602)
* Limit generator tests to memory doc store; split pipeline tests

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-15 15:37:46 +02:00
Malte Pietsch
4a6c9302b3
Redesign primitives - Document, Answer, Label (#1398)
* first draft / notes on new primitives

* wip label / feedback refactor

* rename doc.text -> doc.content. add doc.content_type

* add datatype for content

* remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field

* update converters for . Add warning for empty

* renam label.question -> label.query. Allow sorting of Answers.

* WIP primitives

* update ui/reader for new Answer format

* Improve Label. First refactoring of MultiLabel. Adjust eval code

* fixed workflow conflict with introducing new one (#1472)

* Add latest docstring and tutorial changes

* make add_eval_data() work again

* fix reader formats. WIP fix _extract_docs_and_labels_from_dict

* fix test reader

* Add latest docstring and tutorial changes

* fix another test case for reader

* fix mypy in farm reader.eval()

* fix mypy in farm reader.eval()

* WIP ORM refactor

* Add latest docstring and tutorial changes

* fix mypy weaviate

* make label and multilabel dataclasses

* bump mypy env in CI to python 3.8

* WIP refactor Label ORM

* WIP refactor Label ORM

* simplify tests for individual doc stores

* WIP refactoring markers of tests

* test alternative approach for tests with existing parametrization

* WIP refactor ORMs

* fix skip logic of already parametrized tests

* fix weaviate behaviour in tests - not parametrizing it in our general test cases.

* Add latest docstring and tutorial changes

* fix some tests

* remove sql from document_store_types

* fix markers for generator and pipeline test

* remove inmemory marker

* remove unneeded elasticsearch markers

* add dataclasses-json dependency. adjust ORM to just store JSON repr

* ignore type as dataclasses_json seems to miss functionality here

* update readme and contributing.md

* update contributing

* adjust example

* fix duplicate doc handling for custom index

* Add latest docstring and tutorial changes

* fix some ORM issues. fix get_all_labels_aggregated.

* update drop flags where get_all_labels_aggregated() was used before

* Add latest docstring and tutorial changes

* add to_json(). add + fix tests

* fix no_answer handling in label / multilabel

* fix duplicate docs in memory doc store. change primary key for sql doc table

* fix mypy issues

* fix mypy issues

* haystack/retriever/base.py

* fix test_write_document_meta[elastic]

* fix test_elasticsearch_custom_fields

* fix test_labels[elastic]

* fix crawler

* fix converter

* fix docx converter

* fix preprocessor

* fix test_utils

* fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations

* Add latest docstring and tutorial changes

* fix crawler test. fix ocrconverter attribute

* fix test_elasticsearch_custom_query

* fix generator pipeline

* fix ocr converter

* fix ragenerator

* Add latest docstring and tutorial changes

* fix test_load_and_save_yaml for elasticsearch

* fixes for pipeline tests

* fix faq pipeline

* fix pipeline tests

* Add latest docstring and tutorial changes

* fix weaviate

* Add latest docstring and tutorial changes

* trigger CI

* satisfy mypy

* Add latest docstring and tutorial changes

* satisfy mypy

* Add latest docstring and tutorial changes

* trigger CI

* fix question generation test

* fix ray. fix Q-generation

* fix translator test

* satisfy mypy

* wip refactor feedback rest api

* fix rest api feedback endpoint

* fix doc classifier

* remove relation of Labels -> Docs in SQL ORM

* fix faiss/milvus tests

* fix doc classifier test

* fix eval test

* fixing eval issues

* Add latest docstring and tutorial changes

* fix mypy

* WIP replace dataclasses-json with manual serialization

* Add latest docstring and tutorial changes

* revert to dataclass-json serialization for now. remove debug prints.

* update docstrings

* fix extractor. fix Answer Span init

* fix api test

* keep meta data of answers in reader.run()

* fix meta handling

* adress review feedback

* Add latest docstring and tutorial changes

* make document=None for open domain labels

* add import

* fix print utils

* fix rest api

* adress review feedback

* Add latest docstring and tutorial changes

* fix mypy

Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-13 14:23:23 +02:00
Sara Zan
6354528336
Add /documents/get_by_filters endpoint (#1580)
* Add endpoint to get documents by filter

* Add test for /documents/get_by_filter and extend the delete documents test

* Add rest_api/file-upload to .gitignore

* Make sure the document store is empty for each test

* Improve docstrings of delete_documents_by_filters and get_documents_by_filters

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-12 10:53:54 +02:00
Sara Zan
25d76f508d
Create EntityExtractor (#1573)
* Create extractor/entity.py

* Aggregate NER words into entities

* Support indexing

* Add doc strings

* Add utility for printing

* Update signature of run() to match BaseComponent

* Add test

* Modify simplify_ner_for_qa to return the dictionary and add its test

Co-authored-by: brandenchan <brandenchan@icloud.com>
2021-10-11 11:04:11 +02:00
Sara Zan
54947cb840
Return intermediate nodes output in pipelines (#1558)
* First rough implementation

* Add a flag to dump the debug logs to the console as well

* Typing run() and _dispatch_run()

* Allow debug and debug_logs to be passed as arguments of run()

* Avoid overwriting _debug, later we might want to store other objects in it

* Put logs under a separate key of the _debug dictionary and add input and output of the node alongside it

* Introduce global arguments for pipeline.run() that get applied to every node when defined

* Change default values of debug variables to None, otherwise their default would override the params values

* Remove a potential infinite recursion on the overridden __getattr__

* Do not append the output of the last node in the _debug key, it causes infinite recursion

* Add tests

* Move the input/output collection into _dispatch_run to gather only relevant info

* Add partial Pipeline.run() docstring

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-10-07 22:13:25 +02:00
Vladimir Blagojevic
72168eddaf
Add BatchEncoding flatten (#1562)
* Add BatchEncoding flatten

* Rename BatchEncoding flatten to flatten_rename

* Unit test for BatchEncoding flatten_rename
2021-10-07 15:29:57 +02:00
Sara Zan
3539e6b041
Fix circular import in the REST API (#1556)
* Fix circular import in the REST API

* remove unneeded import in test

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-10-04 21:18:23 +02:00
Sara Zan
af4a44fcbd
WIP Add rest api endpoint to delete documents by filter (#1546)
* Add rest api endpoint to delete documents by filter.

* Remove parametrization of rest api tests

* Make the paths in rest_api/config.py absolute

* Fix path to pipelines.yaml

* Restructuring test_rest_api.py to be able to test only my endpoint (and to make the suite more structured)

* Convert DELETE /documents into POST /documents/delete_by_filters

Co-authored by:  sarthakj2109 <54064348+sarthakj2109@users.noreply.github.com>
2021-10-04 11:21:00 +02:00
Julian Risch
24483d7bad
TransformersDocumentClassifier replacing FARMClassifier (#1540)
* Initial draft of TransformersClassifier

* Add transformers classifier implementation

* Add test for SentenceTransformersClassifier

* Add truncation and corresponding test case to Classifier

* Add zero-shot classification and test

* Add document classifier documentation

* Add latest docstring and tutorial changes

* print meta data with print_documents()

* Add latest docstring and tutorial changes

* Remove top_k param from Classifier usage example

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-01 11:22:56 +02:00
Sara Zan
a30a826c6c
Standardize delete_documents(filter=...) across all document stores (#1509)
* Make InMemoryDocumentStore accept and apply filters in delete_documents()

* Modify test_document_store.py to test the filtered deletion in memory, sql and milvus too

* Make FAISSDocumentStore accept and properly apply filters in delete_documents()

* Add latest docstring and tutorial changes

* Remove accidentally duplicated test

* Remove unnecessary decorators from test/test_document_store.py::test_delete_documents_with_filters

* Add embeddings count test for FAISS and Milvus; Milvus fails it.

* Fixed a bug that made Milvus not deleting embeddings

* Remove batch size parametrization in tests & update all documentstore's docstrings with a filter example

* Add latest docstring and tutorial changes

Co-authored-by: prafgup <prafulgupta6@gmail.com>
2021-09-29 09:27:06 +02:00
Malte Pietsch
2df1aa8713
Fix document_store_type flag for tests with multiple fixtures that get parametrized. (#1526) 2021-09-28 16:38:21 +02:00
Julian Risch
f9d2f786ca
Replace FARM import statements; add dependencies (#1492)
* Replace FARM import statements; add dependencies

* Add InferenceProc., TextCl.Proc., TextPairCl.Proc.

* Remove FARMRanker, add type annotations, rename max_sample

* Add sample_to_features_text for InferenceProc.

* Fix type annotations: model_name_or_path is str not Path

* Fix mypy errors: implement _create_dataset in TextCl.Proc.

* Add task_type "embeddings" in Inferencer

* Allow loading AdaptiveModel for embedding task

* Add SQuAD eval metrics; enable InferenceProc for embedding task

* Add baskets as param to log_samples and handle empty basket list in log_samples

* Remove unused dependencies

* Remove FARMClassifier (doc classificer) due to ref to TextClassificationHead

* Remove FARMRanker and Classifier from doc generation scripts

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-28 16:34:24 +02:00
Sara Zan
1cd17022af
Fix bug when loading FAISS from supplied config file path (#1506)
* Fix the bug found in issue 135

* Add a test for the custom path
2021-09-27 11:25:05 +02:00
Malte Pietsch
183fd5ae5a
Simplify tests & allow running on individual doc stores (#1487)
* simplify tests for individual doc stores

* WIP refactoring markers of tests

* test alternative approach for tests with existing parametrization

* fix skip logic of already parametrized tests

* fix weaviate behaviour in tests - not parametrizing it in our general test cases.

* Add latest docstring and tutorial changes

* fix some tests

* remove sql from document_store_types

* fix markers for generator and pipeline test

* remove inmemory marker

* remove unneeded elasticsearch markers

* update readme and contributing.md

* update contributing

* adjust example

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-27 10:52:07 +02:00
Julian Risch
60471cecdf
Add inferencer for QA only (#1484)
* Add inferencer for QA only

* Add latest docstring and tutorial changes

* Add QA inferencer tests

* Add type annotations for inferencer

* Fix type annotations, move util functions

* Fix type annotations

* Move fixtures to the top of the file

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-22 16:56:51 +02:00
Sara Zan
21513532e5
Improve save/load of FAISS document store by saving its configuration alongside the index (#1459)
* Saves the FAISSDocumentStore init params to JSON at save() and loads them at load() if they're found. First draft, to be tested.

* Fixing issue with string/Path objects in a few string operations, thanks mypy

* Leverage self.set_config instead of saving the parameters in a separate attribute

* Modify test_faiss_and_milvus:test_faiss_index_save_and_load to test that init params are preserved

* Add assert to verify that the SQL doc count and FAISS vector count is equal. Needs to always specify the name of the SQL db for this to work

* Simplified the implementation a bit, add better comments

* Forgot a return at the end of the file

* Fixing some of the suggestions from the review

* Add a try-catch in the load method and fix the tests

* Typo
2021-09-20 08:32:14 +02:00
mathislucka
9c4e67d9b6
Enable cosine similarity metric in FAISSDocumentStore (#1352)
* feat: normalize embeddings for cosine sim

* WIP add test case for faiss cosine

* input to faiss normalize needs to be an array of vectors

* fix: test should compare correct result embedding to original embedding

* add sanity check for cosine sim

* fix typo

* normalize cosine score

* Update docstring

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-09-20 07:54:26 +02:00
Timo Moeller
172de1c05f
Merge pull request #1422 from deepset-ai/farm_merging_base
Farm merging base
2021-09-16 11:32:41 +02:00
Timo Moeller
d804861fb2 Fix tests 2021-09-13 20:00:22 +02:00
Timo Moeller
537204e8c9
Fix tests and adjust folder structure
* Add type annotations in QuestionAnsweringHead

* Fix test by increasing max_seq_len

* Add SampleBasket type annotation

* Remove prediction head param from adaptive model init

* Add type ignore for AdaptiveModel init

* Fix and rename tests

* Adjust folder structure

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2021-09-13 18:38:14 +02:00
Ikram Ali
f186d6327d
Add MostSimilarDocumentsPipeline (#1413)
* [pipeline] MostSimilarDocumentsPipeline added

* [pipeline] mypy bug fixed.

* [pipeline] mypy bug fixed.

* [pipeline] test cases added.

* [pipeline] test cases added.

* [pipeline] set return_embedding back to false.

* [pipeline] return a list of Documents

* [pipeline] define the ids

* [pipeline] code refactor.

* [pipeline] code refactor.

* [pipeline] test case improved.

* Update docstring

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-09-13 12:43:45 +02:00
MichelBartels
da2e8da561
Adding multi gpu support for DPR inference (#1414)
* Added support for Multi-GPU inference to DPR including benchmark

* fixed multi gpu

* added batch size to benchmark to better reflect multi gpu capabilities

* remove unnecessary entry in config.json

* fixed typos

* fixed config name

* update benchmark to use DEVICES constant

* changed multi gpu parameters and updated docstring

* adds silent fallback on cpu

* update doc string, warning and config

Co-authored-by: Michel Bartels <kontakt@michelbartels.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-09-10 13:25:02 +02:00
oryx1729
9dd7c74f4f
Refactor communication between Pipeline Components (#1321) 2021-09-10 11:41:16 +02:00
Julian Risch
4a64c50c7e Merge branch 'farm_merging_base' of github.com:deepset-ai/haystack into farm_merging_base 2021-09-09 13:03:38 +02:00
Julian Risch
ba1fe0ec61 Add fixture distilbert_squad 2021-09-09 13:02:35 +02:00
bogdankostic
2626388961
Fix DPR tests + add Tokenizer tests (#1429)
* Fix DPR tests

* Add Tokenizer tests
2021-09-09 12:56:44 +02:00
Julian Risch
23338f1b74 Add tests: prediction head, processor load/save, qa from FARM 2021-09-09 11:54:47 +02:00
Timo Moeller
b4fd08a296
Add testdata, add tests for qa processor, add dpr tests (some failing) 2021-09-08 12:02:08 +02:00
Shahrukh Khan
4822536886
Add ImageToTextConverter and PDFToTextOCRConverter that utilize OCR (#1349)
* add image.py converter

* add PDFtoImageConverter

* add init to PDFtoImageConverter and classes to __init__

* update imagetotext pipeline

* update imagetotext pipeline

* update imagetotext pipeline

* update imagetotext pipeline

* update imagetotext pipeline

* update imagetotext pipeline

* update imagetotext pipeline

* revert change in base.py in file_conv

* Update base.py

* Update pdf.py

* add ocr file_converter testcase & update dockerfile

* fix tesseract exception message typo

* fix _image_to_text doctstring

* add tesseract installation to CI

* add tesseract installation to CI

* add content test for PDF OCR converter

* update PDFToTextOCRConverter constructor doctsring

* replace image files with tmp paths for image.py convert

* replace image files with tmp paths for image.py convert

* Update README.md

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-09-01 16:42:25 +02:00
oryx1729
a71180a2ca
Refactor replicas config for Ray Pipelines (#1378) 2021-08-31 10:14:55 +02:00
ramgarg102
51f0a56e5d
delete_all_documents() replaced by delete_documents() (#1377)
* [UPDT] delete_all_documents() replaced by delete_documents()

* [UPDT] warning logs to be fixed

* [UPDT] delete_all_documents() renamed and the same method added

Co-authored-by: Ram Garg <ramgarg102@gmai.com>
2021-08-30 15:18:28 +02:00
Markus Paff
be8d305190
Editing docs read.me for new docs website workflow (#1372)
* editing docs read.me for new docs website workflow

* added new links to docs
2021-08-30 14:59:40 +02:00
Ikram Ali
ead96730d3
Add Crawler support for indexing pipeline (#1360) 2021-08-24 14:25:22 +02:00
Ikram Ali
ef27f0d386
Add tests for Crawler (#1339) 2021-08-18 14:05:44 +02:00
Julian Risch
eb990c9688
Removing probability field from answers in favor of score field (#1340)
* Removing probability field from reader and from test cases

* Add switch to FARMReader to choose score/probability

* Remove probability field from doc returned by doc store

* Relax assertion testing joined es and dpr predictions

* Use switch for confidence scores also for no_answer

* Add test that checks switching to old answer scores > 10

* Normalize score in elastic doc store and reset reader.md

* Scale weights of JoinDocuments to sum to 1 and adapt test case
2021-08-17 10:27:11 +02:00
Timo Moeller
07bd3c50ea
Add new QA eval metric: Semantic Answer Similarity (SAS) (#1338)
* init

* Add type annotation

* Add test case, fix mypy

* Add german model to docstring

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-08-12 14:31:48 +02:00
Malte Pietsch
a0921f0c35
Remove Finder (#1326)
* deprecate finder

* remove import

* add doc section for moving from finder to pipelines
2021-08-09 13:41:40 +02:00
oryx1729
bafa1b46de
Add Ray integration for Pipelines (#1255) 2021-08-02 14:51:24 +02:00
Branden Chan
937247d628
Add QuestionGenerator (#1267)
* Create basic Question Generation

* Split texts into 50 word chunks

* Allow prompt to be changed

* Implement iteration functionality in DS

* Add docstrings, create pipelines

* Make pipelines work

* Add comments

* Add tests

* Add tutorials and docs

* Add doc string
2021-07-26 17:20:43 +02:00
Branden Chan
363be65a78
Implement OpenSearch ANN (#1225)
* Simplify ODES init

* Add arguments to ES init and create script

* Rename similarity_fn_name and add util fn

* Create OpenSearchDocumentStore

* Specify params of Open Search HNSW

* Add better argument handling

* Update opensearch index mapping

* Edit opensearch default port

* Fix HNSW mapping

* Force small HNSW params

* Implement auto start and stopping of document store services

* Fix starting and stopping of ds service

* Restore HNSW params

* Add opensearch query benchmarks

* Add write wait time

* Revert wait time

* Add timeout

* Update benchmarks

* Update benchmarks

* Update benchmarks json

* Update documentation

* Update documentation

* Fix similarity name

* Improve argument passing

* Improve stopping and starting of service
2021-07-26 10:52:52 +02:00
Julian Risch
4e6f7f349d
Add FARMClassifier node for Document Classification (#1265)
* Add FARM classification node

* Add classification output to meta field of document

* Update usage example

* Add test case for FARMClassifier

* Replace FARMRanker with FARMClassifier in documentation strings

* Remove base method not implemented by any child class, etc.
2021-07-13 21:44:26 +02:00
Julian Risch
dbb9efbd39
Add SentenceTransformersRanker with pre-trained Cross-Encoder (#1209)
* Add SentenceTransformersRanker with pre-trained Cross-Encoder

* Add test cases for Ranker nodes and update documentation

* update docstring

* Update docstring

* Update __init__.py

* update import for test

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-07-07 17:31:45 +02:00
Ikram Ali
29e140196b
[pipeline] Allow for batch indexing when using Pipelines fix #1168 (#1231)
* [pipeline] Allow for batch indexing when using Pipelines fix #1168

* [pipeline] Test case fixed fix #1168

* [file_converter] Path.suffix updated #1168

* [file_converter] meta can be one of these three cases:
                 A single dict that is applied to all files
                 One dict for each file being converted
                 None #1168

* [file_converter] mypy error fixed.

* [file_converter] mypy error fixed.

* [rest_api] batch file upload introduced in indexing API.

* [test_case] Test_api file upload parameter name updated.

* [ui] Streamlit file upload parameter updated.
2021-06-30 14:13:46 +02:00
vblagoje
02fc4c7783
Improve document stores unit test parametrization (#1202) 2021-06-22 16:08:23 +02:00
vblagoje
2a5882578a
Add Longform-QA (LFQA), Seq2SeqGenerator for generative QA and Retribert Retriever (#1086)
* Integrate LFQA with Haystack

* Integrate LFQA with Haystack - unit tests

* Properly initialize conftest default value for vector_dim

* Update PR after inital feedback

* Fix conftest.py import

* Seq2SeqGenerator uses Callables instead of subclasses for custom model input

* Update docstring

* Fix Callable use

* Add LFQA tutorials

* Improve type error reporting for invalid input converter Callable

* Generate docstrings

* Format comments in tutorial script

* Generate tutorial md

* Add usage page

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: brandenchan <brandenchan@icloud.com>
2021-06-14 17:53:43 +02:00
venuraja79
49886f88f0
Integrate Weaviate as another DocumentStore (#1064)
* Annotation Tool: data is not persisted when using local version #853

* First version of weaviate

* First version of weaviate

* First version of weaviate

* Updated comments

* Updated comments

* ran query, get and write tests

* update embeddings, dynamic schema and filters implemented

* Initial set of tests and fixes

* Tests added for update_embeddings and delete documents

* introduced duplicate documents fix

* fixed mypy errors

* Added Weaviate to requirements

* Fix the weaviate docker env variables

* Fixing test dependencies for now

* Created weaviate test marker and fixed query

* Update docstring

* Add documentation

* Bump up weaviate version

* Bump up weaviate version in documentation

* Bump up weaviate version in documentation

* Updgrade weaviate version

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-06-10 09:43:53 +02:00
Shahrukh Khan
545c625a37
Add QueryClassifier incl. baseline models (#1099)
* restructure query classifier code and add s3 based pickles

* make model and vectorizer optional in query classifier

* update query classifier as per init style

* add query classifiers sklearn/hf

* update docstrings for query classifiers

* add unit test for query classifier

* add type patch for sklearn classifier

* fix mypy type issue

* revert to pure formatting

* add query classifiers

* resolve conflict

* add output names for query classifier

* revert output and update docstring queryclassifier

* Update docstring for SklearnQueryClassifier

* update transformer query classifier docstring

* fix typo

* change arg names in query classifier classes

* add set_config(). rename attributes

* fix set_config()

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-06-08 15:20:13 +02:00