89 Commits

Author SHA1 Message Date
Kristof Herrmann
a8c2cdc565
private hugging face models for retrievers (#1785)
* private dpr

* Add latest docstring and tutorial changes

* added parameters to child functions

* Add latest docstring and tutorial changes

* added tableextractor

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-22 09:24:02 +01:00
Kristof Herrmann
8aa4ca29c2
Huggingface private model support via API tokens (FARMReader) (#1775)
* passed kwargs to model loading

* Pass Auth token explicitly

* add use_auth_token to get_language_model_class

* added use_auth_token parameter at FARMReader

* Add latest docstring and tutorial changes

* added docs for parameter `use_auth_token`

* Add latest docstring and tutorial changes

* adding docs link

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-19 16:48:31 +01:00
tstadel
956d5bba43
Split summarizer tests in order to make windows CI work again (#1757)
* separate testfile for summarizer with translation

* Add latest docstring and tutorial changes

* import SPLIT_DOCS from test_summarizer

* add workflow_dispatch to windows_ci

* add worflow_dispatch to linux_ci

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-15 18:49:49 +01:00
tstadel
59e04cba05
Multi query eval (#1746)
* add eval() to pipeline

* Add latest docstring and tutorial changes

* support multiple queries in eval()

* Add latest docstring and tutorial changes

* keep single query test

* fix EvaluationResult node_results default

* adjust docstrings

* Add latest docstring and tutorial changes

* minor improvements from comments

* Add latest docstring and tutorial changes

* move EvaluationResult and calculate_metrics to schema

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-15 14:51:11 +01:00
Sara Zan
09a462d756
Fix print_answers (#1743)
* Fix a specific path of print_answers that was assuming answers are dictionaries

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-15 09:50:09 +01:00
Branden Chan
81f82b1b95
Update API Reference Pages for v1.0 (#1729)
* Create new API pages and update existing ones

* Create query classifier page

* Remove Objects suffix
2021-11-11 12:44:29 +01:00
tstadel
158460504b
Make FAISSDocumentStore work with yaml (#1727)
* add faiss_index_path and faiss_config_path

* Add latest docstring and tutorial changes

* remove duplicate cleaning stuff

* refactoring + test for invalid param combination

* adjust type hints

* Add latest docstring and tutorial changes

* add documentation to @preload_index

* Add latest docstring and tutorial changes

* recursive __init__ instead of decorator

* Add latest docstring and tutorial changes

* validate instead of check

* combine ifs

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-11 11:02:22 +01:00
Sara Zan
42c8edca54
Simplify logs management (#1696)
* Move each haystack module's logger configuration into the respective file and configure the handlers properly

* Implement most changes from #1714

* Remove accidentally committed git merge tags ':D

* Remove the debug logs capture feature

* Remove more references to debug_logs

* Fix issue with FARMReader that somehow made it to master

* Add devices parameter to Inferencer

* Change log of APEX message to DEBUG and lower the 'Starting <docstore>...' messages to DEBUG as well

* Change log level of a few logs from modeling

* Silence the transformers warning

* Remove empty line below the workers :)

* Fix two more levels in the tutorials logs

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2021-11-11 10:16:25 +01:00
tstadel
14515a861b
Tutorial for DocumentClassifier at Index Time (#1697)
* basic example of document classifier in preprocessing logic

* add batch_size to TransformersDocumentClassifier

* complete tutorial16

* Add latest docstring and tutorial changes

* fix missing batch_size

* add notebook

* test for batch_size use added

* add tutorial 16 to headers.py

* Add latest docstring and tutorial changes

* make DocumentClassifier indexing pipeline rdy

* Add latest docstring and tutorial changes

* flexibility improvements for DocumentClassifier in Pipelines

* Add latest docstring and tutorial changes

* fix index time usage

* remove query from documentclassifier tests

* improve classification_field resolving + minor fixes

* Add latest docstring and tutorial changes

* tutorial 16 extended with zero shot and pipelines

* Add latest docstring and tutorial changes

* install graphviz in notebook

* Add latest docstring and tutorial changes

* remove convert_to_dicts

* Add latest docstring and tutorial changes

* Fix typo

* Add latest docstring and tutorial changes

* remove retriever from indexing pipeline

* Add latest docstring and tutorial changes

* fix save_to_yaml when using FileTypeClassifier

* emphasize the impact with zero shot classification

* Add latest docstring and tutorial changes

* adjust use_gpu to boolean in test

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-11-09 18:43:00 +01:00
bogdankostic
cd8666f904
Standardize initialisation of device settings (#1683)
* Use initialize_device_settings in all nodes

* Set StreamHandler level to INFO

* Add latest docstring and tutorial changes

* work in progress

* Standardize device initialization

* Add latest docstring and tutorial changes

* Adapt device initialization in Reader's train method

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-09 12:44:20 +01:00
Julian Risch
892ce4a760
Make weaviate more compliant to other doc stores (UUIDs and dummy embedddings) (#1656)
* create uuid and dummy embeddding in weaviate doc store

* handle and test for duplicate non-uuid-formatted ids in weaviate

* add uuid and dummy embedding to doc strings

* Add latest docstring and tutorial changes

* Upgrade weaviate

* Include weaviate in common doc store test cases

* Add latest docstring and tutorial changes

* Exclude weaviate doc store from eval tests

* Incorporate index name in uuid generation

* Ignore mypy error

* Fix typo

* Restore DOCS without uuid and embeddings generated by weaviate

* Supply docs for retriever tests as fixture

* Limit scope of fixture to function instead of session

* Add comments

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-04 09:27:12 +01:00
Branden Chan
4ca1937775
Standardize similarity argument description (#1684)
* Standardize argument similarity argument description

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-02 14:53:26 +01:00
fingoldo
27793814cf
Cosine similarity for the rest of DocStores. (#1569)
* Added uniform normalization method to each of the DocStores (implemented), so that now Milvus and Weaviate doc stores can use cosine similarity, plus future method for making existing embeddings normaziled (empty for now).

* Fixed a typo.

* Fixed lots of stuff. Performed local tests.

* Fixed scores representation for cosine. Assuming Weavieate's rep needs no change.

* fixes as per discussion

* Trigger CI

* resolving conflicts

* small typo

* fixed a param type

* cleaned some conflicts resolving left overs

* commented out fastmath for a moment

* fixing tests

* added docstore for small vectors

* test

* fixed document_store_cosine_small

* cosine tests fixes

* fixed document_store_cosine_small

* fixed weaviate index name and lowered rtol for ES

* increased rtol

* added explicit doc_ids for weaviate, excluded ES, included Inmemory

* resolving mismatch

* fixing a typo

* flatten normalize_embedding()

* fix import for test

* standardize normalize_embeddings across doc stores

* Add latest docstring and tutorial changes

* going for the faster plain dot prod

Co-authored-by: fingoldo <fingoldo@gmail.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-01 13:42:32 +01:00
Julian Risch
33b2663fdc
ensure tf-idf matrix calculation before retrieval (#1665)
* ensure tf-idf matrix calculation before retrieval

* Run fit() automatically if new documents have been added

* Add latest docstring and tutorial changes

* Fix type error

* Add test case for tfidf retriever yaml pipeline

* Use InMemoryDocStore and add 2nd test case

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-28 16:48:06 +02:00
Sara Zan
eab475bb5d
Rename every occurrence of 'embed_passages' with 'embed_documents' (#1667)
* Rename every occurrence of 'embed_passages' with 'embed_documents'

* Remove aliased method embed_documents

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-28 12:17:56 +02:00
bogdankostic
0c80ac9e62
Truncate too large tables for TableReader (#1662)
* Truncate too large tables for TableReader

* Add documentation

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-27 15:46:59 +02:00
Timo Moeller
1d3f63ac2e
Allow setting of scroll param in ElasticsearchDocumentStore (#1645)
* remove scroll param in ES call

* Add scroll param to ES init

* Add latest docstring and tutorial changes

* Add scroll to set_config

* remove trailing comma

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-10-27 11:07:13 +02:00
Sara Zan
13510aa753
Refactoring of the haystack package (#1624)
* Files moved, imports all broken

* Fix most imports and docstrings into

* Fix the paths to the modules in the API docs

* Add latest docstring and tutorial changes

* Add a few pipelines that were lost in the inports

* Fix a bunch of mypy warnings

* Add latest docstring and tutorial changes

* Create a file_classifier module

* Add docs for file_classifier

* Fixed most circular imports, now the REST API can start

* Add latest docstring and tutorial changes

* Tackling more mypy issues

* Reintroduce  from FARM and fix last mypy issues hopefully

* Re-enable old-style imports

* Fix some more import from the top-level  package in an attempt to sort out circular imports

* Fix some imports in tests to new-style to prevent failed class equalities from breaking tests

* Change document_store into document_stores

* Update imports in tutorials

* Add latest docstring and tutorial changes

* Probably fixes summarizer tests

* Improve the old-style import allowing module imports (should work)

* Try to fix the docs

* Remove dedicated KnowledgeGraph page from autodocs

* Remove dedicated GraphRetriever page from autodocs

* Fix generate_docstrings.sh with an updated list of yaml files to look for

* Fix some more modules in the docs

* Fix the document stores docs too

* Fix a small issue on Tutorial14

* Add latest docstring and tutorial changes

* Add deprecation warning to old-style imports

* Remove stray folder and import Dict into dense.py

* Change import path for MLFlowLogger

* Add old loggers path to the import path aliases

* Fix debug output of convert_ipynb.py

* Fix circular import on BaseRetriever

* Missed one merge block

* re-run tutorial 5

* Fix imports in tutorial 5

* Re-enable squad_to_dpr CLI from the root package and move get_batches_from_generator into document_stores.base

* Add latest docstring and tutorial changes

* Fix typo in utils __init__

* Fix a few more imports

* Fix benchmarks too

* New-style imports in test_knowledge_graph

* Rollback setup.py

* Rollback squad_to_dpr too

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-25 15:50:23 +02:00
bogdankostic
51acf779f2
Add TableTextRetriever (#1529)
* first draft / notes on new primitives

* wip label / feedback refactor

* rename doc.text -> doc.content. add doc.content_type

* add datatype for content

* remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field

* update converters for . Add warning for empty

* renam label.question -> label.query. Allow sorting of Answers.

* WIP primitives

* update ui/reader for new Answer format

* Improve Label. First refactoring of MultiLabel. Adjust eval code

* fixed workflow conflict with introducing new one (#1472)

* Add latest docstring and tutorial changes

* make add_eval_data() work again

* fix reader formats. WIP fix _extract_docs_and_labels_from_dict

* fix test reader

* Add latest docstring and tutorial changes

* fix another test case for reader

* fix mypy in farm reader.eval()

* fix mypy in farm reader.eval()

* WIP ORM refactor

* Add latest docstring and tutorial changes

* fix mypy weaviate

* make label and multilabel dataclasses

* bump mypy env in CI to python 3.8

* WIP refactor Label ORM

* WIP refactor Label ORM

* simplify tests for individual doc stores

* WIP refactoring markers of tests

* test alternative approach for tests with existing parametrization

* WIP refactor ORMs

* fix skip logic of already parametrized tests

* fix weaviate behaviour in tests - not parametrizing it in our general test cases.

* Add latest docstring and tutorial changes

* fix some tests

* remove sql from document_store_types

* fix markers for generator and pipeline test

* remove inmemory marker

* remove unneeded elasticsearch markers

* add dataclasses-json dependency. adjust ORM to just store JSON repr

* ignore type as dataclasses_json seems to miss functionality here

* update readme and contributing.md

* update contributing

* adjust example

* fix duplicate doc handling for custom index

* Add latest docstring and tutorial changes

* fix some ORM issues. fix get_all_labels_aggregated.

* update drop flags where get_all_labels_aggregated() was used before

* Add latest docstring and tutorial changes

* add to_json(). add + fix tests

* fix no_answer handling in label / multilabel

* fix duplicate docs in memory doc store. change primary key for sql doc table

* fix mypy issues

* fix mypy issues

* haystack/retriever/base.py

* fix test_write_document_meta[elastic]

* fix test_elasticsearch_custom_fields

* fix test_labels[elastic]

* fix crawler

* fix converter

* fix docx converter

* fix preprocessor

* fix test_utils

* fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations

* Add latest docstring and tutorial changes

* fix crawler test. fix ocrconverter attribute

* fix test_elasticsearch_custom_query

* fix generator pipeline

* fix ocr converter

* fix ragenerator

* Add latest docstring and tutorial changes

* fix test_load_and_save_yaml for elasticsearch

* fixes for pipeline tests

* fix faq pipeline

* fix pipeline tests

* Add latest docstring and tutorial changes

* Add MultimodalRetriever

* Add latest docstring and tutorial changes

* fix weaviate

* Add latest docstring and tutorial changes

* trigger CI

* satisfy mypy

* Add latest docstring and tutorial changes

* satisfy mypy

* Add latest docstring and tutorial changes

* trigger CI

* fix question generation test

* fix ray. fix Q-generation

* fix translator test

* satisfy mypy

* wip refactor feedback rest api

* fix rest api feedback endpoint

* fix doc classifier

* remove relation of Labels -> Docs in SQL ORM

* fix faiss/milvus tests

* fix doc classifier test

* fix eval test

* fixing eval issues

* Add latest docstring and tutorial changes

* fix mypy

* WIP replace dataclasses-json with manual serialization

* Add methods to MultimodalRetriever

* Add latest docstring and tutorial changes

* revert to dataclass-json serialization for now. remove debug prints.

* update docstrings

* fix extractor. fix Answer Span init

* fix api test

* keep meta data of answers in reader.run()

* fix meta handling

* adress review feedback

* Add latest docstring and tutorial changes

* make document=None for open domain labels

* add import

* fix print utils

* fix rest api

* Add methods and tests

* Add latest docstring and tutorial changes

* Fix mypy

* Add latest docstring and tutorial changes

* Add type hints and doc strings

* Make use of initialize_device_settings

* Move serialization of pd.DataFrame to schema.py

* Fix mypy

* Adapt Document's from_dict method

* Update docstrings

* Add latest docstring and tutorial changes

* Fix mypy

* Fix mypy

* Fix Document's from_dict method

* Fix Document's to_dict method

* Change handling of table metadata

* Add latest docstring and tutorial changes

* Change naming from Multimodal to TableText

* Turn off tokenizers_parallelism in retriever tests

* Add latest docstring and tutorial changes

* Remove turning off tokenizers_parallelism in retriever tests

* Adapt convert_es_hit_to_document

* Change embed_surrounding_context to embed_meta_fields

* Add latest docstring and tutorial changes

* Add check if torch.distributed is available

* Set n_gpu to 0 in training test

* Set HIP_LAUNCH_BLOCKING to 1

* Set HIP_LAUNCH_BLOCKING to "1"

* Set use_gpu to False

* Use DataParallel only if more than one device

* Remove --find-links=https://download.pytorch.org/whl/torch_stable.html

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-25 12:27:02 +02:00
Timo Moeller
9dc125df9d
Bugfix Tutorial 5 parameters, adjust default split length (#1635)
Bugfix parameters, adjust default split length, add sentencetransformers
2021-10-22 16:03:12 +02:00
Julian Risch
4ed2b90bca
Add delete_labels() except for weaviate doc store (#1604)
* Add delete_labels() except for weaviate doc store

* Add latest docstring and tutorial changes

* Add test for delete_labels()

* Adapt filter for label deletion to different doc stores in test

* Allow delete labels by _id in elasticsearch

* Add latest docstring and tutorial changes

* Add latest docstring and tutorial changes

* re-add bugfix after merge

* Add ids as optional parameter

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-19 17:20:28 +02:00
Sara Zan
9722bbf1e1
DPR training: Rename TransformersAdamW to AdamW (#1613)
* Rename TransformersAdamW into simply AdamW (probably changed in transformers at some point)

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-19 16:18:30 +02:00
Sara Zan
575e64333c
Delete documents by ID in all document stores (#1606)
* Modify BaseDocumentStore.delete_documents() signature, implement ElasticSearch, and add tests

* Add implementation for InMemory

* Implement for SQL, FAISS and Milvus too

* Add tests for faiss and milvus

* Fix delete_all_documents

* Implement deletion by ID for weaviate

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Co-authored-by: sarthakj2109 <54064348+sarthakj2109@users.noreply.github.com>

Co-authored-by: prafgup <prafulgupta6@gmail.com>

Co-authored-by: ankh6 <andynzemokalumu@live.be>
2021-10-19 12:30:15 +02:00
Malte Pietsch
eb95f0e8aa
Add more flexible options for model downloads (Proxies, resume_download, local_files_only...) (#1256)
* allow passing more options for model/tokenizer download from remote

* temporarily change dependency to current farm master

* Add latest docstring and tutorial changes

* add kwargs

* add docstrings

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-18 15:47:36 +02:00
bogdankostic
655d721371
Add Table Reader (#1446)
* first draft / notes on new primitives

* wip label / feedback refactor

* rename doc.text -> doc.content. add doc.content_type

* add datatype for content

* remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field

* update converters for . Add warning for empty

* Add first draft of TableReader

* renam label.question -> label.query. Allow sorting of Answers.

* Add calculation of answer scores

* WIP primitives

* Adapt input and output to new primitives

* Add doc strings

* Add tests

* update ui/reader for new Answer format

* Improve Label. First refactoring of MultiLabel. Adjust eval code

* fixed workflow conflict with introducing new one (#1472)

* Add latest docstring and tutorial changes

* make add_eval_data() work again

* fix reader formats. WIP fix _extract_docs_and_labels_from_dict

* fix test reader

* Add latest docstring and tutorial changes

* fix another test case for reader

* fix mypy in farm reader.eval()

* fix mypy in farm reader.eval()

* WIP ORM refactor

* Add latest docstring and tutorial changes

* fix mypy weaviate

* make label and multilabel dataclasses

* bump mypy env in CI to python 3.8

* WIP refactor Label ORM

* WIP refactor Label ORM

* simplify tests for individual doc stores

* WIP refactoring markers of tests

* test alternative approach for tests with existing parametrization

* WIP refactor ORMs

* fix skip logic of already parametrized tests

* fix weaviate behaviour in tests - not parametrizing it in our general test cases.

* Add latest docstring and tutorial changes

* fix some tests

* remove sql from document_store_types

* fix markers for generator and pipeline test

* remove inmemory marker

* remove unneeded elasticsearch markers

* add dataclasses-json dependency. adjust ORM to just store JSON repr

* ignore type as dataclasses_json seems to miss functionality here

* update readme and contributing.md

* update contributing

* adjust example

* fix duplicate doc handling for custom index

* Add latest docstring and tutorial changes

* fix some ORM issues. fix get_all_labels_aggregated.

* update drop flags where get_all_labels_aggregated() was used before

* Add latest docstring and tutorial changes

* add to_json(). add + fix tests

* fix no_answer handling in label / multilabel

* fix duplicate docs in memory doc store. change primary key for sql doc table

* fix mypy issues

* fix mypy issues

* haystack/retriever/base.py

* fix test_write_document_meta[elastic]

* fix test_elasticsearch_custom_fields

* fix test_labels[elastic]

* fix crawler

* fix converter

* fix docx converter

* fix preprocessor

* fix test_utils

* fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations

* Add latest docstring and tutorial changes

* fix crawler test. fix ocrconverter attribute

* fix test_elasticsearch_custom_query

* fix generator pipeline

* fix ocr converter

* fix ragenerator

* Add latest docstring and tutorial changes

* fix test_load_and_save_yaml for elasticsearch

* fixes for pipeline tests

* fix faq pipeline

* fix pipeline tests

* Add latest docstring and tutorial changes

* fix weaviate

* Add latest docstring and tutorial changes

* trigger CI

* satisfy mypy

* Add latest docstring and tutorial changes

* satisfy mypy

* Add latest docstring and tutorial changes

* trigger CI

* fix question generation test

* fix ray. fix Q-generation

* fix translator test

* satisfy mypy

* wip refactor feedback rest api

* fix rest api feedback endpoint

* fix doc classifier

* remove relation of Labels -> Docs in SQL ORM

* fix faiss/milvus tests

* fix doc classifier test

* fix eval test

* fixing eval issues

* Add latest docstring and tutorial changes

* fix mypy

* WIP replace dataclasses-json with manual serialization

* Add latest docstring and tutorial changes

* revert to dataclass-json serialization for now. remove debug prints.

* update docstrings

* fix extractor. fix Answer Span init

* fix api test

* Adapt answer format

* Add latest docstring and tutorial changes

* keep meta data of answers in reader.run()

* Fix mypy

* fix meta handling

* adress review feedback

* Add latest docstring and tutorial changes

* Allow inference on GPU

* Remove automatic aggregation

* Add automatic aggregation

* Add latest docstring and tutorial changes

* Add torch-scatter dependency

* Add wheel to torch-scatter dependency

* Fix requirements

* Fix requirements

* Fix requirements

* Adapt setup.py to allow for wheels

* Fix requirements

* Fix requirements

* Add type hints and code snippet

* Add latest docstring and tutorial changes

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-15 16:34:48 +02:00
Malte Pietsch
4a6c9302b3
Redesign primitives - Document, Answer, Label (#1398)
* first draft / notes on new primitives

* wip label / feedback refactor

* rename doc.text -> doc.content. add doc.content_type

* add datatype for content

* remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field

* update converters for . Add warning for empty

* renam label.question -> label.query. Allow sorting of Answers.

* WIP primitives

* update ui/reader for new Answer format

* Improve Label. First refactoring of MultiLabel. Adjust eval code

* fixed workflow conflict with introducing new one (#1472)

* Add latest docstring and tutorial changes

* make add_eval_data() work again

* fix reader formats. WIP fix _extract_docs_and_labels_from_dict

* fix test reader

* Add latest docstring and tutorial changes

* fix another test case for reader

* fix mypy in farm reader.eval()

* fix mypy in farm reader.eval()

* WIP ORM refactor

* Add latest docstring and tutorial changes

* fix mypy weaviate

* make label and multilabel dataclasses

* bump mypy env in CI to python 3.8

* WIP refactor Label ORM

* WIP refactor Label ORM

* simplify tests for individual doc stores

* WIP refactoring markers of tests

* test alternative approach for tests with existing parametrization

* WIP refactor ORMs

* fix skip logic of already parametrized tests

* fix weaviate behaviour in tests - not parametrizing it in our general test cases.

* Add latest docstring and tutorial changes

* fix some tests

* remove sql from document_store_types

* fix markers for generator and pipeline test

* remove inmemory marker

* remove unneeded elasticsearch markers

* add dataclasses-json dependency. adjust ORM to just store JSON repr

* ignore type as dataclasses_json seems to miss functionality here

* update readme and contributing.md

* update contributing

* adjust example

* fix duplicate doc handling for custom index

* Add latest docstring and tutorial changes

* fix some ORM issues. fix get_all_labels_aggregated.

* update drop flags where get_all_labels_aggregated() was used before

* Add latest docstring and tutorial changes

* add to_json(). add + fix tests

* fix no_answer handling in label / multilabel

* fix duplicate docs in memory doc store. change primary key for sql doc table

* fix mypy issues

* fix mypy issues

* haystack/retriever/base.py

* fix test_write_document_meta[elastic]

* fix test_elasticsearch_custom_fields

* fix test_labels[elastic]

* fix crawler

* fix converter

* fix docx converter

* fix preprocessor

* fix test_utils

* fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations

* Add latest docstring and tutorial changes

* fix crawler test. fix ocrconverter attribute

* fix test_elasticsearch_custom_query

* fix generator pipeline

* fix ocr converter

* fix ragenerator

* Add latest docstring and tutorial changes

* fix test_load_and_save_yaml for elasticsearch

* fixes for pipeline tests

* fix faq pipeline

* fix pipeline tests

* Add latest docstring and tutorial changes

* fix weaviate

* Add latest docstring and tutorial changes

* trigger CI

* satisfy mypy

* Add latest docstring and tutorial changes

* satisfy mypy

* Add latest docstring and tutorial changes

* trigger CI

* fix question generation test

* fix ray. fix Q-generation

* fix translator test

* satisfy mypy

* wip refactor feedback rest api

* fix rest api feedback endpoint

* fix doc classifier

* remove relation of Labels -> Docs in SQL ORM

* fix faiss/milvus tests

* fix doc classifier test

* fix eval test

* fixing eval issues

* Add latest docstring and tutorial changes

* fix mypy

* WIP replace dataclasses-json with manual serialization

* Add latest docstring and tutorial changes

* revert to dataclass-json serialization for now. remove debug prints.

* update docstrings

* fix extractor. fix Answer Span init

* fix api test

* keep meta data of answers in reader.run()

* fix meta handling

* adress review feedback

* Add latest docstring and tutorial changes

* make document=None for open domain labels

* add import

* fix print utils

* fix rest api

* adress review feedback

* Add latest docstring and tutorial changes

* fix mypy

Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-13 14:23:23 +02:00
Malte Pietsch
9650f7aed1
Add debug and debug_logs params to standard pipelines (#1586)
* add debug and debug_logs to standard pipelines

* Add latest docstring and tutorial changes

* fix params

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-12 16:00:48 +02:00
Sara Zan
6354528336
Add /documents/get_by_filters endpoint (#1580)
* Add endpoint to get documents by filter

* Add test for /documents/get_by_filter and extend the delete documents test

* Add rest_api/file-upload to .gitignore

* Make sure the document store is empty for each test

* Improve docstrings of delete_documents_by_filters and get_documents_by_filters

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-12 10:53:54 +02:00
Malte Pietsch
38652dd4dd
Enable GPU usage for QuestionGenerator (#1571)
* enable GPU usage for question generator

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-08 12:17:48 +02:00
Sara Zan
54947cb840
Return intermediate nodes output in pipelines (#1558)
* First rough implementation

* Add a flag to dump the debug logs to the console as well

* Typing run() and _dispatch_run()

* Allow debug and debug_logs to be passed as arguments of run()

* Avoid overwriting _debug, later we might want to store other objects in it

* Put logs under a separate key of the _debug dictionary and add input and output of the node alongside it

* Introduce global arguments for pipeline.run() that get applied to every node when defined

* Change default values of debug variables to None, otherwise their default would override the params values

* Remove a potential infinite recursion on the overridden __getattr__

* Do not append the output of the last node in the _debug key, it causes infinite recursion

* Add tests

* Move the input/output collection into _dispatch_run to gather only relevant info

* Add partial Pipeline.run() docstring

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-10-07 22:13:25 +02:00
Julian Risch
7e063b77d2
Format doc classifier usage example (#1550)
* Format doc classifier usage example

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-01 15:01:19 +02:00
Julian Risch
24483d7bad
TransformersDocumentClassifier replacing FARMClassifier (#1540)
* Initial draft of TransformersClassifier

* Add transformers classifier implementation

* Add test for SentenceTransformersClassifier

* Add truncation and corresponding test case to Classifier

* Add zero-shot classification and test

* Add document classifier documentation

* Add latest docstring and tutorial changes

* print meta data with print_documents()

* Add latest docstring and tutorial changes

* Remove top_k param from Classifier usage example

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-01 11:22:56 +02:00
Julian Risch
0e7338f0c6
Remove mentions of FARM from Ranker comments (#1535)
* Remove mentions of FARM from Ranker comments

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-29 11:57:30 +02:00
Sara Zan
a30a826c6c
Standardize delete_documents(filter=...) across all document stores (#1509)
* Make InMemoryDocumentStore accept and apply filters in delete_documents()

* Modify test_document_store.py to test the filtered deletion in memory, sql and milvus too

* Make FAISSDocumentStore accept and properly apply filters in delete_documents()

* Add latest docstring and tutorial changes

* Remove accidentally duplicated test

* Remove unnecessary decorators from test/test_document_store.py::test_delete_documents_with_filters

* Add embeddings count test for FAISS and Milvus; Milvus fails it.

* Fixed a bug that made Milvus not deleting embeddings

* Remove batch size parametrization in tests & update all documentstore's docstrings with a filter example

* Add latest docstring and tutorial changes

Co-authored-by: prafgup <prafulgupta6@gmail.com>
2021-09-29 09:27:06 +02:00
Julian Risch
f9d2f786ca
Replace FARM import statements; add dependencies (#1492)
* Replace FARM import statements; add dependencies

* Add InferenceProc., TextCl.Proc., TextPairCl.Proc.

* Remove FARMRanker, add type annotations, rename max_sample

* Add sample_to_features_text for InferenceProc.

* Fix type annotations: model_name_or_path is str not Path

* Fix mypy errors: implement _create_dataset in TextCl.Proc.

* Add task_type "embeddings" in Inferencer

* Allow loading AdaptiveModel for embedding task

* Add SQuAD eval metrics; enable InferenceProc for embedding task

* Add baskets as param to log_samples and handle empty basket list in log_samples

* Remove unused dependencies

* Remove FARMClassifier (doc classificer) due to ref to TextClassificationHead

* Remove FARMRanker and Classifier from doc generation scripts

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-28 16:34:24 +02:00
Malte Pietsch
183fd5ae5a
Simplify tests & allow running on individual doc stores (#1487)
* simplify tests for individual doc stores

* WIP refactoring markers of tests

* test alternative approach for tests with existing parametrization

* fix skip logic of already parametrized tests

* fix weaviate behaviour in tests - not parametrizing it in our general test cases.

* Add latest docstring and tutorial changes

* fix some tests

* remove sql from document_store_types

* fix markers for generator and pipeline test

* remove inmemory marker

* remove unneeded elasticsearch markers

* update readme and contributing.md

* update contributing

* adjust example

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-27 10:52:07 +02:00
Branden Chan
2c4baa7f4e
Regenerate API and Tutorial md files (#1480)
* Change punctuation

* Add latest docstring and tutorial changes

* Change punctuation

* Add documentation for Docs2Answer

* Add latest docstring and tutorial changes

* Generate new API docs

* Replace Finder with Pipeline

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-21 14:42:18 +02:00
Markus Paff
39845c0624
Automate updates docstrings tutorials (#1461)
* remove not needed githab actions and reactivate docstrings and tutorial generation

* test workflow

* update pydoc version

* update python version

* update watchdog

* move to latest version pydoc-markdown

* remove version check

* Add latest docstring and tutorial changes

* remove test workflow

* test for param docstrings

* pin pydoc-markdown version

* add test workflow

* pin watchdog version

* Add latest docstring and tutorial changes

* update original workflow and delete test

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-17 13:44:31 +02:00
Ikram Ali
3fc7f3f695
[docs] crawler api docs updated. (#1388) 2021-09-01 12:07:32 +02:00
Branden Chan
10e332dabb
Fix Links (#1199)
* Fix link highlight

* Regen md files

* Remove duplicate

* Fix whitespace

* fixing strings for website

* Fix link

Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2021-06-23 19:07:54 +02:00
Markus Paff
6cd49105e7
update api markdown files and add markdown file for ranker (#1198)
* update api markdown files and add markdown file for ranker

* added docstrings for weaviate

* new version of pydoc-markdown does not render arguments correctly. We used pydoc-markdown==3.11.0
2021-06-15 17:50:08 +02:00
vblagoje
2a5882578a
Add Longform-QA (LFQA), Seq2SeqGenerator for generative QA and Retribert Retriever (#1086)
* Integrate LFQA with Haystack

* Integrate LFQA with Haystack - unit tests

* Properly initialize conftest default value for vector_dim

* Update PR after inital feedback

* Fix conftest.py import

* Seq2SeqGenerator uses Callables instead of subclasses for custom model input

* Update docstring

* Fix Callable use

* Add LFQA tutorials

* Improve type error reporting for invalid input converter Callable

* Generate docstrings

* Format comments in tutorial script

* Generate tutorial md

* Add usage page

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: brandenchan <brandenchan@icloud.com>
2021-06-14 17:53:43 +02:00
Branden Chan
5f0f85989a
Refresh API docs (#1152) 2021-06-09 16:13:58 +02:00
Lalit Pagaria
f46b09c756
Using text hash as id to prevent document duplication (#1000)
* using text hash as id to prevent document duplication. Also providing a way customize it.

* Add latest docstring and tutorial changes

* Fixing duplicate value test when text is same

* Adding test for duplicate ids in document store

* Changing exception to generic Exception type

* add exception for inmemory. update docstring Document. remove id_hash_keys from object attribute

* Add latest docstring and tutorial changes

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-05-17 17:51:52 +02:00
Branden Chan
869b493b61
Regen api docs (#1015) 2021-04-30 12:35:13 +02:00
Branden Chan
9626c0d65e
Update Documentation (#976)
* Add api pages

* Add latest docstring and tutorial changes

* First sweep of usage docs

* Add link to conversion script

* Add import statements

* Add summarization page

* Add web crawler documentation

* Add confidence scores usage

* Add crawler api docs

* Regenerate api docs

* Update summarizer and translator api

* Add api pages

* Add latest docstring and tutorial changes

* First sweep of usage docs

* Add link to conversion script

* Add import statements

* Add summarization page

* Add web crawler documentation

* Add confidence scores usage

* Add crawler api docs

* Regenerate api docs

* Update summarizer and translator api

* Add indentation (pydoc-markdown 3.10.1)

* Comment out metadata

* Remove Finder deprecation message

* Remove Finder in FAQ

* Update tutorial link

* Incorporate reviewer feedback

* Regen api docs

* Add type annotations

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-22 16:45:29 +02:00
Branden Chan
77d4c2ca1c
Benchmark milvus (#850)
* Add milvus benchmarking support

* Add latest docstring and tutorial changes

* Edit config

* Disable docker interactive mode

* Add milvus index type support

* Adjust FAISS and Milvus node branching

* Remove duplicate in config

* Revert method for speedup

* Add latest docstring and tutorial changes

* Add latest benchmark run

* Add latest docstring and tutorial changes

* Add json files

* Revert "Add latest docstring and tutorial changes"

This reverts commit e2efa5f08aa4fb55bbeeed42aa76817d63fc8923.

* Add latest docstring and tutorial changes

* Revert "Add latest docstring and tutorial changes"

This reverts commit b085a679b9d5f175e91c2c59565e73c5dec1374b.

* Fix typo

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-13 14:54:15 +02:00
Markus Paff
dfb0282b74
Update milvus links and docstrings (#959)
* update milvus links and docstrings

* Add latest docstring and tutorial changes

* new milvus version

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-12 14:38:57 +02:00
Timo Moeller
837dea4e6d
Integrate sentence transformers into benchmarks (#843)
* Integrate sentence transformers into benchmarks

* Add doc store asserts

* switch data downloads from s3 client to https. add license info

* Fix mypy, revert config

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-09 17:24:16 +02:00
oryx1729
8c68699e1c
Refactor REST APIs to use Pipelines (#922) 2021-04-07 17:53:32 +02:00