21 Commits

Author SHA1 Message Date
MichelBartels
0b0b9689a4
Add TinyBERT data augmentation (#1923)
* add tinybert data augmentation

* don't reload glove in tinybert data augmentation

* fix unnecessary load_glove call

* fix type hints

* add comments and type hints

* add batch_size argument

* don't predict subwords as alternative for words

* fix subword predictions

* limit sequence length

* actually limit sequence length

* improve performance by calculating nearest glove vector on gpu

* add model and tokenizer parameter

* fix type hints

* improve data augmentation performance

* explained limits of script

* corrected comment

* added data augmentation test

* don't label every question in augmented dataset as impossible

* add sample glove

* better handling of downloading of glove

* fix typo of last commit
2022-01-04 18:34:16 +01:00
tstadel
fc8df2163d
Fix Windows CI OOM (#1878)
* set fixture scope to "function"

* run FARMReader without multiprocessing

* dispose off ray after tests

* run most expensive tasks first in test files

* run expensive tests first

* run garbage collector between tests

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-12-22 17:20:23 +01:00
tstadel
158460504b
Make FAISSDocumentStore work with yaml (#1727)
* add faiss_index_path and faiss_config_path

* Add latest docstring and tutorial changes

* remove duplicate cleaning stuff

* refactoring + test for invalid param combination

* adjust type hints

* Add latest docstring and tutorial changes

* add documentation to @preload_index

* Add latest docstring and tutorial changes

* recursive __init__ instead of decorator

* Add latest docstring and tutorial changes

* validate instead of check

* combine ifs

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-11 11:02:22 +01:00
tstadel
14515a861b
Tutorial for DocumentClassifier at Index Time (#1697)
* basic example of document classifier in preprocessing logic

* add batch_size to TransformersDocumentClassifier

* complete tutorial16

* Add latest docstring and tutorial changes

* fix missing batch_size

* add notebook

* test for batch_size use added

* add tutorial 16 to headers.py

* Add latest docstring and tutorial changes

* make DocumentClassifier indexing pipeline rdy

* Add latest docstring and tutorial changes

* flexibility improvements for DocumentClassifier in Pipelines

* Add latest docstring and tutorial changes

* fix index time usage

* remove query from documentclassifier tests

* improve classification_field resolving + minor fixes

* Add latest docstring and tutorial changes

* tutorial 16 extended with zero shot and pipelines

* Add latest docstring and tutorial changes

* install graphviz in notebook

* Add latest docstring and tutorial changes

* remove convert_to_dicts

* Add latest docstring and tutorial changes

* Fix typo

* Add latest docstring and tutorial changes

* remove retriever from indexing pipeline

* Add latest docstring and tutorial changes

* fix save_to_yaml when using FileTypeClassifier

* emphasize the impact with zero shot classification

* Add latest docstring and tutorial changes

* adjust use_gpu to boolean in test

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-11-09 18:43:00 +01:00
Julian Risch
33b2663fdc
ensure tf-idf matrix calculation before retrieval (#1665)
* ensure tf-idf matrix calculation before retrieval

* Run fit() automatically if new documents have been added

* Add latest docstring and tutorial changes

* Fix type error

* Add test case for tfidf retriever yaml pipeline

* Use InMemoryDocStore and add 2nd test case

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-28 16:48:06 +02:00
bogdankostic
51acf779f2
Add TableTextRetriever (#1529)
* first draft / notes on new primitives

* wip label / feedback refactor

* rename doc.text -> doc.content. add doc.content_type

* add datatype for content

* remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field

* update converters for . Add warning for empty

* renam label.question -> label.query. Allow sorting of Answers.

* WIP primitives

* update ui/reader for new Answer format

* Improve Label. First refactoring of MultiLabel. Adjust eval code

* fixed workflow conflict with introducing new one (#1472)

* Add latest docstring and tutorial changes

* make add_eval_data() work again

* fix reader formats. WIP fix _extract_docs_and_labels_from_dict

* fix test reader

* Add latest docstring and tutorial changes

* fix another test case for reader

* fix mypy in farm reader.eval()

* fix mypy in farm reader.eval()

* WIP ORM refactor

* Add latest docstring and tutorial changes

* fix mypy weaviate

* make label and multilabel dataclasses

* bump mypy env in CI to python 3.8

* WIP refactor Label ORM

* WIP refactor Label ORM

* simplify tests for individual doc stores

* WIP refactoring markers of tests

* test alternative approach for tests with existing parametrization

* WIP refactor ORMs

* fix skip logic of already parametrized tests

* fix weaviate behaviour in tests - not parametrizing it in our general test cases.

* Add latest docstring and tutorial changes

* fix some tests

* remove sql from document_store_types

* fix markers for generator and pipeline test

* remove inmemory marker

* remove unneeded elasticsearch markers

* add dataclasses-json dependency. adjust ORM to just store JSON repr

* ignore type as dataclasses_json seems to miss functionality here

* update readme and contributing.md

* update contributing

* adjust example

* fix duplicate doc handling for custom index

* Add latest docstring and tutorial changes

* fix some ORM issues. fix get_all_labels_aggregated.

* update drop flags where get_all_labels_aggregated() was used before

* Add latest docstring and tutorial changes

* add to_json(). add + fix tests

* fix no_answer handling in label / multilabel

* fix duplicate docs in memory doc store. change primary key for sql doc table

* fix mypy issues

* fix mypy issues

* haystack/retriever/base.py

* fix test_write_document_meta[elastic]

* fix test_elasticsearch_custom_fields

* fix test_labels[elastic]

* fix crawler

* fix converter

* fix docx converter

* fix preprocessor

* fix test_utils

* fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations

* Add latest docstring and tutorial changes

* fix crawler test. fix ocrconverter attribute

* fix test_elasticsearch_custom_query

* fix generator pipeline

* fix ocr converter

* fix ragenerator

* Add latest docstring and tutorial changes

* fix test_load_and_save_yaml for elasticsearch

* fixes for pipeline tests

* fix faq pipeline

* fix pipeline tests

* Add latest docstring and tutorial changes

* Add MultimodalRetriever

* Add latest docstring and tutorial changes

* fix weaviate

* Add latest docstring and tutorial changes

* trigger CI

* satisfy mypy

* Add latest docstring and tutorial changes

* satisfy mypy

* Add latest docstring and tutorial changes

* trigger CI

* fix question generation test

* fix ray. fix Q-generation

* fix translator test

* satisfy mypy

* wip refactor feedback rest api

* fix rest api feedback endpoint

* fix doc classifier

* remove relation of Labels -> Docs in SQL ORM

* fix faiss/milvus tests

* fix doc classifier test

* fix eval test

* fixing eval issues

* Add latest docstring and tutorial changes

* fix mypy

* WIP replace dataclasses-json with manual serialization

* Add methods to MultimodalRetriever

* Add latest docstring and tutorial changes

* revert to dataclass-json serialization for now. remove debug prints.

* update docstrings

* fix extractor. fix Answer Span init

* fix api test

* keep meta data of answers in reader.run()

* fix meta handling

* adress review feedback

* Add latest docstring and tutorial changes

* make document=None for open domain labels

* add import

* fix print utils

* fix rest api

* Add methods and tests

* Add latest docstring and tutorial changes

* Fix mypy

* Add latest docstring and tutorial changes

* Add type hints and doc strings

* Make use of initialize_device_settings

* Move serialization of pd.DataFrame to schema.py

* Fix mypy

* Adapt Document's from_dict method

* Update docstrings

* Add latest docstring and tutorial changes

* Fix mypy

* Fix mypy

* Fix Document's from_dict method

* Fix Document's to_dict method

* Change handling of table metadata

* Add latest docstring and tutorial changes

* Change naming from Multimodal to TableText

* Turn off tokenizers_parallelism in retriever tests

* Add latest docstring and tutorial changes

* Remove turning off tokenizers_parallelism in retriever tests

* Adapt convert_es_hit_to_document

* Change embed_surrounding_context to embed_meta_fields

* Add latest docstring and tutorial changes

* Add check if torch.distributed is available

* Set n_gpu to 0 in training test

* Set HIP_LAUNCH_BLOCKING to 1

* Set HIP_LAUNCH_BLOCKING to "1"

* Set use_gpu to False

* Use DataParallel only if more than one device

* Remove --find-links=https://download.pytorch.org/whl/torch_stable.html

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-25 12:27:02 +02:00
Sara Zan
6354528336
Add /documents/get_by_filters endpoint (#1580)
* Add endpoint to get documents by filter

* Add test for /documents/get_by_filter and extend the delete documents test

* Add rest_api/file-upload to .gitignore

* Make sure the document store is empty for each test

* Improve docstrings of delete_documents_by_filters and get_documents_by_filters

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-12 10:53:54 +02:00
bogdankostic
2626388961
Fix DPR tests + add Tokenizer tests (#1429)
* Fix DPR tests

* Add Tokenizer tests
2021-09-09 12:56:44 +02:00
Timo Moeller
b4fd08a296
Add testdata, add tests for qa processor, add dpr tests (some failing) 2021-09-08 12:02:08 +02:00
oryx1729
a71180a2ca
Refactor replicas config for Ray Pipelines (#1378) 2021-08-31 10:14:55 +02:00
Markus Paff
be8d305190
Editing docs read.me for new docs website workflow (#1372)
* editing docs read.me for new docs website workflow

* added new links to docs
2021-08-30 14:59:40 +02:00
oryx1729
bafa1b46de
Add Ray integration for Pipelines (#1255) 2021-08-02 14:51:24 +02:00
oryx1729
8c68699e1c
Refactor REST APIs to use Pipelines (#922) 2021-04-07 17:53:32 +02:00
Lalit Pagaria
e904deefa7
Add Markdown file convertor (#875) 2021-03-23 16:31:26 +01:00
Tanay Soni
07907f9eac
Add support for indexing pipelines (#816) 2021-02-16 16:24:28 +01:00
Tanay Soni
8a5dc8f826
Load Pipeline with YAML config file (#785) 2021-02-02 17:32:17 +01:00
Timo Moeller
4803da009a
Using PreProcessor functions on eval data (#751)
* Add eval data splitting

* Adjust for split by passage, add test and test data, adjust docstrings, add max_docs to highler level fct
2021-01-20 14:40:10 +01:00
Malte Pietsch
29a15c0d59
Add eval for Dense Passage Retriever & Refactor handling of labels/feedback (#243) 2020-07-31 11:34:06 +02:00
Anirban Saha
6b217732f5
Add basic support for Docx Files (#225) 2020-07-14 12:28:19 +02:00
Tanay Soni
ef9e4f4467
Add PDF text extraction (#109) 2020-06-08 11:07:19 +02:00
Malte Pietsch
7400abe327 add test 2019-11-27 17:53:42 +01:00