MichelBartels
0b0b9689a4
Add TinyBERT data augmentation ( #1923 )
...
* add tinybert data augmentation
* don't reload glove in tinybert data augmentation
* fix unnecessary load_glove call
* fix type hints
* add comments and type hints
* add batch_size argument
* don't predict subwords as alternative for words
* fix subword predictions
* limit sequence length
* actually limit sequence length
* improve performance by calculating nearest glove vector on gpu
* add model and tokenizer parameter
* fix type hints
* improve data augmentation performance
* explained limits of script
* corrected comment
* added data augmentation test
* don't label every question in augmented dataset as impossible
* add sample glove
* better handling of downloading of glove
* fix typo of last commit
2022-01-04 18:34:16 +01:00
tstadel
fc8df2163d
Fix Windows CI OOM ( #1878 )
...
* set fixture scope to "function"
* run FARMReader without multiprocessing
* dispose off ray after tests
* run most expensive tasks first in test files
* run expensive tests first
* run garbage collector between tests
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-12-22 17:20:23 +01:00
tstadel
158460504b
Make FAISSDocumentStore work with yaml ( #1727 )
...
* add faiss_index_path and faiss_config_path
* Add latest docstring and tutorial changes
* remove duplicate cleaning stuff
* refactoring + test for invalid param combination
* adjust type hints
* Add latest docstring and tutorial changes
* add documentation to @preload_index
* Add latest docstring and tutorial changes
* recursive __init__ instead of decorator
* Add latest docstring and tutorial changes
* validate instead of check
* combine ifs
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-11 11:02:22 +01:00
tstadel
14515a861b
Tutorial for DocumentClassifier at Index Time ( #1697 )
...
* basic example of document classifier in preprocessing logic
* add batch_size to TransformersDocumentClassifier
* complete tutorial16
* Add latest docstring and tutorial changes
* fix missing batch_size
* add notebook
* test for batch_size use added
* add tutorial 16 to headers.py
* Add latest docstring and tutorial changes
* make DocumentClassifier indexing pipeline rdy
* Add latest docstring and tutorial changes
* flexibility improvements for DocumentClassifier in Pipelines
* Add latest docstring and tutorial changes
* fix index time usage
* remove query from documentclassifier tests
* improve classification_field resolving + minor fixes
* Add latest docstring and tutorial changes
* tutorial 16 extended with zero shot and pipelines
* Add latest docstring and tutorial changes
* install graphviz in notebook
* Add latest docstring and tutorial changes
* remove convert_to_dicts
* Add latest docstring and tutorial changes
* Fix typo
* Add latest docstring and tutorial changes
* remove retriever from indexing pipeline
* Add latest docstring and tutorial changes
* fix save_to_yaml when using FileTypeClassifier
* emphasize the impact with zero shot classification
* Add latest docstring and tutorial changes
* adjust use_gpu to boolean in test
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-11-09 18:43:00 +01:00
Julian Risch
33b2663fdc
ensure tf-idf matrix calculation before retrieval ( #1665 )
...
* ensure tf-idf matrix calculation before retrieval
* Run fit() automatically if new documents have been added
* Add latest docstring and tutorial changes
* Fix type error
* Add test case for tfidf retriever yaml pipeline
* Use InMemoryDocStore and add 2nd test case
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-28 16:48:06 +02:00
bogdankostic
51acf779f2
Add TableTextRetriever ( #1529 )
...
* first draft / notes on new primitives
* wip label / feedback refactor
* rename doc.text -> doc.content. add doc.content_type
* add datatype for content
* remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field
* update converters for . Add warning for empty
* renam label.question -> label.query. Allow sorting of Answers.
* WIP primitives
* update ui/reader for new Answer format
* Improve Label. First refactoring of MultiLabel. Adjust eval code
* fixed workflow conflict with introducing new one (#1472 )
* Add latest docstring and tutorial changes
* make add_eval_data() work again
* fix reader formats. WIP fix _extract_docs_and_labels_from_dict
* fix test reader
* Add latest docstring and tutorial changes
* fix another test case for reader
* fix mypy in farm reader.eval()
* fix mypy in farm reader.eval()
* WIP ORM refactor
* Add latest docstring and tutorial changes
* fix mypy weaviate
* make label and multilabel dataclasses
* bump mypy env in CI to python 3.8
* WIP refactor Label ORM
* WIP refactor Label ORM
* simplify tests for individual doc stores
* WIP refactoring markers of tests
* test alternative approach for tests with existing parametrization
* WIP refactor ORMs
* fix skip logic of already parametrized tests
* fix weaviate behaviour in tests - not parametrizing it in our general test cases.
* Add latest docstring and tutorial changes
* fix some tests
* remove sql from document_store_types
* fix markers for generator and pipeline test
* remove inmemory marker
* remove unneeded elasticsearch markers
* add dataclasses-json dependency. adjust ORM to just store JSON repr
* ignore type as dataclasses_json seems to miss functionality here
* update readme and contributing.md
* update contributing
* adjust example
* fix duplicate doc handling for custom index
* Add latest docstring and tutorial changes
* fix some ORM issues. fix get_all_labels_aggregated.
* update drop flags where get_all_labels_aggregated() was used before
* Add latest docstring and tutorial changes
* add to_json(). add + fix tests
* fix no_answer handling in label / multilabel
* fix duplicate docs in memory doc store. change primary key for sql doc table
* fix mypy issues
* fix mypy issues
* haystack/retriever/base.py
* fix test_write_document_meta[elastic]
* fix test_elasticsearch_custom_fields
* fix test_labels[elastic]
* fix crawler
* fix converter
* fix docx converter
* fix preprocessor
* fix test_utils
* fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations
* Add latest docstring and tutorial changes
* fix crawler test. fix ocrconverter attribute
* fix test_elasticsearch_custom_query
* fix generator pipeline
* fix ocr converter
* fix ragenerator
* Add latest docstring and tutorial changes
* fix test_load_and_save_yaml for elasticsearch
* fixes for pipeline tests
* fix faq pipeline
* fix pipeline tests
* Add latest docstring and tutorial changes
* Add MultimodalRetriever
* Add latest docstring and tutorial changes
* fix weaviate
* Add latest docstring and tutorial changes
* trigger CI
* satisfy mypy
* Add latest docstring and tutorial changes
* satisfy mypy
* Add latest docstring and tutorial changes
* trigger CI
* fix question generation test
* fix ray. fix Q-generation
* fix translator test
* satisfy mypy
* wip refactor feedback rest api
* fix rest api feedback endpoint
* fix doc classifier
* remove relation of Labels -> Docs in SQL ORM
* fix faiss/milvus tests
* fix doc classifier test
* fix eval test
* fixing eval issues
* Add latest docstring and tutorial changes
* fix mypy
* WIP replace dataclasses-json with manual serialization
* Add methods to MultimodalRetriever
* Add latest docstring and tutorial changes
* revert to dataclass-json serialization for now. remove debug prints.
* update docstrings
* fix extractor. fix Answer Span init
* fix api test
* keep meta data of answers in reader.run()
* fix meta handling
* adress review feedback
* Add latest docstring and tutorial changes
* make document=None for open domain labels
* add import
* fix print utils
* fix rest api
* Add methods and tests
* Add latest docstring and tutorial changes
* Fix mypy
* Add latest docstring and tutorial changes
* Add type hints and doc strings
* Make use of initialize_device_settings
* Move serialization of pd.DataFrame to schema.py
* Fix mypy
* Adapt Document's from_dict method
* Update docstrings
* Add latest docstring and tutorial changes
* Fix mypy
* Fix mypy
* Fix Document's from_dict method
* Fix Document's to_dict method
* Change handling of table metadata
* Add latest docstring and tutorial changes
* Change naming from Multimodal to TableText
* Turn off tokenizers_parallelism in retriever tests
* Add latest docstring and tutorial changes
* Remove turning off tokenizers_parallelism in retriever tests
* Adapt convert_es_hit_to_document
* Change embed_surrounding_context to embed_meta_fields
* Add latest docstring and tutorial changes
* Add check if torch.distributed is available
* Set n_gpu to 0 in training test
* Set HIP_LAUNCH_BLOCKING to 1
* Set HIP_LAUNCH_BLOCKING to "1"
* Set use_gpu to False
* Use DataParallel only if more than one device
* Remove --find-links=https://download.pytorch.org/whl/torch_stable.html
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-25 12:27:02 +02:00
Sara Zan
6354528336
Add /documents/get_by_filters endpoint ( #1580 )
...
* Add endpoint to get documents by filter
* Add test for /documents/get_by_filter and extend the delete documents test
* Add rest_api/file-upload to .gitignore
* Make sure the document store is empty for each test
* Improve docstrings of delete_documents_by_filters and get_documents_by_filters
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-12 10:53:54 +02:00
bogdankostic
2626388961
Fix DPR tests + add Tokenizer tests ( #1429 )
...
* Fix DPR tests
* Add Tokenizer tests
2021-09-09 12:56:44 +02:00
Timo Moeller
b4fd08a296
Add testdata, add tests for qa processor, add dpr tests (some failing)
2021-09-08 12:02:08 +02:00
oryx1729
a71180a2ca
Refactor replicas config for Ray Pipelines ( #1378 )
2021-08-31 10:14:55 +02:00
Markus Paff
be8d305190
Editing docs read.me for new docs website workflow ( #1372 )
...
* editing docs read.me for new docs website workflow
* added new links to docs
2021-08-30 14:59:40 +02:00
oryx1729
bafa1b46de
Add Ray integration for Pipelines ( #1255 )
2021-08-02 14:51:24 +02:00
oryx1729
8c68699e1c
Refactor REST APIs to use Pipelines ( #922 )
2021-04-07 17:53:32 +02:00
Lalit Pagaria
e904deefa7
Add Markdown file convertor ( #875 )
2021-03-23 16:31:26 +01:00
Tanay Soni
07907f9eac
Add support for indexing pipelines ( #816 )
2021-02-16 16:24:28 +01:00
Tanay Soni
8a5dc8f826
Load Pipeline with YAML config file ( #785 )
2021-02-02 17:32:17 +01:00
Timo Moeller
4803da009a
Using PreProcessor functions on eval data ( #751 )
...
* Add eval data splitting
* Adjust for split by passage, add test and test data, adjust docstrings, add max_docs to highler level fct
2021-01-20 14:40:10 +01:00
Malte Pietsch
29a15c0d59
Add eval for Dense Passage Retriever & Refactor handling of labels/feedback ( #243 )
2020-07-31 11:34:06 +02:00
Anirban Saha
6b217732f5
Add basic support for Docx Files ( #225 )
2020-07-14 12:28:19 +02:00
Tanay Soni
ef9e4f4467
Add PDF text extraction ( #109 )
2020-06-08 11:07:19 +02:00
Malte Pietsch
7400abe327
add test
2019-11-27 17:53:42 +01:00