1024 Commits

Author SHA1 Message Date
Sara Zan
9af1292cda
Remove stray requirements.txt files and update README.md (#2075)
* Remove stray requirements.txt files and update README.md

* Remove requirement files

* Add details about pip bug and link to setup.cfg

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-27 11:22:14 +01:00
AhmedIdr
488c3e9e52
pass faiss batch_size to sqldocumentstore (#2061) 2022-01-26 19:35:16 +01:00
Julian Risch
5079c6847a
Convert doc embedding from ndarray to list of float for REST API (#1901)
* convert ndarray doc embedding to list of float

* check type of embedding of each doc individually

* Fix in case documents is None
2022-01-26 18:20:44 +01:00
Sara Zan
d470b9d0bd
Improve dependency management (#1994)
* Fist attempt at using setup.cfg for dependency management

* Trying the new package on the CI and in Docker too

* Add composite extras_require

* Add the safe_import function for document store imports and add some try-catch statements on rest_api and ui imports

* Fix bug on class import and rephrase error message

* Introduce typing for optional modules and add type: ignore in sparse.py

* Include importlib_metadata backport for py3.7

* Add colab group to extra_requires

* Fix pillow version

* Fix grpcio

* Separate out the crawler as another extra

* Make paths relative in rest_api and ui

* Update the test matrix in the CI

* Add try catch statements around the optional imports too to account for direct imports

* Never mix direct deps with self-references and add ES deps to the base install

* Refactor several paths in tests to make them insensitive to the execution path

* Include tstadel review and re-introduce Milvus1 in the tests suite, to fix

* Wrap pdf conversion utils into safe_import

* Update some tutorials and rever Milvus1 as default for now, see #2067

* Fix mypy config


Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-26 18:12:55 +01:00
MichelBartels
4cc37548e3
Fix finetuning notebook augmentation (#2071)
* fix data augmentation path in finetuning notebook

* Add latest docstring and tutorial changes

* make distillation possible with other models than BERT

* use smaller dataset for distillation in finetuning tutorial

* Add latest docstring and tutorial changes

* make data augmentation in finetuning faster

* update language models forward doc strings

* fix return type of language models

* remove debug output

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-26 17:49:14 +01:00
Sowmiya Jaganathan
c4fff19018
Supported Highlighting in Elasticsearch (#1930)
* Supported Highlighting

* Review changes

* add example to docstrings

* Add latest docstring and tutorial changes

* Add latest docstring and tutorial changes

Co-authored-by: sowmiya-emplay <sowmiya.j@emplay.net>
Co-authored-by: Thomas Stadelmann <thomas.stadelmann@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
2022-01-26 17:35:33 +01:00
Adrien Wald
2edc421a09
Add top_k_join parameter to JoinDocuments.run (#2065)
* add top_k_join parameter to JoinDocuments.run

* test JoinDocuments concatenate with top_k_join parameter

* test two different top_k_join parameters
2022-01-26 17:30:16 +01:00
mathislucka
5b7e906e85
fix: get_documents_by_id should return docs for all passed ids (#2064)
* doc store should return all documents matching ids passed to get_documents_by_id

* test for get_document_by_id should be named correctly

* add test for get_documents_by_id

* Add latest docstring and tutorial changes

* document es query limit

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-26 12:39:04 +01:00
Julian Risch
0f34983f74
fix answer is not subscriptable error (#2069)
* fix answer is not subscriptable error

* fix answer is not subscriptable in script
2022-01-26 11:45:45 +01:00
tstadel
8a32d8da92
Introduce readonly DCDocumentStore (without labels support) (#1991)
* minimal DCDocumentStore

* support filters

* implement get_documents_by_id

* handle not existing documents

* add docstrings

* auth added

* add tests

* generate docs

* Add latest docstring and tutorial changes

* add responses to dev dependencies

* fix tests

* support query() and quey_by_embedding()

* Add latest docstring and tutorial changes

* query tests added

* read api_key and api_endpoint from env

* Add latest docstring and tutorial changes

* support query() and quey_by_embedding()

* query tests added

* Add latest docstring and tutorial changes

* Add latest docstring and tutorial changes

* support dynamic similarity and return_embedding values

* Add latest docstring and tutorial changes

* adjust KeywordDocumentStore description

* refactoring

* Add latest docstring and tutorial changes

* implement get_document_count and raise on all not implemented methods

* Add latest docstring and tutorial changes

* don't use abbreviation DC in comments and errors

* Add latest docstring and tutorial changes

* docstring added to KeywordDocumentStore

* Add latest docstring and tutorial changes

* enhanced api key set

* split tests into two parts

* change setup.py in order to work around build cache

* added link

* Add latest docstring and tutorial changes

* rename DCDocumentStore to DeepsetCloudDocumentStore

* Add latest docstring and tutorial changes

* remove dc.py

* reinsert link to docs

* fix imports

* Add latest docstring and tutorial changes

* better test structure

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ArzelaAscoIi <kristof.herrmann@rwth-aachen.de>
2022-01-25 20:36:28 +01:00
Sara Zan
d147443cb1
Pin Milvus to <2.0.0 (#2063) 2022-01-25 17:12:56 +01:00
MichelBartels
5b6b0cef77
Add UnlabeledTextProcessor (#2054)
* add UnlabeledTextProcessor

* allow choosing processor when finetuning or distilling

* fix type hint

* Add latest docstring and tutorial changes

* improve segment id computation for UnlabeledTextProcessor

* add text and documentation

* change batch size parameter for intermediate layer distillation

* Add latest docstring and tutorial changes

* fix distillation dim mapping

* remove unnecessary changes

* removed confusing parameter

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-25 14:54:34 +01:00
Julian Risch
c6f23dce88
upgrade haystack version number to 1.1.0 (#2039)
* upgrade haystack version number to 1.1.0

* copy docs to new version folder
v1.1.0
2022-01-20 13:45:38 +01:00
tstadel
50317d74bd
Add ndcg and eval_mode to docs (#2038)
* add ndcg and eval_mode to docstrings and reorder dataframe columns in docs

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-20 13:02:46 +01:00
MichelBartels
e8cd5ea943
Add distillation to finetuning tutorial (#2025)
* Add finetuning tutorial

* Add latest docstring and tutorial changes

* fix typo

* Add latest docstring and tutorial changes

* improve distillation explanation in finetuning tutorial

* Add latest docstring and tutorial changes

* allow augment_squad.py to be easier to call from within python

* Update Tutorial2_Finetune_a_model_on_your_data.py

* fix squad augmentation test

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-20 12:18:32 +01:00
oryx1729
cb881b6fa9
Disable pip cache for Dockerfiles (#2015) 2022-01-19 10:26:17 +01:00
Kristof Herrmann
6267476015
Bugfix - save_to_yaml for OpenSearchDocumentStore (#2017)
* fix save_to_yaml

* add link to issue

* added generic implementation

* added type

* remove not used imports
2022-01-19 10:10:50 +01:00
Yorick van Zweeden
ea10d011ab
Replace SessionState with Streamlit built-in (#2006)
* Replace SessionState with Streamlit built-in

* Set session state to default if absent

Co-authored-by: Yorick van Zweeden <git@yorickvanzweeden.nl>
2022-01-18 14:59:42 +01:00
MichelBartels
0cca2b97cd
distinguish intermediate layer & prediction layer distillation phases with different parameters (#2001)
* add parameters to allow for different hyperparameters in stage 1 and 2 of tinybert distillation

* Add latest docstring and tutorial changes

* improve default parameters

* Add latest docstring and tutorial changes

* split up distillation method

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-14 20:40:38 +01:00
tstadel
f42d2e8ba0
Add nDCG to pipeline.eval()'s document metrics (#2008)
* add ndcg metric

* fix merge

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-14 18:36:41 +01:00
Julian Risch
2c063e960e
Extend Tutorial 5 with Upper Bound Reader Eval Metrics (#1995)
* print report for closed-domain eval

* Add latest docstring and tutorial changes

* rename parameter and rewrite docs

* Add latest docstring and tutorial changes

* print eval report in separate cell

* Add latest docstring and tutorial changes

* explain when to eval individual components

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-14 16:29:18 +01:00
Julian Risch
5695d721aa
update link to annotation tool docu (#2005)
* update link to annotation tool docu

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-14 16:10:59 +01:00
Julian Risch
a3147cae47
Add isolated node eval mode in pipeline eval (#1962)
* run predictions on ground-truth docs in reader

* build dataframe for closed/open domain eval

* fix looping through multilabel

* fix looping through multilabel's list of labels

* simplify collecting relevant docs

* switch closed-domain eval off by default

* Add latest docstring and tutorial changes

* handle edge case params not given

* renaming & generate pipeline eval report

* add test case for closed-domain eval metrics

* Add latest docstring and tutorial changes

* test  report of closed-domain eval

* report closed-domain metrics only for answer metrics not doc metrics

* refactoring

* fix mypy & remove comment

* add second for-loop & use answer as method input

* renaming & add separate loop building docs eval df

* Add latest docstring and tutorial changes

* source /home/tstad/miniconda3/bin/activatechange column order for evaluatation dataframe (#1957)
conda activate haystack-dev2

* change column order for evaluatation dataframe

* added missing eval column node_input

* generic order for both document and answer returning nodes; ensure no columns get lost

Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>

* fix column reordering after renaming of node_input

* simplify tests &  add docu

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ju-gu <87523290+ju-gu@users.noreply.github.com>
Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
Co-authored-by: Thomas Stadelmann <thomas.stadelmann@deepset.ai>
2022-01-14 14:37:16 +01:00
Sara Zan
e28bf618d7
Implement proper FK in MetaDocumentORM and MetaLabelORM to work on PostgreSQL (#1990)
* Properly fix MetaDocumentORM and MetaLabelORM with composite foreign key constraints

* update_document_meta() was not using index properly

* Exclude ES and Memory from the cosine_sanity_check test

* move ensure_ids_are_correct_uuids in conftest and move one test back to faiss & milvus suite

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-14 13:48:58 +01:00
MichelBartels
3e4dbbb32c
Align similarity scores across document stores (#1967)
* align document store similarity functions

* remove unnecessary imports

* undone accidental change

* stopped weaviate from pretending to support dot product similarity

* stopped weaviate from pretending to support dot product similarity

* Add latest docstring and tutorial changes

* fix fixture params for document stores

* use cosine similarity for most tests

* fix cosine similarity test

* fix faiss test

* fix weaviate test

* fix accidental deletion

* fix document_store fixture

* test fix; shouldn't be merged

* fix test_normalize_embeddings_diff_shapes

* probably a better fix

* fix for parameter combinations

* revert new pytest_generate_tests functionality

* simplify pytest_generate_tests

* normalize embeddings for test_dpr_embedding

* add to faiss doc that embeddings are normalized

* Add latest docstring and tutorial changes

* remove unnecessary parameters and add comments

* simplify two lines of memory.py into one

* test similarity scores with smaller language model

* fix test_similarity_score


Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-12 19:28:20 +01:00
Manos Papathanasiou
965b9614db
Upgrade pillow version to 9.0.0 (#1992) 2022-01-12 09:59:51 +01:00
Dmitry Goryunov
79fdda8a7c
Remove hard-coded variables from the Tutorial 15 (#1984)
* Remove hard-coded variables from the Tutorial 15

* Fix missing comma

* Add latest docstring and tutorial changes

* Fix formatting in Tutorial15_TableQA.ipynb

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-11 17:55:20 +01:00
tstadel
c861fdb2ce
Enable batch mode for SAS cross encoders (#1987)
* enable batch mode for sas cross encoders

* fix mypy

* comment on top_1 values added
2022-01-11 17:54:43 +01:00
Sara Zan
9c3d9b4885
Add models to demo docker image (#1978)
* Add utility to cache models and nltk data & modify Dockerfiles to use it

* Fix punkt data not being cached
2022-01-11 16:37:45 +01:00
tstadel
192e03be33
Fix elasticsearch scores if they are 0.0 (#1980)
* fix elasticsearch zero scores

* remove unnecessary None check
2022-01-11 09:35:02 +01:00
Mathew Kuriakose
a44b6c18c0
Unify vector_dim and embedding_dim parameter in Document Store (#1922)
* Refactored code to unify vector_dim and embedding_dim parameter in DocumentStores

* Unit test cases updated to use `embedding_dim` instead of `vector_dim`

* Unit test case update to use embedding_dim instead of vector_dim

* Add latest docstring and tutorial changes

* Put usage of `vector_dim` param in same if-block as corresponding warning

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-01-10 18:10:32 +01:00
Benjamin Bossan
00dc30ae54
Use scikit-learn, not sklearn, in requirements.txt (#1974) 2022-01-10 09:56:34 +01:00
ju-gu
b7041941df
change column order for evaluatation dataframe (#1957)
* change column order for evaluatation dataframe

* added missing eval column node_input

* generic order for both document and answer returning nodes; ensure no columns get lost

Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
2022-01-07 14:13:28 +01:00
oryx1729
5b3f693562
Fix Dockerfile-GPU (#1969) 2022-01-06 11:13:04 +01:00
mathislucka
db76a5c5c6
fix UserWarning from slow tensor conversion (#1948)
* fix UserWarning from slow tensor conversion

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-05 22:42:54 +01:00
Julian Risch
30ea1d475d
check multiprocessing sharing strategy is available (#1965)
* check multiprocessing sharing strategy is available

* Change default of multiprocessing strategy to None

* Change default sharing strategy to None in retriever

* Add latest docstring and tutorial changes

* Make logging message easier to understand

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-05 18:22:09 +01:00
oryx1729
e075663feb
Upgrade torch version (#1960) 2022-01-05 18:14:14 +01:00
Yorick van Zweeden
65cd39b533
Fix vector_id collision in FAISS (#1961)
* Fix FAISS vector_id count

* Fix mypy errors

Co-authored-by: Yorick van Zweeden <git@yorickvanzweeden.nl>
2022-01-05 18:10:47 +01:00
MichelBartels
0b0b9689a4
Add TinyBERT data augmentation (#1923)
* add tinybert data augmentation

* don't reload glove in tinybert data augmentation

* fix unnecessary load_glove call

* fix type hints

* add comments and type hints

* add batch_size argument

* don't predict subwords as alternative for words

* fix subword predictions

* limit sequence length

* actually limit sequence length

* improve performance by calculating nearest glove vector on gpu

* add model and tokenizer parameter

* fix type hints

* improve data augmentation performance

* explained limits of script

* corrected comment

* added data augmentation test

* don't label every question in augmented dataset as impossible

* add sample glove

* better handling of downloading of glove

* fix typo of last commit
2022-01-04 18:34:16 +01:00
oryx1729
854af92dc5
Update docker_build.yml 2022-01-04 17:46:34 +01:00
oryx1729
2910f67718
Use long Commit ID for Docker tags (#1946) 2022-01-04 17:39:49 +01:00
Yorick van Zweeden
180befd07a
Propagate duplicate_documents to base class initialization (#1936)
* Add duplicate_documents to base class initialization

* Remove redundant assignment in subclasses

Co-authored-by: Yorick van Zweeden <git@yorickvanzweeden.nl>
2022-01-04 15:04:15 +01:00
oryx1729
00c823cdff
Add GitHub Action for Docker Build for GPU (#1916) 2022-01-04 14:33:13 +01:00
Alon Eirew
7a4fa42fda
Fix #1927 - RuntimeError when loading data using data_silo due to many open file descriptors from multiprocessing (#1928)
* fix #1687

* fix RuntimeError: received 0 items of ancdata

* Add an arg multiprocessing_strategy to DataSilo and DPR.train()

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-04 13:29:26 +01:00
bogdankostic
381fc302cb
Fix loading a saved FAISSDocumentStore (#1937)
* Remove faiss_index param from config

* Add Tests

* Add assertions to tests
2022-01-04 12:22:31 +01:00
bogdankostic
3e0ef1cc8a
Fix Numba TypingError in normalize_embedding for cosine similarity (#1933)
* Fix Numba TypingError

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-03 17:14:51 +01:00
bogdankostic
202ef276ee
Make sure content_type exists (#1938) 2022-01-03 17:00:31 +01:00
bogdankostic
c85ac2baec
Update Ray to version 1.9.1 (#1934) 2022-01-03 16:59:58 +01:00
bogdankostic
45df18c416
Add RCIReader for TableQA (#1909)
* Add RCIReader

* Add latest docstring and tutorial changes

* Add Doc Strings

* Add latest docstring and tutorial changes

* Add Tests

* Add Doc Strings

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-03 16:59:24 +01:00
Kristof Herrmann
6e8e3c68d9
Custom id hashing on documentstore level (#1910)
* adding dynamic id hashing

* Add latest docstring and tutorial changes

* added pr review

* Add latest docstring and tutorial changes

* fixed tests

* fix mypy error

* fix mypy issue

* ignore typing

* fixed correct check

* fixed tests

* try fixing the tests

* set id hash keys only if not none

* dont store id_hash_keys

* fix tests

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-03 16:58:19 +01:00