3803 Commits

Author SHA1 Message Date
Julian Risch
5695d721aa
update link to annotation tool docu (#2005)
* update link to annotation tool docu

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-14 16:10:59 +01:00
Julian Risch
a3147cae47
Add isolated node eval mode in pipeline eval (#1962)
* run predictions on ground-truth docs in reader

* build dataframe for closed/open domain eval

* fix looping through multilabel

* fix looping through multilabel's list of labels

* simplify collecting relevant docs

* switch closed-domain eval off by default

* Add latest docstring and tutorial changes

* handle edge case params not given

* renaming & generate pipeline eval report

* add test case for closed-domain eval metrics

* Add latest docstring and tutorial changes

* test  report of closed-domain eval

* report closed-domain metrics only for answer metrics not doc metrics

* refactoring

* fix mypy & remove comment

* add second for-loop & use answer as method input

* renaming & add separate loop building docs eval df

* Add latest docstring and tutorial changes

* source /home/tstad/miniconda3/bin/activatechange column order for evaluatation dataframe (#1957)
conda activate haystack-dev2

* change column order for evaluatation dataframe

* added missing eval column node_input

* generic order for both document and answer returning nodes; ensure no columns get lost

Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>

* fix column reordering after renaming of node_input

* simplify tests &  add docu

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ju-gu <87523290+ju-gu@users.noreply.github.com>
Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
Co-authored-by: Thomas Stadelmann <thomas.stadelmann@deepset.ai>
2022-01-14 14:37:16 +01:00
Sara Zan
e28bf618d7
Implement proper FK in MetaDocumentORM and MetaLabelORM to work on PostgreSQL (#1990)
* Properly fix MetaDocumentORM and MetaLabelORM with composite foreign key constraints

* update_document_meta() was not using index properly

* Exclude ES and Memory from the cosine_sanity_check test

* move ensure_ids_are_correct_uuids in conftest and move one test back to faiss & milvus suite

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-14 13:48:58 +01:00
MichelBartels
3e4dbbb32c
Align similarity scores across document stores (#1967)
* align document store similarity functions

* remove unnecessary imports

* undone accidental change

* stopped weaviate from pretending to support dot product similarity

* stopped weaviate from pretending to support dot product similarity

* Add latest docstring and tutorial changes

* fix fixture params for document stores

* use cosine similarity for most tests

* fix cosine similarity test

* fix faiss test

* fix weaviate test

* fix accidental deletion

* fix document_store fixture

* test fix; shouldn't be merged

* fix test_normalize_embeddings_diff_shapes

* probably a better fix

* fix for parameter combinations

* revert new pytest_generate_tests functionality

* simplify pytest_generate_tests

* normalize embeddings for test_dpr_embedding

* add to faiss doc that embeddings are normalized

* Add latest docstring and tutorial changes

* remove unnecessary parameters and add comments

* simplify two lines of memory.py into one

* test similarity scores with smaller language model

* fix test_similarity_score


Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-12 19:28:20 +01:00
Manos Papathanasiou
965b9614db
Upgrade pillow version to 9.0.0 (#1992) 2022-01-12 09:59:51 +01:00
Dmitry Goryunov
79fdda8a7c
Remove hard-coded variables from the Tutorial 15 (#1984)
* Remove hard-coded variables from the Tutorial 15

* Fix missing comma

* Add latest docstring and tutorial changes

* Fix formatting in Tutorial15_TableQA.ipynb

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-11 17:55:20 +01:00
tstadel
c861fdb2ce
Enable batch mode for SAS cross encoders (#1987)
* enable batch mode for sas cross encoders

* fix mypy

* comment on top_1 values added
2022-01-11 17:54:43 +01:00
Sara Zan
9c3d9b4885
Add models to demo docker image (#1978)
* Add utility to cache models and nltk data & modify Dockerfiles to use it

* Fix punkt data not being cached
2022-01-11 16:37:45 +01:00
tstadel
192e03be33
Fix elasticsearch scores if they are 0.0 (#1980)
* fix elasticsearch zero scores

* remove unnecessary None check
2022-01-11 09:35:02 +01:00
Mathew Kuriakose
a44b6c18c0
Unify vector_dim and embedding_dim parameter in Document Store (#1922)
* Refactored code to unify vector_dim and embedding_dim parameter in DocumentStores

* Unit test cases updated to use `embedding_dim` instead of `vector_dim`

* Unit test case update to use embedding_dim instead of vector_dim

* Add latest docstring and tutorial changes

* Put usage of `vector_dim` param in same if-block as corresponding warning

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-01-10 18:10:32 +01:00
Benjamin Bossan
00dc30ae54
Use scikit-learn, not sklearn, in requirements.txt (#1974) 2022-01-10 09:56:34 +01:00
ju-gu
b7041941df
change column order for evaluatation dataframe (#1957)
* change column order for evaluatation dataframe

* added missing eval column node_input

* generic order for both document and answer returning nodes; ensure no columns get lost

Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
2022-01-07 14:13:28 +01:00
oryx1729
5b3f693562
Fix Dockerfile-GPU (#1969) 2022-01-06 11:13:04 +01:00
mathislucka
db76a5c5c6
fix UserWarning from slow tensor conversion (#1948)
* fix UserWarning from slow tensor conversion

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-05 22:42:54 +01:00
Julian Risch
30ea1d475d
check multiprocessing sharing strategy is available (#1965)
* check multiprocessing sharing strategy is available

* Change default of multiprocessing strategy to None

* Change default sharing strategy to None in retriever

* Add latest docstring and tutorial changes

* Make logging message easier to understand

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-05 18:22:09 +01:00
oryx1729
e075663feb
Upgrade torch version (#1960) 2022-01-05 18:14:14 +01:00
Yorick van Zweeden
65cd39b533
Fix vector_id collision in FAISS (#1961)
* Fix FAISS vector_id count

* Fix mypy errors

Co-authored-by: Yorick van Zweeden <git@yorickvanzweeden.nl>
2022-01-05 18:10:47 +01:00
MichelBartels
0b0b9689a4
Add TinyBERT data augmentation (#1923)
* add tinybert data augmentation

* don't reload glove in tinybert data augmentation

* fix unnecessary load_glove call

* fix type hints

* add comments and type hints

* add batch_size argument

* don't predict subwords as alternative for words

* fix subword predictions

* limit sequence length

* actually limit sequence length

* improve performance by calculating nearest glove vector on gpu

* add model and tokenizer parameter

* fix type hints

* improve data augmentation performance

* explained limits of script

* corrected comment

* added data augmentation test

* don't label every question in augmented dataset as impossible

* add sample glove

* better handling of downloading of glove

* fix typo of last commit
2022-01-04 18:34:16 +01:00
oryx1729
854af92dc5
Update docker_build.yml 2022-01-04 17:46:34 +01:00
oryx1729
2910f67718
Use long Commit ID for Docker tags (#1946) 2022-01-04 17:39:49 +01:00
Yorick van Zweeden
180befd07a
Propagate duplicate_documents to base class initialization (#1936)
* Add duplicate_documents to base class initialization

* Remove redundant assignment in subclasses

Co-authored-by: Yorick van Zweeden <git@yorickvanzweeden.nl>
2022-01-04 15:04:15 +01:00
oryx1729
00c823cdff
Add GitHub Action for Docker Build for GPU (#1916) 2022-01-04 14:33:13 +01:00
Alon Eirew
7a4fa42fda
Fix #1927 - RuntimeError when loading data using data_silo due to many open file descriptors from multiprocessing (#1928)
* fix #1687

* fix RuntimeError: received 0 items of ancdata

* Add an arg multiprocessing_strategy to DataSilo and DPR.train()

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-04 13:29:26 +01:00
bogdankostic
381fc302cb
Fix loading a saved FAISSDocumentStore (#1937)
* Remove faiss_index param from config

* Add Tests

* Add assertions to tests
2022-01-04 12:22:31 +01:00
bogdankostic
3e0ef1cc8a
Fix Numba TypingError in normalize_embedding for cosine similarity (#1933)
* Fix Numba TypingError

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-03 17:14:51 +01:00
bogdankostic
202ef276ee
Make sure content_type exists (#1938) 2022-01-03 17:00:31 +01:00
bogdankostic
c85ac2baec
Update Ray to version 1.9.1 (#1934) 2022-01-03 16:59:58 +01:00
bogdankostic
45df18c416
Add RCIReader for TableQA (#1909)
* Add RCIReader

* Add latest docstring and tutorial changes

* Add Doc Strings

* Add latest docstring and tutorial changes

* Add Tests

* Add Doc Strings

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-03 16:59:24 +01:00
Kristof Herrmann
6e8e3c68d9
Custom id hashing on documentstore level (#1910)
* adding dynamic id hashing

* Add latest docstring and tutorial changes

* added pr review

* Add latest docstring and tutorial changes

* fixed tests

* fix mypy error

* fix mypy issue

* ignore typing

* fixed correct check

* fixed tests

* try fixing the tests

* set id hash keys only if not none

* dont store id_hash_keys

* fix tests

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-03 16:58:19 +01:00
Julian Risch
a846be99d1
Extend TranslationWrapper to work with QA Generation (#1905)
* draft translationwrapper example

* draft translation of generated qa pairs

* Add latest docstring and tutorial changes

* fixed pass by reference by deepcopy

* delete adapted tutorial 13 (test purposes only)

* adapt method signature and doc string

* Add latest docstring and tutorial changes

* add type ignore

* extend tutorial 13 with TranslationWrapper example

* Add latest docstring and tutorial changes

* removed duplicate code

* indent if statement

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ArzelaAscoIi <kristof.herrmann@rwth-aachen.de>
2022-01-03 13:30:24 +01:00
tstadel
a94c274134
Support custom headers per request in pipeline (#1861)
* chain headers param down to document_stores

* Add latest docstring and tutorial changes

* fix InMemoryDocumentStore params

* Add latest docstring and tutorial changes

* fix TfidfRetriever params

* Add latest docstring and tutorial changes

* fix missing headers

* Add latest docstring and tutorial changes

* fix sparql client and update docs

* Add latest docstring and tutorial changes

* test for documentstores

* pipeline tests added

* update header param in docstrings

* Add latest docstring and tutorial changes

* refactoring: headers as implicit param

* Add latest docstring and tutorial changes

* remove unnecessary imports

* propagade batch_size correctly

* Add latest docstring and tutorial changes

* revert InMemoryDocumentStore.write_documents signature

* Add latest docstring and tutorial changes

* remove #type: ignore

* Add latest docstring and tutorial changes

* replace MutableMapping by Dict

* Add latest docstring and tutorial changes

* improve docstrings

* Add latest docstring and tutorial changes

* get rid of **kwargs

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-01-03 11:38:02 +01:00
el2e10
377c20b8b1
Fix grammatical issue in optimization guides (#1941) 2022-01-03 11:06:13 +01:00
Alon Eirew
a1fb70bbbd
Make ctx_segment_ids a list instead of np.zeros_like
* fix #1687

* fix - UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow..

* fix RuntimeError: received 0 items of ancdata

* Remove set_sharing_strategy from this branch and replace numpy.zeros_like with python numpy
2022-01-03 08:33:55 +01:00
bogdankostic
39573cf0a9
Add ParsrConverter (#1931)
* Add ParsrConverter

* Fix typing error + add Parsr to Linux CI

* Fix valid_language for all converters + fix context generation for ParsrConverter

* Remove ParsrConverter test from WindowsCI

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-12-30 10:15:11 +01:00
Markus Paff
04f3b39ad5
Text for contributor license agreement (#1766)
* text for contributor license agreement

* formatting

* Add details about process

* test
2021-12-28 14:01:20 +01:00
MichelBartels
f33c2b987a
Adding distillation loss functions from TinyBERT (#1879)
* initial tinybertdistill commit

* add tinybert distill loss

* remove teacher caching for tinybert

* add tinybert to distil_from method

* Add latest docstring and tutorial changes

* add dim mapping and fix type hints

* fix type hints

* fix dummy input

* fix dim mapping for tinybert loss and add comments/doc strings

* add test for tinybert loss

* Add latest docstring and tutorial changes

* add comment

* fix BERT forward parameters

* add doc string to AdaptiveModel forward method

* remove unnecessary data silo

* fix farm import

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-12-23 14:54:02 +01:00
tstadel
fc8df2163d
Fix Windows CI OOM (#1878)
* set fixture scope to "function"

* run FARMReader without multiprocessing

* dispose off ray after tests

* run most expensive tasks first in test files

* run expensive tests first

* run garbage collector between tests

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-12-22 17:20:23 +01:00
tstadel
7bdb782871
Raise exception if Elasticsearch search_fields have wrong datatype (#1913) 2021-12-20 16:10:55 +01:00
Dmitry Goryunov
42a0fc3860
Include ray version compatible with M1 processor (#1906) 2021-12-20 10:16:59 +01:00
Johnny-KP
51e84b805b
Changed export to csv method to new answer format (#1907) 2021-12-17 16:10:29 +01:00
bogdankostic
74c80e0c71
Set mypy version to 0.910 (#1899) 2021-12-16 14:02:04 +01:00
javier ramírez
5c7f3c234e
Fix minor typo in readme (#1900)
I just added a missing "r" to the word "contributions" at the "Overview and Usage" section
2021-12-16 13:31:27 +01:00
bogdankostic
4edec04c2c
Add improvements to AzureConverter (#1896)
* Add some improvements to AzureConverter

* Adapt docstring + use Path instead of str

* Fix mypy version to 0.910
2021-12-16 12:45:24 +01:00
Alberto Villa
e4aec4661d
Improved version of print_answers (#1891)
* Improved version of print_answers

* Changed the way max_text_len is checked
2021-12-15 17:16:33 +01:00
Alberto Villa
1bb6244a63
Exchanged minimal with minimum in print_answers function call (#1890) 2021-12-14 15:27:37 +01:00
Alberto Villa
2396f0cd3a
Correct bug with encoding when generating Markdown documentation; linked with issue #1880 (#1881) 2021-12-14 10:50:25 +01:00
tstadel
57a04631df
introduce node_input param (#1854)
* introduce node_input param

* Add latest docstring and tutorial changes

* prediction and label as node_input values

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-12-14 10:34:35 +01:00
Ivan Lopez
86f5688f47
fix wrong branch and repo, add cloudwatch agent (#1877) 2021-12-13 20:32:25 +01:00
Sara Zan
de71b944d7
Fix typo in the Windows CI UI deps (#1876)
* Fix typo in the WindowsCI UI deps

* Force a deps cache miss
2021-12-13 15:49:44 +01:00
Malte Pietsch
7084a24794
Bump version to 1.0 in REST api (#1875) 2021-12-13 12:39:59 +01:00