3803 Commits

Author SHA1 Message Date
Vladimir Blagojevic
a2905d05f7
Bump version to next release candidate (#2765)
* Bump version to next release candidate

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-06 11:26:42 +02:00
Vladimir Blagojevic
c80336c424
Upgrade to v1.6.0 and copy docs folder (#2764)
* Upgrade to v1.6.0 and copy docs folder

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
v1.6.0
2022-07-06 10:25:15 +02:00
tstadel
2a7c0139f5
double max heap size for elasticsearch in CI (#2756) 2022-07-05 13:53:32 +02:00
bogdankostic
353da8b1c1
Add Tutorials 16, 17 and 18 to README (#2758) 2022-07-05 12:04:58 +02:00
Julian Risch
f70f4e90fd
correct docstring parameter name (#2757) 2022-07-05 12:00:40 +02:00
Patrick Deutschmann
1db3fd0942
Add support for Multi-Hop Dense Retrieval (#2571)
* Implement MDR

* Adapt conftest to new MDR signature

* Update Documentation & Code Style

* Change signature of queries param in batch methods of MDR like in #2575

* Update Documentation & Code Style

* Rename MultihopDenseRetriever to MultihopEmbeddingRetriever

* Fix filters in retrieve_batch

* Add docstring for MultihopEmbeddingRetriever.__init__

* Update Documentation & Code Style

* Revert forward signature of TextSimilarityHead

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-05 11:31:11 +02:00
bogdankostic
dc48c444d4
Fix loading of tokenizers in DPR (#2755) 2022-07-04 18:18:14 +02:00
Tuana Celik
2a8b129bae
first version of save_to_remote for HF from FarmReader (#2618)
* first version of save_to_remote for HF from FarmReader

* Update Documentation & Code Style

* Changes based on comments

* Update Documentation & Code Style

* imports order

* making small changes to pydoc

* indent fix

* Update Documentation & Code Style

* keyword arguments instead of positional

* Changing to repo_id

huggingface-hub package would have to be v0.5 or higher - checking how to handle with Thomas

* Update Documentation & Code Style

* adding huggingface-hub dependency 0.5 or above

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
2022-07-04 15:39:56 +02:00
Julian Risch
f7d00476f9
Reduce logging messages and simplify logging (#2682)
* change log levels to debug and use torch.div

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-04 14:02:55 +02:00
tstadel
322d964679
Remove rapidfuzz version pin (#2730)
* remove rapidfuzz version pin

* exclude malicious version 2.0.14

* update rapidfuzz version restrictions
2022-07-04 13:53:39 +02:00
Francesco Castelli
31dcd55c24
Validate max_seq_length in SquadProcessor (#2740)
* added max_len_seq validation in SquadProcessor

* fixed string formatting

* added tests for invalid max_seq_len

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-04 13:35:45 +02:00
Vladimir Blagojevic
ffb7e4e4bd
GPL tutorial - add GPU header and open in colab button (#2736)
* GPL tutorial - add GPU header and open in colab button

* Add GPL tutorial to run exclusion list
2022-07-04 05:23:39 -04:00
Julian Risch
1c1faa4742
Make check of document & embedding count optional in FAISS and Pinecone (#2677)
* make validation optional & add method call in pinecone init

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-04 10:12:31 +02:00
Julian Risch
1781e88802
Upgrade torch to 1.12 (#2741)
* Upgrade torch to 1.12

* upgrade torch-scatter

* add explicit torch-scatter installation

* set torch dependency to range >1.9,<1.13
2022-07-01 20:23:32 +02:00
Daniel Augustus Bichuetti Silva
e3b2ee956a
Improved crawler support for dynamically loaded pages (#2710)
* Improved crawler support for dynamically loaded pages

* Reduced scope of StaleElementReferenceException and removed deprecated code from WebDriver initialization

* Improvements on crawler testing code

* Code format and style applied on f028331948c170448613e86dfdfa222f7c2043fd

* Update Documentation & Code Style

* Remove unused imports/parameters

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-01 10:47:33 +02:00
Massimiliano Pippi
1e01cd0efb
pin es client to include bugfixes (#2735) 2022-06-27 15:13:34 +02:00
mathislucka
8d65bc5f9b
Update document scores based on ranker node (#2048)
* ranker should return scores for later usage

* fix wrong tuple order

* adjust ranker scores; add tests

* Update Documentation & Code Style

* fix mypy

* Update Documentation & Code Style

* fix mypy

* Update Documentation & Code Style

* relax ranker test tolerance

* update ranker test score

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2022-06-27 12:17:18 +02:00
Julian Risch
46c9c8c562
Upgrade transformers to 4.20.1 (#2702)
Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2022-06-27 11:56:58 +02:00
Vladimir Blagojevic
b08c5f81d1
Add GPL adaptation tutorial (#2632)
* Add GPL adaptation tutorial

* Latest round of Aga's corrections

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-26 02:44:57 -04:00
Sara Zan
426f49979b
Change repo with repository in python_cache (#2731)
* Change repo with repository

* remove name

* using owner and name

* use owner name

* replace name with login

* Trying with the PR context instead
2022-06-24 18:36:19 +02:00
Sara Zan
6a7152044e
add repo name as well (#2729) 2022-06-24 17:08:28 +02:00
Stefano Fiorucci
42b1a5c3a4
fix error in log message (#2719)
* fix error in log message

* Update Documentation & Code Style

* pass index to _drop_duplicate_documents

* make the use of index in logging more explicit

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-24 16:53:52 +02:00
Sara Zan
13514f960d
Speficy ref in action (#2727) 2022-06-24 15:56:17 +02:00
Massimiliano Pippi
3207f372ee
Fix bugs in loading code from yaml (#2705)
* fix bug in loading code from yaml
2022-06-24 14:52:13 +02:00
tstadel
ab443aab28
Fix match_context tests in test_utils.py (#2725)
* fix match_context tests

* fix naming of test

* pin rapidfuzz to 2.0.13
2022-06-24 13:23:00 +02:00
Sara Zan
e8546e2124
Replace deprecated Selenium methods (#2724)
* Fix crawler.py

* Fix test_connector.py

* unused import

Co-authored-by: danielbichuetti <daniel.bichuetti@gmail.com>
2022-06-24 12:05:32 +02:00
Sara Zan
400d2cdf77
Fix audio tests on CI (#2718)
* Update Documentation & Code Style

* fix huggingface-hub version

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-24 11:36:31 +02:00
tstadel
1168f6365d
Fix using id_hash_keys as pipeline params (#2717)
* Fix using id_hash_keys as pipeline params

* Update Documentation & Code Style

* add tests

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-24 09:55:09 +02:00
tstadel
a084a982c4
Show warning in reader.eval() about differences compared to pipeline.eval() (#2477)
* deprecate reader.eval

* Update Documentation & Code Style

* update warning to describe differences between pipeline.eval()

* remove empty lines

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-23 18:40:17 +02:00
Sara Zan
e69492a28f
Tutorial 14 doc changes (#2714)
* let the bot apply changes in this pr

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-23 12:36:12 +02:00
Stefano Fiorucci
b01a7c2259
Add InMemoryKnowledgeGraph (#2678)
* draft for InMemoryKnowledgeGraph

* remove comments

* Update Documentation & Code Style

* fix import and signature

* Fix dependencies for in_memory_knowlede_graph

* updated tutorials

* Update Documentation & Code Style

* fix bug in notebook

* fix other notebook bug

* Update Documentation & Code Style

* improved tutorial notebook

* Update Documentation & Code Style

* better implementation of InMemoryKnowledgeGraph

* fix

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-22 19:16:33 +02:00
Rob Pasternak
b87c0c950b
Tutorial 14 edit (#2663)
* Rewrite Tutorial 14 for increased user-friendliness

* Update Tutorial14 .py file to match .ipynb file

* Update Documentation & Code Style

* unblock the ci

* ignore error in jitterbit/get-changed-files

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
2022-06-22 13:03:07 +02:00
Julian Risch
325bc5466a
Revert "Upgrade transformers to 4.20.0 (#2694)" (#2700)
This reverts commit 4a63707f1a177123c13929eb316d3ecaa7fd6c5f.
2022-06-21 21:17:21 +02:00
Julian Risch
4a63707f1a
Upgrade transformers to 4.20.0 (#2694) 2022-06-21 17:23:31 +02:00
Sara Zan
505ababf43
Skip Pinecone tests (#2696)
* comment out Pinecone tests block

* Add comment
2022-06-21 14:49:36 +02:00
Massimiliano Pippi
5d255f0e4a
replace question issue with link to discussions (#2697) 2022-06-21 14:10:11 +02:00
Sara Zan
a6c06ee376
Update contributor's checklists in PR template (#2659)
* Split contributor's and reviewer's checklists

* contributor-centric checklist

* Move issues at the top and split entry

* phrasing
2022-06-21 10:11:18 +02:00
tstadel
da5ea73339
Fix EvaluationSetCliet.get_labels() (#2690)
* fix EvaluationSetCliet.get_labels()

* Update Documentation & Code Style

* fix tests

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-20 19:16:09 +02:00
bogdankostic
b16430b61e
Tutorial 4: Set similarity to "cosine" in DocStore initialization (#2673)
* Set similarity to cosine in DocStore initialization

* Update Documentation & Code Style

* Set `scale_score` to `False`

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-20 18:47:09 +02:00
Massimiliano Pippi
79b287b568
Extract common code for ES and OS into a base class (#2664)
* extract common code for ES and OS into a base class

* Update Documentation & Code Style

* give the base class a more obvious name

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-20 09:47:44 +02:00
MichelBartels
964e6cdafb
Fix JoinAnswer/JoinNode (#2612)
* fix join nodes

* Update Documentation & Code Style

* fix unused import

* change arg order

* Update Documentation & Code Style

* fix kwargs check

* add warning when there is only one input node

* Update Documentation & Code Style

* fix type hint

* fix wrong import order

* Update Documentation & Code Style

* undo kwargs

* add accidentally deleted newline#

* fix type hint

* fix type hint

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-17 16:29:15 +02:00
Sara Zan
a26c042994
Fix typo in code_and_docs.sh (#2662)
* Fix typo in code_and_docs.sh & install ffmpeg in autoformat.yml

* apt update to get ffmpeg

* Update Documentation & Code Style

* Add header and better error message

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-15 13:50:55 +02:00
Aleksander Smywiński-Pohl
642229255f
Use AutoTokenizer by default, to easily adapt to new models and token… (#1902)
* Use AutoTokenizer by default, to easily adapt to new models and tokenizers

* Add missing AutoTokenizer import

* Apply Black

* Missing import

* Fix DPR tests

* Remove tests on max length

* Update Documentation & Code Style

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-15 13:13:48 +02:00
Sara Zan
776eba0cd1
Remove pull_request from triggers (#2661) 2022-06-15 10:14:22 +02:00
Sara Zan
584e046642
AnswerToSpeech (#2584)
* Add new audio answer primitives

* Add AnswerToSpeech

* Add dependency group

* Update Documentation & Code Style

* Extract TextToSpeech in a helper class, create DocumentToSpeech and primitives

* Add tests

* Update Documentation & Code Style

* Add ability to compress audio and more tests

* Add audio group to test, all and all-gpu

* fix pylint

* Update Documentation & Code Style

* Accidental git tag

* Try pleasing mypy

* Update Documentation & Code Style

* fix pylint

* Add warning for missing OS library and support in CI

* Try fixing mypy

* Update Documentation & Code Style

* Add docs, simplify args for audio nodes and add tutorials

* Fix mypy

* Fix run_batch

* Feedback on tutorials

* fix mypy and pylint

* Fix mypy again

* Fix mypy yet again

* Fix the ci

* Fix dicts merge and install ffmpeg on CI

* Make the audio nodes import safe

* Trying to increase tolerance in audio test

* Fix import paths

* fix linter

* Update Documentation & Code Style

* Add audio libs in unit tests

* Update _text_to_speech.py

* Update answer_to_speech.py

* Use dedicated dataset & update telemetry

* Remove  and use distilled roberta

* Revert special primitives so that the nodes run in indexing

* Improve tutorials and fix smaller bugs

* Update Documentation & Code Style

* Fix serialization issue

* Update Documentation & Code Style

* Improve tutorial

* Update Documentation & Code Style

* Update _text_to_speech.py

* Minor lg updates

* Minor lg updates to tutorial

* Making indexing work in tutorials

* Update Documentation & Code Style

* Improve docstrings

* Try to use GPU when available

* Update Documentation & Code Style

* Fixi mypy and pylint

* Try to pass the device correctly

* Update Documentation & Code Style

* Use type of device

* use .cpu()

* Improve .ipynb

* update apt index to be able to download libsndfile1

* Fix SpeechDocument.from_dict()

* Change pip URL

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-06-15 10:13:18 +02:00
Sara Zan
735ffa635b
[CI refactoring] Tutorials on CI (#2547)
* Experimental Ci workflow for running tutorials

* Run on every push for now

* Not starting?

* Disabling paths temporarily

* Sort tutorials in natural order

* Install ipython

* remove ipython install

* Try running ipython with sudo

* env.pythonLocation

* Skipping tutorial2 and 9 for speed

* typo

* Use one runner per tutorial, for now

* Typo in dependend job

* Missing quotes broke scripts matrix

* Simplify setup for the tutorials, try to prevent containers conflict

* Remove needless job dependencies

* Try prevent cache issues, fix small Tut10 bug

* Missing deps for running notebook tutorials

* Create three groups of tutorials excluding the longest among them

* remove deps

* use proper bash loop

* Try with a single string

* Fix typo in echo

* Forgot do

* Typo

* Try to make the GraphDB tutorial without launching its own container

* Run notebook and script together

* Whitespace

* separate scrpits and notebooks execution

* Run notebooks first

* Try caching the GoT data before running the scripts

* add note

* fix mkdir

* Fix path

* Update Documentation & Code Style

* missing -r

* Fix folder numbering

* Run notebooks as well

* Typo in notebook command

* complete path in notebook command

* Try with TIKA_LOG_PATH

* Fix folder naming

* Do not use cached data in Tut9

* extracting the number better

* Small tweaks

* Same fix on Tut10 on the notebook

* Exclude GoT cache for tut5 too

* Remove faiss files after tutorial run

* Layout

* fix remove command

* Fix path in tut10 notebook

* Fix typo in node name in tut14

* Third block was too long, rebancing

* Reduce GoT dataset even more, why wasting time after all...

* Fix paths in tut10 again

* do git clean to make sure to cleanup everything (breaks post Python)

* Remove ES file with bad permission at the end of the run

* Split first block, takes >30mins

* take out tut15 for a moment, has an actual bug

* typo

* Forgot rm option

* Simply remove all ES files

* Improve logs of GoT reduction

* Exclude also tut16 from cache to try fix bug

* Replace ll with ls

* Reintroduce 15_TableQA

* Small regrouping

* regrouping to make the min num of runners go for about 30mins

* Add cron schedule and PR paths conditions

* Add some timing information

* Separate tutorials by diff and tutorials by cron

* temp add pull_request to tutorials nightly

* Add badge in README to keep track of the nightly tutorials run

* Remove prefixes from data folder names

* Add fetch depth to get diff with master

* Fix paths again

* typo

* Exclude long-running ones

* Typo

* Fix tutorials.yml as well

* Use head_ref

* Using an action for now

* exclude other files

* Use only the correct command to run the tutorial

* Add long running tutorials in separate runners, just for experiment

* Factor out the complex bash script

* Pass the python path to the bash script

* Fix paths

* adding log statement

* Missing dollarsign

* Resetting variable in loop

* using mini GoT dataset and improving bash script

* change dataset name

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-15 09:53:36 +02:00
James Briggs
2688135481
Pinecone unary queries upgrade (#2657)
* update query and response process for unary query update

* added metadata_config parameter

* Update Documentation & Code Style

Co-authored-by: James Briggs <jamesbriggs@Jamess-MacBook-Pro-2.local>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-15 09:45:39 +02:00
tstadel
66c7d1a7ee
Add execute_eval_run example to Tutorial 5 (#2459)
* add execute_eval_run.ipynb

* update Tutorial 5

* Update Documentation & Code Style

* change experiment name

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-13 09:19:12 +02:00
Massimiliano Pippi
bb729ab95f
wait for postgres to be ready before data migrations (#2654) 2022-06-10 19:30:57 +02:00
Sara Zan
54518ac790
[CI Refactoring] Refactor Document fixtures in tests (#2577)
* Refactor document fixtures

* Add embedding files

* Update Documentation & Code Style

* Indentation issue

* Update Documentation & Code Style

* Fix type conversion in conftest.py

* Update Documentation & Code Style

* mypy on sql.py

* mypy on crawler.py

* mypy on pinecone.py

* Adapt retriever tests

* Update Documentation & Code Style

* mypy on crawler.py

* Update Documentation & Code Style

* mypy on crawler.py again

* Update Documentation & Code Style

* mypy fix was too rough

* Fix some more tests

* Update Documentation & Code Style

* Skip meaningless test on FilterRetriever

* Make embedding values less specific

* Update Documentation & Code Style

* Use stable IDs in retriever tests that depend on it

* Remove needless fixtures

* docs_with_ids

* Update Documentation & Code Style

* Typo

* Fix retriever tests

* Fix reader tests

* Update Documentation & Code Style

* Workaround #2626

* Update Documentation & Code Style

* Fix label generator tests

* Reorder vectors

* remove print

* Update Documentation & Code Style

* Update Documentation & Code Style

* git tags leftover

* Update Documentation & Code Style

* fix last failing test

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-10 18:22:48 +02:00