2539 Commits

Author SHA1 Message Date
Stefano Fiorucci
7dcef68685
Handle invalid metadata for SQLDocumentStore (#2868)
* modify notebook

* skip invalid metadata

* Update Documentation & Code Style

* fix nonetype

* fix nonetype

* drop nonetype from valid types

* drop nonetype from valid types

* fix

* Update sql.py

* sqlalchemy validation

* removed newlines

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-25 14:57:21 +02:00
Sara Zan
5119acb260
Raise timeout on integration tests (#2880) 2022-07-25 06:43:20 -04:00
Sara Zan
4e45062a00
Simplify language_modeling.py and tokenization.py (#2703)
* Simplification of language_model.py and tokenization.py to remove code duplication

Co-authored-by: vblagoje <dovlex@gmail.com>
2022-07-22 16:29:30 +02:00
Massimiliano Pippi
8ee2b6b403
Add a custom pydoc renderer for Readme.io (#2825)
* add custom pydoc renderer

* create an example

* revert example code
2022-07-22 10:43:51 +02:00
tstadel
11c46006df
Fix corrupted csv from EvaluationResult.save() (#2854)
* fix corrupted csv if text contains \r chars; make csv serialization configurable

* Update Documentation & Code Style

* incorporate feedback

* Update Documentation & Code Style

* adjust columns to be converted during loading

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-21 16:31:07 +02:00
Stefano Fiorucci
e350781825
Exclude docker from Tutorial 15 (#2861)
* modify notebook

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-21 10:01:25 +02:00
Daniel Bichuetti
3948b997b2
Add support for custom trained PunktTokenizer in PreProcessor (#2783)
* Add support for model folder into BasePreProcessor

* First draft of custom model on PreProcessor

* Update Documentation & Code Style

* Update tests to support custom models

* Update Documentation & Code Style

* Test for wrong models in custom folder

* Default to ISO names on custom model folder

Use long names only when needed

* Update Documentation & Code Style

* Refactoring language names usage

* Update fallback logic

* Check unpickling error

* Updated tests using parametrize

Co-authored-by:  Sara Zan <sara.zanzottera@deepset.ai>

* Refactored common logic

* Add format control to NLTK load

* Tests improvements

Add a sample for specialized model

* Update Documentation & Code Style

* Minor log text update

* Log model format exception details

* Change pickle protocol version to 4 for 3.7 compat

* Removed unnecessary model folder parameter

Changed logic comparisons

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

* Update Documentation & Code Style

* Removed unused import

* Change errors with warnings

* Change to absolute path

* Rename sentence tokenizer method

Co-authored-by: tstadel

* Check document content is a string before process

* Change to log errors and not warnings

* Update Documentation & Code Style

* Improve split sentences method

Co-authored-by:  Sara Zan  <sara.zanzottera@deepset.ai>

* Update Documentation & Code Style

* Empty commit - trigger workflow

* Remove superfluous parameters

Co-authored-by: tstadel

* Explicit None checking

Co-authored-by: tstadel

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-07-21 09:50:45 +02:00
Kristof Herrmann
f51587b4ad
🐛 fix: update deployment status codes (#2713)
* 🐛 fix: update deployment status codes

* Update Documentation & Code Style

* adjust error log

* added tests for failed state

* added valid initial states

* fix

* fix tests

* add test

* updated comments

* uncommented code again

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Thomas Stadelmann <thomas.stadelmann@deepset.ai>
2022-07-21 09:04:45 +02:00
Stefano Fiorucci
de6b9c3d3e
Remove deprecated method prepare_seq2seq_batch (#2852)
* Remove deprecated method prepare_seq2seq_batch
2022-07-20 16:49:54 +02:00
James Briggs
a4e197c21a
changed mock pinecone to use dict rather than list index (#2845) 2022-07-19 15:28:22 +02:00
kekayan
925eeddf0a
remove unnecessary if else block #2835 (#2842) 2022-07-19 15:25:40 +02:00
Stefano Fiorucci
baf5ef81f7
Validate OpenAI response (#2844)
* openai response check

* Update Documentation & Code Style

* Update haystack/nodes/answer_generator/openai.py

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

* Update Documentation & Code Style

* correct indentation

* add OpenAIError

* raise OpenAIError

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
2022-07-19 11:54:50 +02:00
tstadel
9ad90b2e23
fix healtcheck cmds for annotation tool postgres (#2840) 2022-07-18 18:31:22 +02:00
Sara Zan
48644b23fb
Enable CI on tutorials (#2801)
* enable ci on tutorials

* Disable all path restrictions for safety

* actually comment out the paths block

* remove comment
2022-07-18 17:59:55 +02:00
Massimiliano Pippi
632cd1c141
Allow values that are not dictionaries in the request params in the /search endpoint (#2720)
* let params contain something else than dictionaries

* rewrite the test same style as the main branch
2022-07-15 13:24:29 +02:00
Sara Zan
6b39fbd39c
Mocking Pinecone tests (#2778)
* Integrating the mock into conftest.py

* re-enable workflow

* delete_all

* Update Documentation & Code Style

* remove ValueError

* Add empty response

* wrong condition

* return response

* revert removal of delete_all

* change mock

* Update Documentation & Code Style

* test for rest api, to revert

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-14 20:03:33 +02:00
tstadel
e6d8bcdf9b
Fix gold_contexts_similarity for table retrieval evaluation (#2815)
* fix gold_contexts_similarity for table documents

* check for type of gold_context
2022-07-14 17:59:20 +02:00
Massimiliano Pippi
82df677ebf
API tests (#2738)
* clean up tests and run earlier

* use change detection

* better naming, skip ES

* more cleanup

* fix job name

* dummy commit to trigger the CI

* mock away the PDF converter

* make the test compatible with 3.7

* removed leftover

* always run the api tests, use a matrix for the OS

* refactor all the tests

* remove outdated dependency

* pylint

* new abstract method

* adjust for older python versions

* rename pipeline file

* address PR comments
2022-07-14 15:36:28 +02:00
Branden Chan
0388284d71
Clean OpenAIAnswerGenerator docstrings (#2797)
* Clean OpenAIAnswerGenerator docstrings

* Incorporate reviewer feedback

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-07-14 09:35:30 +02:00
Vladimir Blagojevic
2a7e333d9a
Tutorial 12: add introduction (#2798)
* Tutorial 12: add introduction

* PR review for Tutorial 12: add introduction

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-13 17:44:19 +02:00
Julian Risch
f599ce9458
Change "text" to "content" as dict key (#2800)
* change "text" to "content" as dict key

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-13 16:36:06 +02:00
Sara Zan
d8e7aaeacc
API key check in OpenAIAnswerGenerator (#2791)
* api key check in node and tests

* Clarify skip message

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-12 14:05:47 +02:00
Sara Zan
4d2a06989d
Fix YAML validation for ElasticsearchDocumentStore.custom_query (#2789)
* Add exception for  in the validation code

* Update Documentation & Code Style

* Add tests

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-12 13:49:06 +02:00
Sara Zan
091711b8c4
Fix Tutorials and Tutorials (nightly) (#2737)
* Remove caching and install audio deps

* Fix `Tutorials` as well

* Run all tutorials even though some fail

* Forgot fi

* fix failure condition

* proper bash string equality

* Enable debug logs

* remove audio files

* Update Documentation & Code Style

* Use the setup action in the Tutorial CI as well

* Try with a file that exists

* Update Documentation & Code Style

* Fix the comments in the tutorials

* Update Documentation & Code Style

* Fix tutorials.sh

* Remove debug logging

* import pprint and try editable install

* Update Documentation & Code Style

* extract no run list

* Add tutorial18 to no run list nightly

* import pprint correctly

* Update Documentation & Code Style

* try making site-packages editable

* Make pythonpath editable every time Tut17 is run on CI

* typo

* fix imports in tut5

* add git clean

* Update Documentation & Code Style

* add comments and remove` -e`

* accidentally deleted a line

* Update .github/utils/tutorials.sh

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2022-07-12 11:22:17 +02:00
Sowmiya Jaganathan
4d8f40425b
Passing the meta-data in the summerizer response (#2179)
* Passing the all the meta-data in the summerizer

* Disable metadata forwarding if `generate_single_summary` is `True`

* Update Documentation & Code Style

* simplify tests

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-11 17:28:36 +02:00
Daniel Augustus Bichuetti Silva
1706729e26
Prevent PDFToTextConverter from failing on PDFs with spaces in their names (#2786)
* Change split logic to list

* Fix wrong parameter for run

* Fix mypy error

* Fix layout/raw parameter

* Add test for filename with whitespaces on PDFToText

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-11 13:30:33 +02:00
Daniel Augustus Bichuetti Silva
77a513fe49
Fix crawler long file names (#2723)
* Changing the name that crawled page is saved to avoid long file names error on some file systems

* Custom naming function for saving crawled files

* Update Documentation & Code Style

* Remove bad characters on file name and preffix

* Add test for naming function

* Update Documentation & Code Style

* Fix expensive regex recalculation and linter warns

* Check for exceptions on file dump

* Remove param_naming variable

* Fix file paths on Windows, Linux and Mac

* Update Documentation & Code Style

* Test using one of the docstrings examples

* Change default naming function
Update docstrings

* Applying formatting rules

* Update Documentation & Code Style

* Fix mypy incompatible assignment error

* Remove unused type declaration

* Fix typo

* Update tests for naming function

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-11 12:16:32 +02:00
Malte Pietsch
ba08fc86f5
Add node to use OpenAI's GPT-3 for QA (#2605)
* first draft of openai node for QA

* Update Documentation & Code Style

* fix mypy. add node to inits

* Update Documentation & Code Style

* fix linter

* Adapt OpenAIGenerator to completions endpoint

* Update Documentation & Code Style

* Fix pylint

* Fix doc strings

* Make use of temperature

* Make use of api key in tests

* Adapt doc strings

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ZanSara <sarazanzo94@gmail.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-07-08 13:59:27 +02:00
Agnieszka Marzec
425da1fd31
Fix load_from_yaml example in the Pipelines tutorial (#2774)
* Fix load from yaml example and image

* Update Documentation & Code Style

* Fixed pipeline exmple

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-08 11:22:11 +02:00
James Briggs
ea40387b97
added mock pinecone client (#2770) 2022-07-07 19:51:30 +02:00
tstadel
d21b066fc7
fix pipeline run loop on joined pipelines whithout debug flag (#2777)
* fix pipeline run loop on joined pipelines whithout debug flag

* use .keys() consistently
2022-07-07 16:47:59 +02:00
bogdankostic
195aed942f
Add update_document_meta to InMemoryDocumentStore (#2689)
* Add update_document_meta to InMemoryDocumentStore

* Fix typo

* Update Documentation & Code Style

* Add update_document_meta to BaseDocumentStore

* Update Documentation & Code Style

* Fix mypy

* Update Documentation & Code Style

* Add update_document_meta to MockDocumentStore

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-07 15:44:07 +02:00
tstadel
45136badfe
Fix _debug info getting lost for previous nodes when using join nodes (#2776)
* fix debug output for pipelines with join nodes

* add test

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-07 15:10:13 +02:00
Vladimir Blagojevic
a766b70a8f
Tutorial 18:Open in Colab doesn't work in Firefox (#2767)
* Tutorial 18:Open in Colab doesn't work in Firefox

* Tutorial 18:Open in Colab doesn't work in Firefox v2

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-06 10:51:09 -04:00
Tuana Celik
917afb1530
Trying out some smaller images for docs (#2772) 2022-07-06 16:11:23 +02:00
tstadel
e9219f4dc2
Fix confusing elasticsearch exception (#2763)
* convert confusing exception to warning and add no docs case.

* blacken

* fix test
2022-07-06 15:40:51 +02:00
Vladimir Blagojevic
a2905d05f7
Bump version to next release candidate (#2765)
* Bump version to next release candidate

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-06 11:26:42 +02:00
Vladimir Blagojevic
c80336c424
Upgrade to v1.6.0 and copy docs folder (#2764)
* Upgrade to v1.6.0 and copy docs folder

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
v1.6.0
2022-07-06 10:25:15 +02:00
tstadel
2a7c0139f5
double max heap size for elasticsearch in CI (#2756) 2022-07-05 13:53:32 +02:00
bogdankostic
353da8b1c1
Add Tutorials 16, 17 and 18 to README (#2758) 2022-07-05 12:04:58 +02:00
Julian Risch
f70f4e90fd
correct docstring parameter name (#2757) 2022-07-05 12:00:40 +02:00
Patrick Deutschmann
1db3fd0942
Add support for Multi-Hop Dense Retrieval (#2571)
* Implement MDR

* Adapt conftest to new MDR signature

* Update Documentation & Code Style

* Change signature of queries param in batch methods of MDR like in #2575

* Update Documentation & Code Style

* Rename MultihopDenseRetriever to MultihopEmbeddingRetriever

* Fix filters in retrieve_batch

* Add docstring for MultihopEmbeddingRetriever.__init__

* Update Documentation & Code Style

* Revert forward signature of TextSimilarityHead

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-05 11:31:11 +02:00
bogdankostic
dc48c444d4
Fix loading of tokenizers in DPR (#2755) 2022-07-04 18:18:14 +02:00
Tuana Celik
2a8b129bae
first version of save_to_remote for HF from FarmReader (#2618)
* first version of save_to_remote for HF from FarmReader

* Update Documentation & Code Style

* Changes based on comments

* Update Documentation & Code Style

* imports order

* making small changes to pydoc

* indent fix

* Update Documentation & Code Style

* keyword arguments instead of positional

* Changing to repo_id

huggingface-hub package would have to be v0.5 or higher - checking how to handle with Thomas

* Update Documentation & Code Style

* adding huggingface-hub dependency 0.5 or above

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
2022-07-04 15:39:56 +02:00
Julian Risch
f7d00476f9
Reduce logging messages and simplify logging (#2682)
* change log levels to debug and use torch.div

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-04 14:02:55 +02:00
tstadel
322d964679
Remove rapidfuzz version pin (#2730)
* remove rapidfuzz version pin

* exclude malicious version 2.0.14

* update rapidfuzz version restrictions
2022-07-04 13:53:39 +02:00
Francesco Castelli
31dcd55c24
Validate max_seq_length in SquadProcessor (#2740)
* added max_len_seq validation in SquadProcessor

* fixed string formatting

* added tests for invalid max_seq_len

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-04 13:35:45 +02:00
Vladimir Blagojevic
ffb7e4e4bd
GPL tutorial - add GPU header and open in colab button (#2736)
* GPL tutorial - add GPU header and open in colab button

* Add GPL tutorial to run exclusion list
2022-07-04 05:23:39 -04:00
Julian Risch
1c1faa4742
Make check of document & embedding count optional in FAISS and Pinecone (#2677)
* make validation optional & add method call in pinecone init

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-04 10:12:31 +02:00
Julian Risch
1781e88802
Upgrade torch to 1.12 (#2741)
* Upgrade torch to 1.12

* upgrade torch-scatter

* add explicit torch-scatter installation

* set torch dependency to range >1.9,<1.13
2022-07-01 20:23:32 +02:00