1524 Commits

Author SHA1 Message Date
ZanSara
024332f98f
refactor: simplify registration of PromptModelInvocationLayer (#4339)
* use __init_subclass__ and remove registering functions
2023-03-07 20:53:48 +01:00
Sebastian
7d5e7c089c
refactor: Use TableQuestionAnsweringPipeline from transformers (#4303)
* Added changes from table-qa-pipeline

* Moved classes around to make diff to main look nicer.

* Cleaned things up. Removed option to return_no_answer (not needed), added docs and added integration marks.

* Remove unneeded code

* Added fix for test

* Add check for document_ids in answer

* Prevent passing of empty list to np.mean

* Batching doesn't work with TableQAPipeline b/c of HF issue

* Cleanup of table reader tests, added check for document ids.

* Fixing pylint

* More pylint

* PR comments

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
2023-03-07 11:46:50 +01:00
Daniel Bichuetti
af6efbdcb0
refactor: Allow flexible document id generation (#4326) 2023-03-07 07:25:27 +01:00
Zoltan Fedor
4dea9db01e
feat: Report execution time for pipeline components in _debug (#4197)
* Adding execution time to the debug output of pipeline components

* Linting issue fix

* [EMPTY] Re-trigger CI

* fixed test

---------

Co-authored-by: Mayank Jobanputra <mayankjobanputra@gmail.com>
2023-03-07 04:45:31 +05:30
tstadel
19311119db
fix: EvalResult load migration (#4289)
* fix evalresult load migration

* handle none values correctly

* better None check

* improve logic and add test
2023-03-06 20:05:02 +01:00
ZanSara
c802305ccf
test: move tests on standard pipelines in e2e/ (#4309)
* move out standard pipelines e2e

* fixing unit tests

* add test data

* feedback

* pylint

* black
2023-03-06 17:26:19 +01:00
Vladimir Blagojevic
348e7d2dfe
refactor: Separate PromptModelInvocationLayers in providers.py (#4327)
* Refactor PromptNode, separate PromptModelInvocationLayers in providers.py
2023-03-06 16:34:59 +01:00
Daniel Bichuetti
1548c5ba0f
feat: Add Azure OpenAI embeddings support (#4332)
* feate: add Azure OpenAI as embedding option

* feat: Add Azure OpenAI embeddings support

* refactor: check api key

* refactor: better type checking for Azure

* refactor: enable parallelism + separate and update tests

* refactor: string reformat

* refactor: explicit typing

* refactor: update refs and remove unused code
2023-03-06 13:37:20 +01:00
Sebastian
1a42166978
fix: Prevent going past token limit in OpenAI calls in PromptNode (#4179)
* Refactoring to remove duplicate code when using OpenAI API

* Adding docstrings

* Fix mypy issue

* Moved retry mechanism to openai_request function in openai_utils

* Migrate OpenAI embedding encoder to use the openai_request util function.

* Adding docstrings.

* pylint import errors

* More pylint import errors

* Move construction of headers into openai_request and api_key as input variable.

* Made _openai_text_completion_tokenization_details so can be resued in PromptNode and OpenAIAnswerGenerator

* Add prompt truncation to the PromptNode.

* Removed commented out test.

* Bump version of tiktoken to 0.2.0 so we can use MODEL_TO_ENCODING to automatically determine correct tokenizer for the requested model

* Change one method back to public

* Fixed bug in token length truncation. Included answer length into truncation amount. Moved truncation higher up to PromptNode level.

* Pylint error

* Improved warning message

* Added _ensure_token_limit for HFLocalInvocationLayer. Had to remove max_length from base PromptModelInvocationLayer to ensure that max_length has a default value.

* Adding tests

* Expanded on doc strings

* Updated tests

* Update docstrings

* Update tests, and go back to how USE_TIKTOKEN was used before.

* Update haystack/nodes/prompt/prompt_node.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/nodes/prompt/prompt_node.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/nodes/prompt/prompt_node.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/nodes/retriever/_openai_encoder.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/utils/openai_utils.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/utils/openai_utils.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Updated docstrings, and added integration marks

* Remove comment

* Update test

* Fix test

* Update test

* Updated openai_request function to work with the azure api

* Fixed error in _openai_encodery.py

---------

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
2023-03-03 13:49:21 +01:00
Vladimir Blagojevic
79bf25aaea
feat: Add Azure as OpenAI endpoint (#4170)
* Add Azure as OpenAI endpoint
---------

Co-authored-by: Sebastian Lee <sebastian.lee@deepset.ai>
2023-03-02 09:55:09 +01:00
Daniel Bichuetti
7c49fffc71
feat: Enable PDFToTextConverter multiprocessing, increase general performance and simplify installation (#4226)
* refactor: isolate PDF converters

* refactor: remove xpdf dependency and fix tests

* refactor: add min. version

* feat: enable multiprocessing and add tests

* fix: remove unused imports

* fix: regression when moved code

* refactor: use itertools

* fix: mypy claims

* refactor: double tool support

* refactor: add fallback to xpdf

* refactor: black formatting

* refactor: make superclass signature compatible

* refactor: complete removal of xPdf

* refactor: regroup Haystack imports and fix regression

* refactor: remove original declaration

* docs: fix docstrings

* tests: add [pdf] to [all]

* refactor: remove redundant checks, avoid extra processes

* refactor: add deprecation warning

* refactor: add pytest mark

* tests: change PDF test file

* fix: correct pytest mark

* refactor: deprecate parameter and add new

* tests: change pdf sample

* Add minor lg changes to docstrings

* Fix default value in doc strings

* Update test/nodes/test_file_converter.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* tests: fix page count

* refactor: add imported function

* refactor: change default value

* tests: change parameters and fix typo

* Unify sort_by_position parameter names

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-03-01 22:34:38 +01:00
ZanSara
ae04ce3c6a
test: mock all Summarizer tests and move a few into e2e (#4299)
* stub e2e folders

* simplify pipeline test

* mocking

* unit tests fixed

* clean up e2e

* pipeline tests work

* pylint

* leftover

* small fix from #2994 and additional tests

* review feedback

* change summaries

* black

* revert models and summaries
2023-03-01 17:30:55 +01:00
ZanSara
165a0a5faa
test: mock all Translator tests and move one to e2e (#4290)
* mock all translator tests and move one to e2e

* typo

* extract pipeline tests using translator

* remove duplicate test

* move generator test in e2e

* Update e2e/pipelines/test_extractive_qa.py

* pytest.mark.unit

* black

* remove model name as well

* remove unused fixture

* rename original and improve pipeline tests

* fixes

* pylint
2023-03-01 14:52:05 +01:00
Stefano Fiorucci
e8f9b1b65d
test: replace ElasticsearchDS with InMemoryDS when it makes sense; support scale_score in InMemoryDS (#4283)
* replace elasticds with imds - first draft

* fix

* fix tests and implement scale_score in imds bm25

* add docstrings for scale_score
2023-03-01 11:35:10 +01:00
Malte Pietsch
2a1d73e16d
refactor: Make extraction of "Tool" and "Tool input" for Agent more robust and user-friendly (#4269)
* adjust [] in prompt template. Add error+docs for Tool name.

* fix test

* update error message
2023-02-28 20:01:34 +01:00
Massimiliano Pippi
c3a38a59c0
Update test_prompt_node.py (#4281) 2023-02-28 09:37:40 +01:00
Julian Risch
662441a62b
fix: FARMReader produces Answers with negative start and end position (#4248) 2023-02-28 09:27:42 +01:00
Sebastian
040d806b42
test: Added integration test for using EntityExtractor in query pipeline (#4117)
* Added new test for using EntityExtractor in query node and made some fixtures to reduce code duplication.

* Reuse ner_node fixture

* Added pytest unit markings and swapped over to in memory doc store.

* Change to integration tests
2023-02-28 09:20:44 +01:00
Massimiliano Pippi
4b8d195288
refact: mark unit tests under the test/nodes/** path (#4235)
* document merger

* mark unit tests

* revert
2023-02-27 15:00:19 +01:00
Sebastian
efe46b1214
Fix: Allow torch_dtype="auto" in PromptNode (#4166)
* Fix for allowing torch_dtype="auto"

* Fix to logic of torch_dtype detection

* separate test for dtype
2023-02-27 09:59:27 +01:00
Silvano Cerza
4a93517eb4
test: Fix deprecation fixture (#4219)
* Fix deprecation fixture

* Update docstring

* Update docstring

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-02-27 09:55:03 +01:00
ZanSara
13c4ff1b52
refactor: remove direct logging without a logger (#4253)
* remove direct logging without a logger

* add custom pylint checker

* add test

* pylint

* improve checker message

* mypy

* remove test

* add checker for basicConfig

* more logging missed

* ignore basicConfig

* move out logger

* move out statement

* remove logging configuration
2023-02-23 20:42:42 +01:00
Stefano Fiorucci
5e85f33bd3
refactor: Remove deprecated nodes EvalDocuments and EvalAnswers (#4194)
* remove deprecated classed and update test

* remove deprecated classed and update test

* remove unused code

* remove unused import

* remove empty evaluator node

* unused import :-)

* move sas to metrics
2023-02-23 15:26:17 +01:00
Massimiliano Pippi
722dead1b2
fix agents tests (#4237) 2023-02-23 13:03:45 +01:00
Massimiliano Pippi
764eaa035f
skip summarizer tests to reduce pressure (#4241) 2023-02-23 09:50:24 +01:00
ZanSara
f816efa50c
feat: reduce and focus telemetry (#4087)
* simplified telemetry and docker containers detection

* pylint

* mypy

* mypy

* Add new credentials and metadata

* remove prints

* mypy

* remove comment

* simplify inout len measurement

* black

* removed old telemetry, to revert

* reintroduce env function

* reintroduce old telemetry

* fix telemetry selection

* telemetry for promptnode

* telemetry for some training methods

* telemetry for eval and distillation

* mypy & pylint

* review

* Update lg

* mypy

* improve docstrings

* pylint

* mypy

* fix test

* linting

* remove old tests

---------

Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-02-22 19:02:47 +01:00
Daniel Bichuetti
e0b0fe1bc3
feat!: Increase Crawler standardization regarding Pipelines (#4122)
* feat!(Crawler): Integrate Crawler in the Pipeline.

+Output Documents
+Optional file saving
+Optional Document meta about file path

* refactor: add Optional decl.

* chore: dummy commit

* chore: dummy commit

* refactor: improve overwrite flow

* refactor: change custom file path meta logic + add test

* Update haystack/nodes/connector/crawler.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* Update haystack/nodes/connector/crawler.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* Update haystack/nodes/connector/crawler.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* Update haystack/nodes/connector/crawler.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* Update haystack/nodes/connector/crawler.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-02-22 17:34:19 +01:00
tstadel
32b2abf9d5
fix: add option to not override results by Shaper (#4231)
* add  option to shaper and support answers

* remove publish restrictions on outputs

* support list
2023-02-22 14:36:58 +01:00
Massimiliano Pippi
262c9771f4
relax test assertion (#4229) 2023-02-22 12:37:09 +01:00
Massimiliano Pippi
40f772a9b0
refact: move the first batch of unit tests into the proper job (#4216)
* move the first batch of unit tests into the proper job

* leftover
2023-02-21 17:00:02 +01:00
Julian Risch
5ce7a404ac
feat: Add Agent (#4148)
* initial Agent implementation

* mypy and pylint fixes

* add missing ABC import

* improved prompt template

* refactor and shorten run method

* refactor and shorten run method

* add tests for extracting

* fix mixed up tool_input/observation & make tests more robust

* fix bug with max_iterations and update prompt template

* allow setting prompt_template in Agent init

* remove example yml for agent

* add final prediction to transcript

* add transcript to errors and accept PromptTemplate in init

* simplify if else to elif

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* add checks for max_iter<2 and empty list returned by prompt node

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-02-21 14:27:40 +01:00
Sebastian
bde01cbf1f
Checking if output keys and output_values are same length and fix bug in storing output keys (#4223) 2023-02-21 13:36:15 +01:00
Sebastian
2bedb80ba5
Fix for custom template in OpenAIAnswerGenerator (#4220) 2023-02-21 13:35:17 +01:00
Bijay Gurung
d4b822646e
feat: Add JsonConverter node (#4130)
* Add JsonConverter node

* Update language

* JsonConverter: Remove id_hash_keys overwrite when it's None

Also, changes in docstring based on review

* Update docstring for JsonConverter

---------

Co-authored-by: agnieszka-m <amarzec13@gmail.com>
Co-authored-by: Sebastian Lee <sebastian.lee@deepset.ai>
2023-02-21 09:23:42 +01:00
bogdankostic
18e7b8399b
refactor: Remove id_hash_keys parameter in from_dict method (#4207)
* Remove id_hash_keys parameter in from_dict method

* Remove unused import

* Adapt `from_dict` of `SpeechDocument`

* Revert "Adapt `from_dict` of `SpeechDocument`"

This reverts commit 309cbeb7fbb3094c43be76d9e431db9391913144.

* Adapt `from_dict` of `SpeechDocument`
2023-02-20 17:37:35 +01:00
tstadel
14578aa54f
feat: add top_k to PromptNode (#4159)
* add top_k to PromptNode

* fix OpenAI

* fix openai test
2023-02-20 14:51:45 +01:00
Sebastian
d129598203
Prompt node/run batch (#4072)
* Starting to implement first pass at run_batch

* Started to add _flatten_input function

* First pass at run_batch method.

* Fixed bug

* Adding tests for run_batch

* Update doc strings

* Pylint and mypy

* Pylint

* Fixing mypy

* Restructurig of run_batch tests

* Add minor lg updates

* Adding more tests

* Update dev comments and call static method differently

* Fixed the setting of output variable

* Set output_variable in __init__ of PromptNode

* Make a one-liner

---------

Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-02-20 11:58:13 +01:00
Massimiliano Pippi
83d615a32b
feat: include testing facilities into haystack package (#4182) 2023-02-17 19:38:03 +01:00
bogdankostic
7eeb3e07bf
feat: Add IVF and Product Quantization support for OpenSearchDocumentStore (#3850)
* Add IVF and Product Quantization support for OpenSearchDocumentStore

* Remove unused import statement

* Fix mypy

* Adapt doc strings and error messages to account for PQ

* Adapt validation of indices

* Adapt existing tests

* Fix pylint

* Add tests

* Update lg

* Adapt based on PR review comments

* Fix Pylint

* Adapt based on PR review

* Add request_timeout

* Adapt based on PR review

* Adapt based on PR review

* Adapt tests

* Pin tenacity

* Unpin tenacity

* Adapt based on PR comments

* Add match to tests

---------

Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-02-17 10:28:36 +01:00
Daniel Bichuetti
5187cc1801
refactor: Remove the pin from the espnet module and fix the audio node tests. (#4128)
* fix: fix audio tests + unbound some dependencies

* fix: update for Python 3.8

* refactor: change numpy assertion

* feat: add voice recog. support on audio tests

* fix: fix var assignement

* chore: dummy commit

* fix: fix sndfile error

* refactor: change skip reason

* refactor: hardcode variable

* refactor: unpin numpy

* fix: pin numpy only for audio
2023-02-16 22:12:17 +05:30
Massimiliano Pippi
ec72dd73fc
refactor: complete the document stores test refactoring (#4125)
* add e2e tests

* move tests to their own module

* add e2e workflow

* pylint

* remove from job

* fix index field name

* skip test on sql

* removed unused code

* fix embedding tests

* adjust test for pinecone

* adjust assertions to the new documents

* bad copypasta

* test

* fix tests

* fix tests

* fix test

* fix tests

* pylint

* update milvus version

* remove debug

* move graphdb tests under e2e
2023-02-16 09:43:25 +01:00
Sebastian
9a26942952
feat: Add model_kwargs option to PromptNode (#4151)
* Add input option to PromptNode to allow the passing of default kwargs

* Add yaml test for model_kwargs parameter
2023-02-15 18:46:26 +01:00
Stefano Fiorucci
24405f851c
refactor: InMemoryDocumentStore - manage documents without embedding & fix mypy errors (#4113)
* refactoring and test

* try to replace error with warning

* more expressive and robust get_scores methods

* make get_scores methods internal
2023-02-14 17:43:11 +01:00
Sebastian
75ef959678
feat: Update OpenAIAnswerGenerator defaults and with learnings from PromptNode (#4038)
* added instruction_prompt and update defaults

* Change back max_tokens

* Code formatting

* Starting to update instruction_prompt to be a PromptTemplate

* Using PromptTemplate in OpenAIAnswerGenerator

* Removed hardcoded value

* pylint and make examples and examples_context optional prompt parameters

* Added new test for when prompt length goes past max token limit

* Improve doc strings.

* Make "text-davinci-003" the new default model

* Renaming variable to prompt_template and name to question-answering-with-examples

* Reduced repetitive code.

* Added some comments to explain key logic for future debuggers

* Update docs for max_tokens and increase defaul

* Updating variable name to prompt_template and docs.

* Updated test and handled Answer case where no documents are used.

* Slight update to docs.

* Adding more doc strings

* lg updates

* Blackify

---------

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-02-12 00:08:07 +01:00
Vladimir Blagojevic
d839b9314f
Update PromptTemplate tests (#4131) 2023-02-10 15:24:01 +01:00
bogdankostic
05950719ba
fix: Deduplicate same Documents in isolated evaluation of Reader (#4114)
* Deduplicate same Documents in one MultiLabel

* Add tests

* Update label

* Update label

* Update test

* Update test

* Revert change to check CI

* Revert reversion

* Use deepcopy

* Update tests
2023-02-10 13:55:14 +01:00
Jack Butler
e6b6f70ae2
fix: Fix TableTextRetriever for input consisting of tables only (#4048)
* fix: update kwargs for TriAdaptiveModel

* fix: squeeze batch for TTR inference

* test: add test for ttr + dataframe case

* test: update and reorganise ttr tests

* refactor: make triadaptive model handle shapes

* refactor: remove duplicate reshaping

* refactor: rename test with duplicate name

* fix: add device assignment back to TTR

* fix: remove duplicated vars in test

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
2023-02-09 11:38:16 +01:00
bogdankostic
986472c26f
feat: Add BM25 support for tables in InMemoryDocumentStore (#4090)
* Add BM25 support for tables in InMemoryDocumentStore

* Add table type to query method

* Fix import order

* Adapt tests
2023-02-09 10:47:35 +01:00
Silvano Cerza
274746db07
style: Update black (#4101)
* Update black version

* Format file with new black style

* Update black pre-commit hook version
2023-02-08 15:34:43 +01:00
Sebastian
1bbf10a376
Remove double batching in retrieve_batch (#4014)
* Removed double batching around embed_queries

* Add back tests for retrieve_batch for dpr and embedding retrievers

* Updated table-text-retriever to not double batch

* Fixing pylint

* Update to test

* Remove code breaking test

* Updating dev comment to be clearer
2023-02-08 14:39:20 +01:00