3803 Commits

Author SHA1 Message Date
Stefano Fiorucci
7d17ca7391
add DocumentLanguageClassifier API (#4401) 2023-03-14 09:12:03 +01:00
Vladimir Blagojevic
98256ecf57
Add Whisper node (#4335)
* Add Whisper node

* Add support for audio path, improve tests

* Add docs

* Improve tests
2023-03-13 16:17:07 +01:00
Daniel Bichuetti
28724e2e25
feat: add automatic OCR detection mechanism and improve performance (#4329)
* feat: add automatic OCR detection mechanism and improve performance

* refactor: add error message

* refactor: ignore pdftoppm bad typing

* refactor: add Tesseract install. docstrings

* fix: check if OCR var. assigned on mp

* tests: add path to windows/linux tests

* tests: add tessdata path

* tests: include matrix ref.

* tests: custom Tesseract matrix install

* refactor: improve user guide

* tests: fix macos path

* tests: remove brew formulae version

* fix: macos paths

* tests: fix macos path

* tests: add Tesseract to Windows Path

* tests: pytesseract path

* tests: macos path

* refactor: fix path message and remove extra path from tests

* refactor: raise exception when path not found

* refactor: expression simplification

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* refactor: check ocr parameter

* tests: mark as integration

* tests: mock deprecation warning

* refactor: simplify code

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* refactor: change deprecation test

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* refactor: add unit patch

* refactor: black formatting

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Mayank Jobanputra <mayankjobanputra@gmail.com>
2023-03-13 20:19:22 +05:30
ZanSara
fd3f3143d4
feat: LanguageClassifier (#2994)
* add lanaguage classifier node

* Fix a few bugs and general code style

* whitespace

* first draft and refactoring

* draft of classes separation

* improve base class

* fix inivisible character; add some tests

* fix and more tests

* more docs and tests

* move __init__ to base

* add transformers node; improve tests

* incorporate feedback; little fix to other node

* labels_to_languages mapping

* better docstrings

* use logger instead of logging

---------

Co-authored-by: Stanislav Zamecnik <stanislav.zamecnik@telekom.com>
Co-authored-by: anakin87 <44616784+anakin87@users.noreply.github.com>
Co-authored-by: stazam <zamecnik.stanislav@gmail.com>
2023-03-13 10:30:03 +01:00
Mahipal Singh Rathore
405aee0cfa
Update table.py (#4376)
Answer should be checked if it is not none before adding id to it
2023-03-13 10:27:59 +01:00
ZanSara
8ea7ba3a94
proposal: drop BaseComponent and re-implement Pipeline (#4284)
* draft proposal

* pr number

* reminder for an agent pipeline example

* proposal number

* add real query pipeline

* add paragraph on validation

* wording

* add_store

* decorator

* add rollout process and parameter's hierarchy examples

* rename project into application

* feedback from the meeting

* defer evaluation to another proposal

* smaller changes

* remove applications for now

* u-turn on pipeline.connect()

* typo

* connect_from/to

* update with Malte's feedback
2023-03-13 10:05:59 +01:00
Vladimir Blagojevic
95a48c6c9d
refactor: Simplify agent and tool interaction (#4362)
* Simplify agent and tool interaction
2023-03-10 18:07:44 +01:00
Stefano Fiorucci
444a3116c4
docs: TransformersImageToText- inform about supported models, better exception handling (#4310)
* better docs, exception handling and tests

* Update lg

* fix little error

---------

Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-03-09 15:35:17 +01:00
Mayank Jobanputra
39a20c37fd
fix: hf-tiny-roberta model loading from disk and mypy errors (#4363)
* Fix mypy failures

* Fix try 1 hf model on windows

* Fix try 2 hf model on windows

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-03-09 18:06:09 +05:30
Vítor Bernardes
95851b82fb
fix: Fix print_answers for output of query run_batch (#4273)
* fix: Fix `print_answers` for output of query `run_batch` (#4255)

* fix: print "Answers" label even with no query list

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* test: add unit tests for `print_answers` on `run`, `run_batch` output (#4255)

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-03-09 12:10:50 +01:00
bogdankostic
e3503a92c9
build: Use uvicorn instead of gunicorn as server in REST API's Dockerfile (#4304)
* Use uvicorn instead of gunicorn as server

* Added comments and changed service names

* comments improvised

---------

Co-authored-by: Mayank Jobanputra <mayankjobanputra@gmail.com>
2023-03-09 01:46:07 +05:30
Stefano Fiorucci
f90ffb6851
increase MetaDocumentORM value length (#4333) 2023-03-08 03:15:27 +05:30
Bilge Yücel
9198d5ec42
chore: add topic:promptnode label (#4347) 2023-03-07 21:23:40 +01:00
ZanSara
024332f98f
refactor: simplify registration of PromptModelInvocationLayer (#4339)
* use __init_subclass__ and remove registering functions
2023-03-07 20:53:48 +01:00
Sebastian
7d5e7c089c
refactor: Use TableQuestionAnsweringPipeline from transformers (#4303)
* Added changes from table-qa-pipeline

* Moved classes around to make diff to main look nicer.

* Cleaned things up. Removed option to return_no_answer (not needed), added docs and added integration marks.

* Remove unneeded code

* Added fix for test

* Add check for document_ids in answer

* Prevent passing of empty list to np.mean

* Batching doesn't work with TableQAPipeline b/c of HF issue

* Cleanup of table reader tests, added check for document ids.

* Fixing pylint

* More pylint

* PR comments

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
2023-03-07 11:46:50 +01:00
tstadel
d096f03230
proposal: Shapers in Prompt Templates (#4172)
* add proposal

* Update 0000-shaper-in-prompt-template.md

* rename proposal file

* update proposal according to feedback

* add clarification about the number of prompts generated

* add section about parsing logic

* Revert "add section about parsing logic"

This reverts commit 904713558706206637eefe1579420d89663f58b8.

* add section about parsing logic

* fix typo

* improved the detailed design section

* fix code section

* chore formatting

* chore formatting

* updated adoption strategy

* final typo and expression changes
2023-03-07 09:52:18 +01:00
Tuana Çelik
8cd8ff6cbb
Update README.md (#4340) 2023-03-07 08:34:21 +01:00
Daniel Bichuetti
af6efbdcb0
refactor: Allow flexible document id generation (#4326) 2023-03-07 07:25:27 +01:00
Zoltan Fedor
4dea9db01e
feat: Report execution time for pipeline components in _debug (#4197)
* Adding execution time to the debug output of pipeline components

* Linting issue fix

* [EMPTY] Re-trigger CI

* fixed test

---------

Co-authored-by: Mayank Jobanputra <mayankjobanputra@gmail.com>
2023-03-07 04:45:31 +05:30
tstadel
19311119db
fix: EvalResult load migration (#4289)
* fix evalresult load migration

* handle none values correctly

* better None check

* improve logic and add test
2023-03-06 20:05:02 +01:00
Silvano Cerza
9253990bdf
Add workflow to push CI metrics to Datadog (#4336) 2023-03-06 18:02:24 +01:00
ZanSara
c802305ccf
test: move tests on standard pipelines in e2e/ (#4309)
* move out standard pipelines e2e

* fixing unit tests

* add test data

* feedback

* pylint

* black
2023-03-06 17:26:19 +01:00
Vladimir Blagojevic
348e7d2dfe
refactor: Separate PromptModelInvocationLayers in providers.py (#4327)
* Refactor PromptNode, separate PromptModelInvocationLayers in providers.py
2023-03-06 16:34:59 +01:00
Daniel Bichuetti
1548c5ba0f
feat: Add Azure OpenAI embeddings support (#4332)
* feate: add Azure OpenAI as embedding option

* feat: Add Azure OpenAI embeddings support

* refactor: check api key

* refactor: better type checking for Azure

* refactor: enable parallelism + separate and update tests

* refactor: string reformat

* refactor: explicit typing

* refactor: update refs and remove unused code
2023-03-06 13:37:20 +01:00
Daniel Bichuetti
c7dddfeaea
chore: add intelijus (#4330) 2023-03-06 13:12:04 +01:00
Sebastian
1a42166978
fix: Prevent going past token limit in OpenAI calls in PromptNode (#4179)
* Refactoring to remove duplicate code when using OpenAI API

* Adding docstrings

* Fix mypy issue

* Moved retry mechanism to openai_request function in openai_utils

* Migrate OpenAI embedding encoder to use the openai_request util function.

* Adding docstrings.

* pylint import errors

* More pylint import errors

* Move construction of headers into openai_request and api_key as input variable.

* Made _openai_text_completion_tokenization_details so can be resued in PromptNode and OpenAIAnswerGenerator

* Add prompt truncation to the PromptNode.

* Removed commented out test.

* Bump version of tiktoken to 0.2.0 so we can use MODEL_TO_ENCODING to automatically determine correct tokenizer for the requested model

* Change one method back to public

* Fixed bug in token length truncation. Included answer length into truncation amount. Moved truncation higher up to PromptNode level.

* Pylint error

* Improved warning message

* Added _ensure_token_limit for HFLocalInvocationLayer. Had to remove max_length from base PromptModelInvocationLayer to ensure that max_length has a default value.

* Adding tests

* Expanded on doc strings

* Updated tests

* Update docstrings

* Update tests, and go back to how USE_TIKTOKEN was used before.

* Update haystack/nodes/prompt/prompt_node.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/nodes/prompt/prompt_node.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/nodes/prompt/prompt_node.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/nodes/retriever/_openai_encoder.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/utils/openai_utils.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/utils/openai_utils.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Updated docstrings, and added integration marks

* Remove comment

* Update test

* Fix test

* Update test

* Updated openai_request function to work with the azure api

* Fixed error in _openai_encodery.py

---------

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
2023-03-03 13:49:21 +01:00
Silvano Cerza
18e83b3ed4
Pin requests-cache test dependency to <1.0.0 (#4325) 2023-03-03 12:47:15 +01:00
bogdankostic
f33829fabf
Remove xpdf dependencies (#4314) 2023-03-02 11:12:03 +01:00
Vladimir Blagojevic
79bf25aaea
feat: Add Azure as OpenAI endpoint (#4170)
* Add Azure as OpenAI endpoint
---------

Co-authored-by: Sebastian Lee <sebastian.lee@deepset.ai>
2023-03-02 09:55:09 +01:00
Daniel Bichuetti
7c49fffc71
feat: Enable PDFToTextConverter multiprocessing, increase general performance and simplify installation (#4226)
* refactor: isolate PDF converters

* refactor: remove xpdf dependency and fix tests

* refactor: add min. version

* feat: enable multiprocessing and add tests

* fix: remove unused imports

* fix: regression when moved code

* refactor: use itertools

* fix: mypy claims

* refactor: double tool support

* refactor: add fallback to xpdf

* refactor: black formatting

* refactor: make superclass signature compatible

* refactor: complete removal of xPdf

* refactor: regroup Haystack imports and fix regression

* refactor: remove original declaration

* docs: fix docstrings

* tests: add [pdf] to [all]

* refactor: remove redundant checks, avoid extra processes

* refactor: add deprecation warning

* refactor: add pytest mark

* tests: change PDF test file

* fix: correct pytest mark

* refactor: deprecate parameter and add new

* tests: change pdf sample

* Add minor lg changes to docstrings

* Fix default value in doc strings

* Update test/nodes/test_file_converter.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* tests: fix page count

* refactor: add imported function

* refactor: change default value

* tests: change parameters and fix typo

* Unify sort_by_position parameter names

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-03-01 22:34:38 +01:00
Silvano Cerza
90da7bf4f8
Fix docstring-labeler.yml workflow (#4307) 2023-03-01 17:49:04 +01:00
ZanSara
ae04ce3c6a
test: mock all Summarizer tests and move a few into e2e (#4299)
* stub e2e folders

* simplify pipeline test

* mocking

* unit tests fixed

* clean up e2e

* pipeline tests work

* pylint

* leftover

* small fix from #2994 and additional tests

* review feedback

* change summaries

* black

* revert models and summaries
2023-03-01 17:30:55 +01:00
bogdankostic
583d2d8244
Fix search path for Shaper API docs (#4306) 2023-03-01 16:10:39 +01:00
ZanSara
165a0a5faa
test: mock all Translator tests and move one to e2e (#4290)
* mock all translator tests and move one to e2e

* typo

* extract pipeline tests using translator

* remove duplicate test

* move generator test in e2e

* Update e2e/pipelines/test_extractive_qa.py

* pytest.mark.unit

* black

* remove model name as well

* remove unused fixture

* rename original and improve pipeline tests

* fixes

* pylint
2023-03-01 14:52:05 +01:00
Agnieszka Marzec
7e0f9715ba
Docs: Add shaper API (#4288)
* Add shaper and update category id

* Fix the category id

* Update category
2023-03-01 14:02:47 +01:00
Stefano Fiorucci
e8f9b1b65d
test: replace ElasticsearchDS with InMemoryDS when it makes sense; support scale_score in InMemoryDS (#4283)
* replace elasticds with imds - first draft

* fix

* fix tests and implement scale_score in imds bm25

* add docstrings for scale_score
2023-03-01 11:35:10 +01:00
Silvano Cerza
ee74421212
ci: Refactor docs config and generation (#4280)
* Change docs yml category config

* Update docs renderers to fetch categories from Readme.io

* Update readme_sync.yml to handle new docs rendering

* Remove unecessary script and related workflow step

* Fix sys.exits
2023-03-01 09:51:02 +01:00
Silvano Cerza
6e241262ad
ci: Change docker_release.yml workflow to run after successful PyPi release (#4293)
* Change docker_release.yml workflow to run after successful PyPi release

* Add warning on name change in pypi_release.yml
2023-03-01 09:50:47 +01:00
tstadel
d1c9407a25
fix opensearch delete_index (#4295) 2023-03-01 08:40:38 +01:00
Malte Pietsch
2a1d73e16d
refactor: Make extraction of "Tool" and "Tool input" for Agent more robust and user-friendly (#4269)
* adjust [] in prompt template. Add error+docs for Tool name.

* fix test

* update error message
2023-02-28 20:01:34 +01:00
Massimiliano Pippi
c3a38a59c0
Update test_prompt_node.py (#4281) 2023-02-28 09:37:40 +01:00
Julian Risch
662441a62b
fix: FARMReader produces Answers with negative start and end position (#4248) 2023-02-28 09:27:42 +01:00
Sebastian
040d806b42
test: Added integration test for using EntityExtractor in query pipeline (#4117)
* Added new test for using EntityExtractor in query node and made some fixtures to reduce code duplication.

* Reuse ner_node fixture

* Added pytest unit markings and swapped over to in memory doc store.

* Change to integration tests
2023-02-28 09:20:44 +01:00
Silvano Cerza
5678bb6375
Parallellize Docker build job (#4268) 2023-02-27 16:03:24 +01:00
Massimiliano Pippi
4b8d195288
refact: mark unit tests under the test/nodes/** path (#4235)
* document merger

* mark unit tests

* revert
2023-02-27 15:00:19 +01:00
Sebastian
efe46b1214
Fix: Allow torch_dtype="auto" in PromptNode (#4166)
* Fix for allowing torch_dtype="auto"

* Fix to logic of torch_dtype detection

* separate test for dtype
2023-02-27 09:59:27 +01:00
Silvano Cerza
4a93517eb4
test: Fix deprecation fixture (#4219)
* Fix deprecation fixture

* Update docstring

* Update docstring

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-02-27 09:55:03 +01:00
Kshitij Pawar
3d3e9c9b32
Fix: Issue of failure to initialize input_converter in Seq2SeqGenerator when model_file_path is given as folder path on local disk after manual model download (#4213)
* test

* test documentation commit:

* added original return statement for linting

* removed empty lines

* formatted code using black

* made changes based on suggestions
2023-02-26 18:13:26 +01:00
Silvano Cerza
2c9e4c5ff9
Remove unnecessary operations in minor_version_release.yml (#4267) 2023-02-24 14:29:42 +01:00
Silvano Cerza
280414e5c6
Fix OpenAPI specs upload (#4266) 2023-02-24 10:50:59 +01:00