181 Commits

Author SHA1 Message Date
Vladimir Blagojevic
0cc9ce7522
fix: WebRetriever top_k is ignored in a pipeline (#5106)
* Initial changes

* Add WebSearch, WebRetriever top_k unit tests

* Add exact integration test that failed Tuana

* PR review
2023-06-09 10:42:37 +02:00
Sebastian
1777b22fcb
fix: Ensure eval mode for farm and transformer models for predictions (#3791)
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-06-06 13:06:30 +02:00
Michael Feil
6ea8ae01a2
feat: Allow setting custom api_base for OpenAI nodes (#5033)
* add changes for api_base

* format retriever

* Update haystack/nodes/retriever/dense.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* Update haystack/nodes/audio/whisper_transcriber.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* Update haystack/preview/components/audio/whisper_remote.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* Update haystack/nodes/answer_generator/openai.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* Update test_retriever.py

* Update test_whisper_remote.py

* Update test_generator.py

* Update test_retriever.py

* reformat with black

* Update haystack/nodes/prompt/invocation_layer/chatgpt.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Add unit tests

* apply docstring suggestions

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
Co-authored-by: michaelfeil <me@michaelfeil.eu>
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
2023-06-05 11:32:06 +02:00
Massimiliano Pippi
929b8d1fb0
ci: run Elasticsearch 8.6 in compatibility mode (#3853)
* bump ES version in CI

disable ssl

wait for service to start

set env vars

do not use choco to install ES

re-enable jobs deps

skip test on windows CI because of OOM

allocate more memory for ES

uniform ES installation and use default heap size

skip tests causing OOM

increase job timeout

restore memory limit for ES8

* Use latest elasticsearch version
2023-05-24 18:53:54 +02:00
Massimiliano Pippi
68924161df
chore: remove deprecated node PDFToTextOCRConverter (#4982)
* remove deprecated node

* remove related test
2023-05-23 16:55:54 +02:00
ZanSara
949b1b63b3
PromptHub integration in PromptNode (#4879)
* initial integration

* upgrade of prompthub

* fix get_prompt_template

* feedback

* add prompthub-py to dependencies

* tests

* mypy

* stray changes

* review feedback

* missing init

* fix test

* move logic in prompttemplate

* linting

* bugfixes

* fix unit tests

* fix cache

* simplify prompttemplate init

* remove unused function

* removing wrong params

* try remove all instances of prompt names

* more tests

* fix agent tests

* more tests

* fix tests

* pylint

* comma

* black

* fix test

* docstring

* review feedback

* review feedback

* fix mocks

* mypy

* fix mocks

* fix reference to missing templates

* feedback

* remove direct references to default template var

* tests

* Update haystack/nodes/prompt/prompt_node.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-05-23 15:22:58 +02:00
Massimiliano Pippi
c6ea542b57
chore: remove BaseKnowledgeGraph (#4953)
* remove BaseKnowledgeGraph

* fix pylint
2023-05-21 10:42:02 +02:00
Massimiliano Pippi
4974bf7ab3
chore: remove deprecated MilvusDocumentStore (#4951)
* remove deprecated MilvusDocumentStore

* remove leftovers

* fix pylint
2023-05-19 16:37:38 +02:00
Vladimir Blagojevic
5d7ee2e5e6
feat: Add max_tokens to BaseGenerator params (#4168)
* Add max_tokens to BaseGenerator params

* Make mypy happy

* Rebase and resolve conflicts

* Fix signature issues

* Update lg

* Add a mocked unit test method

* end-of-file-fixer corrected file

* Convert to unit test

* Mark test as integration

* make the test unit

---------

Co-authored-by: agnieszka-m <amarzec13@gmail.com>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-05-18 15:19:29 +02:00
Massimiliano Pippi
3ea784464a
add test case for #4929 (#4936) 2023-05-18 09:12:03 +02:00
bogdankostic
df46e7fadd
fix: Use AutoTokenizer instead of DPR specific tokenizer (#4898)
* Use AutoTokenizer instead of DPR specific tokenizer

* Adapt TableTextRetriever

* Adapt tests

* Adapt tests
2023-05-17 18:54:34 +02:00
Stefano Fiorucci
6e0000732d
feat: add BLIP support in TransformersImageToText (#4912)
* add blip support

* fix typo

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-05-16 10:57:41 +02:00
bogdankostic
5b2ef2afd6
Revert "refactor!: Deprecate name param in PromptTemplate and introduce template_name instead (#4810)" (#4834)
This reverts commit f660f41c0615e6b3064ef3e321f1e5a295fafc1b.
2023-05-08 11:31:04 +02:00
ZanSara
6e982e9283
fix: preserve root_node in JoinNode's output (#4820)
* preserve root_node and add tests

* Added if statement to fix failing tests

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <sjrl423@gmail.com>
2023-05-08 10:17:36 +02:00
bogdankostic
f660f41c06
refactor!: Deprecate name param in PromptTemplate and introduce template_name instead (#4810)
* Deprecate name parameter

* Adapt existing tests and uses of PromptTemplate

* Move parameter `name` to end

* Adapt existing tests

* lg update

---------

Co-authored-by: Darja Fokina <daria.f93@gmail.com>
2023-05-08 10:12:29 +02:00
Pouyan
75ff768c21
Pouyanpi/feat/search engine/providers/google api (#4722)
* feat: implement google api search engine provider

Signed-off-by: Pouyan <prezakhanipr@gmail.com>

---------

Signed-off-by: Pouyan <prezakhanipr@gmail.com>
2023-05-02 17:09:17 +02:00
Mayank Jobanputra
dcf3ddddff
Added deprecation tests for seq2seq generator and RAG Generator (#4782) 2023-05-02 13:30:22 +05:30
Mayank Jobanputra
896eb6a2ea
chore: fixed reader loading test for hf-hub starting 0.14.0 (#4607)
* fixed test base for hub 0.13.3

* check if test succeed from branch

* 2nd check if test succeed from branch

* removed dependency changes

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-05-02 08:22:44 +02:00
Vladimir Blagojevic
dcaf3002f1
fix: SentenceTransformersRanker's predict_batch returns wrong number of documents (#4756)
* Fix SentenceTransformersRanker spredict_batch returning wrong number of documents

* Julian's feedback
2023-04-27 15:24:39 +02:00
Vladimir Blagojevic
aebc22d27e
Upgrade transformers to 4.28.1 (#4665)
* Upgrade to transformers 4.28.1

* Commenting out failing piece of test

* trailing-whitespace

* Adjust regex for error match - it changed between releases

* Remove RAG tests failing with transformers update
2023-04-27 12:55:21 +02:00
ZanSara
1b57b96210
refactor!: extract elasticsearch (#4668)
* extract elasticsearch

* update pyproject.toml

* make more import optional

* move MockBaseRetriever in conftest

* install es in the es integration tests
2023-04-26 10:14:20 +02:00
Sebastian
8d9136bad4
feat: Implementation of Table Cell Proposal (#4616)
* Starting adding support for TableCell

* Update tests to use row and col

* Added schema test to check to_dict and from_dict works for Table documents. Also updated Doc.__eq__ to work for tables.

* Update eval test to use TableCell

* Added more schema tests for table docs, labels and answers.

* Add boolean to toggle between Span and TableCell

* Add deprecation message

* Test that table answers work as responses in the rest API

---------

Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-04-19 13:14:49 +02:00
Sebastian
8c4176bdb2
feat: More flexible routing for RouteDocuments node (#4690)
* Added warning messages for documents that are skipped by RouteDocuments. Begun adding support for new option return_remaining and List of List support for metadata value splitting.

* Simplify _split_by_content_type

* Added new unit test and updated _calculate_outgoing_edges

* Added some TODOs and turned assert into raising an error.

* Update logging messages and make new fixture in tests

* Update _split_by_metadata_values to work with return_remaining

* Remove unneeded code

* Documentation

* Add proper support for list of lists

* Fix mypy errors

* Added assert to make mypy happy

* Update haystack/nodes/other/route_documents.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* PR comments

* Remove check for logging level

* make mypy happy

* Update docstring of metadata_values

* Removed duplicate check. Make explicit check for metadata_values

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-04-18 15:18:13 +02:00
Fernando Pereira
5d41e60d89
fix: ParsrConverter list element added (#4562)
* fix: list element and mapping logic around it added to ParsrConverter convert step + unit test covering the specific mapping of list content from Parsr's to Haystack's

* Code review changes

* changed the samples path after conftest changes

* added samples_path to function arg

---------

Co-authored-by: Namoush <fmpereira22@gmail.com>
Co-authored-by: Fernando Pereira <fernando.pereira@criticalsoftware.com>
Co-authored-by: Mayank Jobanputra <mayankjobanputra@gmail.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2023-04-12 18:38:21 +05:30
Ben Heckmann
2d65742443
feat: arbitrary crawler_depth for Crawler class (#4623)
* #3674 implemented iterative crawler depth

* #3674 added two tests for increased crawler depth

* removed old comment
2023-04-11 10:39:17 +02:00
Silvano Cerza
5ac3dffbef
test: Rework conftest (#4614)
* Split root conftest into multiple ones and remove unused fixtures

* Remove some constants and make them fixtures

* Remove unnecessary fixture scoping

* Fix failing whisper tests

* Fix image_file_paths fixture
2023-04-11 10:33:43 +02:00
Silvano Cerza
e85dc79eaa
test: Add pytest fixture to block requests in unit tests (#4433)
* Add pytest fixture to block requests in unit tests

* Mark test correctly as integration

* Fix crawler unit test failing cause it tries to install chromedriver
2023-04-06 18:04:57 +02:00
Julian Risch
57415ef8ab
test: Remove duplicate test and edit docstring (#4567) 2023-03-31 12:39:18 +02:00
Stefano Fiorucci
57f87e24a3
refactor: OpenAIAnswerGenerator - avoid tokenizing all documents several times (#4504) 2023-03-29 22:38:27 +02:00
Zoltan Fedor
32091d66cb
Adding filtering support for Weaviate when used for BM25 querying (#4385) 2023-03-29 16:51:22 +02:00
Vladimir Blagojevic
7c9f719496
refactor: Adjust WhisperTranscriber to pipeline run methods (#4510)
* Retrofit WhisperTranscriber run methods
* Add pipeline unit test
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-03-28 13:52:21 +02:00
bogdankostic
ed1837c0c9
feat: Deduplicate duplicate Answers resulting from overlapping Documents in FARMReader (#4470)
* Deduplicate answers resulting from document split overlap

* Add tests

* Fix Pylint

* Adapt existing test

* Incorporate PR feedback
2023-03-27 20:04:59 +02:00
Vladimir Blagojevic
be25655663
feat: Add agent tools (#4437)
* Initial commit, add search_engine

* Add TopPSampler

* Add more TopPSampler unit tests

* Remove SearchEngineSampler (converted to TopPSampler)

* Add some basic WebSearch unit tests

* Rename unit tests

* Add WebRetriever into agent_tools

* Adjust to WebRetriever

* Add WebRetriever mode [snippet|document]

* Minor changes

* SerperDev: add peopleAlsoAsk search results

* First agent for hotpotqa

* Making WebRetriever work on hotpotqa

* refactor: minor WebRetriever improvements (#4377)

* refactor: remove doc ids rebuild + antecipate cache

* refactor: improve caching, fix Document ids

* Minor WebRetriever improvements

* Overlooked minor fixes

* feat: add Bing API as search engine

* refactor: let kwargs pass-through

* feat: increase search context

* check sampler result, improve batch typing

* refactor: increase mypy compliance

* Initial commit, add search_engine

* Add TopPSampler

* Add more TopPSampler unit tests

* Remove SearchEngineSampler (converted to TopPSampler)

* Add some basic WebSearch unit tests

* Rename unit tests

* Add WebRetriever into agent_tools

* Adjust to WebRetriever

* Add WebRetriever mode [snippet|document]

* Minor changes

* SerperDev: add peopleAlsoAsk search results

* First agent for hotpotqa

* Making WebRetriever work on hotpotqa

* refactor: minor WebRetriever improvements (#4377)

* refactor: remove doc ids rebuild + antecipate cache

* refactor: improve caching, fix Document ids

* Minor WebRetriever improvements

* Overlooked minor fixes

* feat: add Bing API as search engine

* refactor: let kwargs pass-through

* feat: increase search context

* check sampler result, improve batch typing

* refactor: increase mypy compliance

* Fix mypy

* Minor example fixes

* Fix the descriptions

* PR feedback updates

* More fixes

* TopPSampler: handle top p None value, add unit test

* Add top_k to WebSearch

* Use boilerpy3 instead trafilatura

* Remove date finding

* Add more WebRetriever docs

* Refactor long methods

* making the preprocessor optional

* hide WebSearch and make NeuralWebSearch a pipeline

* remove unused imports

* add WebQAPipeline and split example into two

* change example search engine to SerperDev

* Turn off progress bars in WebRetriever's PreProcesssor

* Agent tool examples - final updates

* Add webqa test, search results ranking scores

* Better answer box handling for SerperDev and SerpAPI

* Minor fixes

* pylint

* pylint fixes

* extract TopPSampler from WebRetriever

* use sampler only for WebRetriever modes other than snippet

* add web retriever tests

* add web retriever tests

* exclude rdflib@6.3.2 due to license issues

* add test for preprocessed docs and kwargs examples in docstrings

* Move test_webqa_pipeline to test/pipelines

* change docstring for join_documents_and_scores

* Use WebQAPipeline in examples/web_lfqa.py

* Use WebQAPipeline in examples/web_lfqa.py

* Move test_webqa_pipeline to e2e

* Updated lg

* Sampler added automatically in WebQAPipeline, no need to add it

* Updated lg

* Updated lg

* :ignore Update agent tools examples to new templates (#4503)

* Update examples to new templates

* Add print back

* fix linting and black format issues

---------

Co-authored-by: Daniel Bichuetti <daniel.bichuetti@gmail.com>
Co-authored-by: agnieszka-m <amarzec13@gmail.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-03-27 18:14:58 +02:00
Silvano Cerza
5b63c2086e
refactor: Deprecate BaseKnowledgeGraph, GraphDBKnowledgeGraph, InMemoryKnowledgeGraph and Text2SparqlRetriever (#4500)
* Deprecate BaseKnowledgeGraph and InMemoryKnowledgeGraph

* Deprecate GraphDBKnowledgeGraph

* Fix mypy

* Deprecate Text2SparqlRetriever
2023-03-27 15:31:22 +02:00
tstadel
382ca8094e
feat: PromptTemplate extensions (#4378)
* use outputshapers in prompttemplate

* fix pylint

* first iteration on regex

* implement new promptnode syntax based on f-strings

* finish fstring implementation

* add additional tests

* add security tests

* fix mypy

* fix pylint

* fix test_prompt_templates

* fix test_prompt_template_repr

* fix test_prompt_node_with_custom_invocation_layer

* fix test_invalid_template

* more security tests

* fix test_complex_pipeline_with_all_features

* fix agent tests

* refactor get_prompt_template

* fix test_prompt_template_syntax_parser

* fix test_complex_pipeline_with_all_features

* allow functions in comprehensions

* break out of fstring test

* fix additional tests

* mark new tests as unit tests

* fix agents tests

* convert missing templates

* proper use of get_prompt_template

* refactor and add docstrings

* fix tests

* fix pylint

* fix agents test

* fix tests

* refactor globals

* make allowed functions configurable via env variable

* better dummy variable

* fix special alias

* don't replace special char variables

* more special chars, better docstrings

* cherrypick fix audio tests

* fix test

* rework shapers

* fix pylint

* fix tests

* add new templates

* add reference parsing

* add more shaper tests

* add tests for join and to_string

* fix pylint

* fix pylint

* fix pylint for real

* auto fill shaper function params

* fix reference parsing for multiple references

* fix output variable inference

* consolidate qa prompt template output and make shaper work per-document

* fix types after merge

* introduce output_parser

* fix tests

* better docstring

* rename RegexAnswerParser to AnswerParser

* better docstrings
2023-03-27 12:14:11 +02:00
Silvano Cerza
1b5df55dbb
Skip flaky test (#4444) 2023-03-16 16:32:28 +01:00
Silvano Cerza
3591fc02e1
Mark Crawler tests correctly (#4435) 2023-03-16 09:26:19 +01:00
Vladimir Blagojevic
2538b4cbc9
Make promptnode test unit (#4420) 2023-03-15 22:17:23 +01:00
Silvano Cerza
b59cf76093
refactor: Remove AnswerToSpeech and DocumentToSpeech nodes (#4391)
* Remove AnswerToSpeech and DocumentToSpeech nodes

* Remove unused dataclasses

* Remove unnecessary dependencies

* Remove unused error class and imports
2023-03-15 19:31:13 +01:00
Vladimir Blagojevic
f13501309e
OpenAI streaming support (#4397) 2023-03-15 18:24:47 +01:00
Silvano Cerza
b3a659cd4a
test: Fix audio tests failing (#4418)
* Fix audio tests failing

* Disable local whisper tests
2023-03-15 15:26:30 +01:00
Vladimir Blagojevic
98256ecf57
Add Whisper node (#4335)
* Add Whisper node

* Add support for audio path, improve tests

* Add docs

* Improve tests
2023-03-13 16:17:07 +01:00
Daniel Bichuetti
28724e2e25
feat: add automatic OCR detection mechanism and improve performance (#4329)
* feat: add automatic OCR detection mechanism and improve performance

* refactor: add error message

* refactor: ignore pdftoppm bad typing

* refactor: add Tesseract install. docstrings

* fix: check if OCR var. assigned on mp

* tests: add path to windows/linux tests

* tests: add tessdata path

* tests: include matrix ref.

* tests: custom Tesseract matrix install

* refactor: improve user guide

* tests: fix macos path

* tests: remove brew formulae version

* fix: macos paths

* tests: fix macos path

* tests: add Tesseract to Windows Path

* tests: pytesseract path

* tests: macos path

* refactor: fix path message and remove extra path from tests

* refactor: raise exception when path not found

* refactor: expression simplification

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* refactor: check ocr parameter

* tests: mark as integration

* tests: mock deprecation warning

* refactor: simplify code

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* refactor: change deprecation test

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* refactor: add unit patch

* refactor: black formatting

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Mayank Jobanputra <mayankjobanputra@gmail.com>
2023-03-13 20:19:22 +05:30
ZanSara
fd3f3143d4
feat: LanguageClassifier (#2994)
* add lanaguage classifier node

* Fix a few bugs and general code style

* whitespace

* first draft and refactoring

* draft of classes separation

* improve base class

* fix inivisible character; add some tests

* fix and more tests

* more docs and tests

* move __init__ to base

* add transformers node; improve tests

* incorporate feedback; little fix to other node

* labels_to_languages mapping

* better docstrings

* use logger instead of logging

---------

Co-authored-by: Stanislav Zamecnik <stanislav.zamecnik@telekom.com>
Co-authored-by: anakin87 <44616784+anakin87@users.noreply.github.com>
Co-authored-by: stazam <zamecnik.stanislav@gmail.com>
2023-03-13 10:30:03 +01:00
Stefano Fiorucci
444a3116c4
docs: TransformersImageToText- inform about supported models, better exception handling (#4310)
* better docs, exception handling and tests

* Update lg

* fix little error

---------

Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-03-09 15:35:17 +01:00
Mayank Jobanputra
39a20c37fd
fix: hf-tiny-roberta model loading from disk and mypy errors (#4363)
* Fix mypy failures

* Fix try 1 hf model on windows

* Fix try 2 hf model on windows

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-03-09 18:06:09 +05:30
ZanSara
024332f98f
refactor: simplify registration of PromptModelInvocationLayer (#4339)
* use __init_subclass__ and remove registering functions
2023-03-07 20:53:48 +01:00
Sebastian
7d5e7c089c
refactor: Use TableQuestionAnsweringPipeline from transformers (#4303)
* Added changes from table-qa-pipeline

* Moved classes around to make diff to main look nicer.

* Cleaned things up. Removed option to return_no_answer (not needed), added docs and added integration marks.

* Remove unneeded code

* Added fix for test

* Add check for document_ids in answer

* Prevent passing of empty list to np.mean

* Batching doesn't work with TableQAPipeline b/c of HF issue

* Cleanup of table reader tests, added check for document ids.

* Fixing pylint

* More pylint

* PR comments

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
2023-03-07 11:46:50 +01:00
Vladimir Blagojevic
348e7d2dfe
refactor: Separate PromptModelInvocationLayers in providers.py (#4327)
* Refactor PromptNode, separate PromptModelInvocationLayers in providers.py
2023-03-06 16:34:59 +01:00
Daniel Bichuetti
1548c5ba0f
feat: Add Azure OpenAI embeddings support (#4332)
* feate: add Azure OpenAI as embedding option

* feat: Add Azure OpenAI embeddings support

* refactor: check api key

* refactor: better type checking for Azure

* refactor: enable parallelism + separate and update tests

* refactor: string reformat

* refactor: explicit typing

* refactor: update refs and remove unused code
2023-03-06 13:37:20 +01:00