1524 Commits

Author SHA1 Message Date
Vladimir Blagojevic
6e86f4e26a
Update embedding integration tests (#6823) 2024-01-24 15:22:47 +01:00
Vladimir Blagojevic
c47b82c54f
Remove pipeline_utils package and dependent code (#6806) 2024-01-23 18:40:43 +01:00
Ashwin Mathur
a238c6dd51
feat: Add Exact Match metric (#6696)
* Add exact match metric

* Add release notes

* Cleanup comments in test_eval_exact_match.py

* Create separate preprocessing function; Add output_key parameter

* Update release note

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2024-01-22 09:57:04 +01:00
Silvano Cerza
d4f6531c52
feat: Refactor Pipeline.run() (#6729)
* First rough implementation of refactored run

* Further improve run logic

* Properly handle variadic input in run

* Further work

* Enhance names and add more documentation

* Fix issue with output distribution

* This works

* Enhance run comments

* Mark Multiplexer as greedy

* Remove MergeLoop in favour of Multiplexer in tests

* Remove FirstIntSelector in favour of Multiplexer

* Handle corner when waiting for input is stuck

* Remove unused import

* Handle mutable input data in run and misbehaving components

* Handle run input validation

* Test validation

* Fix pylint

* Fix mypy

* Call warm_up in run to fix tests
2024-01-18 17:53:47 +01:00
Vladimir Blagojevic
0b177b3bc6
feat: Improve OpenAPIServiceConnector service response serialization (#6772)
* Better service response json -> str serialization

* Add unit test
2024-01-18 16:49:48 +01:00
Vladimir Blagojevic
fea1428e84
feat: Add HuggingFaceLocalChatGenerator (#6751) 2024-01-18 15:53:12 +01:00
Madeesh Kannan
5d66d040cc
feat: Add serde methods to HTMLToDocument (#6758) 2024-01-18 10:02:01 +01:00
Sebastian Husch Lee
c0b67432e4
feat: Add page breaks to default PDF to Document converter (#6755)
* Speedup tests for PyPDFToDocument

* Added unit test and removed skipping of empty pages

* add release note

* Add back some integration marks
2024-01-18 08:54:59 +01:00
sahusiddharth
a7ac4edd07
feat: added split by page to DocumentSplitter (#6753)
* feat-added-split-by-page-to-DocumentSplitter

* added test case and the suggested changes

* Update document_splitter.py

* Update haystack/components/preprocessors/document_splitter.py

* Update test_document_splitter.py

---------

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
2024-01-17 15:36:29 +01:00
Madeesh Kannan
7376838922
feat!: Framework-agnostic device management (#6748)
* feat: Framework-agnostic device management

* Add release note

* Linting

* Fix test

* Add `first_device` property, expand release notes, validate `ComponentDevice` state
2024-01-17 10:41:34 +01:00
ZanSara
b8b8b5d5c6
feat!: rename model_name_or_path to model in NamedEntityExtractor (#6744)
* rename model_name_or_path to simply model

* fix tests

* reno
2024-01-16 15:32:48 +01:00
Sebastian Husch Lee
20f04f6054
feat: MetaFieldRanker update (#6742)
* Add weight and ranking_mode as params to run for easier experimentation

* renaming of metadata to meta

* User logger.warning instead of warnings

* Add another unit test

* Add support for sort_order and fix formatting of error messages

* Make MetaFieldRanker more robust. Doesn't crash pipeline if some Documents are missing keys.

* Don't print same warning message twice

* Add another test

* Making MetaFieldRanker more robust

* Move up if return statement to earlier in the function

* Setting up infer_type

* Remove infer_type for now

* Release notes

* Add init file

* Update releasenotes/notes/metafieldranker_sort-order_refactor-2000d89dc40dc15a.yaml

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2024-01-16 08:52:58 +01:00
Vladimir Blagojevic
8cafff0645
refactor: Extract HF stop words handling in hf_utils.py (#6745)
* Move StopWordsCriteria to hf_utils.py

* Raise ValueError for invalid StopWordsCriteria tokenizer

* StopWordsCriteria, make sure padding token exists

* Use proper torch types

* Update unit tests
2024-01-15 17:42:29 +01:00
ZanSara
96c0b59aaa
feat!: Rename model_name_or_path to model in ExtractiveReader (#6736)
* rename model parameter and internam model attribute in ExtractiveReader

* fix tests for ExtractiveReader

* fix e2e

* reno

* another fix

* review feedback

* Update releasenotes/notes/rename-model-param-reader-b8cbb0d638e3b8c2.yaml
2024-01-15 14:48:33 +01:00
Stefano Fiorucci
8eba053dbc
fix pipeline test (#6741) 2024-01-15 13:59:11 +01:00
Madeesh Kannan
a5189dd035
fix!: InMemoryBM25Retriever no longer returns documents that have a score of 0.0 (#6717)
* fix!: `InMemoryBM25Retriever` no longer returns documents that have a score of 0.0

Also update tests to accommodate the new behavior.

* Remove superfluous code
2024-01-12 17:50:55 +01:00
Madeesh Kannan
4647f2a506
fix: ComponentMeta.__call__ handles keyword- and positional-only parameters correctly (#6701)
* fix: `ComponentMeta.__call__` handles keyword- and positional-only parameters correctly

* Update release note
2024-01-12 17:16:03 +01:00
ZanSara
0616197b44
feat!: Rename model_name_or_path to model in TransformersSimilarityRanker (#6734)
* rename model parameter in transformers ranker

* fix tests for transformers ranker

* reno

* reno

* typo
2024-01-12 17:09:12 +01:00
ZanSara
288ed150c9
feat!: Rename model_name or model_name_or_path to model in all Embedder classes (#6733)
* rename model parameter in the openai doc embedder

* fix tests for openai doc embedder

* rename model parameter in the openai text embedder

* fix tests for openai text embedder

* rename model parameter in the st doc embedder

* fix tests for st doc embedder

* rename model parameter in the st backend

* fix tests for st backend

* rename model parameter in the st text embedder

* fix tests for st text embedder

* fix docstring

* fix pipeline utils

* fix e2e

* reno

* fix the indexing pipeline _create_embedder function

* fix e2e eval rag pipeline

* pytest
2024-01-12 15:30:17 +01:00
ZanSara
ce7abc9bde
feat!: Rename model_name or model_name_or_path to model in all Transcriber classes (#6731)
* rename model parameter in local transcriber

* fix tests for local transcriber

* rename model parameter in remote transcriber

* fix tests for remote transcriber

* reno

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2024-01-12 14:40:30 +01:00
Stefano Fiorucci
24c71bd221
rename model_name_or_path to model in test (#6732) 2024-01-12 13:56:14 +01:00
sahusiddharth
dbdeb8259e
feat: rename model_name or model_name_or_path to model in generators (#6715)
* renamed model_name or model_name_or_path to model

* added release notes

* Update releasenotes/notes/renamed-model_name-or-model_name_or_path-to-model-184490cbb66c4d7c.yaml

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2024-01-12 12:58:01 +01:00
Stefano Fiorucci
80c3e6825a
fix: serialize/deserialize torch dtype in the components that need it (#6713)
* first draft for ranker

* same for the reader

* consider also bnb_4bit_compute_dtype

* dtype serialization in hugging_face_local_generator

* add release note

* address dtype defined in huggingface_pipeline_kwargs

* test quantization options in reader

* fix

* serialize quantization_config

* test quantization_config serialization

* address feedback

* fix typo
2024-01-12 12:22:45 +01:00
ZanSara
60780ce897
feat: Tweak CacheChecker output type (#6719)
* specify cache checker output type

* (de)serialization

* tests

* add default value for type

* reno

* mypy

* feedback

* reduce diff

* reduce diff

* reno
2024-01-11 12:33:26 +01:00
Massimiliano Pippi
e1ec4e5e4d
refact!: Remove symbols under the haystack.document_stores namespace (#6714)
* remove symbols under the haystack.document_stores namespace

* Update haystack/document_stores/types/protocol.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* fix

* same for retrievers

* leftovers

* more leftovers

* add relnote

* leftovers

* one more

* fix examples

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2024-01-10 21:20:42 +01:00
Ashwin Mathur
374a937663
feat: Add calculate_metrics and MetricsResult (#6680)
* Add calculate_metrics, MetricsResult, Exact Match

* Add additional tests for metric calculation

* Add release notes

* Add docstring for Exact Match metric

* Remove Exact Match Implementation

* Update release notes

* Remove unnecessary metrics implementation

* Simplify logic to run supported metrics

* Add some evaluation tests

* Fix linting

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2024-01-10 10:26:44 +01:00
Madeesh Kannan
e6d6ce1c73
feat: Add NamedEntityExtractorcomponent (#6689)
* feat: Add `NamedEntityExtractor`component

This component accepts a list of `Document`s which it annotates with named entities. The annotations are stored in the `meta` dictionary of each `Document` under a specific key.

The component currently support two backends for the annotation models: Hugging Face `transformers` and spaCy.

* Address comments

* Expand release note

* Add the `[torch]` extra package specifier to the lazy import

* Remove dead code

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2024-01-09 17:56:20 +01:00
ZanSara
abd16ab796
feat: support single metadata dictionary in MarkdownToDocument (#6629)
* support single metadata dict in markdown2document

* reno

* unwrap list

* direct key access

* typing

* add explicit test
2024-01-09 14:44:39 +01:00
Massimiliano Pippi
9ace6bf63d
feat: store input's default value in InputSocket (#6651)
* track default value in sockets

* remove dead code

* include default value in socket description

* add unit test

* add relnote

* unused import

* clarify
2024-01-09 12:17:46 +01:00
ZanSara
175b5baf45
feat: support single metadata dictionary in AzureOCRDocumentConverter (#6635)
* support single metadata dict in azureconverter

* reno

* tests

* Update releasenotes/notes/single-meta-in-azureconverter-ce1cc196a9b161f3.yaml
2024-01-09 10:49:37 +01:00
ZanSara
974d65f30a
feat: support single metadata dictionary in TikaDocumentConverter (#6698)
* reno

* converter

* test

* comment
2024-01-09 09:49:47 +01:00
Massimiliano Pippi
93b2aaee09
chore: move DocumentJoiner to new joiners package (#6692)
* move DocumentJoiner to new joiners package

* relnote

* leftovers

* fix docstrings generation

* fix unrelated pydoc misconfiguration

* more unrelated work, yay!

* fix assertions
2024-01-08 22:06:27 +01:00
Silvano Cerza
9445b2d466
Fix skipif with empty env var (#6704) 2024-01-08 19:19:14 +01:00
Silvano Cerza
607e7d1488
Skip integration tests if env var is missing (#6703) 2024-01-08 17:15:10 +01:00
Vladimir Blagojevic
9e0b58784f
feat: Improve UrlCacheChecker, make it more generic (#6699)
* Rename UrlCacheChecker to CacheChecker, make it field generic

* Add release note
2024-01-08 16:15:27 +01:00
Sebastian Husch Lee
beade1cef9
feat: Add scaling and thresholding of the similarity ranker scores (#6683)
* Add scale_score functionality to the TransformersSimilarityRanker

* Updated test to check scores

* Use pytest approx when comparing floats

* Updated how scale score works and added calibration factor. Started to add score threshold.

* Add support for score_threshold

* Add some parameters to the run method

* Add release notes

* Fix mypy

* Be more tolerant on the score values

* Adding unit test for scale_score=False

* Add unit test for score threshold

* Update tests

* Rename test

* Fix typo

* PR comments
2024-01-08 09:05:24 +01:00
Vladimir Blagojevic
552f0e394b
feat: Add Azure embedders support (#6676)
* Add Azure embedders
---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2024-01-05 15:49:25 +01:00
Vladimir Blagojevic
b7159ad7c2
feat: Add AzureOpenAIGenerator and AzureOpenAIChatGenerator (#6648)
* Add AzureOpenAIGenerator and AzureOpenAIChatGenerator
2024-01-05 15:48:28 +01:00
Stefano Fiorucci
bb2b1a20f8
refactor: optimize API keys reading (#6655)
* centralize API keys handling

* fix mypy and pylint

* rm utility function, be more explicit
2024-01-05 10:40:03 +01:00
Vladimir Blagojevic
1336456b4f
Update prompt builders examples (#6681) 2024-01-04 16:54:26 +01:00
Vladimir Blagojevic
090d66b531
feat: Update OpenAIChatGenerator to handle both tools and functions calling (#6639)
* Handle tools parameter in OpenAIChatGenerator

* Handle tools/functions parameter in OpenAIChatGenerator streaming mode

* Adjust OpenAPIServiceConnector to handle tools parameter

* We never deal with functions/tools in non-chat generator

* Add release note
2023-12-28 17:29:47 +01:00
Stefano Fiorucci
c773c30c66
refactor!: rename all remaining metadata to meta (#6650)
* change metadata to meta

* release note
2023-12-28 12:18:15 +01:00
Vladimir Blagojevic
ef2f6bd681
feat: Split DynamicPromptBuilder and DynamicChatPromptBuilder (#6557)
* Split DynamicPromptBuilder

* Add release note

* Julian PR feedback

* dynamicchatbuilder lg upd

* dynamicpromptbuilder lg upd

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-12-26 15:27:43 +01:00
Vladimir Blagojevic
506ab81d26
chore: Rename GPT generators, deprecate old names (#6626) 2023-12-22 19:37:29 +01:00
ZanSara
c0f1dab454
feat: support single metadata dictionary in PyPDFToDocument (#6615)
* support single metadata dict in pypdf2document

* improve tests

* tests

* remove line
2023-12-22 14:13:11 +01:00
Stefano Fiorucci
8469c7f702
chore: upgrade transformers to 4.36.2 in test requirements (#6610)
* Update test_requirements.txt

* make tests run when tests requirements change

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-12-21 16:48:24 +01:00
ZanSara
ff55985e2d
feat: support single metadata dictionary in HTMLToDocument (#6613)
* support single metadata in HTMLToDocument

* reno

* docstring
2023-12-21 16:45:31 +01:00
Vladimir Blagojevic
4d08be0c2a
feat: Update OpenAI Python Client in Haystack 2.x (#6584)
* Update openai python client

* Add release note

* Consolidate multiple mock_chat_completion into one

* Ensure all components have api_base_url, organization params

* Update tests

* Enable function calling

* Oversight

* Minor fixes, add streaming test mocks

* Apply suggestions from code review

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* metadata -> meta

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-12-21 16:21:24 +01:00
ZanSara
cf79aa1485
feat: add support for single meta dict in TextFileToDocument (#6606)
* add support for single meta dict

* reno

* reno

* mypy

* extract to function

* docstring

* mypy
2023-12-21 14:21:17 +01:00
Stefano Fiorucci
7cc6080dfa
chore: replace metadata w meta in tests/examples (#6612)
* replace metadata w meta in tests/examples

* do not touch already broken e2e tests

* Revert "do not touch already broken e2e tests"

This reverts commit 1f911920d98954b57daacfe8d8ed02fd77d136db.
2023-12-21 14:09:31 +01:00