1524 Commits

Author SHA1 Message Date
Madeesh Kannan
33675b4caf
chore: Remove deprecated DefaultConverter for PyPDFToDocument (#8501)
* chore: Remove deprecated `DefaultConverter` for `PyPDFToDocument`

* Remove unused imports

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2024-10-29 16:42:48 +00:00
Bohan Qu
081b143aae
feat!: tracing with concurrency (#8489) 2024-10-29 17:39:41 +01:00
Stefano Fiorucci
2045f6f16a
try test jsonschema (#8496) 2024-10-29 16:21:51 +01:00
Vladimir Blagojevic
28161f7bb9
feat: DOCXToDocument: add table extraction (#8457)
* DOCXToDocument: add table extraction

* Add reno note

* mypy fixes

* add unit tests

* Add csv table support

* Update release note

* Add TableFormat enum

* Add table_format as str init param

* Update docx.py

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* PR feedback

* PR feedback

---------

Co-authored-by: medsriha <medsriha@gmail.com>
Co-authored-by: Mo Sriha <22803208+medsriha@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2024-10-29 16:20:27 +01:00
Silvano Cerza
8205724395
feat: Rework Pipeline.run() to better handle cycles (#8431)
* draft

* Enhance

* Almost works

* Simplify some parts and handle intermediate outputs

* Handle connections with default

* Handle cycles with multiple connections from two components

* Update distributed outputs at the correct time

* Remove Component inputs after it runs

* Add agent pipeline test case

* Fix infite loop test

* Handle some corner cases with loops checking and inputs deletion

* Fix tests

* Add new behavioral test

* Remove unused code in behavioural test

* Fix behavioural test

* Fix max run check

* Simplify outputs distribution

* Simplify subgraph run check

* Remove unused _init_run_queue function

* Remove commented code

* Add some missing type hints

* Simplify cycles breaking

* Fix _distribute_output test

* Fix _find_components_that_will_receive_no_input test

* Fix validation test

* Fix tracer losing Component inputs

* Fix some linting issues

* Remove ignore pylint rule

* Rename method that break cycles and make it raise

* Add docstring to _run_subgraph

* Update Pipeline.run() docstring

* Update comment to clarify cycles execution

* Remove SelfLoop sample Component

* Add behavioural test for unsupported cycles

* Rename behavioural test to be more specific

* Add new behavioural test

* Add release notes

* Remove commented out code and random pass

* Use more efficient function to find cycles

* Simplify _break_supported_cycles_in_graph by using defaultdict

* Stop breaking edges as soon as we make the graph acyclic

* Fix docstring and add some more comments

* Fix _distribute_output docstring

* Fix _find_receivers_from docstring

* More detailed release notes

* Minimize calls to networkx.is_directed_acyclic_graph

* Add some more info on edges keys

* Adjust components_in_cycles comment

* Add new Pipeline behavioural test

* Enhance _find_components_that_will_receive_no_input to cover more cases

* Explain why run_queue is reset after running a subgraph cycle

* Rename _init_inputs_state to _normalize_input_data

* Better explain the subgraph output distribution

* Remove for else

* Fix some comments and docstrings

* Fix linting

* Add missing return type

* Fix typo

* Rename _normalize_input_data to _normalize_varidiac_input_data and add more documentation

* Remove unused import

---------

Co-authored-by: Sebastian Husch Lee <sjrl423@gmail.com>
2024-10-29 15:43:16 +01:00
tstadel
d430833f8f
feat: streaming_callback as run param from HF generators (#8406)
* feat: streaming_callback as run param from HF generators

* apply feedback

* add reno

* fix test

* fix test

* fix mypy

* fix excessive linting rule
2024-10-29 15:32:06 +01:00
Stefano Fiorucci
78292422f0
feat: allow passing meta in the run method of FileTypeRouter (#8486)
* initial refactoring

* progress

* refinements

* serde methods + tests

* release note

* comment

* make additional_mimetypes internal attribute
2024-10-24 16:21:15 +02:00
Madeesh Kannan
906177329b
fix: Enforce basic Python types restriction on serialized component data (#8473) 2024-10-22 17:08:36 +02:00
Alper
a556e11bf1
fix: window_size set during run instead of construction (#8463)
* window_size set during runtime

* revert init and update run with window_size

* improved doc, removed print

* adding release notes

* updating tests

* reverting docstring example

* Update haystack/components/retrievers/sentence_window_retriever.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Update haystack/components/retrievers/sentence_window_retriever.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Update haystack/components/retrievers/sentence_window_retriever.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2024-10-22 14:01:26 +00:00
David S. Batista
3a50d35f06
feat: allow Generators to run with a system prompt defined at run time (#8423)
* initial import

* Update haystack/components/generators/openai.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* docs: fixing

* supporting the three use cases: no system prompt, using system prompt defined at init, using system prompt defined at run time

* renaming 'run_time_system_prompt' to 'system_prompt'

* adding tests, converting methods to static

---------

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
2024-10-22 11:21:10 +02:00
Stefano Fiorucci
322f63de6d
feat: Logging Tracer (#8447)
* logging tracer: first draft

* progress

* more tests

* license header

* avoid interference with other tests

* release note

* incorporate feedback from review

* Update haystack/tracing/logging_tracer.py

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2024-10-21 09:47:46 +02:00
Ajit Singh
6cf13e8b98
enhancement: reduced usage of numpy and substituted built-in libraries (#8418)
* reduced usage of numpy and substituted built-in libraries

* added release note

* edited expit function to support both float as well as list (this case was giving error CI)

* revert code , numpy can't be removed here

* more cleaning

* fix relnote

---------

Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2024-10-18 15:42:19 +02:00
Stefano Fiorucci
dfd339ca2d
ensure compatibility with huggingface_hub==0.26.0 (#8464) 2024-10-18 08:38:48 +00:00
tstadel
8613bb7653
fix: logs containing JSON getting lost (#8456)
* fix: logs getting lost

* add test

* add reno
2024-10-15 14:11:14 +02:00
Alper
b40f0c8b5d
feat: SentenceTransformersTextEmbedder supports config_kwargs (#8432)
* add config_kwargs

* disable PLR0913 for a specific function

* add a release note

* refer to AutoConfig in config_kwargs docstring

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
Co-authored-by: Julian Risch <julianrisch@gmx.de>
2024-10-14 16:08:53 +00:00
David S. Batista
b81abc0c85
feat: SentenceTransformersDocumentEmbedder supports config_kwargs (#8433)
* initial import

* adding release notes
2024-10-14 17:43:04 +02:00
David S. Batista
5867fa1f34
fix: whisper transcription test use github url + update test (#8455)
* adding audio file

* changing URL

* updating tests

* temporary removing failing test

* updating tests

* removing failing test

* typo

* linting

* fixing URL

* updating tests
2024-10-14 16:24:52 +02:00
David S. Batista
a50593ede0
fix: whisper tests using audio file from our github repo (#8454)
* adding audio file

* temporary removing failing test

* removing failing test
2024-10-14 12:56:37 +02:00
Madeesh Kannan
e7bfd80f3b
fix: (Temporarily) Re-add suport for pre-2.6.0 YAMLs with PyPDFConverter (#8443) 2024-10-08 14:35:43 +02:00
Madeesh Kannan
ee89f6ad57
fix: PyPDFToDocument correctly serializes custom converters, deprecate DefaultConverter (#8430)
* fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter`

* Remove `auto` prefix from serde util function names, add unit tests
2024-10-01 16:35:38 +02:00
Julian Risch
08686d90af
feat: Add DocumentNDCGEvaluator component (#8419)
* draft new component and tests

* draft new component and tests

* fix tests, replace usage of get_attr

* improve docstrings, refactor tests

* add test for mixed documents w/wo scores

* add test with multiple lists and update docstring

* validate inputs, add tests, make methods static

* change fallback to binary relevance

* rename validate_init_parameters to validate_inputs
2024-10-01 16:15:02 +02:00
Silvano Cerza
d6f073f9b3
Revert "fix: make pypdf converter more robust (#8427)" (#8428)
This reverts commit d234c75168dcb49866a6714aa232f37d56f72cab.
2024-10-01 11:55:25 +02:00
Tobias Wochinger
d234c75168
fix: make pypdf converter more robust (#8427)
* fix: make `from_dict` of `PyPDFToDocument` more robust

* chore: drop trailing space

* converting method to static and making the comment shorter

* reverting method to static

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2024-09-30 16:47:23 +00:00
Ajit Singh
2dd8089409
chore: Removed deprecated max_loop_allowed argument from Pipeline init (#8409)
* Added equality check for sender and receiver in connection function of pipeline

* Update base.py

irrelevant changes reverted

* added release note

* removed deprecated param max_loops_allowed from pipeline init

* added release note

* revert non relevant test

* Delete releasenotes/notes/remove-support-to-connect-component-to-self-6eedfb287f2a2a02.yaml

* revery non relevant change

* Remove unused test_pipeline_deprecated.yaml

* Remove PipelineMaxLoops error

* Update release notes

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2024-09-30 15:58:05 +02:00
Ajit Singh
7ba30d5691
feat: Pipeline.connect() will now raise a PipelineConnectError if sender and receiver are the same Component (#8403)
* Added equality check for sender and receiver in connection function of pipeline

* Update base.py

irrelevant changes reverted

* added release note

* altered a walk with cycle test

* added a test to verify that pipeline raises PipelineConnectError when adding a component to itself

* Update release notes

* Remove self connection feature tests

* Tidy up connect unit test

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2024-09-30 15:52:36 +02:00
Silvano Cerza
29672d4b42
feat: Add JSONConverter Component (#8397)
* Add JSONConverter Component

* Handle some corner cases

* Add JSONConverter to pydoc config

* Add a way to extract all non content fields as metadata

* Small fix in docstring

* Fix tests

* docstrings upd

* Update json.py

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2024-09-25 12:34:51 +02:00
Silvano Cerza
0df379e6a2
feat: Deprecate @component decorator is_greedy argument (#8400)
* Deprecate @component decorator is_greedy argument

* Fix some typos and docstrings

* Add _is_lazy_variadic test
2024-09-25 11:28:30 +02:00
Sebastian Husch Lee
74f7c6fdfb
Set max_runs_per_component to 1 for pipelines that are linear (#8393) 2024-09-24 14:44:45 +02:00
Vladimir Blagojevic
09b95746a2
feat: HuggingFaceAPIChatGenerator add token usage data (#8375)
* Ensure HuggingFaceAPIChatGenerator has token usage data

* Add reno note

* Fix release note
2024-09-23 15:40:50 +02:00
Sriniketh J
066e2e3ec5
Make api_key param optional in LLMEvaluator (#8340) 2024-09-20 10:47:13 +02:00
Sebastian Husch Lee
2235ce673f
test: Move pipeline test to behavorials (#8377) 2024-09-19 16:59:35 +02:00
Vladimir Blagojevic
514e0abc39
fix: Fix nltk imports (#8381) 2024-09-18 11:25:21 +00:00
Madeesh Kannan
b22014b915
fix: Prevent set_output_types from being called when the output_types decorator is used (#8376) 2024-09-18 13:05:31 +02:00
Vladimir Blagojevic
badd0594cc
feat: Port NLTKDocumentSplitter from dC to Haystack (#8350)
* Port NLTKDocumentSplitter from dC to Haystack

* Improve pydocs

* Use haystack logging

* Add NLTKDocumentSplitter to __init__.py

* Use haystack logging, rename test classes

* Fixing _needs_join return

* Linting

* PR feedback

* More static methods

* Increase test coverage

* Compile pattern

---------

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
2024-09-17 13:59:19 +02:00
Madeesh Kannan
5071e47843
refactor: Rename Component.async_run to Component.run_async for better readablility (#8370)
Using a suffix will keep names logically sorted, less noisy and relegate the async aspect to an implementation/API detail.
2024-09-17 10:10:34 +00:00
David S. Batista
97126eb544
fix: changing default model to gpt-4o-mini on OpenAI API calls (#8360)
* chaning default model to gpt-4o-mini

* adding release notes

* fixing some missed tests

* fixing some more missed tests

* fixing one last missed test

* fixing linting issues

* making pylint happy about an end2end test

* chaning if test to walruss operator

* fixing azure embedder from ada to text-embedding-ada-002
2024-09-17 10:36:42 +02:00
Giovanni Alzetta, PhD
4106e7e8d1
feat : DocumentSplitter, adding the option to split_by function (#8336)
* Adding splitting function

* Adding test for split by function

* Adding release note for feat adding split by function

* Fixing release note for split_by_function

* Fixing issue with splitting_function non callable

* nit: fixing value error in documentsplitter for split_by

* Add custom serde

---------

Co-authored-by: Giovanni Alzetta <giovannialzetta@gmail.com>
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
2024-09-12 16:38:37 +02:00
Vladimir Blagojevic
7e9f153e78
chore: Remove all references to old filter syntax (#8342)
* Remove all references to old filter syntax

* More removals

* Lint

* Do not remove test_filter_retriever.py

* Add reno note

* Update ValueError text to match text in haystack-core-integrations
2024-09-12 16:28:31 +02:00
Madeesh Kannan
672bcf7e03
fix: Add constraints to set_input_type(s) based on run method (#8358)
* fix: Prevent the usage of `set_input_type(s)` when the `run` method doesn't have kwargs,
raise if `set_input_type(s)` overrides `run` method parameters

* fix: update components and tests

* reno
2024-09-12 15:58:16 +02:00
Silvano Cerza
5514676b5e
feat: Deprecate max_loops_allowed in favour of new argument max_runs_per_component (#8354)
* Deprecate max_loops_allowed in favour of new argument max_runs_per_component

* Add missing test file

* Some enhancements

* Add version that will remove deprecate stuff
2024-09-12 11:00:12 +02:00
Sebastian Husch Lee
7227bcf9df
feat: TransformerSimilarityRanker add batching across Documents during inference (#8344)
* First pass at adding batch support to TransformersSimilarityRanker

* Add test

* Add reno
2024-09-11 12:47:29 +02:00
Silvano Cerza
4d67b552e1
Fix Pipeline skipping a Component with Variadic input (#8347)
* Fix Pipeline skipping a Component with Variadic input

* Simplify _find_components_that_will_receive_no_input
2024-09-10 14:59:53 +02:00
Ulises M
145ca89a3f
feat: Expose default_headers and add kwargs for Azure Client (#8244)
* default_headers and azure_kwargs added

* update docstrings

* dont forget about chat generator

* Remove azure_kwargs argument

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2024-09-10 10:29:56 +00:00
jpatra72
b126c14e51
feat: Adds support for zero-shot document classification (#7669) (#8193)
* feat: adds support for zero short document classification (#7669)

Also, supports multi-label classification

* pytests for zero shot document classification

* release note

* added licence info to py scripts

* updated the format of licence info

* Added doc string and example code

* added review points highlighted in the PR

* feat: adds support for zero short document classification (#7669)

Also, supports multi-label classification

* pytests for zero shot document classification

* release note

* added licence info to py scripts

* updated the format of licence info

* Added doc string and example code

* added review points highlighted in the PR

* Applied suggestions from doc string review

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* fixed pytest for init

* added output type

* added test for pipeline (de-) serialization

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
2024-09-10 11:00:05 +02:00
Silvano Cerza
da49e782e2
chore: Make arrow an optional dependency (#8345)
* Make arrow an optional dependency

* Fix imports
2024-09-09 16:09:51 +02:00
ArzelaAscoIi
720e54970f
fix: make from dict conditional router more resilient (#8343)
* fix: make from dict conditional router more resilient

* refactor: remove

* dos: add release notes

* fix: format
2024-09-09 15:11:52 +02:00
Mo Sriha
75955922b9
feat: Add current date in UTC to PromptBuilder (#8233)
* initial commit

* add unit tests

* add release notes

* update function name
2024-09-09 09:47:03 +02:00
Sebastian Husch Lee
06dd5c2f37
feat (v2): Update so model_max_length updates max_seq_length for Sentence Transformers (#8334)
* Update so model_max_length does what is expected

* Add release notes

* Some fixes

* Another test
2024-09-06 11:37:56 +02:00
Sriniketh J
e98a6fea04
Convertor: CSVToDocument (#8328)
* carry forwarded initial commit

* fix: doc strings

* fix: update docstrings

* fix: docstring update

* fix: csv encoding in actions

* fix: line endings through hooks

* fix: converter docs addition
2024-09-06 10:59:12 +02:00
Vladimir Blagojevic
b2c19a8c7a
feat: ChatPromptBuilder copies entire ChatMessage rather than copying content field only (#8317)
* Initial implementation of ChatMessage copy and deepcopy

* Add reno release note

* Satisfy hawkeye

* Remove copy and deepcopy, no need to complicate things

* Add new reno note

* Add unit test
2024-09-02 18:06:38 +02:00