1558 Commits

Author SHA1 Message Date
David S. Batista
5867fa1f34
fix: whisper transcription test use github url + update test (#8455)
* adding audio file

* changing URL

* updating tests

* temporary removing failing test

* updating tests

* removing failing test

* typo

* linting

* fixing URL

* updating tests
2024-10-14 16:24:52 +02:00
David S. Batista
a50593ede0
fix: whisper tests using audio file from our github repo (#8454)
* adding audio file

* temporary removing failing test

* removing failing test
2024-10-14 12:56:37 +02:00
Madeesh Kannan
e7bfd80f3b
fix: (Temporarily) Re-add suport for pre-2.6.0 YAMLs with PyPDFConverter (#8443) 2024-10-08 14:35:43 +02:00
Madeesh Kannan
ee89f6ad57
fix: PyPDFToDocument correctly serializes custom converters, deprecate DefaultConverter (#8430)
* fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter`

* Remove `auto` prefix from serde util function names, add unit tests
2024-10-01 16:35:38 +02:00
Julian Risch
08686d90af
feat: Add DocumentNDCGEvaluator component (#8419)
* draft new component and tests

* draft new component and tests

* fix tests, replace usage of get_attr

* improve docstrings, refactor tests

* add test for mixed documents w/wo scores

* add test with multiple lists and update docstring

* validate inputs, add tests, make methods static

* change fallback to binary relevance

* rename validate_init_parameters to validate_inputs
2024-10-01 16:15:02 +02:00
Silvano Cerza
d6f073f9b3
Revert "fix: make pypdf converter more robust (#8427)" (#8428)
This reverts commit d234c75168dcb49866a6714aa232f37d56f72cab.
2024-10-01 11:55:25 +02:00
Tobias Wochinger
d234c75168
fix: make pypdf converter more robust (#8427)
* fix: make `from_dict` of `PyPDFToDocument` more robust

* chore: drop trailing space

* converting method to static and making the comment shorter

* reverting method to static

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2024-09-30 16:47:23 +00:00
Ajit Singh
2dd8089409
chore: Removed deprecated max_loop_allowed argument from Pipeline init (#8409)
* Added equality check for sender and receiver in connection function of pipeline

* Update base.py

irrelevant changes reverted

* added release note

* removed deprecated param max_loops_allowed from pipeline init

* added release note

* revert non relevant test

* Delete releasenotes/notes/remove-support-to-connect-component-to-self-6eedfb287f2a2a02.yaml

* revery non relevant change

* Remove unused test_pipeline_deprecated.yaml

* Remove PipelineMaxLoops error

* Update release notes

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2024-09-30 15:58:05 +02:00
Ajit Singh
7ba30d5691
feat: Pipeline.connect() will now raise a PipelineConnectError if sender and receiver are the same Component (#8403)
* Added equality check for sender and receiver in connection function of pipeline

* Update base.py

irrelevant changes reverted

* added release note

* altered a walk with cycle test

* added a test to verify that pipeline raises PipelineConnectError when adding a component to itself

* Update release notes

* Remove self connection feature tests

* Tidy up connect unit test

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2024-09-30 15:52:36 +02:00
Silvano Cerza
29672d4b42
feat: Add JSONConverter Component (#8397)
* Add JSONConverter Component

* Handle some corner cases

* Add JSONConverter to pydoc config

* Add a way to extract all non content fields as metadata

* Small fix in docstring

* Fix tests

* docstrings upd

* Update json.py

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2024-09-25 12:34:51 +02:00
Silvano Cerza
0df379e6a2
feat: Deprecate @component decorator is_greedy argument (#8400)
* Deprecate @component decorator is_greedy argument

* Fix some typos and docstrings

* Add _is_lazy_variadic test
2024-09-25 11:28:30 +02:00
Sebastian Husch Lee
74f7c6fdfb
Set max_runs_per_component to 1 for pipelines that are linear (#8393) 2024-09-24 14:44:45 +02:00
Vladimir Blagojevic
09b95746a2
feat: HuggingFaceAPIChatGenerator add token usage data (#8375)
* Ensure HuggingFaceAPIChatGenerator has token usage data

* Add reno note

* Fix release note
2024-09-23 15:40:50 +02:00
Sriniketh J
066e2e3ec5
Make api_key param optional in LLMEvaluator (#8340) 2024-09-20 10:47:13 +02:00
Sebastian Husch Lee
2235ce673f
test: Move pipeline test to behavorials (#8377) 2024-09-19 16:59:35 +02:00
Vladimir Blagojevic
514e0abc39
fix: Fix nltk imports (#8381) 2024-09-18 11:25:21 +00:00
Madeesh Kannan
b22014b915
fix: Prevent set_output_types from being called when the output_types decorator is used (#8376) 2024-09-18 13:05:31 +02:00
Vladimir Blagojevic
badd0594cc
feat: Port NLTKDocumentSplitter from dC to Haystack (#8350)
* Port NLTKDocumentSplitter from dC to Haystack

* Improve pydocs

* Use haystack logging

* Add NLTKDocumentSplitter to __init__.py

* Use haystack logging, rename test classes

* Fixing _needs_join return

* Linting

* PR feedback

* More static methods

* Increase test coverage

* Compile pattern

---------

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
2024-09-17 13:59:19 +02:00
Madeesh Kannan
5071e47843
refactor: Rename Component.async_run to Component.run_async for better readablility (#8370)
Using a suffix will keep names logically sorted, less noisy and relegate the async aspect to an implementation/API detail.
2024-09-17 10:10:34 +00:00
David S. Batista
97126eb544
fix: changing default model to gpt-4o-mini on OpenAI API calls (#8360)
* chaning default model to gpt-4o-mini

* adding release notes

* fixing some missed tests

* fixing some more missed tests

* fixing one last missed test

* fixing linting issues

* making pylint happy about an end2end test

* chaning if test to walruss operator

* fixing azure embedder from ada to text-embedding-ada-002
2024-09-17 10:36:42 +02:00
Giovanni Alzetta, PhD
4106e7e8d1
feat : DocumentSplitter, adding the option to split_by function (#8336)
* Adding splitting function

* Adding test for split by function

* Adding release note for feat adding split by function

* Fixing release note for split_by_function

* Fixing issue with splitting_function non callable

* nit: fixing value error in documentsplitter for split_by

* Add custom serde

---------

Co-authored-by: Giovanni Alzetta <giovannialzetta@gmail.com>
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
2024-09-12 16:38:37 +02:00
Vladimir Blagojevic
7e9f153e78
chore: Remove all references to old filter syntax (#8342)
* Remove all references to old filter syntax

* More removals

* Lint

* Do not remove test_filter_retriever.py

* Add reno note

* Update ValueError text to match text in haystack-core-integrations
2024-09-12 16:28:31 +02:00
Madeesh Kannan
672bcf7e03
fix: Add constraints to set_input_type(s) based on run method (#8358)
* fix: Prevent the usage of `set_input_type(s)` when the `run` method doesn't have kwargs,
raise if `set_input_type(s)` overrides `run` method parameters

* fix: update components and tests

* reno
2024-09-12 15:58:16 +02:00
Silvano Cerza
5514676b5e
feat: Deprecate max_loops_allowed in favour of new argument max_runs_per_component (#8354)
* Deprecate max_loops_allowed in favour of new argument max_runs_per_component

* Add missing test file

* Some enhancements

* Add version that will remove deprecate stuff
2024-09-12 11:00:12 +02:00
Sebastian Husch Lee
7227bcf9df
feat: TransformerSimilarityRanker add batching across Documents during inference (#8344)
* First pass at adding batch support to TransformersSimilarityRanker

* Add test

* Add reno
2024-09-11 12:47:29 +02:00
Silvano Cerza
4d67b552e1
Fix Pipeline skipping a Component with Variadic input (#8347)
* Fix Pipeline skipping a Component with Variadic input

* Simplify _find_components_that_will_receive_no_input
2024-09-10 14:59:53 +02:00
Ulises M
145ca89a3f
feat: Expose default_headers and add kwargs for Azure Client (#8244)
* default_headers and azure_kwargs added

* update docstrings

* dont forget about chat generator

* Remove azure_kwargs argument

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2024-09-10 10:29:56 +00:00
jpatra72
b126c14e51
feat: Adds support for zero-shot document classification (#7669) (#8193)
* feat: adds support for zero short document classification (#7669)

Also, supports multi-label classification

* pytests for zero shot document classification

* release note

* added licence info to py scripts

* updated the format of licence info

* Added doc string and example code

* added review points highlighted in the PR

* feat: adds support for zero short document classification (#7669)

Also, supports multi-label classification

* pytests for zero shot document classification

* release note

* added licence info to py scripts

* updated the format of licence info

* Added doc string and example code

* added review points highlighted in the PR

* Applied suggestions from doc string review

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* fixed pytest for init

* added output type

* added test for pipeline (de-) serialization

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
2024-09-10 11:00:05 +02:00
Silvano Cerza
da49e782e2
chore: Make arrow an optional dependency (#8345)
* Make arrow an optional dependency

* Fix imports
2024-09-09 16:09:51 +02:00
ArzelaAscoIi
720e54970f
fix: make from dict conditional router more resilient (#8343)
* fix: make from dict conditional router more resilient

* refactor: remove

* dos: add release notes

* fix: format
2024-09-09 15:11:52 +02:00
Mo Sriha
75955922b9
feat: Add current date in UTC to PromptBuilder (#8233)
* initial commit

* add unit tests

* add release notes

* update function name
2024-09-09 09:47:03 +02:00
Sebastian Husch Lee
06dd5c2f37
feat (v2): Update so model_max_length updates max_seq_length for Sentence Transformers (#8334)
* Update so model_max_length does what is expected

* Add release notes

* Some fixes

* Another test
2024-09-06 11:37:56 +02:00
Sriniketh J
e98a6fea04
Convertor: CSVToDocument (#8328)
* carry forwarded initial commit

* fix: doc strings

* fix: update docstrings

* fix: docstring update

* fix: csv encoding in actions

* fix: line endings through hooks

* fix: converter docs addition
2024-09-06 10:59:12 +02:00
Vladimir Blagojevic
b2c19a8c7a
feat: ChatPromptBuilder copies entire ChatMessage rather than copying content field only (#8317)
* Initial implementation of ChatMessage copy and deepcopy

* Add reno release note

* Satisfy hawkeye

* Remove copy and deepcopy, no need to complicate things

* Add new reno note

* Add unit test
2024-09-02 18:06:38 +02:00
Silvano Cerza
3e3f79b928
feat: Add unsafe init arg in ConditionalRouter and OutputAdapter to enable previous behaviour (#8176)
* Add unsafe behaviour to OutputAdapter

* Add unsafe behaviour to ConditionalRouter

* Add release notes

* Fix mypy

* Add documentation links

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2024-09-02 14:14:54 +00:00
Alper
e614fa0c62
refactor: Rename deserialize_document_store_in_init_parameters (#8302)
* 8259

* update function name

* rename and update docstring

* fix linting

* add a release note
2024-09-02 11:42:23 +02:00
Stefano Fiorucci
842a7b80a8
rm sentence_window_retrieval (#8303) 2024-08-28 10:51:07 +02:00
David S. Batista
2f3257b77a
chore: removing deprecated SentenceWindowRetrieval (#8294)
* removing deprecated SentenceWindowRetrieval

* adding release notes

* Rename TestSentenceWindowRetrieval to TestSentenceWindowRetriever

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2024-08-28 10:04:52 +02:00
Madeesh Kannan
f0b45c873f
feat: Extend core component machinery to support an async run method (experimental) (#8279)
* feat: Extend core component machinery to support an async run method

* Add reno

* Fix incorrect docstring

* Make `async_run` a coroutine

* Make `supports_async` a dunder field
2024-08-27 14:20:13 +02:00
Madeesh Kannan
1fa30d4aaa
chore: Remove deprecated debug param from Pipeline.run (#8288)
* chore: Remove deprecated `debug` param from `Pipeline.run`

* Fix tests
2024-08-27 11:27:38 +02:00
David S. Batista
b411c14414
feat: The SentenceWindowRetriever has now an extra output key containing all the documents belonging to the context window (#8283)
* initial import

* adding release notes

* linting

* improving docs and release notes

* updating example
2024-08-27 10:30:12 +02:00
Stefano Fiorucci
2e619f06c8
fix: make meta produced by DOCXToDocument JSON serializable (#8263)
* make meta from DOCXToDocument JSON serializable

* unused import

* update docstrings
2024-08-22 12:24:32 +00:00
Jon Strutz
471f07c8fe
fix: extract page breaks from .docx files (#8232)
* fix: extract page breaks from .docx files

Context: Currently, DOCXToDocument does not extract page breaks from
word documents. This makes it impossible to do things like split by page
or get correct page number metadata after using something like
DocumentSplitter. For example, if you split by word, the 'page_number'
metadata field will be 1 for all documents.

Solution: Added a method to DOCXToDocument that extracts page breaks
from word documents as '\f' characters so that they are recognized by
DocumentSplitter.

Caveat: Due to the way the python-docx library is set up, you can only
accurately determine the location of the first page break for a given
paragraph. In the rare case that a paragraph contains more than one page
break (which means it is an extremely long paragraph spanning multiple
pages), the 2nd, 3rd, etc. page break locations are not known. To sort
of fix this, I just appended the page break characters to the end of
the paragraph text to keep the overall page number values for the
document consistent.

* Apply suggestions from code review

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2024-08-21 09:48:02 +00:00
Sebastian Husch Lee
7fd0b6a013
feat: Add min_top_k to TopPSampler (#8228)
* Add feature to Top P Sampler

* Add release notes

* Fix zip call

* Fix mypy

* Restore doc string and make mypy happy hopefully

* Make mypy happy

* PR comment

* Revert change to make mypy happy

* Add back type ignore

* try to fix typing

* Update haystack/components/samplers/top_p.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* Update haystack/components/samplers/top_p.py

---------

Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2024-08-21 11:29:23 +02:00
Madeesh Kannan
cf5fd2a821
chore: Remove deprecated ChatMessage.to_openai_format (#8242)
* chore: Remove deprecated `ChatMessage.to_openai_format`

* lint
2024-08-16 10:34:44 +02:00
Stefano Fiorucci
bcc4104729
refactor: utility function for docstore deserialization (#8226)
* refactor docstore deserialization

* more tests

* reno; headers

* expose key
2024-08-14 13:29:27 +02:00
Silvano Cerza
ab7eb25856
Add utility then step in feature testing to draw pipeline to file (#8209) 2024-08-13 14:49:13 +02:00
Vladimir Blagojevic
3318d894c0
Add sede_with_list_output_type_in_pipeline unit test (#8196) 2024-08-13 14:37:24 +02:00
Amna Mubashar
373de97426
Deprecate SentenceWindowRetrieval (#8206) 2024-08-13 13:49:41 +02:00
Vladimir Blagojevic
21c507331c
feat: Implement apply_filter_policy and FilterPolicy.MERGE for the new filters (#8042) 2024-08-09 12:04:24 +02:00