* draft new component and tests
* draft new component and tests
* fix tests, replace usage of get_attr
* improve docstrings, refactor tests
* add test for mixed documents w/wo scores
* add test with multiple lists and update docstring
* validate inputs, add tests, make methods static
* change fallback to binary relevance
* rename validate_init_parameters to validate_inputs
* fix: make `from_dict` of `PyPDFToDocument` more robust
* chore: drop trailing space
* converting method to static and making the comment shorter
* reverting method to static
---------
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* Added equality check for sender and receiver in connection function of pipeline
* Update base.py
irrelevant changes reverted
* added release note
* altered a walk with cycle test
* added a test to verify that pipeline raises PipelineConnectError when adding a component to itself
* Update release notes
* Remove self connection feature tests
* Tidy up connect unit test
---------
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
* Add JSONConverter Component
* Handle some corner cases
* Add JSONConverter to pydoc config
* Add a way to extract all non content fields as metadata
* Small fix in docstring
* Fix tests
* docstrings upd
* Update json.py
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Port NLTKDocumentSplitter from dC to Haystack
* Improve pydocs
* Use haystack logging
* Add NLTKDocumentSplitter to __init__.py
* Use haystack logging, rename test classes
* Fixing _needs_join return
* Linting
* PR feedback
* More static methods
* Increase test coverage
* Compile pattern
---------
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* chaning default model to gpt-4o-mini
* adding release notes
* fixing some missed tests
* fixing some more missed tests
* fixing one last missed test
* fixing linting issues
* making pylint happy about an end2end test
* chaning if test to walruss operator
* fixing azure embedder from ada to text-embedding-ada-002
* Adding splitting function
* Adding test for split by function
* Adding release note for feat adding split by function
* Fixing release note for split_by_function
* Fixing issue with splitting_function non callable
* nit: fixing value error in documentsplitter for split_by
* Add custom serde
---------
Co-authored-by: Giovanni Alzetta <giovannialzetta@gmail.com>
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
* Remove all references to old filter syntax
* More removals
* Lint
* Do not remove test_filter_retriever.py
* Add reno note
* Update ValueError text to match text in haystack-core-integrations
* fix: Prevent the usage of `set_input_type(s)` when the `run` method doesn't have kwargs,
raise if `set_input_type(s)` overrides `run` method parameters
* fix: update components and tests
* reno
* Deprecate max_loops_allowed in favour of new argument max_runs_per_component
* Add missing test file
* Some enhancements
* Add version that will remove deprecate stuff
* feat: adds support for zero short document classification (#7669)
Also, supports multi-label classification
* pytests for zero shot document classification
* release note
* added licence info to py scripts
* updated the format of licence info
* Added doc string and example code
* added review points highlighted in the PR
* feat: adds support for zero short document classification (#7669)
Also, supports multi-label classification
* pytests for zero shot document classification
* release note
* added licence info to py scripts
* updated the format of licence info
* Added doc string and example code
* added review points highlighted in the PR
* Applied suggestions from doc string review
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
* fixed pytest for init
* added output type
* added test for pipeline (de-) serialization
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
* Initial implementation of ChatMessage copy and deepcopy
* Add reno release note
* Satisfy hawkeye
* Remove copy and deepcopy, no need to complicate things
* Add new reno note
* Add unit test
* feat: Extend core component machinery to support an async run method
* Add reno
* Fix incorrect docstring
* Make `async_run` a coroutine
* Make `supports_async` a dunder field
* fix: extract page breaks from .docx files
Context: Currently, DOCXToDocument does not extract page breaks from
word documents. This makes it impossible to do things like split by page
or get correct page number metadata after using something like
DocumentSplitter. For example, if you split by word, the 'page_number'
metadata field will be 1 for all documents.
Solution: Added a method to DOCXToDocument that extracts page breaks
from word documents as '\f' characters so that they are recognized by
DocumentSplitter.
Caveat: Due to the way the python-docx library is set up, you can only
accurately determine the location of the first page break for a given
paragraph. In the rare case that a paragraph contains more than one page
break (which means it is an extremely long paragraph spanning multiple
pages), the 2nd, 3rd, etc. page break locations are not known. To sort
of fix this, I just appended the page break characters to the end of
the paragraph text to keep the overall page number values for the
document consistent.
* Apply suggestions from code review
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>