* fix: Prevent the usage of `set_input_type(s)` when the `run` method doesn't have kwargs,
raise if `set_input_type(s)` overrides `run` method parameters
* fix: update components and tests
* reno
* Deprecate max_loops_allowed in favour of new argument max_runs_per_component
* Add missing test file
* Some enhancements
* Add version that will remove deprecate stuff
* feat: adds support for zero short document classification (#7669)
Also, supports multi-label classification
* pytests for zero shot document classification
* release note
* added licence info to py scripts
* updated the format of licence info
* Added doc string and example code
* added review points highlighted in the PR
* feat: adds support for zero short document classification (#7669)
Also, supports multi-label classification
* pytests for zero shot document classification
* release note
* added licence info to py scripts
* updated the format of licence info
* Added doc string and example code
* added review points highlighted in the PR
* Applied suggestions from doc string review
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
* fixed pytest for init
* added output type
* added test for pipeline (de-) serialization
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
* Initial implementation of ChatMessage copy and deepcopy
* Add reno release note
* Satisfy hawkeye
* Remove copy and deepcopy, no need to complicate things
* Add new reno note
* Add unit test
* feat: Extend core component machinery to support an async run method
* Add reno
* Fix incorrect docstring
* Make `async_run` a coroutine
* Make `supports_async` a dunder field
* fix: extract page breaks from .docx files
Context: Currently, DOCXToDocument does not extract page breaks from
word documents. This makes it impossible to do things like split by page
or get correct page number metadata after using something like
DocumentSplitter. For example, if you split by word, the 'page_number'
metadata field will be 1 for all documents.
Solution: Added a method to DOCXToDocument that extracts page breaks
from word documents as '\f' characters so that they are recognized by
DocumentSplitter.
Caveat: Due to the way the python-docx library is set up, you can only
accurately determine the location of the first page break for a given
paragraph. In the rare case that a paragraph contains more than one page
break (which means it is an extremely long paragraph spanning multiple
pages), the 2nd, 3rd, etc. page break locations are not known. To sort
of fix this, I just appended the page break characters to the end of
the paragraph text to keep the overall page number values for the
document consistent.
* Apply suggestions from code review
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* fix ChatPromptBuilder from dict if template=None
* fix ChatPromptBuilder from dict if template=None
* leave template None
---------
Co-authored-by: Marie-Luise Klaus <marieluise.klaus@deepset.ai>
* feat: add unicode normalization & ascii_only mode for DocumentCleaner.
* feat: add unicode_normalization parameter valdiation to DocumentCleaner.
* test: fix the unit test to work after code linting.
* Start adding model and tokenizer kwargs support
* Add model and tokenizer kwargs to doc embedder
* Some updates and fixes in tests
* Fix more tests
* Fix tests
* Add release note
* Fix test
* Add from_dict tests
* Fix TikaConverter not having \f page tag by using HTML mode of parsing and then parsing the HTML to text using the old Haystack 1.X integration as template.
* Add Reno
* Fix test by making Mock Tika return XML (before parsing)
* refinements and test
---------
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
* Fix issue that could lead to RCE if using unsecure Jinja templates
* Add comment explaining exception suppression
* Update release note
* Update release note
* Fix bug in DocumentSplitter and expand tests to catch said bug
* Fix split overlap information calc and actually test it
* Add release notes
* Remove comments
* Same fix in SentenceWindowRetrieval
---------
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
* Pin structlog to 24.2.0 due to unit test failures
* Remove object init parameter in huggingface_hub unit tests
* Use less restrictive structlog pin
* Add release note