* feat: Extend core component machinery to support an async run method
* Add reno
* Fix incorrect docstring
* Make `async_run` a coroutine
* Make `supports_async` a dunder field
* fix: extract page breaks from .docx files
Context: Currently, DOCXToDocument does not extract page breaks from
word documents. This makes it impossible to do things like split by page
or get correct page number metadata after using something like
DocumentSplitter. For example, if you split by word, the 'page_number'
metadata field will be 1 for all documents.
Solution: Added a method to DOCXToDocument that extracts page breaks
from word documents as '\f' characters so that they are recognized by
DocumentSplitter.
Caveat: Due to the way the python-docx library is set up, you can only
accurately determine the location of the first page break for a given
paragraph. In the rare case that a paragraph contains more than one page
break (which means it is an extremely long paragraph spanning multiple
pages), the 2nd, 3rd, etc. page break locations are not known. To sort
of fix this, I just appended the page break characters to the end of
the paragraph text to keep the overall page number values for the
document consistent.
* Apply suggestions from code review
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* fix ChatPromptBuilder from dict if template=None
* fix ChatPromptBuilder from dict if template=None
* leave template None
---------
Co-authored-by: Marie-Luise Klaus <marieluise.klaus@deepset.ai>
* feat: add unicode normalization & ascii_only mode for DocumentCleaner.
* feat: add unicode_normalization parameter valdiation to DocumentCleaner.
* test: fix the unit test to work after code linting.
* Start adding model and tokenizer kwargs support
* Add model and tokenizer kwargs to doc embedder
* Some updates and fixes in tests
* Fix more tests
* Fix tests
* Add release note
* Fix test
* Add from_dict tests
* Fix TikaConverter not having \f page tag by using HTML mode of parsing and then parsing the HTML to text using the old Haystack 1.X integration as template.
* Add Reno
* Fix test by making Mock Tika return XML (before parsing)
* refinements and test
---------
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
* Fix issue that could lead to RCE if using unsecure Jinja templates
* Add comment explaining exception suppression
* Update release note
* Update release note
* Fix bug in DocumentSplitter and expand tests to catch said bug
* Fix split overlap information calc and actually test it
* Add release notes
* Remove comments
* Same fix in SentenceWindowRetrieval
---------
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
* Pin structlog to 24.2.0 due to unit test failures
* Remove object init parameter in huggingface_hub unit tests
* Use less restrictive structlog pin
* Add release note
* Fix bug in Pipeline.run() executing Components in a wrong and unexpected order
* Update haystack/core/pipeline/base.py
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Move utility functions from _enqueue_next_runnable_component (#7895)
* Isolate logic to check if we're stuck in a loop
* Simplify for else
* Add missing return in docstring
* Emit warning when stuck in a loop
* Fix docstring
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Add utility function to move Components in queues
* Add function to find next Component to run
* Comment update
* Add missing break in loop
* Make _add_missing_input_defaults less error prone and add tests
* Fix tests
* Update docstring
* Simplify enqueue logic
* Remove unused _enqueue_next_runnable_component function
* Add method to find Component with lazy variadic input or all inputs with defaults
* Simplify _find_next_runnable_lazy_variadic_or_default_component
* Remove unnecessary type ignore
* Split _dequeue_components_that_received_no_input into separate functions
* Fix linting
* Simplify variadic check when running Component
* Simplify code
* Reorganize functions used by Pipeline.run
* Rename variables used in Pipeline.run() for clarity
* Add comment clarifying last_waiting_queue and before_last_waiting_queue
* Add functions to easily update waiting_queue
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* initial support for api_params
* add tests and reno
* resolve suggestions and add integration test
* fix mypy
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* initial import
* adding tests
* adding license and release notes
* adding missing release notes
* working with any type of doc store
* nit
* adding get_class_object to serialization package
* nit
* refactoring get_class_object()
* refactoring get_class_object()
* chaning type and var names
* more refactoring
* Update haystack/core/serialization.py
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
* Update haystack/core/serialization.py
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
* Update test/core/test_serialization.py
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
* more refactoring
* more refactoring
* Pydoc syntax
---------
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
* Fix from_dict to work if device isn't provided in init params
* Minor refactoring of from_dict for components that load HF models
* Add tests
* Update tests to test loading with all default parameters
* Add more tests
* Add release notes
* Add unit test for whisper local
* Update reno
* Add fix for ExtractiveReader
* Fix NamedEntityExtractor
* Fix default value for huggingface_pipeline_kwargs
* Add reno note
* Update HuggingFaceLocalGenerator.from_dict to use the same logic as HuggingFaceLocalChatGenerator.from_dict
* Update tests slightly
* Update release note
max_retries: if not set is read from the OPENAI_MAX_RETRIES
env variable or set to 5.
timeout: if not set is read from the OPENAI_TIMEOUT
env variable or set to 30.
Signed-off-by: Nitanshu Vashistha <nitanshu.vzard@gmail.com>