* Initial commit for csv cleaner
* Add release notes
* Update lineterminator
* Update releasenotes/notes/csv-document-cleaner-8eca67e884684c56.yaml
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* alphabetize
* Use lazy import
* Some refactoring
* Some refactoring
---------
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* add component checks
* pipeline should run deterministically
* add FIFOQueue
* add agent tests
* add order dependent tests
* run new tests
* remove code that is not needed
* test: intermediate from cycle outputs are available outside cycle
* add tests for component checks (Claude)
* adapt tests for component checks (o1 review)
* chore: format
* remove tests that aren't needed anymore
* add _calculate_priority tests
* revert accidental change in pyproject.toml
* test format conversion
* adapt to naming convention
* chore: proper docstrings and type hints for PQ
* format
* add more unit tests
* rm unneeded comments
* test input consumption
* lint
* fix: docstrings
* lint
* format
* format
* fix license header
* fix license header
* add component run tests
* fix: pass correct input format to tracing
* fix types
* format
* format
* types
* add defaults from Socket instead of signature
- otherwise components with dynamic inputs would fail
* fix test names
* still wait for optional inputs on greedy variadic sockets
- mirrors previous behavior
* fix format
* wip: warn for ambiguous running order
* wip: alternative warning
* fix license header
* make code more readable
Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>
* Introduce content tracing to a behavioral test
* Fixing linting
* Remove debug print statements
* Fix tracer tests
* remove print
* test: test for component inputs
* test: remove testing for run order
* chore: update component checks from experimental
* chore: update pipeline and base from experimental
* refactor: remove unused method
* refactor: remove unused method
* refactor: outdated comment
* refactor: inputs state is updated as side effect
- to prepare for AsyncPipeline implementation
* format
* test: add file conversion test
* format
* fix: original implementation deepcopies outputs
* lint
* fix: from_dict was updated
* fix: format
* fix: test
* test: add test for thread safety
* remove unused imports
* format
* test: FIFOPriorityQueue
* chore: add release note
* fix: resolve merge conflict with mermaid changes
* fix: format
* fix: remove unused import
* refactor: rename to avoid accidental conflicts
* chore: remove unused inputs, add missing license header
* chore: extend release notes
* Update releasenotes/notes/fix-pipeline-run-2fefeafc705a6d91.yaml
Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>
* fix: format
* fix: format
* Update release note
---------
Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* feat: SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder can accept and pass any arguments to SentenceTransformer.encode
* refactor: encode_kwargs parameter of SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder mae to be the last positional parameter for backward compatibility reasons
* docs: added explanation for encode_kwargs in SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder
* test: added tests for encode_kwargs in SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder
* doc: removed empty lines from docstrings of SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder
* refactor: encode_kwargs parameter of SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder mae to be the last positional parameter for backward compatibility (part II.)
* HF API Embedders: refactoring
* rename variables
* rm leftovers
* rm pin
* rm unused import
* relnote
* warning with truncate/normalize and serverless inference API
* test that warnings are raised
* compress graph data to support pako endpoint
* support mermaid.ink parameters and custom servers
* dont try to resolve conflicts with the github web ui...
* avoid double graph copy
* fixing typing, improving docstrings and release notes
* reverting type
* nit - force type checker no cache
* nit - force type checker no cache
---------
Co-authored-by: Ulises M <ulises@lbux.org>
Co-authored-by: Ulises M <30765968+lbux@users.noreply.github.com>
* fix: callables can be deserialized from fully qualified import path
* fix: license header
* fix: format
* fix: types
* fix? types
* test: extend test case
* format
* add release notes
* compress graph data to support pako endpoint
* Update haystack/core/pipeline/draw.py
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* Update haystack/core/pipeline/draw.py
Co-authored-by: David S. Batista <dsbatista@gmail.com>
---------
Co-authored-by: David S. Batista <dsbatista@gmail.com>
The pyright language server is now able to resolve the import and provide completions for the component.
Co-authored-by: Michele Pangrazzi <xmikex83@gmail.com>
* updated DocumentSplitter
issue #8741
* release note
* updated DocumentSplitter
in _create_docs_from_splits function initialize a new variable copied_mete instead to overwrite meta
* added test
test_duplicate_pages_get_different_doc_id
* fix fmt
---------
Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
* initial import
* adding double new lines between container_texts so that passages can be detected
* reducing type specification to avoid import error
* adding release notes
* renaming variable
* fix: PDFMinerToDocument initializes documents with content and meta
* add release note
* Apply suggestions from code review
Co-authored-by: David S. Batista <dsbatista@gmail.com>
---------
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* initial import
* adding initial version + tests
* adding more tests
* more tests
* incorporating SentenceSplitter based on NLTK
* adding more tests
* adding release notes
* adding LICENSE header
* removing unused imports
* fixing example docstring
* addding docstrings
* fixing tests and returning a dictionary
* updating release notes
* attending PR comments
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* wip: updating tests for split_idx_start and _split_overlap
* adding tests for split_idx and split_start and overlaps
* adjusting file for LICENSE checking
* adding more tests
* adding tests for page numbering
* adding tests for min split lenghts and falling back to character-level chunking based on size
* fixing linting issue
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* wip
* wip
* updating tests
* wip: fixing all tests after changes
* more tests
* wip: debugging sentence overlap
* wip: debugging page number
* wip
* wip; fixed bug with sentence tokenizer, needs to keep white spaces
* adding tests for counting pages on different split approaches
* NLTK checks done on SentenceSplitter
* fixing types
* adding detecting for full overlap with previous chunks
* fixing types
* improving docstring
* improving docstring
* adding custom lenght, 'character' use case
* customising overlap function for word and adding a few tests
* updating docstring
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* wip: adding more tests for word unit length
* fix
* feat: `Tool` dataclass - unified abstraction to represent tools (#8652)
* draft
* del HF token in tests
* adaptations
* progress
* fix type
* import sorting
* more control on deserialization
* release note
* improvements
* support name field
* fix chatpromptbuilder test
* port Tool from experimental
* release note
* docs upd
* Update tool.py
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* fix: fix deserialization issues in multi-threading environments (#8651)
* adding 'word' as default length
* fixing types
* handing both default strategies
* wip
* \f was not being counted properly
* updating tests
* fixing the overlap bug
* adding more tests
* refactoring _apply_overlap
* further refactoring
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* adding ticks to close code block
* fixing comments
* applying changes: split with space and force keep_white_spaces=True
* fixing some tests and replacing count words approach in more places
* keep_white_spaces = True only if not defined
* cleaning docs
* handling some more edge cases, when split is still too big and all separators ran
* fixing fallback whitespaces count to fixed word/char split based on split size
* cleaning
---------
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Tobias Wochinger <tobias.wochinger@deepset.ai>
* Add draft of the Excel To Document converter
* Add license header
* Add release note
* Use Union instead of pipe
* Add openpyxl as additional dep
* Fix zip issue
* few updates from Bijay
* Update deps
* Add markdown test
* Adding more example excels and expanding tests
* Added more tests
* Fix windows test by setting lineterminator
* Addressing PR comments
* PR comments
* Fix linting
* reorganize docstore test suite to isolate dataframe tests
* improve docstring
* include FilterDocumentsTestWithDataframe in InMemoryDocumentStore tests
* message conversion function
* hfapi w tools
* right test file + hf_hub version
* release note
* fix for new chatmessage; serialize chat_template
* feedback
* draft
* del HF token in tests
* adaptations
* progress
* fix type
* import sorting
* more control on deserialization
* release note
* improvements
* support name field
* fix chatpromptbuilder test
* port Tool from experimental
* release note
* docs upd
* Update tool.py
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* draft
* del HF token in tests
* adaptations
* progress
* fix type
* import sorting
* more control on deserialization
* release note
* improvements
* support name field
* fix chatpromptbuilder test
* Update chat_message.py
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>