* Add draft of the Excel To Document converter
* Add license header
* Add release note
* Use Union instead of pipe
* Add openpyxl as additional dep
* Fix zip issue
* few updates from Bijay
* Update deps
* Add markdown test
* Adding more example excels and expanding tests
* Added more tests
* Fix windows test by setting lineterminator
* Addressing PR comments
* PR comments
* Fix linting
* draft
* del HF token in tests
* adaptations
* progress
* fix type
* import sorting
* more control on deserialization
* release note
* improvements
* support name field
* fix chatpromptbuilder test
* port Tool from experimental
* release note
* docs upd
* Update tool.py
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Add JSONConverter Component
* Handle some corner cases
* Add JSONConverter to pydoc config
* Add a way to extract all non content fields as metadata
* Small fix in docstring
* Fix tests
* docstrings upd
* Update json.py
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Port NLTKDocumentSplitter from dC to Haystack
* Improve pydocs
* Use haystack logging
* Add NLTKDocumentSplitter to __init__.py
* Use haystack logging, rename test classes
* Fixing _needs_join return
* Linting
* PR feedback
* More static methods
* Increase test coverage
* Compile pattern
---------
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Pin structlog to 24.2.0 due to unit test failures
* Remove object init parameter in huggingface_hub unit tests
* Use less restrictive structlog pin
* Add release note
* ruff settings
enable ruff format and re-format outdated files
feat: `EvaluationRunResult` add parameter to specify columns to keep in the comparative `Dataframe` (#7879)
* adding param to explictily state which cols to keep
* adding param to explictily state which cols to keep
* adding param to explictily state which cols to keep
* updating tests
* adding release notes
* Update haystack/evaluation/eval_run_result.py
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Update releasenotes/notes/add-keep-columns-to-EvalRunResult-comparative-be3e15ce45de3e0b.yaml
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* updating docstring
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
add format-check
fail on format and linting failures
fix string formatting
reformat long lines
fix tests
fix typing
linter
pull from main
* reformat
* lint -> check
* lint -> check
* clean up default env and add reno script
* update contributions guidelines
* use test script
* format
* re-add missing dep
* remove black in favour of ruff
* first fucntioning DocxFileToDocument
* fix lazy import message
* add reno
* Add license headder
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* change DocxFileToDocument to DocxToDocument
* Update library install to the maintained version
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* clan try-exvept to only take non haystack errors into account
* Add wanring on docstring of component ignoring page brakes, mark test as skip
* make warnings lazy evaluations
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* make warnings lazy evaluations
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Make warnings lazy evaluated
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Solve f bug
* Get more metadata from docx files
* add 'python-docx' dependency and docs
* Change logging import
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Fix typo
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* remake metadata extraction for docx
* solve bug regarding _get_docx_metadata method
* Update haystack/components/converters/docx.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/converters/docx.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Delete unused test
---------
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Add first pass at PPTXToDocument converter
* Add test and update code
* Add doc string
* Update docstrings
* Add release notes
* remove unused imports, add to api docs, update pyproject.toml
* Add a new test
* Add dep so tests can run
* incorporating better bm25 impl without breaking interface
* all three bm25 algos
* 1. setting algo post-init not allowed; 2. remove extra underscore for naming consistency; 3. remove unused import
* 1. rename attribute name for IDF computation 2. organize document statistics as a dataclass instead of tuple to improve readability
* fix score type initialization (int -> float) to pass mypy check
* release note included
* fixing linting issues and mypy
* fixing tests
* removing heapq import and cleaning up logging
* changing indexing order
* adding more tests
* increasing tests
* removing rank_bm25 from pyproject.toml
---------
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* Update huggingface_hub classes used after library upgrade
* Fix chat tests
* Update lazy import guard and other references to huggingface_hub>=0.23.0
* In huggingface_hub 0.23.0 TextGenerationOutput property details is now optional
* More fixes
* Add reno note
* Initial commit pdfminer converter
* Revert back naming of argument all_text per pdfminer documentation
* Add the component decorator
* Add release notes
* Reformat code with black
* Remove LTPage and comments
* Update dependencies in pyproject.toml
* Added some tests and incorporated reference doc in docstring
* Added some tests and incorporated reference doc in docstring
* initial import
* wip
* cleaning up tests
* fixing tests
* adding context relevance
* reverting some wrong changes to due PyCharm error in refactoring
* building eval pipeline only once
* handling mypy issues
* tests: import test for missing libraries
* build: add missing dependencies
* refactor: use glob instead of tree walk
* test: extract constants + more documentation