* Add draft of the Excel To Document converter
* Add license header
* Add release note
* Use Union instead of pipe
* Add openpyxl as additional dep
* Fix zip issue
* few updates from Bijay
* Update deps
* Add markdown test
* Adding more example excels and expanding tests
* Added more tests
* Fix windows test by setting lineterminator
* Addressing PR comments
* PR comments
* Fix linting
* Add log lines for PDF conversion and make skipping more explicit in DocumentSplitter
* Add logging statement for PDFMinerToDocument as well
* Add tests
* Remove unused line
* Remove unused line
* add reno
* Add in PDF file
* Update checks in PDF converters and add tests for document splitter
* Revert
* Remove line
* Fix comment
* Make mypy happy
* Make mypy happy
* fix: make `from_dict` of `PyPDFToDocument` more robust
* chore: drop trailing space
* converting method to static and making the comment shorter
* reverting method to static
---------
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* Add JSONConverter Component
* Handle some corner cases
* Add JSONConverter to pydoc config
* Add a way to extract all non content fields as metadata
* Small fix in docstring
* Fix tests
* docstrings upd
* Update json.py
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* fix: extract page breaks from .docx files
Context: Currently, DOCXToDocument does not extract page breaks from
word documents. This makes it impossible to do things like split by page
or get correct page number metadata after using something like
DocumentSplitter. For example, if you split by word, the 'page_number'
metadata field will be 1 for all documents.
Solution: Added a method to DOCXToDocument that extracts page breaks
from word documents as '\f' characters so that they are recognized by
DocumentSplitter.
Caveat: Due to the way the python-docx library is set up, you can only
accurately determine the location of the first page break for a given
paragraph. In the rare case that a paragraph contains more than one page
break (which means it is an extremely long paragraph spanning multiple
pages), the 2nd, 3rd, etc. page break locations are not known. To sort
of fix this, I just appended the page break characters to the end of
the paragraph text to keep the overall page number values for the
document consistent.
* Apply suggestions from code review
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Fix TikaConverter not having \f page tag by using HTML mode of parsing and then parsing the HTML to text using the old Haystack 1.X integration as template.
* Add Reno
* Fix test by making Mock Tika return XML (before parsing)
* refinements and test
---------
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
* Some changes
Use tests file path
* Update tests
* Add another unit test
* Shorten _get_docx_metadata
* Update tests
* Remove try block
* Add a dataclass
* Add a to dict unit test
* Remove unused import
* Add release notes
* Update docstrings
* Use optional instead of pipe
* Update docstring
* Remove file
* first fucntioning DocxFileToDocument
* fix lazy import message
* add reno
* Add license headder
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* change DocxFileToDocument to DocxToDocument
* Update library install to the maintained version
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* clan try-exvept to only take non haystack errors into account
* Add wanring on docstring of component ignoring page brakes, mark test as skip
* make warnings lazy evaluations
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* make warnings lazy evaluations
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Make warnings lazy evaluated
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Solve f bug
* Get more metadata from docx files
* add 'python-docx' dependency and docs
* Change logging import
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Fix typo
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* remake metadata extraction for docx
* solve bug regarding _get_docx_metadata method
* Update haystack/components/converters/docx.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/converters/docx.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Delete unused test
---------
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Add first pass at PPTXToDocument converter
* Add test and update code
* Add doc string
* Update docstrings
* Add release notes
* remove unused imports, add to api docs, update pyproject.toml
* Add a new test
* Add dep so tests can run
* Initial commit pdfminer converter
* Revert back naming of argument all_text per pdfminer documentation
* Add the component decorator
* Add release notes
* Reformat code with black
* Remove LTPage and comments
* Update dependencies in pyproject.toml
* Added some tests and incorporated reference doc in docstring
* Added some tests and incorporated reference doc in docstring
* Initial commit
* Remove old mock tests
* Fix current_last_page_number calculation
* Carry over unit tests from the other side
* Update pydocs, skip failing tests
* Fix pylint and mypy
* Minor adjustments
* Add release note
* Minor touch ups
* Resolve Document unique id issue by using custom id calculation
* Better hashing, add unit tests
* Small fixes