62 Commits

Author SHA1 Message Date
Sebastian Husch Lee
28ad78c73d
feat: Add XLSXToDocument converter (#8522)
* Add draft of the Excel To Document converter

* Add license header

* Add release note

* Use Union instead of pipe

* Add openpyxl as additional dep

* Fix zip issue

* few updates from Bijay

* Update deps

* Add markdown test

* Adding more example excels and expanding tests

* Added more tests

* Fix windows test by setting lineterminator

* Addressing PR comments

* PR comments

* Fix linting
2025-01-09 09:03:19 +01:00
Michele Pangrazzi
21d53d0ec6
update default value of 'store_full_path' to False in converters (#8619) 2024-12-10 16:03:38 +01:00
Michele Pangrazzi
b32f85cca2
remove deprecated 'converter' init parameter from PyPDFToDocument component (#8609) 2024-12-06 15:43:43 +01:00
Amna Mubashar
4c8eb54049
feat: Add store_full_path to converters (3/3) (#8585)
* Add store_full_path params
2024-12-03 13:48:56 +05:00
Stefano Fiorucci
fb42c035c5
feat: PyPDFToDocument - add new customization parameters (#8574)
* deprecat converter in pypdf

* fix linting of MetaFieldGroupingRanker

* linting

* pypdftodocument: add customization params

* fix mypy

* incorporate feedback
2024-11-26 16:37:59 +01:00
Amna Mubashar
9302d3d9f0
feat: Add store_full_path to converters (2/3) (#8573) 2024-11-25 15:22:19 +05:00
Amna Mubashar
21906d0558
feat: Add store_full_path to converters (1/3) (#8566)
* Add store_full_path param to 3 converters
2024-11-22 13:55:08 +01:00
Sebastian Husch Lee
911f3523ab
feat: Increase logging transparency for empty Documents during conversion (#8509)
* Add log lines for PDF conversion and make skipping more explicit in DocumentSplitter

* Add logging statement for PDFMinerToDocument as well

* Add tests

* Remove unused line

* Remove unused line

* add reno

* Add in PDF file

* Update checks in PDF converters and add tests for document splitter

* Revert

* Remove line

* Fix comment

* Make mypy happy

* Make mypy happy
2024-11-04 09:26:57 +01:00
Madeesh Kannan
33675b4caf
chore: Remove deprecated DefaultConverter for PyPDFToDocument (#8501)
* chore: Remove deprecated `DefaultConverter` for `PyPDFToDocument`

* Remove unused imports

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2024-10-29 16:42:48 +00:00
Vladimir Blagojevic
28161f7bb9
feat: DOCXToDocument: add table extraction (#8457)
* DOCXToDocument: add table extraction

* Add reno note

* mypy fixes

* add unit tests

* Add csv table support

* Update release note

* Add TableFormat enum

* Add table_format as str init param

* Update docx.py

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* PR feedback

* PR feedback

---------

Co-authored-by: medsriha <medsriha@gmail.com>
Co-authored-by: Mo Sriha <22803208+medsriha@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2024-10-29 16:20:27 +01:00
Madeesh Kannan
e7bfd80f3b
fix: (Temporarily) Re-add suport for pre-2.6.0 YAMLs with PyPDFConverter (#8443) 2024-10-08 14:35:43 +02:00
Madeesh Kannan
ee89f6ad57
fix: PyPDFToDocument correctly serializes custom converters, deprecate DefaultConverter (#8430)
* fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter`

* Remove `auto` prefix from serde util function names, add unit tests
2024-10-01 16:35:38 +02:00
Silvano Cerza
d6f073f9b3
Revert "fix: make pypdf converter more robust (#8427)" (#8428)
This reverts commit d234c75168dcb49866a6714aa232f37d56f72cab.
2024-10-01 11:55:25 +02:00
Tobias Wochinger
d234c75168
fix: make pypdf converter more robust (#8427)
* fix: make `from_dict` of `PyPDFToDocument` more robust

* chore: drop trailing space

* converting method to static and making the comment shorter

* reverting method to static

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2024-09-30 16:47:23 +00:00
Silvano Cerza
29672d4b42
feat: Add JSONConverter Component (#8397)
* Add JSONConverter Component

* Handle some corner cases

* Add JSONConverter to pydoc config

* Add a way to extract all non content fields as metadata

* Small fix in docstring

* Fix tests

* docstrings upd

* Update json.py

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2024-09-25 12:34:51 +02:00
Sriniketh J
e98a6fea04
Convertor: CSVToDocument (#8328)
* carry forwarded initial commit

* fix: doc strings

* fix: update docstrings

* fix: docstring update

* fix: csv encoding in actions

* fix: line endings through hooks

* fix: converter docs addition
2024-09-06 10:59:12 +02:00
Silvano Cerza
3e3f79b928
feat: Add unsafe init arg in ConditionalRouter and OutputAdapter to enable previous behaviour (#8176)
* Add unsafe behaviour to OutputAdapter

* Add unsafe behaviour to ConditionalRouter

* Add release notes

* Fix mypy

* Add documentation links

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2024-09-02 14:14:54 +00:00
Stefano Fiorucci
2e619f06c8
fix: make meta produced by DOCXToDocument JSON serializable (#8263)
* make meta from DOCXToDocument JSON serializable

* unused import

* update docstrings
2024-08-22 12:24:32 +00:00
Jon Strutz
471f07c8fe
fix: extract page breaks from .docx files (#8232)
* fix: extract page breaks from .docx files

Context: Currently, DOCXToDocument does not extract page breaks from
word documents. This makes it impossible to do things like split by page
or get correct page number metadata after using something like
DocumentSplitter. For example, if you split by word, the 'page_number'
metadata field will be 1 for all documents.

Solution: Added a method to DOCXToDocument that extracts page breaks
from word documents as '\f' characters so that they are recognized by
DocumentSplitter.

Caveat: Due to the way the python-docx library is set up, you can only
accurately determine the location of the first page break for a given
paragraph. In the rare case that a paragraph contains more than one page
break (which means it is an extremely long paragraph spanning multiple
pages), the 2nd, 3rd, etc. page break locations are not known. To sort
of fix this, I just appended the page break characters to the end of
the paragraph text to keep the overall page number values for the
document consistent.

* Apply suggestions from code review

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2024-08-21 09:48:02 +00:00
Vladimir Blagojevic
3318d894c0
Add sede_with_list_output_type_in_pipeline unit test (#8196) 2024-08-13 14:37:24 +02:00
Marie-Luise Klaus
ec02817f14
fix: OutputAdapter from_dict with custom_filters None (#8173)
Co-authored-by: Marie-Luise Klaus <marieluise.klaus@deepset.ai>
2024-08-08 14:02:40 +02:00
Stefano Fiorucci
3d1ad10385
fix html test (#8127) 2024-07-31 10:59:53 +02:00
Corentin Meyer
1c53aae8f0
fix: Tika converter not yielding page break tags (\f) (#8082)
* Fix TikaConverter not having \f page tag by using HTML mode of parsing and then parsing the HTML to text using the old Haystack 1.X integration as template.

* Add Reno

* Fix test by making Mock Tika return XML (before parsing)

* refinements and test

---------

Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2024-07-26 20:13:47 +02:00
Madeesh Kannan
8faa3fa465
Revert "fix: make PyPDF backward compatible (#7996)" (#8014)
This reverts commit 58b48e36eb56a896365133ab4a9d8e327989948c.
2024-07-11 13:06:08 +00:00
Tobias Wochinger
58b48e36eb
fix: make PyPDF backward compatible (#7996)
* fix: make PyPDF backward compatible

* Add release note

---------

Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
2024-07-09 10:08:37 +02:00
Vladimir Blagojevic
0255422eb3
chore: Mark AzureOCRDocumentConverter test_run_with_pdf_file flaky (#7978)
* Disable AzureOCRDocumentConverter test_run_with_pdf_file on osx

* Mark test flaky instead

* Remove import
2024-07-04 16:36:32 +02:00
tstadel
aa46466894
fix: meta from ByteStream input for AzureOCRDocumentConverter (#7955)
* fix: meta from ByteStream input for AzureOCRDocumentConverter

* add test

* add reno

* fix test
2024-07-04 14:42:30 +02:00
Sebastian Husch Lee
6836079686
chore: Capitalize DOCX in DOCXToDocument converter (#7931)
* Capitalize DOCX in DOCXToDocument converter

* Update docstrings

* Update test class name

* add releease notes
2024-06-27 08:19:01 +02:00
Stefano Fiorucci
c51f8ffb86
PyPDFToDocument: remove deprecated converter_name and CONVERTERS_REGISTRY (#7910) 2024-06-21 16:52:03 +02:00
Sebastian Husch Lee
3db56d9066
refactor: DocxToDocument update (#7857)
* Some changes

Use tests file path

* Update tests

* Add another unit test

* Shorten _get_docx_metadata

* Update tests

* Remove try block

* Add a dataclass

* Add a to dict unit test

* Remove unused import

* Add release notes

* Update docstrings

* Use optional instead of pipe

* Update docstring

* Remove file
2024-06-19 15:48:31 +02:00
Stefano Fiorucci
8de639bd70
DocxDocument forward reference (#7852) 2024-06-13 11:29:31 +02:00
Carlos Fernández
c1c339923f
feat: add DocxToDocument converter (#7838)
* first fucntioning DocxFileToDocument

* fix lazy import message

* add reno

* Add license headder

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* change DocxFileToDocument to DocxToDocument

* Update library install to the maintained version

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* clan try-exvept to only take non haystack errors into account

* Add wanring on docstring of component ignoring page brakes, mark test as skip

* make warnings lazy evaluations

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* make warnings lazy evaluations

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Make warnings lazy evaluated

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Solve f bug

* Get more metadata from docx files

* add 'python-docx' dependency and docs

* Change logging import

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Fix typo

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* remake metadata extraction for docx

* solve bug regarding _get_docx_metadata method

* Update haystack/components/converters/docx.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/converters/docx.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Delete unused test

---------

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
2024-06-12 11:58:36 +02:00
Sebastian Husch Lee
2c2c7c9f56
feat: Add PPTXToDocument converter (#7808)
* Add first pass at PPTXToDocument converter

* Add test and update code

* Add doc string

* Update docstrings

* Add release notes

* remove unused imports, add to api docs, update pyproject.toml

* Add a new test

* Add dep so tests can run
2024-06-07 09:43:29 +00:00
Stefano Fiorucci
7181f6b7e9
feat: change HTML conversion backend from boilerpy3 to Trafilatura (#7705)
* change HTML conversion backed to Trafilatura

* rm unused var
2024-05-17 10:38:47 +02:00
Massimiliano Pippi
10c675d534
chore: add license header to all modules (#7675)
* add license header to modules
* check license header at linting time
2024-05-09 13:40:36 +00:00
Mo
2e35f13085
feat: add converter based on pdfminer (#7607)
* Initial commit pdfminer converter

* Revert back naming of argument all_text per pdfminer documentation

* Add the component decorator

* Add release notes

* Reformat code with black

* Remove LTPage and comments

* Update dependencies in pyproject.toml

* Added some tests and incorporated reference doc in docstring

* Added some tests and incorporated reference doc in docstring
2024-05-02 10:36:54 +02:00
Vladimir Blagojevic
988c360b6d
feat: Azure converter updates (#7409)
* Initial commit

* Remove old mock tests

* Fix current_last_page_number calculation

* Carry over unit tests from the other side

* Update pydocs, skip failing tests

* Fix pylint and mypy

* Minor adjustments

* Add release note

* Minor touch ups

* Resolve Document unique id issue by using custom id calculation

* Better hashing, add unit tests

* Small fixes
2024-04-09 09:45:06 +02:00
Vladimir Blagojevic
c3b96392fd
feat: Use all HTMLToDocument extractors until content is extracted (#7452)
* Use all HTMLToDocument extractors until content is extracted

* Add release note

* Minor doc update

* Improvements, unit test fixes

* Add try_others init param, update tests

* Update haystack/components/converters/html.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

* PR feedback - Stefano

* Improve reno release note, add  reference

* little fixes

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2024-04-05 16:02:34 +02:00
Stefano Fiorucci
6925e3a2e1
refactor!: Improve PyPDFToDocument (#7362)
* first draft

* rm kwargs from protocol

* Simplify

* no breaking changes

* reno

* one more test of the deprecated registry
2024-03-26 10:09:29 +01:00
Vladimir Blagojevic
0e7c41be5e
feat: Improve OpenAPIServiceToFunctions signature (#7257)
* Convert OpenAPIServiceToFunctions run interface
---------
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2024-03-04 14:38:58 +01:00
Vladimir Blagojevic
d871bbbfbd
feat: Add complex types in OpenAPI support (#7065)
* Add complex types OpenAPI support

* Add release note
---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2024-02-27 18:11:06 +01:00
Vladimir Blagojevic
cb6389d7a2
feat: Improve OpenAPI integration (#7034)
* Simplify and improve OpenAPIServiceConnector and OpenAPIServiceToFunctions, add unit tests

* Add reno note

* Add flask test dependency

* Initial PR feedback - Julian

* Remove indirection - Silvano

* Remove flask end-to-end tests

* Remove unused import

* Add mixed body unit test

* Update unit test, mock properly
2024-02-22 14:03:50 +01:00
Vladimir Blagojevic
8d46a2883e
feat: Make system_messages optional in OpenAPIServiceToFunctions run (#6825)
* Make system_messages optional in OpenAPIServiceToFunctions run

* Adjust unit test

* PR feedback Massi
2024-02-14 16:04:35 +01:00
Vladimir Blagojevic
6a776e672f
Add OutputAdapter sede for custom filters (#6985) 2024-02-13 16:56:43 +01:00
Vladimir Blagojevic
97a0df66d2
feat: Add OutputAdapter (#6936)
* Add OutputAdapter component
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2024-02-13 13:03:50 +01:00
Madeesh Kannan
27d1af3068
feat!: Use Secret for passing authentication secrets to components (#6887)
* feat!: Use `Secret` for passing authentication secrets to components

* Add comment to clarify type ignore
2024-02-05 13:17:01 +01:00
Madeesh Kannan
5d66d040cc
feat: Add serde methods to HTMLToDocument (#6758) 2024-01-18 10:02:01 +01:00
Sebastian Husch Lee
c0b67432e4
feat: Add page breaks to default PDF to Document converter (#6755)
* Speedup tests for PyPDFToDocument

* Added unit test and removed skipping of empty pages

* add release note

* Add back some integration marks
2024-01-18 08:54:59 +01:00
ZanSara
abd16ab796
feat: support single metadata dictionary in MarkdownToDocument (#6629)
* support single metadata dict in markdown2document

* reno

* unwrap list

* direct key access

* typing

* add explicit test
2024-01-09 14:44:39 +01:00
ZanSara
175b5baf45
feat: support single metadata dictionary in AzureOCRDocumentConverter (#6635)
* support single metadata dict in azureconverter

* reno

* tests

* Update releasenotes/notes/single-meta-in-azureconverter-ce1cc196a9b161f3.yaml
2024-01-09 10:49:37 +01:00