David S. Batista
2c84266d8f
test: adding test for PyPDF to extract passages so that they are detect by DocumentSplitter ( #8739 )
2025-01-17 10:56:16 +01:00
Julian Risch
dd9660f90d
fix: PyPDFToDocument initializes documents with content and meta ( #8698 )
...
* initialize document with content and meta
* update test
* add test checking that not only content is used for id generation
2025-01-09 19:12:10 +00:00
Michele Pangrazzi
21d53d0ec6
update default value of 'store_full_path' to False in converters ( #8619 )
2024-12-10 16:03:38 +01:00
Michele Pangrazzi
b32f85cca2
remove deprecated 'converter' init parameter from PyPDFToDocument component ( #8609 )
2024-12-06 15:43:43 +01:00
Amna Mubashar
4c8eb54049
feat: Add store_full_path to converters (3/3) ( #8585 )
...
* Add store_full_path params
2024-12-03 13:48:56 +05:00
Stefano Fiorucci
fb42c035c5
feat: PyPDFToDocument
- add new customization parameters ( #8574 )
...
* deprecat converter in pypdf
* fix linting of MetaFieldGroupingRanker
* linting
* pypdftodocument: add customization params
* fix mypy
* incorporate feedback
2024-11-26 16:37:59 +01:00
Sebastian Husch Lee
911f3523ab
feat: Increase logging transparency for empty Documents during conversion ( #8509 )
...
* Add log lines for PDF conversion and make skipping more explicit in DocumentSplitter
* Add logging statement for PDFMinerToDocument as well
* Add tests
* Remove unused line
* Remove unused line
* add reno
* Add in PDF file
* Update checks in PDF converters and add tests for document splitter
* Revert
* Remove line
* Fix comment
* Make mypy happy
* Make mypy happy
2024-11-04 09:26:57 +01:00
Madeesh Kannan
33675b4caf
chore: Remove deprecated DefaultConverter
for PyPDFToDocument
( #8501 )
...
* chore: Remove deprecated `DefaultConverter` for `PyPDFToDocument`
* Remove unused imports
---------
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2024-10-29 16:42:48 +00:00
Madeesh Kannan
e7bfd80f3b
fix: (Temporarily) Re-add suport for pre-2.6.0 YAMLs with PyPDFConverter
( #8443 )
2024-10-08 14:35:43 +02:00
Madeesh Kannan
ee89f6ad57
fix: PyPDFToDocument
correctly serializes custom converters, deprecate DefaultConverter
( #8430 )
...
* fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter`
* Remove `auto` prefix from serde util function names, add unit tests
2024-10-01 16:35:38 +02:00
Silvano Cerza
d6f073f9b3
Revert "fix: make pypdf converter more robust ( #8427 )" ( #8428 )
...
This reverts commit d234c75168dcb49866a6714aa232f37d56f72cab.
2024-10-01 11:55:25 +02:00
Tobias Wochinger
d234c75168
fix: make pypdf converter more robust ( #8427 )
...
* fix: make `from_dict` of `PyPDFToDocument` more robust
* chore: drop trailing space
* converting method to static and making the comment shorter
* reverting method to static
---------
Co-authored-by: David S. Batista <dsbatista@gmail.com>
2024-09-30 16:47:23 +00:00
Madeesh Kannan
8faa3fa465
Revert "fix: make PyPDF backward compatible ( #7996 )" ( #8014 )
...
This reverts commit 58b48e36eb56a896365133ab4a9d8e327989948c.
2024-07-11 13:06:08 +00:00
Tobias Wochinger
58b48e36eb
fix: make PyPDF backward compatible ( #7996 )
...
* fix: make PyPDF backward compatible
* Add release note
---------
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
2024-07-09 10:08:37 +02:00
Stefano Fiorucci
c51f8ffb86
PyPDFToDocument: remove deprecated converter_name and CONVERTERS_REGISTRY ( #7910 )
2024-06-21 16:52:03 +02:00
Massimiliano Pippi
10c675d534
chore: add license header to all modules ( #7675 )
...
* add license header to modules
* check license header at linting time
2024-05-09 13:40:36 +00:00
Stefano Fiorucci
6925e3a2e1
refactor!: Improve PyPDFToDocument
( #7362 )
...
* first draft
* rm kwargs from protocol
* Simplify
* no breaking changes
* reno
* one more test of the deprecated registry
2024-03-26 10:09:29 +01:00
Sebastian Husch Lee
c0b67432e4
feat: Add page breaks to default PDF to Document converter ( #6755 )
...
* Speedup tests for PyPDFToDocument
* Added unit test and removed skipping of empty pages
* add release note
* Add back some integration marks
2024-01-18 08:54:59 +01:00
ZanSara
c0f1dab454
feat: support single metadata dictionary in PyPDFToDocument
( #6615 )
...
* support single metadata dict in pypdf2document
* improve tests
* tests
* remove line
2023-12-22 14:13:11 +01:00
sahusiddharth
3d17e6ff76
changed metadata to meta ( #6605 )
2023-12-21 12:39:58 +01:00
Stefano Fiorucci
2f034d3c97
refactor!: Converters - standardize inputs ( #6540 )
...
* standardize converters inputs: first draft
* fix precommit
* fix precommit 2
* fix precommit 3
* add default for optional param
* rm leftover
* install boilerpy in linting workflow
* add boilerpy3 to the core dependencies
* add reno
* remove boilerpy3 installation from test workflow
* fix pylint: import order and unused import
* fix import order
* add release note
* better Tika docstring
* rm boilerpy from linting
* leftover
* md link brackets
* feat: Converters - allow passing `meta` in the `run` method (#6554 )
* first impl for html
* progressing on other components
* fix test
* add tests - run with meta
* release note
* reintroduce patches wrongly deleted
* add patch in test
* fix tika test
* Update haystack/components/converters/azure.py
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
* Update releasenotes/notes/converters-standardize-inputs-ed2ba9c97b762974.yaml
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
* simplify test
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-12-15 16:41:35 +01:00
Massimiliano Pippi
7c05f37a53
remove unit marker ( #6450 )
2023-11-29 19:24:25 +01:00
Silvano Cerza
e6637f5ec2
Fix all tests
2023-11-24 14:48:43 +01:00
Massimiliano Pippi
8adb8bbab8
Remove preview folder in test/
...
---------
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-24 11:52:55 +01:00