Daniel Bichuetti
7c49fffc71
feat: Enable PDFToTextConverter multiprocessing, increase general performance and simplify installation ( #4226 )
...
* refactor: isolate PDF converters
* refactor: remove xpdf dependency and fix tests
* refactor: add min. version
* feat: enable multiprocessing and add tests
* fix: remove unused imports
* fix: regression when moved code
* refactor: use itertools
* fix: mypy claims
* refactor: double tool support
* refactor: add fallback to xpdf
* refactor: black formatting
* refactor: make superclass signature compatible
* refactor: complete removal of xPdf
* refactor: regroup Haystack imports and fix regression
* refactor: remove original declaration
* docs: fix docstrings
* tests: add [pdf] to [all]
* refactor: remove redundant checks, avoid extra processes
* refactor: add deprecation warning
* refactor: add pytest mark
* tests: change PDF test file
* fix: correct pytest mark
* refactor: deprecate parameter and add new
* tests: change pdf sample
* Add minor lg changes to docstrings
* Fix default value in doc strings
* Update test/nodes/test_file_converter.py
Co-authored-by: bogdankostic <bogdankostic@web.de>
* tests: fix page count
* refactor: add imported function
* refactor: change default value
* tests: change parameters and fix typo
* Unify sort_by_position parameter names
---------
Co-authored-by: bogdankostic <bogdankostic@web.de>
Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-03-01 22:34:38 +01:00
bogdankostic
60224412bc
feat: Add headline extraction to ParsrConverter ( #3488 )
...
* Add headline extraction to ParsrConverter
* Add sample PDF file
* Add test
* Use extract_headlines if set in convert method
* Integrate PR feedback
2022-10-31 19:00:02 +01:00
Daniel Bichuetti
df1f4205b6
feat: add public layout-base extraction support on PDFToTextConverter ( #3137 )
...
* feat(PDFToTextConverter): add option to get text in physical layout order
* test: add physical layout extraction test to PDFToTextConverter
* refactor: change layout parameter attribution places
* docs: manually trigger pre-commits
* docs: generate new docs to comply with pydoc-markdown style
2022-09-13 16:55:21 +02:00
Daniel Augustus Bichuetti Silva
1706729e26
Prevent PDFToTextConverter from failing on PDFs with spaces in their names ( #2786 )
...
* Change split logic to list
* Fix wrong parameter for run
* Fix mypy error
* Fix layout/raw parameter
* Add test for filename with whitespaces on PDFToText
* Update Documentation & Code Style
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-11 13:30:33 +02:00
Tanay Soni
ef9e4f4467
Add PDF text extraction ( #109 )
2020-06-08 11:07:19 +02:00