5 Commits

Author SHA1 Message Date
Daniel Bichuetti
7c49fffc71
feat: Enable PDFToTextConverter multiprocessing, increase general performance and simplify installation (#4226)
* refactor: isolate PDF converters

* refactor: remove xpdf dependency and fix tests

* refactor: add min. version

* feat: enable multiprocessing and add tests

* fix: remove unused imports

* fix: regression when moved code

* refactor: use itertools

* fix: mypy claims

* refactor: double tool support

* refactor: add fallback to xpdf

* refactor: black formatting

* refactor: make superclass signature compatible

* refactor: complete removal of xPdf

* refactor: regroup Haystack imports and fix regression

* refactor: remove original declaration

* docs: fix docstrings

* tests: add [pdf] to [all]

* refactor: remove redundant checks, avoid extra processes

* refactor: add deprecation warning

* refactor: add pytest mark

* tests: change PDF test file

* fix: correct pytest mark

* refactor: deprecate parameter and add new

* tests: change pdf sample

* Add minor lg changes to docstrings

* Fix default value in doc strings

* Update test/nodes/test_file_converter.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* tests: fix page count

* refactor: add imported function

* refactor: change default value

* tests: change parameters and fix typo

* Unify sort_by_position parameter names

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-03-01 22:34:38 +01:00
bogdankostic
60224412bc
feat: Add headline extraction to ParsrConverter (#3488)
* Add headline extraction to ParsrConverter

* Add sample PDF file

* Add test

* Use extract_headlines if set in convert method

* Integrate PR feedback
2022-10-31 19:00:02 +01:00
Daniel Bichuetti
df1f4205b6
feat: add public layout-base extraction support on PDFToTextConverter (#3137)
* feat(PDFToTextConverter): add option to get text in physical layout order

* test: add physical layout extraction test to PDFToTextConverter

* refactor: change layout parameter attribution places

* docs: manually trigger pre-commits

* docs: generate new docs to comply with pydoc-markdown style
2022-09-13 16:55:21 +02:00
Daniel Augustus Bichuetti Silva
1706729e26
Prevent PDFToTextConverter from failing on PDFs with spaces in their names (#2786)
* Change split logic to list

* Fix wrong parameter for run

* Fix mypy error

* Fix layout/raw parameter

* Add test for filename with whitespaces on PDFToText

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-11 13:30:33 +02:00
Tanay Soni
ef9e4f4467
Add PDF text extraction (#109) 2020-06-08 11:07:19 +02:00