haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-12-30 08:37:20 +00:00

Author	SHA1	Message	Date
Daniel Bichuetti	7c49fffc71	feat: Enable PDFToTextConverter multiprocessing, increase general performance and simplify installation (#4226 ) * refactor: isolate PDF converters * refactor: remove xpdf dependency and fix tests * refactor: add min. version * feat: enable multiprocessing and add tests * fix: remove unused imports * fix: regression when moved code * refactor: use itertools * fix: mypy claims * refactor: double tool support * refactor: add fallback to xpdf * refactor: black formatting * refactor: make superclass signature compatible * refactor: complete removal of xPdf * refactor: regroup Haystack imports and fix regression * refactor: remove original declaration * docs: fix docstrings * tests: add [pdf] to [all] * refactor: remove redundant checks, avoid extra processes * refactor: add deprecation warning * refactor: add pytest mark * tests: change PDF test file * fix: correct pytest mark * refactor: deprecate parameter and add new * tests: change pdf sample * Add minor lg changes to docstrings * Fix default value in doc strings * Update test/nodes/test_file_converter.py Co-authored-by: bogdankostic <bogdankostic@web.de> * tests: fix page count * refactor: add imported function * refactor: change default value * tests: change parameters and fix typo * Unify sort_by_position parameter names --------- Co-authored-by: bogdankostic <bogdankostic@web.de> Co-authored-by: agnieszka-m <amarzec13@gmail.com>	2023-03-01 22:34:38 +01:00
bogdankostic	60224412bc	feat: Add headline extraction to `ParsrConverter` (#3488 ) * Add headline extraction to ParsrConverter * Add sample PDF file * Add test * Use extract_headlines if set in convert method * Integrate PR feedback	2022-10-31 19:00:02 +01:00
Daniel Bichuetti	df1f4205b6	feat: add public layout-base extraction support on PDFToTextConverter (#3137 ) * feat(PDFToTextConverter): add option to get text in physical layout order * test: add physical layout extraction test to PDFToTextConverter * refactor: change layout parameter attribution places * docs: manually trigger pre-commits * docs: generate new docs to comply with pydoc-markdown style	2022-09-13 16:55:21 +02:00
Daniel Augustus Bichuetti Silva	1706729e26	Prevent `PDFToTextConverter` from failing on PDFs with spaces in their names (#2786 ) * Change split logic to list * Fix wrong parameter for run * Fix mypy error * Fix layout/raw parameter * Add test for filename with whitespaces on PDFToText * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-07-11 13:30:33 +02:00
Tanay Soni	ef9e4f4467	Add PDF text extraction (#109 )	2020-06-08 11:07:19 +02:00

5 Commits