4 Commits

Author SHA1 Message Date
bogdankostic
60224412bc
feat: Add headline extraction to ParsrConverter (#3488)
* Add headline extraction to ParsrConverter

* Add sample PDF file

* Add test

* Use extract_headlines if set in convert method

* Integrate PR feedback
2022-10-31 19:00:02 +01:00
Daniel Bichuetti
df1f4205b6
feat: add public layout-base extraction support on PDFToTextConverter (#3137)
* feat(PDFToTextConverter): add option to get text in physical layout order

* test: add physical layout extraction test to PDFToTextConverter

* refactor: change layout parameter attribution places

* docs: manually trigger pre-commits

* docs: generate new docs to comply with pydoc-markdown style
2022-09-13 16:55:21 +02:00
Daniel Augustus Bichuetti Silva
1706729e26
Prevent PDFToTextConverter from failing on PDFs with spaces in their names (#2786)
* Change split logic to list

* Fix wrong parameter for run

* Fix mypy error

* Fix layout/raw parameter

* Add test for filename with whitespaces on PDFToText

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-11 13:30:33 +02:00
Tanay Soni
ef9e4f4467
Add PDF text extraction (#109) 2020-06-08 11:07:19 +02:00