- revert the layout parser fast pdf file to original with just two pages
- add a new file that has one empty page and one page says "this page is
intentionally left blank" for tests
This PR resolves#1247 by using the matching elements and bbox for
coordinate computation.
This PR also updates the example doc
`example-docs/layout-parser-paper-fast.pdf` so that it includes a true
blank page and a page with text "this page is intentionally left blank".
This change helps us testing:
- differences between fast and hi_res
- code handling empty pages in between pages with contents (which
triggers the bug found in #1247 )
Lastly, this PR updates the names of the variables inside
`_partition_pdf_or_image_with_ocr` so that matching inputs all starts
with `_` like `_elements`, `_text`, and `_bboxes` to improve
readability.
This change also improves partition performance for multi-page pdfs as
it reduces the amount of iterations inside
`add_pytesseract_bbox_to_elements`. Testing locally on m2 mac + Rocky
docker shows it reduces partition time for DA-619p.pdf file from around
1min to around 23s.
* docs: add new feature to the CHANGELOG.md, bump the version, update __version__.py
* feat: new partition to call the document image analysis API
* fix: remove duplicated dependency on partition.py
* fix: linting error due to line-lenght > 100
* test: add test to call partition_pdf brick
* chore: new short example-doc pdf for speed up in test X8
* fix: add missing return statement to _read to pass check
* feat: new partitioning brick to call doc parse API
* docs: version update fix in CHANGELOG
* refactor: no nested ifs
* docs: documentation for new brick partition_pdf
* refactor: made tidy
* docs: minor doc refactor
Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>