unstructured/example-docs/layout-parser-paper-fast.pdf
Yao You 27773132b7
[issue 1247] fix element and bbox mismatch bug (#1250)
This PR resolves #1247 by using the matching elements and bbox for
coordinate computation.

This PR also updates the example doc
`example-docs/layout-parser-paper-fast.pdf` so that it includes a true
blank page and a page with text "this page is intentionally left blank".
This change helps us testing:
- differences between fast and hi_res
- code handling empty pages in between pages with contents (which
triggers the bug found in #1247 )

Lastly, this PR updates the names of the variables inside
`_partition_pdf_or_image_with_ocr` so that matching inputs all starts
with `_` like `_elements`, `_text`, and `_bboxes` to improve
readability.

This change also improves partition performance for multi-page pdfs as
it reduces the amount of iterations inside
`add_pytesseract_bbox_to_elements`. Testing locally on m2 mac + Rocky
docker shows it reduces partition time for DA-619p.pdf file from around
1min to around 23s.
2023-08-30 23:34:55 +00:00

278 KiB