unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-13 00:53:15 +00:00

History

Christine Straub 69d0ee1aea

Refactor: support merging extracted layout with inferred layout (#2158 )

### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/294. This
PR adds logic to merge the extracted layout with the inferred layout.

The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
  * if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)

### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.

### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`

2023-12-01 20:56:31 +00:00

test_config.py

Chore (refactor): support table extraction with pre-computed ocr data (#1801 )

2023-10-21 00:24:23 +00:00

test_processing_elements.py

Refactor: support merging extracted layout with inferred layout (#2158 )

2023-12-01 20:56:31 +00:00

test_sorting.py

feat: shrink bboxes by top left (#1633 )

2023-10-06 05:16:11 +00:00

test_xycut.py

refactor: partition_pdf() for ocr_only strategy (#1811 )

2023-10-30 20:13:29 +00:00