mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-22 21:44:28 +00:00

History

Christine Straub 69d0ee1aea

Refactor: support merging extracted layout with inferred layout (#2158 )

### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/294. This
PR adds logic to merge the extracted layout with the inferred layout.

The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
  * if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)

### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.

### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`

2023-12-01 20:56:31 +00:00

README.md

enhancement: add visualization script to annotate elements (#1613 )

2023-10-05 12:53:16 -07:00

requirements.txt

enhancement: add visualization script to annotate elements (#1613 )

2023-10-05 12:53:16 -07:00

visualization.py

Refactor: support merging extracted layout with inferred layout (#2158 )

2023-12-01 20:56:31 +00:00

README.md

Analyzing Layout Elements

This directory contains examples of how to analyze layout elements.

How to run

Run pip install -r requirements.txt to install the Python dependencies.

Visualization

Python script (visualization.py)

$ PYTHONPATH=. python examples/layout-analysis/visualization.py <file_path> <strategy>

The strategy can be one of "auto", "hi_res", "ocr_only", or "fast". For example,

$ PYTHONPATH=. python examples/layout-analysis/visualization.py example-docs/loremipsum.pdf hi_res