mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-22 21:44:28 +00:00

History

Christine Straub 69d0ee1aea

Refactor: support merging extracted layout with inferred layout (#2158 )

### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/294. This
PR adds logic to merge the extracted layout with the inferred layout.

The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
  * if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)

### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.

### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`

2023-12-01 20:56:31 +00:00

evaluate_natural_reading_order.py

Refactor: support merging extracted layout with inferred layout (#2158 )

2023-12-01 20:56:31 +00:00

README.md

feat: improve natural reading order by filtering OCR results (#1768 )

2023-10-16 23:05:55 +00:00

visualise_and_reorder.ipynb

Chore (refactor): support table extraction with pre-computed ocr data (#1801 )

2023-10-21 00:24:23 +00:00

README.md

Custom Layout Sorting

This directory contains examples of how element sorting works.

Running the example

Running script(.py)

export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_natural_reading_order.py <file_path> <strategy>

Here, the file should be under the project root directory. For example,

export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf fast

Running jupyter notebook

The Google Colab version of the notebook can be found here: https://colab.research.google.com/drive/1HgBvHNPnY-dXO043DftvvMeynlLPgQ_p