mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-08 01:22:43 +00:00

### Summary Some `OCR` elements with only spaces in the text have full-page width in the bounding box, which causes the `xycut` sorting to not work as expected. Now the logic to parse OCR results removes any elements with only spaces (more than one space). --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
661 B
661 B
Custom Layout Sorting
This directory contains examples of how element sorting works.
Running the example
Running script(.py)
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_natural_reading_order.py <file_path> <strategy>
Here, the file should be under the project root directory. For example,
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf fast
Running jupyter notebook
The Google Colab version of the notebook can be found here: https://colab.research.google.com/drive/1HgBvHNPnY-dXO043DftvvMeynlLPgQ_p