mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-08 09:33:43 +00:00

### Summary Some `OCR` elements with only spaces in the text have full-page width in the bounding box, which causes the `xycut` sorting to not work as expected. Now the logic to parse OCR results removes any elements with only spaces (more than one space). --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
19 lines
661 B
Markdown
19 lines
661 B
Markdown
# Custom Layout Sorting
|
|
|
|
This directory contains examples of how element sorting works.
|
|
|
|
## Running the example
|
|
|
|
### Running script(.py)
|
|
|
|
```
|
|
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_natural_reading_order.py <file_path> <strategy>
|
|
```
|
|
Here, the file should be under the project root directory. For example,
|
|
```
|
|
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf fast
|
|
```
|
|
|
|
### Running jupyter notebook
|
|
The Google Colab version of the notebook can be found here: https://colab.research.google.com/drive/1HgBvHNPnY-dXO043DftvvMeynlLPgQ_p
|