Christine Straub 237d04c896
feat: improve natural reading order by filtering OCR results (#1768)
### Summary
Some `OCR` elements with only spaces in the text have full-page width in
the bounding box, which causes the `xycut` sorting to not work as
expected. Now the logic to parse OCR results removes any elements with
only spaces (more than one space).

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-10-16 23:05:55 +00:00

19 lines
661 B
Markdown

# Custom Layout Sorting
This directory contains examples of how element sorting works.
## Running the example
### Running script(.py)
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_natural_reading_order.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf fast
```
### Running jupyter notebook
The Google Colab version of the notebook can be found here: https://colab.research.google.com/drive/1HgBvHNPnY-dXO043DftvvMeynlLPgQ_p