unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2026-01-07 12:50:54 +00:00

History

Fix bbox coordinates for ocr_only strategy (#1325 )

### Summary
Duplicate PR of #1259 because of issues with checks
Closes #1227, which found that `nan` values were present in the
coordinates being generated for some elements.
This breaks logic out from `add_pytesseract_bbox_to_elements` to new
functions `_get_element_box` and
`convert_multiple_coordinates_to_new_system`. It also updates the logic
to check that the current bounding box matches the first character of
the element's text (as to avoid the `~` characters that
`pytesseract.image_to_boxes` includes, but are not present in
`pytesseract.image_to_string`.

### Testing
```
from unstructured.partition.image import partition_image
from PIL import Image, ImageDraw

filename="example-docs/layout-parser-paper-with-table.jpg"
elements = partition_image(filename=filename, strategy="ocr_only")
image = Image.open(filename)
draw = ImageDraw.Draw(image)
for i, element in enumerate(elements):
    print(i, element.metadata.coordinates)
    if element.metadata.coordinates:
        draw.polygon(element.metadata.coordinates.points, outline="red", width=2)
output = "example-docs/box-layout-parser-paper-with-table.jpg"
image.save(output)
image.close()
```

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>

2023-09-15 15:11:16 -05:00

csv

fix: update test_json to not use auto partition (#1187 )

2023-08-29 16:59:26 -04:00

docx

Fix bbox coordinates for ocr_only strategy (#1325 )

2023-09-15 15:11:16 -05:00

epub

fix: updating element types (#1394 )

2023-09-15 11:51:22 -05:00

markdown

fix: updating element types (#1394 )