3 Commits

Author SHA1 Message Date
Yao You
d51fb134e6
Feat/improve iou speed (#3582)
This PR vectorizes the computation of element overlap to speed up
deduplication process of extracted elements.

## test

This PR adds unit test to the new vectorized IOU and subregion
computation functions.

In addition, running partition on large files with many elements like
this slide:

[002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf)

shows a reduction of runtime from around 15min on the main branch to
less than 4min with this branch.

Profiling results show that the new implementation greatly reduces the
time cost of computation and now most of the time is spend on getting
the coordinates from a list of bboxes.

![Screenshot 2024-08-30 at 9 29
27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)
2024-09-03 00:06:18 +00:00
Christine Straub
b0d8a779da
feat: partiton_pdf() set inferred elements text (#3061)
This PR adds the ability to fill inferred elements text from embedded
text (`pdfminer`) without depending on `unstructured-inference` library.
This PR is the second part of moving embedded text related code from
`unstructured-inference` to `unstructured` and works together with
https://github.com/Unstructured-IO/unstructured-inference/pull/349.
2024-05-21 19:43:38 +00:00
Christine Straub
ac5048bf30
enhancement: remove duplicate embedded images (#2897)
This PR aims to remove duplicate embedded images taken by `PDFminer`.

### Summary
- add `clean_pdfminer_duplicate_image_elements()` to remove embedded
images with similar `bboxes` and the same `text`
- add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the
bounding boxes of two embedded images as the same region
- refactor: reorganzie `clean_pdfminer_inner_elements()`
2024-04-18 23:07:47 +00:00