unstructured

yujunjun/unstructured

Fork 0

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-08-07 00:10:05 +00:00

Commit Graph

Author SHA1 Message Date

Author	SHA1	Message	Date
Yao You	d51fb134e6	Feat/improve iou speed (#3582 ) This PR vectorizes the computation of element overlap to speed up deduplication process of extracted elements. ## test This PR adds unit test to the new vectorized IOU and subregion computation functions. In addition, running partition on large files with many elements like this slide: [002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf) shows a reduction of runtime from around 15min on the main branch to less than 4min with this branch. Profiling results show that the new implementation greatly reduces the time cost of computation and now most of the time is spend on getting the coordinates from a list of bboxes. ![Screenshot 2024-08-30 at 9 29 27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)	2024-09-03 00:06:18 +00:00
Christine Straub	b0d8a779da	feat: `partiton_pdf()` set inferred elements text (#3061 ) This PR adds the ability to fill inferred elements text from embedded text (`pdfminer`) without depending on `unstructured-inference` library. This PR is the second part of moving embedded text related code from `unstructured-inference` to `unstructured` and works together with https://github.com/Unstructured-IO/unstructured-inference/pull/349.	2024-05-21 19:43:38 +00:00
Christine Straub	ac5048bf30	enhancement: remove duplicate embedded images (#2897 ) This PR aims to remove duplicate embedded images taken by `PDFminer`. ### Summary - add `clean_pdfminer_duplicate_image_elements()` to remove embedded images with similar `bboxes` and the same `text` - add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the bounding boxes of two embedded images as the same region - refactor: reorganzie `clean_pdfminer_inner_elements()`	2024-04-18 23:07:47 +00:00

Yao You

d51fb134e6

Feat/improve iou speed (#3582 )

This PR vectorizes the computation of element overlap to speed up
deduplication process of extracted elements.

## test

This PR adds unit test to the new vectorized IOU and subregion
computation functions.

In addition, running partition on large files with many elements like
this slide:

[002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf)

shows a reduction of runtime from around 15min on the main branch to
less than 4min with this branch.

Profiling results show that the new implementation greatly reduces the
time cost of computation and now most of the time is spend on getting
the coordinates from a list of bboxes.

![Screenshot 2024-08-30 at 9 29
27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)

2024-09-03 00:06:18 +00:00

Christine Straub

b0d8a779da

feat: partiton_pdf() set inferred elements text (#3061 )

This PR adds the ability to fill inferred elements text from embedded
text (`pdfminer`) without depending on `unstructured-inference` library.
This PR is the second part of moving embedded text related code from
`unstructured-inference` to `unstructured` and works together with
https://github.com/Unstructured-IO/unstructured-inference/pull/349.

2024-05-21 19:43:38 +00:00

Christine Straub

ac5048bf30

enhancement: remove duplicate embedded images (#2897 )

This PR aims to remove duplicate embedded images taken by `PDFminer`.

### Summary
- add `clean_pdfminer_duplicate_image_elements()` to remove embedded
images with similar `bboxes` and the same `text`
- add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the
bounding boxes of two embedded images as the same region
- refactor: reorganzie `clean_pdfminer_inner_elements()`

2024-04-18 23:07:47 +00:00

3 Commits