mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-05 16:12:30 +00:00

This PR aims to remove duplicate embedded images taken by `PDFminer`. ### Summary - add `clean_pdfminer_duplicate_image_elements()` to remove embedded images with similar `bboxes` and the same `text` - add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the bounding boxes of two embedded images as the same region - refactor: reorganzie `clean_pdfminer_inner_elements()`