mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-31 04:46:07 +00:00

### Summary The goal of this PR is to keep all image elements when using "hi_res" strategy. Previously, `Image` elements with small chunks of text were ignored unless the image block extraction parameters (`extract_images_in_pdf` or `extract_image_block_types`) were specified. Now, all image elements are kept regardless of whether the image block extraction parameters are specified. ### Testing - on `main` branch, ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print("number of image elements: ", len(image_elements)) ``` The above code will display `number of image elements: 0`. - on this `feature` branch, The same code will display `number of image elements: 3` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>