fix: Fix for Pillow error when extracting PNG images (#3998)

When I tried to partition a PNG file and extract images, I got an error
from Pillow:

```
WARNING  unstructured:pdf_image_utils.py:230 Image Extraction Error: Skipping the failed image
Traceback (most recent call last):
  File "/Users/austin/.pyenv/versions/unstructured/lib/python3.10/site-packages/PIL/JpegImagePlugin.py", line 666, in _save
    rawmode = RAWMODE[im.mode]
KeyError: 'RGBA'
```

The issue is that a PNG has an additional layer that cannot be saved off
in jpeg format. We can fix this with a quick conversion. I added a png
test case that is now passing with this fix.
This commit is contained in:
Austin Walker 2025-05-08 17:57:05 -04:00 committed by GitHub
parent b814ece39f
commit e3417d7e98
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 16 additions and 1 deletions

View File

@ -1,3 +1,12 @@
## 0.17.7-dev0
### Enhancements
### Features
### Fixes
- **Fix image extraction for PNG files.** When `extract_image_block_to_payload` is True, and the image is a PNG, we get a Pillow error. We need to remove the PNG transparency layer before saving the image.
## 0.17.6
### Enhancements

View File

@ -73,6 +73,7 @@ def test_convert_pdf_to_image_raises_error(filename=example_doc_path("embedded-i
[
(example_doc_path("pdf/layout-parser-paper-fast.pdf"), False),
(example_doc_path("img/layout-parser-paper-fast.jpg"), True),
(example_doc_path("img/english-and-korean.png"), True),
],
)
@pytest.mark.parametrize("element_category_to_save", [ElementType.IMAGE, ElementType.TABLE])

View File

@ -1 +1 @@
__version__ = "0.17.6" # pragma: no cover
__version__ = "0.17.7-dev0" # pragma: no cover

View File

@ -204,6 +204,11 @@ def save_elements(
image_path = image_paths[page_index]
image = Image.open(image_path)
cropped_image = image.crop(padded_bbox)
# PNG images with transparency need to be converted before saving
if cropped_image.mode == "RGBA":
cropped_image = cropped_image.convert("RGB")
if extract_image_block_to_payload:
buffered = BytesIO()
cropped_image.save(buffered, format="JPEG")