mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-04 19:16:03 +00:00
Fix: handle pdf text extraction errors (#2101)
Closes #2084. ### Summary Certain pdfs throw unexpected errors when being opened by `pdfminer`, causing `partition_pdf()` to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy. ### Testing PDF: [NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf](https://github.com/Unstructured-IO/unstructured/files/13383215/NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf) ``` elements = partition_pdf( filename="NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf", ) ```
This commit is contained in:
parent
a589a494f6
commit
9c66eab8a9
@ -13,7 +13,8 @@
|
|||||||
|
|
||||||
### Fixes
|
### Fixes
|
||||||
|
|
||||||
* **Fix `fast` strategy fall back to `ocr_only`.** The `fast` strategy should not fall back to a more expensive strategy.
|
* **Handle errors when extracting PDF text** Certain pdfs throw unexpected errors when being opened by `pdfminer`, causing `partition_pdf()` to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy.
|
||||||
|
* **Fix `fast` strategy fall back to `ocr_only`** The `fast` strategy should not fall back to a more expensive strategy.
|
||||||
* **Remove default user ./ssh folder** The default notebook user during image build would create the known_hosts file with incorrect ownership, this is legacy and no longer needed so it was removed.
|
* **Remove default user ./ssh folder** The default notebook user during image build would create the known_hosts file with incorrect ownership, this is legacy and no longer needed so it was removed.
|
||||||
* **Include `languages` in metadata when partitioning strategy='hi_res' or 'fast'** User defined `languages` was previously used for text detection, but not included in the resulting element metadata for some strategies. `languages` will now be included in the metadata regardless of partition strategy for pdfs and images.
|
* **Include `languages` in metadata when partitioning strategy='hi_res' or 'fast'** User defined `languages` was previously used for text detection, but not included in the resulting element metadata for some strategies. `languages` will now be included in the metadata regardless of partition strategy for pdfs and images.
|
||||||
* **Handle a case where Paddle returns a list item in ocr_data as None** In partition, while parsing PaddleOCR data, it was assumed that PaddleOCR does not return None for any list item in ocr_data. Removed the assumption by skipping the text region whenever this happens.
|
* **Handle a case where Paddle returns a list item in ocr_data as None** In partition, while parsing PaddleOCR data, it was assumed that PaddleOCR does not return None for any list item in ocr_data. Removed the assumption by skipping the text region whenever this happens.
|
||||||
|
|||||||
@ -245,20 +245,23 @@ def partition_pdf_or_image(
|
|||||||
)
|
)
|
||||||
|
|
||||||
extracted_elements = []
|
extracted_elements = []
|
||||||
|
pdf_text_extractable = False
|
||||||
if not is_image:
|
if not is_image:
|
||||||
extracted_elements = extractable_elements(
|
try:
|
||||||
filename=filename,
|
extracted_elements = extractable_elements(
|
||||||
file=spooled_to_bytes_io_if_needed(file),
|
filename=filename,
|
||||||
include_page_breaks=include_page_breaks,
|
file=spooled_to_bytes_io_if_needed(file),
|
||||||
languages=languages,
|
include_page_breaks=include_page_breaks,
|
||||||
metadata_last_modified=metadata_last_modified or last_modification_date,
|
languages=languages,
|
||||||
**kwargs,
|
metadata_last_modified=metadata_last_modified or last_modification_date,
|
||||||
)
|
**kwargs,
|
||||||
pdf_text_extractable = any(
|
)
|
||||||
isinstance(el, Text) and el.text.strip() for el in extracted_elements
|
pdf_text_extractable = any(
|
||||||
)
|
isinstance(el, Text) and el.text.strip() for el in extracted_elements
|
||||||
else:
|
)
|
||||||
pdf_text_extractable = False
|
except Exception as e:
|
||||||
|
logger.error(e, exc_info=True)
|
||||||
|
logger.warning("PDF text extraction failed, skip text extraction...")
|
||||||
|
|
||||||
strategy = determine_pdf_or_image_strategy(
|
strategy = determine_pdf_or_image_strategy(
|
||||||
strategy,
|
strategy,
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user