Fix: partition pdf overflow error (#2054)

Closes #2050.
### Summary
- set zoom to `1` if zoom is less than `0` when parsing Tesseract OCR
data
- update `determine_pdf_auto_strategy` to return the `hi_res` strategy
if either `infer_table_structure` or `extract_images_in_pdf` is true
### Testing
PDF:
[getty_62-62.pdf](https://github.com/Unstructured-IO/unstructured/files/13322169/getty_62-62.pdf)

Run the following code in both the `main` branch and the `current`
branch.

```
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="getty_62-62.pdf",
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path,
)
```
This commit is contained in:
Christine Straub 2023-11-10 11:01:46 -08:00 committed by GitHub
parent f8c180a59e
commit b11c546757
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 13 additions and 3 deletions

View File

@ -1,4 +1,4 @@
## 0.10.30-dev5
## 0.10.30
### Enhancements
@ -12,6 +12,8 @@
### Fixes
* **Fix logic that determines pdf auto strategy.** Previously, `_determine_pdf_auto_strategy` returned `hi_res` strategy only if `infer_table_structure` was true. It now returns the `hi_res` strategy if either `infer_table_structure` or `extract_images_in_pdf` is true.
* **Fix invalid coordinates when parsing tesseract ocr data.** Previously, when parsing tesseract ocr data, the ocr data had invalid bboxes if zoom was set to `0`. A logical check is now added to avoid such error.
* **Fix ingest partition parameters not being passed to the api.** When using the --partition-by-api flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. This allows a user to specify all of the same partition arguments they would locally and have them respected when specifying --partition-by-api.
* **Support tables in section-less DOCX.** Generalize solution for MS Chat Transcripts exported as DOCX by including tables in the partitioned output when present.
* **Support tables that contain only numbers when partitioning via `ocr_only`** Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from `.image_to_data()`. An AttributeError was raised downstream when trying to `.strip()` the floats.

View File

@ -1 +1 @@
__version__ = "0.10.30-dev5" # pragma: no cover
__version__ = "0.10.30" # pragma: no cover

View File

@ -528,6 +528,9 @@ def parse_ocr_data_tesseract(ocr_data: pd.DataFrame, zoom: float = 1) -> List[Te
data frame will result in its associated bounding box being ignored.
"""
if zoom <= 0:
zoom = 1
text_regions = []
for idtx in ocr_data.itertuples():
text = idtx.text

View File

@ -281,6 +281,7 @@ def partition_pdf_or_image(
file=file,
is_image=is_image,
infer_table_structure=infer_table_structure,
extract_images_in_pdf=extract_images_in_pdf,
)
!= "ocr_only"
):
@ -304,6 +305,7 @@ def partition_pdf_or_image(
is_image=is_image,
infer_table_structure=infer_table_structure,
pdf_text_extractable=pdf_text_extractable,
extract_images_in_pdf=extract_images_in_pdf,
)
if strategy == "hi_res":

View File

@ -39,6 +39,7 @@ def determine_pdf_or_image_strategy(
is_image: bool = False,
infer_table_structure: bool = False,
pdf_text_extractable: bool = True,
extract_images_in_pdf: bool = False,
):
"""Determines what strategy to use for processing PDFs or images, accounting for fallback
logic if some dependencies are not available."""
@ -62,6 +63,7 @@ def determine_pdf_or_image_strategy(
strategy = _determine_pdf_auto_strategy(
pdf_text_extractable=pdf_text_extractable,
infer_table_structure=infer_table_structure,
extract_images_in_pdf=extract_images_in_pdf,
)
if file is not None:
@ -124,12 +126,13 @@ def _determine_image_auto_strategy():
def _determine_pdf_auto_strategy(
pdf_text_extractable: bool = True,
infer_table_structure: bool = False,
extract_images_in_pdf: bool = False,
):
"""If "auto" is passed in as the strategy, determines what strategy to use
for PDFs."""
# NOTE(robinson) - Currrently "hi_res" is the only stategy where
# infer_table_structure is used.
if infer_table_structure:
if infer_table_structure or extract_images_in_pdf:
return "hi_res"
if pdf_text_extractable: