fix languages 500 error with empty string for ocr_languages (#1968)

Closes #1870 
Defining both `languages` and `ocr_languages` raises a ValueError, but
the api defaults to `ocr_languages` being an empty string, so if users
define `languages` they are automatically hitting the ValueError.

This fix checks if `ocr_languages` is an empty string and converts it to
`None` to avoid this.

### Testing
On the main branch, the following will raise the ValueError, but it will
correctly partition on this branch
```
from unstructured.partition.auto import partition
filename = "example-docs/category-level.docx"
elements = partition(filename,languages=['spa'],ocr_languages="")

elements[0].metadata.languages
```

---------

Co-authored-by: yuming <305248291@qq.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Austin Walker <awalk89@gmail.com>
This commit is contained in:
John 2023-11-01 17:02:00 -05:00 committed by GitHub
parent 1893d5a669
commit b92cab7fbd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 18 additions and 4 deletions

View File

@ -1,4 +1,4 @@
## 0.10.29-dev5
## 0.10.29-dev6
### Enhancements
@ -13,7 +13,7 @@
* **Allow setting table crop parameter** In certain circumstances, adjusting the table crop padding may improve table.
### Fixes
* **Handle empty string for `ocr_languages` with values for `languages`** Some API users ran into an issue with sending `languages` params because the API defaulted to also using an empty string for `ocr_languages`. This update handles situations where `languages` is defined and `ocr_languages` is an empty string.
* **Fix PDF tried to loop through None** Previously the PDF annotation extraction tried to loop through `annots` that resolved out as None. A logical check added to avoid such error.
* **Ingest session handler not being shared correctly** All ingest docs that leverage the session handler should only need to set it once per process. It was recreating it each time because the right values weren't being set nor available given how dataclasses work in python.
* **Ingest download-only fix** Previously the download only flag was being checked after the doc factory pipeline step, which occurs before the files are actually downloaded by the source node. This check was moved after the source node to allow for the files to be downloaded first before exiting the pipeline.

View File

@ -399,6 +399,18 @@ def test_auto_partition_element_metadata_user_provided_languages():
assert elements[0].metadata.languages == ["eng"]
@pytest.mark.parametrize(
("languages", "ocr_languages"),
[(["auto"], ""), (["eng"], "")],
)
def test_auto_partition_ignores_empty_string_for_ocr_languages(languages, ocr_languages):
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "book-war-and-peace-1p.txt")
elements = partition(
filename=filename, strategy="ocr_only", ocr_languages=ocr_languages, languages=languages
)
assert elements[0].metadata.languages == ["eng"]
def test_auto_partition_warns_with_ocr_languages(caplog):
filename = "example-docs/chevron-page.pdf"
partition(filename=filename, strategy="hi_res", ocr_languages="eng")

View File

@ -1 +1 @@
__version__ = "0.10.29-dev5" # pragma: no cover
__version__ = "0.10.29-dev6" # pragma: no cover

View File

@ -214,6 +214,9 @@ def partition(
)
kwargs.setdefault("metadata_filename", metadata_filename)
if ocr_languages == "":
ocr_languages = None
if ocr_languages is not None:
# check if languages was set to anything not the default value
# languages and ocr_languages were therefore both provided - raise error
@ -222,7 +225,6 @@ def partition(
"Only one of languages and ocr_languages should be specified. "
"languages is preferred. ocr_languages is marked for deprecation.",
)
else:
languages = convert_old_ocr_languages_to_languages(ocr_languages)
logger.warning(