fix languages 500 error with empty string for ocr_languages (#1968)

Closes #1870 Defining both `languages` and `ocr_languages` raises a ValueError, but the api defaults to `ocr_languages` being an empty string, so if users define `languages` they are automatically hitting the ValueError. This fix checks if `ocr_languages` is an empty string and converts it to `None` to avoid this. ### Testing On the main branch, the following will raise the ValueError, but it will correctly partition on this branch ``` from unstructured.partition.auto import partition filename = "example-docs/category-level.docx" elements = partition(filename,languages=['spa'],ocr_languages="") elements[0].metadata.languages ``` --------- Co-authored-by: yuming <305248291@qq.com> Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com> Co-authored-by: Austin Walker <awalk89@gmail.com>
2026-01-06 12:21:30 +00:00 · 2023-11-01 17:02:00 -05:00 · 2023-11-01 17:02:00 -05:00 · b92cab7fbd
commit b92cab7fbd
parent 1893d5a669
4 changed files with 18 additions and 4 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,4 +1,4 @@
-## 0.10.29-dev5
+## 0.10.29-dev6

 ### Enhancements

@ -13,7 +13,7 @@
 * **Allow setting table crop parameter** In certain circumstances, adjusting the table crop padding may improve table.

 ### Fixes
-
+* **Handle empty string for `ocr_languages` with values for `languages`** Some API users ran into an issue with sending `languages` params because the API defaulted to also using an empty string for `ocr_languages`. This update handles situations where `languages` is defined and `ocr_languages` is an empty string.
 * **Fix PDF tried to loop through None** Previously the PDF annotation extraction tried to loop through `annots` that resolved out as None. A logical check added to avoid such error.
 * **Ingest session handler not being shared correctly** All ingest docs that leverage the session handler should only need to set it once per process. It was recreating it each time because the right values weren't being set nor available given how dataclasses work in python.
 * **Ingest download-only fix** Previously the download only flag was being checked after the doc factory pipeline step, which occurs before the files are actually downloaded by the source node. This check was moved after the source node to allow for the files to be downloaded first before exiting the pipeline.
--- a/test_unstructured/partition/test_auto.py
+++ b/test_unstructured/partition/test_auto.py
@ -399,6 +399,18 @@ def test_auto_partition_element_metadata_user_provided_languages():
    assert elements[0].metadata.languages == ["eng"]


+@pytest.mark.parametrize(
+    ("languages", "ocr_languages"),
+    [(["auto"], ""), (["eng"], "")],
+)
+def test_auto_partition_ignores_empty_string_for_ocr_languages(languages, ocr_languages):
+    filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "book-war-and-peace-1p.txt")
+    elements = partition(
+        filename=filename, strategy="ocr_only", ocr_languages=ocr_languages, languages=languages
+    )
+    assert elements[0].metadata.languages == ["eng"]
+
+
 def test_auto_partition_warns_with_ocr_languages(caplog):
    filename = "example-docs/chevron-page.pdf"
    partition(filename=filename, strategy="hi_res", ocr_languages="eng")
--- a/unstructured/version.py
+++ b/unstructured/version.py
@ -1 +1 @@
-__version__ = "0.10.29-dev5"  # pragma: no cover
+__version__ = "0.10.29-dev6"  # pragma: no cover
--- a/unstructured/partition/auto.py
+++ b/unstructured/partition/auto.py
@ -214,6 +214,9 @@ def partition(
        )
    kwargs.setdefault("metadata_filename", metadata_filename)

+    if ocr_languages == "":
+        ocr_languages = None
+
    if ocr_languages is not None:
        # check if languages was set to anything not the default value
        # languages and ocr_languages were therefore both provided - raise error
@ -222,7 +225,6 @@ def partition(
                "Only one of languages and ocr_languages should be specified. "
                "languages is preferred. ocr_languages is marked for deprecation.",
            )
-
        else:
            languages = convert_old_ocr_languages_to_languages(ocr_languages)
            logger.warning(