mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-03 15:11:30 +00:00

### Summary Closes #2011 `languages` was missing from the metadata when partitioning pdfs via `hi_res` and `fast` strategies and missing from image partitions via `hi_res`. This PR adds `languages` to the relevant function calls so it is included in the resulting elements. ### Testing On the main branch, `partition_image` will include `languages` when `strategy='ocr_only'`, but not when `strategy='hi_res'`: ``` filename = "example-docs/english-and-korean.png" from unstructured.partition.image import partition_image elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor']) elements[0].metadata.languages elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor']) elements[0].metadata.languages ``` For `partition_pdf`, `'ocr_only'` will include `languages` in the metadata, but `'fast'` and `'hi_res'` will not. ``` filename = "example-docs/korean-text-with-tables.pdf" from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename, strategy="ocr_only", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="fast", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="hi_res", languages=['kor']) elements[0].metadata.languages ``` On this branch, `languages` is included in the metadata regardless of strategy --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>