unstructured/expected-structured-output at c2853e4ac3afe5172929f8a7f40cd9095ff3b243 - unstructured - Gitea: Git with a cup of tea

yujunjun/unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-28 08:10:29 +00:00

History

shreyanid c2853e4ac3

refactor languages parameter for pdf partition functions (#1334 )

### Summary

In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings.

### Details

Adds `languages` (a list of strings) as a parameter to pdf partitioning
functions. Marks `ocr_languages` for deprecation. Adds a new file
`lang.py` for language-related helper functions.

Coming up: langcode standardization, language detection

### Test

Call `partition_pdf` or `partition_pdf_or_image` with a variety of
strategies, languages, or `ocr_languages`.
- inclusion of `ocr_languages` as a parameter should display a
deprecation warning
- the other valid call outputs should be no different from the current
outputs.

ex:
```
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(filename="example-docs/DA-1p.pdf", strategy="hi_res", languages=["eng", "spa"])
print("\n\n".join([str(el) for el in elements]))
```

2023-09-12 16:15:26 +00:00

..

fix: GH issue 1057 etree parser error (csv) (#1112 )

2023-08-14 17:48:57 +00:00

fix: separate elements by <br> tag in partition_html (#1314 )

2023-09-07 13:16:31 +00:00

build(deps): PDF images, unstructured-inference==0.5.23 (#1341 )

2023-09-08 05:29:53 +00:00

biomed-path/07/07

Feat/1136 elements ordering for pdf (#1161 )

2023-08-24 17:46:19 -07:00

refactor languages parameter for pdf partition functions (#1334 )

2023-09-12 16:15:26 +00:00

confluence-diff

Adding table extraction to partition_html (#1324 )

2023-09-11 11:14:11 -07:00

Roman/delta table connector (#1132 )

2023-08-22 10:19:46 -04:00

chore: refactor ingest tests (#814 )

2023-06-29 23:13:41 +00:00

Adding table extraction to partition_html (#1324 )

2023-09-11 11:14:11 -07:00

elasticsearch/movies

Klaijan/auto paragraph grouper (#994 )

2023-08-07 18:37:18 -04:00

Adding table extraction to partition_html (#1324 )

2023-09-11 11:14:11 -07:00

fix: separate elements by <br> tag in partition_html (#1314 )

2023-09-07 13:16:31 +00:00

Feat/1060 update metadata fields (#1099 )

2023-08-16 04:33:06 +00:00

feat: jira connector (cloud) (#1238 )

2023-09-06 10:10:48 +00:00

local-single-file

fix: local connector output filename when a single file is being processed (#879 )

2023-07-05 14:37:40 -07:00

local-single-file-with-encoding

fix: respect <pre> tag order in partition_html (#1197 )

2023-08-25 04:14:48 +00:00

local-single-file-with-pdf-infer-table-structure

build(deps): PDF images, unstructured-inference==0.5.23 (#1341 )

2023-09-08 05:29:53 +00:00

Adding table extraction to partition_html (#1324 )

2023-09-11 11:14:11 -07:00

onedrive/utic-test-ingest-fixtures

enhancement: Add include_header kwarg for xlsx, default True(#1125 )

2023-08-17 04:16:23 +00:00

Feat/1060 update metadata fields (#1099 )

2023-08-16 04:33:06 +00:00

pdf-fast-reprocess

Feat/1136 elements ordering for pdf (#1161 )

2023-08-24 17:46:19 -07:00

s3/small-pdf-set

fix: avoid PDF sorting error on negative coords (#1361 )

2023-09-10 19:29:49 -07:00

feat: add salesforce connector (#1168 )

2023-09-02 08:50:31 -07:00

Adding table extraction to partition_html (#1324 )

2023-09-11 11:14:11 -07:00

feat: Adds in threaded replies (#1188 )

2023-08-24 12:12:29 -07:00