unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-16 18:44:58 +00:00

History

chore: refactor languages parameter for text_type functions (#1399 )

### Summary
In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings. To identify element
types such as NarrativeText and Title, continue the refactor into
functions that use language checks to determine those potential
classifications.

### Details
Replaces `language` with `languages` (a list of strings) as a parameter
to `is_possible_narrative_text` and `is_possible_title`.


### Test
Call `is_possible_narrative_text` and `is_possible_title` with text in a
variety of languages and different inputs for `languages`. The resulting
element classifications should be no different from the current outputs.

ex: see `test_text_type_handles_multi_language_examples` in
`test_unstructured/partition/test_text_type.py`.

2023-09-13 19:46:36 +00:00

csv

fix: update test_json to not use auto partition (#1187 )

2023-08-29 16:59:26 -04:00

docx

chunk_by_title decorator (#1304 )

2023-09-11 21:00:14 +00:00

epub

Table processing test for RTF (#1388 )

2023-09-12 18:27:05 -07:00

markdown

chunk_by_title decorator (#1304 )