### Description
Set language to None by default. Update ingest test to use local file
used in language unit tests to validate.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Summary
Closes#1534 and #1535
Detects document language using `langdetect` package.
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)
---------
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
### Summary
Uses `langdetect` to detect all languages present in the input document.
### Details
- Converts all language codes (whether user inputted or detected using
`langdetect`) to a standard ISO 639-3 code.
- Adds `languages` field to the metadata
- Will revisit how to nonstandardly represent simplified vs traditional
Chinese scripts internally (separate PR).
- Update ingest test results to add `languages` field to documents. Some
other side effects are changes in order of some elements and changes in
element categorization
### Test
You can test the detect_languages function individually by importing the
function and inputting a text sample and optionally a language:
```
text = "My lubimy mleko i chleb."
doc_langs = detect_languages(text)
print(doc_langs)
```
-> ['ces', 'pol', 'slk']
---------
Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
* feat: add functionality to track emphasized text (`bold/italic` formatting) from paragraph
* chore: add docstring
* chore: fix lint errors
* feat: ignore spaces when extracting emphasized texts from a paragraph
* feat: add functionality to track emphasized text (`bold/italic` formatting) from table
* test: add test case for grabbing emphasized texts from element metadata
* chore: fix lint errors
* chore: update changelog & version
* Update ingest test fixtures (#1047)
- Adds reusable validation scripts (check-x.sh) to minimize repeated (or near-repeated) code and create one source of truth
- Restructures the location of download and output folders such that they are nested in the test_unstructured_ingest directory
- Adds gitignore for output folders / files to avoid them accidentally getting checked into the repository
- Construct paths as reusable variables declared at top of scripts
- Sort order of flag for ingest calls, across all tests (this makes it easier to parse at a glance)
- OVERWRITE_FIXTURES removes all old fixtures for path to guarantee no stale results are left behind
- Bonus: don't check/exit on expected number of expected outputs when OVERWRITE_FIXTURES is true
- Bonus: exclude file_directory from Slack and Discord test scripts (match convention in all others)