### Summary
Uses `langdetect` to detect all languages present in the input document.
### Details
- Converts all language codes (whether user inputted or detected using
`langdetect`) to a standard ISO 639-3 code.
- Adds `languages` field to the metadata
- Will revisit how to nonstandardly represent simplified vs traditional
Chinese scripts internally (separate PR).
- Update ingest test results to add `languages` field to documents. Some
other side effects are changes in order of some elements and changes in
element categorization
### Test
You can test the detect_languages function individually by importing the
function and inputting a text sample and optionally a language:
```
text = "My lubimy mleko i chleb."
doc_langs = detect_languages(text)
print(doc_langs)
```
-> ['ces', 'pol', 'slk']
---------
Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
### Summary
Partial solution to #1185.
Related to #1222.
Creates decorator from `chunk_by_title` cleaning brick.
Breaks a document into sections based on the presence of Title elements.
Also starts a new section under the following conditions:
- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is 1500. The **chunking function does not split individual
elements**, so it's possible for a section to exceed that threshold if
an individual element if over `new_after_n_chars characters`, which
could occur with a long NarrativeText element.
Combines sections under these conditions
- Sections under `combine_under_n_chars` characters are combined. The
default is 500.
### Testing
from unstructured.partition.html import partition_html
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
chunks = partition_html(url=url, chunking_strategy="by_title")
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
* add auto_paragraph_grouper. add line break pattern.
* combine group_broken_paragraph and blank_line_grouper function
* fix make check errors
* fix make check errors
* fix make check errors
* fix make check errors
* run make tidy to fix errors
* tidy core.py and text.py
* fix blank-line breaker to extends the result and replace new line with space
* fix function name typo
* call group_broken_paragraphs for blank_line_grouper
* edit function name from one_line_grouper to new_line_grouper for consistency
* edit threshold from 0.5 to 0.1
* edit threshold from 0.5 to 0.1
* Revert "call group_broken_paragraphs for blank_line_grouper"
This reverts commit 8fb93b7aa7c4d7e0320ac1e09c77da44c9b6c7d9.
* revert to commit 8fb93b7 and change threshold from 0.5 to 0.1
* edit test_text assertion. remove all BULLETS_PATTERN.
* Update ingest test fixtures (#1052)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* edit test case in test_xml_partition
* update assertion on test_auto
---------
Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MacBook-Pro.local>
Co-authored-by: Klaijan Sinteppadon <klaijan@klaijans-mbp.mynetworksettings.com>
Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MBP.fios-router.home>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* add min_partition
* functioning _split_content_to_fit_min_max
* create test and make tidy/check
* fix rebase issues
* fix type hinting, remove unused code, add tests
* various changes and refactoring of methods
* add test, refactor, change var names for debugging purposes
* update test
* make tidy/check
* give more descriptive var names and add comments
* update xml partition via partition_text and create test
* fix <pre> bug for test_partition_html_with_pre_tag
* make tidy
* refactor and fix tests
* make tidy/check
* ingest-test-fixtures-update
* change list comprehension to for loop
* fix error check
* fix conflicts
* add tests and clean metadata_filename in partitions
* fix test_email and remove comments
* make tidy/check
* update changelog and version
* fix tests
* make tidy again
* add include_metadata kwarg and tests to parsers
add exclude_metadata to docx
add test for doc to exclude metadata
add include_metadata kwarg to email
add include_metadata kwarg to epub
add include_metadata kwarg to json
add exclude_metadata tests to md
add include_metadata kwarg and tests for msg parse
add include_metadata kwarg and tests for odt parse
add include_metadata kwarg and tests for org parse
add include_metadata kwarg and tests for ppt and pptx parse
add include_metadata kwarg and tests for rst parse
add include_metadata kwarg and tests for rtf parse
add include_metadata tests for text parse
add include_metadata tests for tsv parse
add include_metadata tests for xlsx parse
add include_metadata tests for xml parse
* WIP add include_metadata to partition_pdf
* add include_metadata tests to partition_pdf
* make tidy/check
* update changelog and version
* change test asserts and move docstring logic to process_metadata
* make tidy
* fix tests asserts
* linting, linting, linting
* sync versions
* skip api call test not on main
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* add max partition size logic
* work splitting logic into split_by_paragraph
* pass through max_partition to other functions
* added test for splitting long document
* add type hint
* add documentation
* version and changelog
* ingest-test-fixtures-update
* Update ingest test fixtures (#819)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* retrigger ci
* ingest-test-fixtures-update
* ingest-test-fixtures-update
* Update ingest test fixtures (#821)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* update default for partition_xml
* update version for release
* update msg doc string
---------
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* first pass on regex metadata
* fix typing for regex metadata
* add dataclass back in
* add decorators
* fix tests
* update docs
* add tests for regex metadata
* add process metadata to tsv
* changelog and version
* docs typos
* consolidate to using a single kwarg
* fix test
* cleaning brick to group broken paragraphs
* docs for group_broken_paragraphs
* add docs for partition_text with grouper
* partition_text and auto with paragraph_grouper
* version and changelog
* typo in the docs
* linting, linting, linting
* switch to using regular expressions
Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.
* Apply import sorting
ruff . --select I --fix
* Remove unnecessary open mode parameter
ruff . --select UP015 --fix
* Use f-string formatting rather than .format
* Remove extraneous parentheses
Also use "" instead of str()
* Resolve missing trailing commas
ruff . --select COM --fix
* Rewrite list() and dict() calls using literals
ruff . --select C4 --fix
* Add () to pytest.fixture, use tuples for parametrize, etc.
ruff . --select PT --fix
* Simplify code: merge conditionals, context managers
ruff . --select SIM --fix
* Import without unnecessary alias
ruff . --select PLR0402 --fix
* Apply formatting via black
* Rewrite ValueError somewhat
Slightly unrelated to the rest of the PR
* Apply formatting to tests via black
* Update expected exception message to match
0d81564
* Satisfy E501 line too long in test
* Update changelog & version
* Add ruff to make tidy and test deps
* Run 'make tidy'
* Update changelog & version
* Update changelog & version
* Add ruff to 'check' target
Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
* add env var for cap threshold; raise default threshold
* update docs and tests
* added check for ending in a comma
* update docs
* no caps check for all upper text
* capture Text in html and text
* check category in Text equality check
* lower case all caps before checking for verbs
* added check for us city/state/zip
* added address type
* add address to html
* add address to text
* fix for text tests; escape for large text segments
* refactor regex for readability
* update comment
* additional test for text with linebreaks
* update docs
* update changelog
* update elements docs
* remove old comment
* case -> cast
* type fix