29 Commits

Author SHA1 Message Date
Steve Canny
74d089d942
rfctr: skip CheckBox elements during chunking (#2253)
`CheckBox` elements get special treatment during chunking. `CheckBox`
does not derive from `Text` and can contribute no text to a chunk. It is
considered "non-combinable" and so is emitted as-is as a chunk of its
own. A consequence of this is it breaks an otherwise contiguous chunk
into two wherever it occurs.

This is problematic, but becomes much more so when overlap is
introduced. Each chunk accepts a "tail" text fragment from its preceding
element and contributes its own tail fragment to the next chunk. These
tails represent the "overlap" between chunks. However, a non-text chunk
can neither accept nor provide a tail-fragment and so interrupts the
overlap. None of the possible solutions are terrific.

Give `Element` a `.text` attribute such that _all_ elements have a
`.text` attribute, even though its value is the empty-string for
element-types such as CheckBox and PageBreak which inherently have no
text. As a consequence, several `cast()` wrappers are no longer required
to satisfy strict type-checking.

This also allows a `CheckBox` element to be combined with `Text`
subtypes during chunking, essentially the same way `PageBreak` is,
contributing no text to the chunk.

Also, remove the `_NonTextSection` object which previously wrapped a
`CheckBox` element during pre-chunking as it is no longer required.
2023-12-13 20:22:25 +00:00
John
2f553333bd
refactor text.py (#1872)
### Summary
Closes #1520 
Partial solution to #1521 

- Adds an abstraction layer between the user API and the partitioner
implementation
- Adds comments explaining paragraph chunking
- Makes edits to pass strict type-checking for both text.py and
test_text.py
2023-11-01 17:44:55 -05:00
John
6d7fe3ab02
fix: default to None for the languages metadata field (#1743)
### Summary
Closes #1714
Changes the default value for `languages` to `None` for elements that
don't have text or the language can't be detected.

### Testing
```
from unstructured.partition.auto import partition
filename = "example-docs/handbook-1p.docx"
elements = partition(filename=filename, detect_language_per_element=True)

# PageBreak elements don't have text and will be collected here
none_langs = [element for element in elements if element.metadata.languages is None]
none_langs[0].text
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-14 22:46:24 +00:00
Steve Canny
d726963e42
serde tests round-trip through JSON (#1681)
Each partitioner has a test like `test_partition_x_with_json()`. What
these do is serialize the elements produced by the partitioner to JSON,
then read them back in from JSON and compare the before and after
elements.

Because our element equality (`Element.__eq__()`) is shallow, this
doesn't tell us a lot, but if we take it one more step, like
`List[Element] -> JSON -> List[Element] -> JSON` and then compare the
JSON, it gives us some confidence that the serialized elements can be
"re-hydrated" without losing any information.

This actually showed up a few problems, all in the
serialization/deserialization (serde) code that all elements share.
2023-10-12 19:47:55 +00:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00
Benjamin Torres
e0201e9a11
feat/add sources from unstructured inference (#1538)
This PR adds support for `source` property from
`unstructured_inference`, allowing the user to be able to see the origin
of the data under `detection_origin`field environment variable
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true

In order to try this feature you can use this code:
```
from unstructured.partition.pdf import partition_pdf_or_image

yolox_elements = partition_pdf_or_image(filename='example-docs/loremipsum-flat.pdf', strategy='hi_res', model_name='yolox')

sources = [e.detection_origin for e in yolox_elements]
print(sources)
```
And will print 'yolox' as source for all the elements
2023-10-05 20:26:47 +00:00
shreyanid
32bfebccf7
feat: introduce language detection function for text partitioning function (#1453)
### Summary
Uses `langdetect` to detect all languages present in the input document.

### Details
- Converts all language codes (whether user inputted or detected using
`langdetect`) to a standard ISO 639-3 code.
- Adds `languages` field to the metadata
- Will revisit how to nonstandardly represent simplified vs traditional
Chinese scripts internally (separate PR).
- Update ingest test results to add `languages` field to documents. Some
other side effects are changes in order of some elements and changes in
element categorization

### Test
You can test the detect_languages function individually by importing the
function and inputting a text sample and optionally a language:
```
text = "My lubimy mleko i chleb."
doc_langs = detect_languages(text)
print(doc_langs)
```
-> ['ces', 'pol', 'slk']

---------

Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2023-09-26 18:09:27 +00:00
John
c58b261feb
chunk_by_title decorator (#1304)
### Summary

Partial solution to #1185.
Related to #1222.
Creates decorator from `chunk_by_title` cleaning brick.
Breaks a document into sections based on the presence of Title elements.
Also starts a new section under the following conditions:

- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is 1500. The **chunking function does not split individual
elements**, so it's possible for a section to exceed that threshold if
an individual element if over `new_after_n_chars characters`, which
could occur with a long NarrativeText element.

Combines sections under these conditions
- Sections under `combine_under_n_chars` characters are combined. The
default is 500.

### Testing

from unstructured.partition.html import partition_html

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
chunks = partition_html(url=url, chunking_strategy="by_title")

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()
2023-09-11 21:00:14 +00:00
Klaijan
675a10ea69
fix: update test_json to not use auto partition (#1187)
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
2023-08-29 16:59:26 -04:00
Matt Robinson
fa5a3dbd81
feat: unique_element_ids kwarg for UUID elements (#1085)
* added kwarg for unique elements

* test for unique ids

* update docs

* changelog and version
2023-08-11 11:02:37 +00:00
Klaijan
ad386af8b5
Klaijan/auto paragraph grouper (#994)
* add auto_paragraph_grouper. add line break pattern.

* combine group_broken_paragraph and blank_line_grouper function

* fix make check errors

* fix make check errors

* fix make check errors

* fix make check errors

* run make tidy to fix errors

* tidy core.py and text.py

* fix blank-line breaker to extends the result and replace new line with space

* fix function name typo

* call group_broken_paragraphs for blank_line_grouper

* edit function name from one_line_grouper to new_line_grouper for consistency

* edit threshold from 0.5 to 0.1

* edit threshold from 0.5 to 0.1

* Revert "call group_broken_paragraphs for blank_line_grouper"

This reverts commit 8fb93b7aa7c4d7e0320ac1e09c77da44c9b6c7d9.

* revert to commit 8fb93b7 and change threshold from 0.5 to 0.1

* edit test_text assertion. remove all BULLETS_PATTERN.

* Update ingest test fixtures (#1052)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* edit test case in test_xml_partition

* update assertion on test_auto

---------

Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MacBook-Pro.local>
Co-authored-by: Klaijan Sinteppadon <klaijan@klaijans-mbp.mynetworksettings.com>
Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MBP.fios-router.home>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-07 18:37:18 -04:00
Matt Robinson
a1ef6248bf
fix: simplify min_partition logic for partition_text (#1032)
* min simplify first pass

* update tests

* better max partition default

* version and changelog
2023-08-04 13:32:42 +00:00
cragwolfe
13d3559fa4
chore: rename Element's "date" field to "last_modified" (#997)
Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.
2023-08-01 02:55:43 +00:00
Matt Robinson
d9aed66b65
feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-07-26 15:10:14 -04:00
John
676c50a6ec
feat: add min_partition kwarg to that combines elements below a specified threshold (#926)
* add min_partition

* functioning _split_content_to_fit_min_max

* create test and make tidy/check

* fix rebase issues

* fix type hinting, remove unused code, add tests

* various changes and refactoring of methods

* add test, refactor, change var names for debugging purposes

* update test

* make tidy/check

* give more descriptive var names and add comments

* update xml partition via partition_text and create test

* fix <pre> bug for test_partition_html_with_pre_tag

* make tidy

* refactor and fix tests

* make tidy/check

* ingest-test-fixtures-update

* change list comprehension to for loop

* fix error check
2023-07-24 15:57:24 +00:00
John
dc6d7d7268
feat: add metadata_filename parameter across all partition functions (#811)
* fix conflicts

* add tests and clean metadata_filename in partitions

* fix test_email and remove comments

* make tidy/check

* update changelog and version

* fix tests

* make tidy again
2023-07-05 16:02:22 -04:00
John
e9fdbb0943
feat: add include_metadata across all partition functions (#853)
* add include_metadata kwarg and tests to parsers

add exclude_metadata to docx

add test for doc to exclude metadata

add include_metadata kwarg to email

add include_metadata kwarg to epub

add include_metadata kwarg to json

add exclude_metadata tests to md

add include_metadata kwarg and tests for msg parse

add include_metadata kwarg and tests for odt parse

add include_metadata kwarg and tests for org parse

add include_metadata kwarg and tests for ppt and pptx parse

add include_metadata kwarg and tests for rst parse

add include_metadata kwarg and tests for rtf parse

add include_metadata tests for text parse

add include_metadata tests for tsv parse

add include_metadata tests for xlsx parse

add include_metadata tests for xml parse

* WIP add include_metadata to partition_pdf

* add include_metadata tests to partition_pdf

* make tidy/check

* update changelog and version

* change test asserts and move docstring logic to process_metadata

* make tidy

* fix tests asserts

* linting, linting, linting

* sync versions

* skip api call test not on main

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-06-30 10:44:46 -04:00
Matt Robinson
44411ecc59
enhancement: max_partition kwarg for limiting element size (#818)
* add max partition size logic

* work splitting logic into split_by_paragraph

* pass through max_partition to other functions

* added test for splitting long document

* add type hint

* add documentation

* version and changelog

* ingest-test-fixtures-update

* Update ingest test fixtures (#819)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* retrigger ci

* ingest-test-fixtures-update

* ingest-test-fixtures-update

* Update ingest test fixtures (#821)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* update default for partition_xml

* update version for release

* update msg doc string

---------

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-28 15:26:01 -04:00
Matt Robinson
06077b09ee
fix: don't detect line breaks as list items (#831)
* add negative lookahead to bullet pattern

* version and changelog

* update paragraph pattern

* add list item assert
2023-06-28 12:49:12 -04:00
Matt Robinson
4ea716837d
feat: add ability to extract extra metadata with regex (#763)
* first pass on regex metadata

* fix typing for regex metadata

* add dataclass back in

* add decorators

* fix tests

* update docs

* add tests for regex metadata

* add process metadata to tsv

* changelog and version

* docs typos

* consolidate to using a single kwarg

* fix test
2023-06-16 10:10:56 -04:00
Christine Straub
a1fed6d4c6
Issue/unicode error (#608)
This PR adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
2023-05-23 13:35:38 -07:00
Matt Robinson
c99c099158
feat: enable grouping broken paragraphs in partition_text (#456)
* cleaning brick to group broken paragraphs

* docs for group_broken_paragraphs

* add docs for partition_text with grouper

* partition_text and auto with paragraph_grouper

* version and changelog

* typo in the docs

* linting, linting, linting

* switch to using regular expressions
2023-04-06 18:35:22 +00:00
Amanda Cameron
555b95b8f7
Fixing test for unstructured-api (#425)
Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.
2023-04-03 11:12:12 -07:00
Matt Robinson
09b52b4fc4
fix: text kwargs no longer fail with empty string (#413)
* fix: text kwargs no longer fail with empty string

* linting
2023-03-28 21:03:51 +00:00
Amanda Cameron
64efcc0e50
Adding optional encoding arg, and text_partition tests (#339) 2023-03-06 15:07:33 -08:00
Tom Aarsen
5eb1466acc
Resolve various style issues to improve overall code quality (#282)
* Apply import sorting

ruff . --select I --fix

* Remove unnecessary open mode parameter

ruff . --select UP015 --fix

* Use f-string formatting rather than .format

* Remove extraneous parentheses

Also use "" instead of str()

* Resolve missing trailing commas

ruff . --select COM --fix

* Rewrite list() and dict() calls using literals

ruff . --select C4 --fix

* Add () to pytest.fixture, use tuples for parametrize, etc.

ruff . --select PT --fix

* Simplify code: merge conditionals, context managers

ruff . --select SIM --fix

* Import without unnecessary alias

ruff . --select PLR0402 --fix

* Apply formatting via black

* Rewrite ValueError somewhat

Slightly unrelated to the rest of the PR

* Apply formatting to tests via black

* Update expected exception message to match
0d81564

* Satisfy E501 line too long in test

* Update changelog & version

* Add ruff to make tidy and test deps

* Run 'make tidy'

* Update changelog & version

* Update changelog & version

* Add ruff to 'check' target

Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
2023-02-27 11:30:54 -05:00
Matt Robinson
339c133326
fix: cleanup from live .docx tests (#177)
* add env var for cap threshold; raise default threshold

* update docs and tests

* added check for ending in a comma

* update docs

* no caps check for all upper text

* capture Text in html and text

* check category in Text equality check

* lower case all caps before checking for verbs

* added check for us city/state/zip

* added address type

* add address to html

* add address to text

* fix for text tests; escape for large text segments

* refactor regex for readability

* update comment

* additional test for text with linebreaks

* update docs

* update changelog

* update elements docs

* remove old comment

* case -> cast

* type fix
2023-01-26 15:52:25 +00:00
Matt Robinson
f12240c5e7
feat: add support for .txt files in partition (#150)
* added partition_text for auto

* rename partition_text tests

* bump version and update docs
2023-01-13 16:39:53 -05:00
Mallori Harrell
d7a00046a9
feat: Add new functionality to parse text and header of emails (#111)
* partition_text function
2023-01-09 17:08:08 +00:00