unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-15 18:17:09 +00:00

Author	SHA1	Message	Date
Steve Canny	d9c2516364	fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779 ) Executive Summary. Introducing strict type-checking as preparation for adding the chunk-overlap feature revealed a type mismatch for regex-metadata between chunking tests and the (authoritative) ElementMetadata definition. The implementation of regex-metadata aspects of chunking passed the tests but did not produce the appropriate behaviors in production where the actual data-structure was different. This PR fixes these two bugs. 1. Over-chunking. The presence of `regex-metadata` in an element was incorrectly being interpreted as a semantic boundary, leading to such elements being isolated in their own chunks. 2. Discarded regex-metadata. regex-metadata present on the second or later elements in a section (chunk) was discarded. Technical Summary The type of `ElementMetadata.regex_metadata` is `Dict[str, List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text": "this matched", "start": 7, "end": 19}`. Multiple regexes can be specified, each with a name like "mail-stop", "version", etc. Each of those may produce its own set of matches, like: ```python >>> element.regex_metadata { "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}], "version": [ {"text": "current: v1.7.2", "start": 7, "end": 21}, {"text": "supersedes: v1.7.0", "start": 22, "end": 40}, ], } ``` Forensic analysis * The regex-metadata feature was added by Matt Robinson on 06/16/2023 commit: 4ea71683. The regex_metadata data structure is the same as when it was added. * The chunk-by-title feature was added by Matt Robinson on 08/29/2023 commit: f6a745a7. The mistaken regex-metadata data structure in the tests is present in that commit. Looks to me like a mis-remembering of the regex-metadata data-structure and insufficient type-checking rigor (type-checker strictness level set too low) to warn of the mistake. Over-chunking Behavior The over-chunking looked like this: Chunking three elements with regex metadata should combine them into a single chunk (`CompositeElement` object), subject to maximum size rules (default 500 chars). ```python elements: List[Element] = [ Title( "Lorem Ipsum", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]} ), ), Text( "Lorem ipsum dolor sit amet consectetur adipiscing elit.", metadata=ElementMetadata( regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]} ), ), Text( "In rhoncus ipsum sed lectus porta volutpat.", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]} ), ), ] chunks = chunk_by_title(elements) assert chunks == [ CompositeElement( "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus" " ipsum sed lectus porta volutpat." ) ] ``` Observed behavior looked like this: ```python chunks => [ CompositeElement('Lorem Ipsum') CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.') CompositeElement('In rhoncus ipsum sed lectus porta volutpat.') ] ``` The fix changed the approach from breaking on any metadata field not in a specified group (`regex_metadata` was missing from this group) to only breaking on specified fields (whitelisting instead of blacklisting). This avoids overchunking every time we add a new metadata field and is also simpler and easier to understand. This change in approach is discussed in more detail here #1790. Dropping regex-metadata Behavior Chunking this section: ```python elements: List[Element] = [ Title( "Lorem Ipsum", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]} ), ), Text( "Lorem ipsum dolor sit amet consectetur adipiscing elit.", metadata=ElementMetadata( regex_metadata={ "dolor": [RegexMetadata(text="dolor", start=12, end=17)], "ipsum": [RegexMetadata(text="ipsum", start=6, end=11)], } ), ), Text( "In rhoncus ipsum sed lectus porta volutpat.", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]} ), ), ] ``` ..should produce this regex_metadata on the single produced chunk: ```python assert chunk == CompositeElement( "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus" " ipsum sed lectus porta volutpat." ) assert chunk.metadata.regex_metadata == { "dolor": [RegexMetadata(text="dolor", start=25, end=30)], "ipsum": [ RegexMetadata(text="Ipsum", start=6, end=11), RegexMetadata(text="ipsum", start=19, end=24), RegexMetadata(text="ipsum", start=81, end=86), ], } ``` but instead produced this: ```python regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]} ``` Which is the regex-metadata from the first element only. The fix was to remove the consolidation+adjustment process from inside the "list-attribute-processing" loop (because regex-metadata is not a list) and process regex metadata separately.	2023-10-19 22:16:02 -05:00
unifyh	89bd2faaf7	fix: Fix various cases of HTML text missing after partition (#1587 ) Fix 4 cases of text missing after partition: 1. Text immediately after `<body>` ```html <body> missing1 <div>hello</div> </body> ``` 2. Text inside container and immediately after `<br/>` ```html <div>hello<br/>missing2</div> ``` 3. Text immediately after a text opening tag, if said tag contains `<br/>` ```html <p>missing3<br/>hello</p> ``` 4. Text inside `<body>` if it is the only content (different cause from case 1) ```html <body>missing4</body> ``` Also fix problem causing `test_unstructured/documents/test_html.py::test_exclude_tag_types` to not work as intended. This will close GitHub Issue#1543	2023-10-03 04:17:51 +00:00
Austin Walker	f34c277bca	fix: add backwards compatibility to ElementMetadata (#1526 ) Fixes https://github.com/Unstructured-IO/unstructured-api/issues/237 The problem: The `ElementMetadata` class was not able to ignore fields that it didn't know about. This surfaced in `partition_via_api`, when the hosted api schema is newer than the local `unstructured` version. In `ElementMetadata.from_json()` we get errors such as `TypeError: __init__() got an unexpected keyword argument 'parent_id'`. The fix: The `from_json` methods for these dataclasses should drop any unexpected fields before calling `__init__`. To verify: This shouldn't throw an error ``` from unstructured.staging.base import elements_from_json import json test_api_result = json.dumps([ { "type": "Title", "element_id": "2f7cc75f6467bba468022c4c2875335e", "metadata": { "filename": "layout-parser-paper.pdf", "filetype": "application/pdf", "page_number": 1, "new_field": "foo", }, "text": "LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis" } ]) elements = elements_from_json(text=test_api_result) print(elements) ```	2023-09-27 18:40:56 +00:00
Benjamin Torres	5d193c8e5a	fix/bad formed formula (#1481 ) @ron-unstructured reported that loading files with: ``` from unstructured.partition.pdf import partition_pdf elements_yolox = partition_pdf(filename="1706.03762.pdf", strategy='hi_res', model_name="yolox") print(elements_yolox) ``` Throws an error. After debugging the execution I found that the issue is that an object of class Formula is being created, however, this class doesn't contain an __init__ method. This PR solves the issue of adding a constructor method with an empty string for the element. The file can be found at: https://drive.google.com/drive/folders/1hDumyps0hA4_d-GZxs3Hij15Cpa5fjWY?usp=sharing After this PR is merged this file is correctly processed	2023-09-23 02:36:22 +00:00
Amanda Cameron	a501d1d18f	Adding table extraction to partition_html (#1324 ) Adding table extraction to HTML partitioning. This PR utilizes 'table' HTML elements to extract and parse HTML tables and return them in partitioning. ``` # checkout this branch, go into ipython shell In [1]: from unstructured.partition.html import partition_html In [2]: path_to_html = "{html sample file with table}" In [3]: elements = partition_html(path_to_html) ``` you should see the table in the elements list!	2023-09-11 11:14:11 -07:00
Matt Robinson	fa5a3dbd81	feat: `unique_element_ids` kwarg for UUID elements (#1085 ) * added kwarg for unique elements * test for unique ids * update docs * changelog and version	2023-08-11 11:02:37 +00:00
Chris Pappalardo	ef5091f276	feat: added UUID option for element_id arg in element constructor (#1076 ) * added UUID option for element_id arg in element constructor and updated unit tests * updated CHANGELOG and bumped to dev2	2023-08-09 18:32:20 -04:00
Matt Robinson	f4ddf53590	feat: track emphasized text in `partition_html` (#1034 ) * Feat/965 track emphasized text html (#1021) * feat: add functionality to track emphasized text (<strong>, <em>, <span>, <b>, <i> tags) in HTML * feat: add `include_tail_text` parameter to `_construct_text` * test: add test case for `_get_emphasized_texts_from_tag` * test: add `emphasized_texts` to metadata * chore: update changelog & version * fix tests * fix lint errors * chore: update changelog * chore: small comment updates * feat: update `XMLDocument._read_xml` to create `<p>` tag element for the text enclosed in the `<pre>` tag * chore: update changelog * Update ingest test fixtures (#1026) Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io> * ingest-test-fixtures-update * Update ingest test fixtures (#1035) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-08-03 16:24:25 +00:00
John	676c50a6ec	feat: add min_partition kwarg to that combines elements below a specified threshold (#926 ) * add min_partition * functioning _split_content_to_fit_min_max * create test and make tidy/check * fix rebase issues * fix type hinting, remove unused code, add tests * various changes and refactoring of methods * add test, refactor, change var names for debugging purposes * update test * make tidy/check * give more descriptive var names and add comments * update xml partition via partition_text and create test * fix <pre> bug for test_partition_html_with_pre_tag * make tidy * refactor and fix tests * make tidy/check * ingest-test-fixtures-update * change list comprehension to for loop * fix error check	2023-07-24 15:57:24 +00:00
Emily Chen	24ebd0fa4e	chore: Move coordinate details from Element model to a metadata model (#827 )	2023-07-05 11:25:11 -07:00
qued	db4c5dfdf7	feat: coordinate systems (#774 ) Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.	2023-06-20 11:19:55 -05:00
Matt Robinson	23ff32cc42	feat: add `partition_xml` for XML files (#596 ) * first pass on partition_xml * add option to keep xml tags * added tests for xml * fix filename * update filenames * remove outdated readme * add xml to auto * version and changelog * update readme and docs * pass through include_metadata * update include_metadata description * add README back in * linting, linting, linting * more linting * spooled to bytes doesnt need to be a tuple * Add tests for newly supported filetypes * Correct metadata filetype * doc typo Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * keep_xml_tags -> xml_keep_tags --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-05-18 15:40:12 +00:00
qued	dc4147d7df	feat: extract tables (#503 ) Exposes table extraction through partition and partition_pdf.	2023-04-21 17:01:29 +00:00
Matt Robinson	b628fa8048	feat: allow headers in `partition` (#473 ) * feat: allow headers in `partition` * warning if header is set and url is not * update emoji test	2023-04-13 15:04:15 +00:00
jonvet	7f0f33ddb0	fix: encode xml string if document_tree is `None` in `_read_xml` (#477 ) * fix: encode xml string if document_tree is `None` in `_read_xml` * don't encode text in test	2023-04-13 09:09:58 -04:00
Matt Robinson	30b5a4da65	fix: parsing for files with `message/rfc822` MIME type; dir for unsupported files (#358 ) Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.	2023-03-10 15:10:39 -08:00
Tom Aarsen	350c4230ee	fix: Remove JavaScript from HTML reader output (#313 ) * Fixes an error causing JavaScript to appear in the output of `partition_html` sometimes.	2023-02-28 14:24:24 -08:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
Matt Robinson	339c133326	fix: cleanup from live `.docx` tests (#177 ) * add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix	2023-01-26 15:52:25 +00:00
Mallori Harrell	e0a76effff	feat: Added `EmailElement` for email documents (#103 ) * new EmailElement data structure	2022-12-21 16:03:44 -06:00
Matt Robinson	4f6fc29b54	fix: `partition_html` should process container divs that include text (#110 ) * check for containers with text * added tests for containers with text * changelog and version bump	2022-12-21 21:51:04 +00:00
Matt Robinson	1d68bb2482	feat: `apply` method to apply cleaning bricks to elements (#102 ) * add apply method to apply cleaners to elements * bump version * add check for string output * documentations for the apply method * change interface to *cleaners	2022-12-15 22:19:02 +00:00
Mallori Harrell	53fcf4e912	chore: Remove PDF parsing code and dependencies (#75 ) Remove PDF parsing code and dependencies.	2022-11-21 11:47:29 -06:00
qued	9906dd23a1	fix: move _read out of base Document class Changed where _read sits in the inheritance structure since PDFDocument doesn't really need lazy document processing	2022-11-14 13:34:42 -06:00
Matt Robinson	704d6e11d1	chore: Update PDFDocument to use from_file method (#35 ) * update PDFDocument to use from_file method * bump version	2022-10-13 16:04:30 +00:00
Matt Robinson	5f40c78f25	Initial Release	2022-09-26 14:55:20 -07:00

26 Commits