**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
Fix 4 cases of text missing after partition:
1. Text immediately after `<body>`
```html
<body>
missing1
<div>hello</div>
</body>
```
2. Text inside container and immediately after `<br/>`
```html
<div>hello<br/>missing2</div>
```
3. Text immediately after a text opening tag, if said tag contains
`<br/>`
```html
<p>missing3<br/>hello</p>
```
4. Text inside `<body>` if it is the only content (different cause from
case 1)
```html
<body>missing4</body>
```
Also fix problem causing
`test_unstructured/documents/test_html.py::test_exclude_tag_types` to
not work as intended.
This will close GitHub Issue#1543
Fixes https://github.com/Unstructured-IO/unstructured-api/issues/237
The problem:
The `ElementMetadata` class was not able to ignore fields that it didn't
know about. This surfaced in `partition_via_api`, when the hosted api
schema is newer than the local `unstructured` version. In
`ElementMetadata.from_json()` we get errors such as `TypeError:
__init__() got an unexpected keyword argument 'parent_id'`.
The fix:
The `from_json` methods for these dataclasses should drop any unexpected
fields before calling `__init__`.
To verify:
This shouldn't throw an error
```
from unstructured.staging.base import elements_from_json
import json
test_api_result = json.dumps([
{
"type": "Title",
"element_id": "2f7cc75f6467bba468022c4c2875335e",
"metadata": {
"filename": "layout-parser-paper.pdf",
"filetype": "application/pdf",
"page_number": 1,
"new_field": "foo",
},
"text": "LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis"
}
])
elements = elements_from_json(text=test_api_result)
print(elements)
```
@ron-unstructured reported that loading files with:
```
from unstructured.partition.pdf import partition_pdf
elements_yolox = partition_pdf(filename="1706.03762.pdf", strategy='hi_res', model_name="yolox")
print(elements_yolox)
```
Throws an error. After debugging the execution I found that the issue is
that an object of class Formula is being created, however, this class
doesn't contain an __init__ method. This PR solves the issue of adding a
constructor method with an empty string for the element.
The file can be found at:
https://drive.google.com/drive/folders/1hDumyps0hA4_d-GZxs3Hij15Cpa5fjWY?usp=sharing
After this PR is merged this file is correctly processed
Adding table extraction to HTML partitioning.
This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.
```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
* add min_partition
* functioning _split_content_to_fit_min_max
* create test and make tidy/check
* fix rebase issues
* fix type hinting, remove unused code, add tests
* various changes and refactoring of methods
* add test, refactor, change var names for debugging purposes
* update test
* make tidy/check
* give more descriptive var names and add comments
* update xml partition via partition_text and create test
* fix <pre> bug for test_partition_html_with_pre_tag
* make tidy
* refactor and fix tests
* make tidy/check
* ingest-test-fixtures-update
* change list comprehension to for loop
* fix error check
* first pass on partition_xml
* add option to keep xml tags
* added tests for xml
* fix filename
* update filenames
* remove outdated readme
* add xml to auto
* version and changelog
* update readme and docs
* pass through include_metadata
* update include_metadata description
* add README back in
* linting, linting, linting
* more linting
* spooled to bytes doesnt need to be a tuple
* Add tests for newly supported filetypes
* Correct metadata filetype
* doc typo
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* keep_xml_tags -> xml_keep_tags
---------
Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* Apply import sorting
ruff . --select I --fix
* Remove unnecessary open mode parameter
ruff . --select UP015 --fix
* Use f-string formatting rather than .format
* Remove extraneous parentheses
Also use "" instead of str()
* Resolve missing trailing commas
ruff . --select COM --fix
* Rewrite list() and dict() calls using literals
ruff . --select C4 --fix
* Add () to pytest.fixture, use tuples for parametrize, etc.
ruff . --select PT --fix
* Simplify code: merge conditionals, context managers
ruff . --select SIM --fix
* Import without unnecessary alias
ruff . --select PLR0402 --fix
* Apply formatting via black
* Rewrite ValueError somewhat
Slightly unrelated to the rest of the PR
* Apply formatting to tests via black
* Update expected exception message to match
0d81564
* Satisfy E501 line too long in test
* Update changelog & version
* Add ruff to make tidy and test deps
* Run 'make tidy'
* Update changelog & version
* Update changelog & version
* Add ruff to 'check' target
Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
* add env var for cap threshold; raise default threshold
* update docs and tests
* added check for ending in a comma
* update docs
* no caps check for all upper text
* capture Text in html and text
* check category in Text equality check
* lower case all caps before checking for verbs
* added check for us city/state/zip
* added address type
* add address to html
* add address to text
* fix for text tests; escape for large text segments
* refactor regex for readability
* update comment
* additional test for text with linebreaks
* update docs
* update changelog
* update elements docs
* remove old comment
* case -> cast
* type fix
* add apply method to apply cleaners to elements
* bump version
* add check for string output
* documentations for the apply method
* change interface to *cleaners