mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-09-14 19:19:56 +00:00
feat: improved mapping for missing chipper elements (#1431)
This PR updates
[TYPE_TO_TEXT_ELEMENT_MAP](bd33a52ee0/unstructured/documents/elements.py (L551)
)
with additional mapping for `chipper` elements:
```
“Threading”: NarrativeText,
“Form”: NarrativeText,
“FieldName”: Title,
“Value”: NarrativeText,
“Link”: NarrativeText,
```
This commit is contained in:
parent
50db2abd9f
commit
fe11ab4235
10
CHANGELOG.md
10
CHANGELOG.md
@ -2,7 +2,15 @@
|
||||
|
||||
### Enhancements
|
||||
|
||||
* **Adds `chipper` element types to mapping: `Headline`, `Subheadline` and `Abstract`.** Mapped respectevely to `Title` (with `category_depth=1`), `Title` (with `category_depth=2`), and `NarrativeText`.
|
||||
* **Adds `chipper` element types to mapping:**
|
||||
* "Threading": `NarrativeText`
|
||||
* "Form": `NarrativeText`
|
||||
* "Field-Name": `Title`
|
||||
* "Value": `NarrativeText`
|
||||
* "Link": `NarrativeText`
|
||||
* "Headline": `Title` (with `category_depth=1`)
|
||||
* "Subheadline": `Title` (with `category_depth=2`)
|
||||
* "Abstract": `NarrativeText`
|
||||
* **Better ListItem grouping for PDF's (fast strategy).** The `partition_pdf` with `fast` strategy previously broke down some numbered list item lines as separate elements. This enhancement leverages the x,y coordinates and bbox sizes to help decide whether the following chunk of text is a continuation of the immediate previous detected ListItem element or not, and not detect it as its own non-ListItem element.
|
||||
* **Fall back to text-based classification for uncategorized Layout elements for Images and PDF's**. Improves element classification by running existing text-based rules on previously UncategorizedText elements
|
||||
* **Adds table partitioning for Partitioning for many doc types including: .html, .epub., .md, .rst, .odt, and .msg.** At the core of this change is the .html partition functionality, which is leveraged by the other effected doc types. This impacts many scenarios where `Table` Elements are now propery extracted.
|
||||
|
@ -578,4 +578,9 @@ TYPE_TO_TEXT_ELEMENT_MAP: Dict[str, Any] = {
|
||||
"Headline": Title,
|
||||
"Subheadline": Title,
|
||||
"Abstract": NarrativeText,
|
||||
"Threading": NarrativeText,
|
||||
"Form": NarrativeText,
|
||||
"Field-Name": Title,
|
||||
"Value": NarrativeText,
|
||||
"Link": NarrativeText,
|
||||
}
|
||||
|
Loading…
x
Reference in New Issue
Block a user