feat: improved mapping for missing chipper elements (#1431)

This PR updates
[TYPE_TO_TEXT_ELEMENT_MAP](bd33a52ee0/unstructured/documents/elements.py (L551))
with additional mapping for `chipper` elements:

```
“Threading”: NarrativeText,
“Form”: NarrativeText,
“FieldName”: Title,
“Value”: NarrativeText,
“Link”: NarrativeText,
```
This commit is contained in:
Sebastian Laverde Alfonso 2023-09-15 13:05:40 -07:00 committed by GitHub
parent 50db2abd9f
commit fe11ab4235
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 14 additions and 1 deletions

View File

@ -2,7 +2,15 @@
### Enhancements
* **Adds `chipper` element types to mapping: `Headline`, `Subheadline` and `Abstract`.** Mapped respectevely to `Title` (with `category_depth=1`), `Title` (with `category_depth=2`), and `NarrativeText`.
* **Adds `chipper` element types to mapping:**
* "Threading": `NarrativeText`
* "Form": `NarrativeText`
* "Field-Name": `Title`
* "Value": `NarrativeText`
* "Link": `NarrativeText`
* "Headline": `Title` (with `category_depth=1`)
* "Subheadline": `Title` (with `category_depth=2`)
* "Abstract": `NarrativeText`
* **Better ListItem grouping for PDF's (fast strategy).** The `partition_pdf` with `fast` strategy previously broke down some numbered list item lines as separate elements. This enhancement leverages the x,y coordinates and bbox sizes to help decide whether the following chunk of text is a continuation of the immediate previous detected ListItem element or not, and not detect it as its own non-ListItem element.
* **Fall back to text-based classification for uncategorized Layout elements for Images and PDF's**. Improves element classification by running existing text-based rules on previously UncategorizedText elements
* **Adds table partitioning for Partitioning for many doc types including: .html, .epub., .md, .rst, .odt, and .msg.** At the core of this change is the .html partition functionality, which is leveraged by the other effected doc types. This impacts many scenarios where `Table` Elements are now propery extracted.

View File

@ -578,4 +578,9 @@ TYPE_TO_TEXT_ELEMENT_MAP: Dict[str, Any] = {
"Headline": Title,
"Subheadline": Title,
"Abstract": NarrativeText,
"Threading": NarrativeText,
"Form": NarrativeText,
"Field-Name": Title,
"Value": NarrativeText,
"Link": NarrativeText,
}