feat: improved mapping for missing chipper elements (#1431)

This PR updates [TYPE_TO_TEXT_ELEMENT_MAP](bd33a52ee0/unstructured/documents/elements.py (L551)) with additional mapping for `chipper` elements: ``` “Threading”: NarrativeText, “Form”: NarrativeText, “FieldName”: Title, “Value”: NarrativeText, “Link”: NarrativeText, ```
2025-12-18 02:34:13 +00:00 · 2023-09-15 13:05:40 -07:00 · 2023-09-15 13:05:40 -07:00 · fe11ab4235
commit fe11ab4235
parent 50db2abd9f
2 changed files with 14 additions and 1 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -2,7 +2,15 @@

 ### Enhancements

-* **Adds `chipper` element types to mapping: `Headline`, `Subheadline` and `Abstract`.** Mapped respectevely to `Title` (with `category_depth=1`), `Title` (with `category_depth=2`), and `NarrativeText`.
+* **Adds `chipper` element types to mapping:**
+  * "Threading": `NarrativeText`
+  * "Form": `NarrativeText`
+  * "Field-Name": `Title`
+  * "Value": `NarrativeText`
+  * "Link": `NarrativeText`
+  * "Headline": `Title` (with `category_depth=1`)
+  * "Subheadline": `Title` (with `category_depth=2`)
+  * "Abstract": `NarrativeText`
 * **Better ListItem grouping for PDF's (fast strategy).** The `partition_pdf` with `fast` strategy previously broke down some numbered list item lines as separate elements. This enhancement leverages the x,y coordinates and bbox sizes to help decide whether the following chunk of text is a continuation of the immediate previous detected ListItem element or not, and not detect it as its own non-ListItem element.
 * **Fall back to text-based classification for uncategorized Layout elements for Images and PDF's**. Improves element classification by running existing text-based rules on previously UncategorizedText elements
 * **Adds table partitioning for Partitioning for many doc types including: .html, .epub., .md, .rst, .odt, and .msg.** At the core of this change is the .html partition functionality, which is leveraged by the other effected doc types. This impacts many scenarios where `Table` Elements are now propery extracted.
--- a/unstructured/documents/elements.py
+++ b/unstructured/documents/elements.py
@ -578,4 +578,9 @@ TYPE_TO_TEXT_ELEMENT_MAP: Dict[str, Any] = {
    "Headline": Title,
    "Subheadline": Title,
    "Abstract": NarrativeText,
+    "Threading": NarrativeText,
+    "Form": NarrativeText,
+    "Field-Name": Title,
+    "Value": NarrativeText,
+    "Link": NarrativeText,
 }