From fe11ab4235ad2b2bc8328a036b4da33b7392f8fb Mon Sep 17 00:00:00 2001 From: Sebastian Laverde Alfonso Date: Fri, 15 Sep 2023 13:05:40 -0700 Subject: [PATCH] feat: improved mapping for missing chipper elements (#1431) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This PR updates [TYPE_TO_TEXT_ELEMENT_MAP](https://github.com/Unstructured-IO/unstructured/blob/bd33a52ee0fcec3928db171e6d717e70521c5aef/unstructured/documents/elements.py#L551) with additional mapping for `chipper` elements: ``` “Threading”: NarrativeText, “Form”: NarrativeText, “FieldName”: Title, “Value”: NarrativeText, “Link”: NarrativeText, ``` --- CHANGELOG.md | 10 +++++++++- unstructured/documents/elements.py | 5 +++++ 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 7e7d86c17..474d97323 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,7 +2,15 @@ ### Enhancements -* **Adds `chipper` element types to mapping: `Headline`, `Subheadline` and `Abstract`.** Mapped respectevely to `Title` (with `category_depth=1`), `Title` (with `category_depth=2`), and `NarrativeText`. +* **Adds `chipper` element types to mapping:** + * "Threading": `NarrativeText` + * "Form": `NarrativeText` + * "Field-Name": `Title` + * "Value": `NarrativeText` + * "Link": `NarrativeText` + * "Headline": `Title` (with `category_depth=1`) + * "Subheadline": `Title` (with `category_depth=2`) + * "Abstract": `NarrativeText` * **Better ListItem grouping for PDF's (fast strategy).** The `partition_pdf` with `fast` strategy previously broke down some numbered list item lines as separate elements. This enhancement leverages the x,y coordinates and bbox sizes to help decide whether the following chunk of text is a continuation of the immediate previous detected ListItem element or not, and not detect it as its own non-ListItem element. * **Fall back to text-based classification for uncategorized Layout elements for Images and PDF's**. Improves element classification by running existing text-based rules on previously UncategorizedText elements * **Adds table partitioning for Partitioning for many doc types including: .html, .epub., .md, .rst, .odt, and .msg.** At the core of this change is the .html partition functionality, which is leveraged by the other effected doc types. This impacts many scenarios where `Table` Elements are now propery extracted. diff --git a/unstructured/documents/elements.py b/unstructured/documents/elements.py index f4b45418f..98b7be743 100644 --- a/unstructured/documents/elements.py +++ b/unstructured/documents/elements.py @@ -578,4 +578,9 @@ TYPE_TO_TEXT_ELEMENT_MAP: Dict[str, Any] = { "Headline": Title, "Subheadline": Title, "Abstract": NarrativeText, + "Threading": NarrativeText, + "Form": NarrativeText, + "Field-Name": Title, + "Value": NarrativeText, + "Link": NarrativeText, }