fix: cleanup from live .docx tests (#177)

* add env var for cap threshold; raise default threshold

* update docs and tests

* added check for ending in a comma

* update docs

* no caps check for all upper text

* capture Text in html and text

* check category in Text equality check

* lower case all caps before checking for verbs

* added check for us city/state/zip

* added address type

* add address to html

* add address to text

* fix for text tests; escape for large text segments

* refactor regex for readability

* update comment

* additional test for text with linebreaks

* update docs

* update changelog

* update elements docs

* remove old comment

* case -> cast

* type fix
This commit is contained in:
Matt Robinson 2023-01-26 10:52:25 -05:00 committed by GitHub
parent 1ce8447ba7
commit 339c133326
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
16 changed files with 208 additions and 33 deletions

View File

@ -1,3 +1,13 @@
## 0.4.5-dev0
* Loosen the default cap threshold to `0.5`.
* Add a `NARRATIVE_TEXT_CAP_THRESHOLD` environment variable for controlling the cap ratio threshold.
* Unknown text elements are identified as `Text` for HTML and plain text documents.
* `Body Text` styles no longer default to `NarrativeText` for Word documents. The style information
is insufficient to determine that the text is narrative.
* Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
* Adds an `Address` element for capturing elements that only contain an address.
## 0.4.4
* Updated `partition_pdf` and `partition_image` to return `unstructured` `Element` objects

View File

@ -246,8 +246,10 @@ for consideration as narrative text. The function performs the following checks
* Text that does not contain a verb cannot be narrative text
* Text that exceeds the specified caps ratio cannot be narrative text. The threshold
is configurable with the ``cap_threshold`` kwarg. To ignore this check, you can set
``cap_threshold=1.0``. You may want to ignore this check when dealing with text
that is all caps.
``cap_threshold=1.0``. You can also set the threshold by using the
``NARRATIVE_TEXT_CAP_THRESHOLD`` environment variable. The environment variable
takes precedence over the kwarg.
* The cap ratio test does not apply to text that is all uppercase.
Examples:
@ -277,8 +279,8 @@ for consideration as a title. The function performs the following checks:
* Empty text cannot be a title
* Text that is all numeric cannot be a title
* If a title contains more than one sentence that exceeds a certain length, it cannot be a title.
Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
* If a title contains more than one sentence that exceeds a certain length, it cannot be a title. Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
* If a segment of text ends in a comma, it is not considered a potential title. This is to avoid salutations like "To My Dearest Friends," getting flagged as titles.
Examples:
@ -320,7 +322,9 @@ Examples:
Checks if the text contains a verb. This is used in ``is_possible_narrative_text``, but can
be used independently as well. The function identifies verbs using the NLTK part of speech
tagger. The following part of speech tags are identified as verbs:
tagger. Text that is all upper case is lower cased before part of speech detection. This is
because the upper case letters sometimes cause the part of speech tagger to miss verbs.
The following part of speech tags are identified as verbs:
* ``VB``
* ``VBG``
@ -374,6 +378,9 @@ Determines if the section of text exceeds the specified caps ratio. Used in
``is_possible_narrative_text`` and ``is_possible_title``, but can be used independently
as well. You can set the caps threshold using the ``threshold`` kwarg. The threshold
defaults to ``0.3``. Only runs on sections of text that are a single sentence.
You can also set the threshold using the ``NARRATIVE_TEXT_CAP_THRESHOLD`` environment
variable. The environment variable takes precedence over the kwarg. The caps ratio
check does not apply to text that is all capitalized.
Examples:

View File

@ -11,6 +11,8 @@ elements.
* ``NarrativeText`` - Sections of a document that include well-formed prose. Sub-class of ``Text``.
* ``Title`` - Headings and sub-headings wtihin a document. Sub-class of ``Text``.
* ``ListItem`` - A text element that is part of an ordered or unordered list. Sub-class of ``Text``.
* ``Address`` - A text item that consists only of an address. Sub-class of ``Text``.
* ``CheckBox`` - An element representing a check box. Has a ``checked`` element, which is a boolean indicating whether or not that box is checked.
#########################################

View File

@ -1,5 +1,7 @@
This is a test document to use for unit tests.
Doylestown, PA 18901
Important points:
- Hamburgers are delicious

View File

@ -4,7 +4,7 @@ from lxml import etree
import pytest
from unstructured.documents.base import Page
from unstructured.documents.elements import ListItem, NarrativeText, Title
from unstructured.documents.elements import Address, ListItem, NarrativeText, Text, Title
from unstructured.documents.html import (
LIST_ITEM_TAGS,
HTMLDocument,
@ -153,7 +153,7 @@ def test_parse_not_anything(monkeypatch):
document_tree = etree.fromstring(doc, etree.HTMLParser())
el = document_tree.find(".//p")
parsed_el = html._parse_tag(el)
assert parsed_el is None
assert parsed_el == Text(text="This is nothing")
def test_parse_bullets(monkeypatch):
@ -484,6 +484,7 @@ def test_containers_with_text_are_processed():
<div dir=3D"ltr">
<div dir=3D"ltr">Dino the Datasaur<div>Unstructured Technologies<br><div>Data Scientist
</div>
<div>Doylestown, PA 18901</div>
<div><br></div>
</div>
</div>
@ -494,12 +495,13 @@ def test_containers_with_text_are_processed():
html_document._read()
assert html_document.elements == [
Title(text="Hi All,"),
Text(text="Hi All,"),
NarrativeText(text="Get excited for our first annual family day!"),
Title(text="Best."),
Title(text="Dino the Datasaur"),
Title(text="Unstructured Technologies"),
Title(text="Data Scientist"),
Address(text="Doylestown, PA 18901"),
]

View File

@ -4,7 +4,7 @@ import pytest
import docx
from unstructured.documents.elements import NarrativeText, Title, Text, ListItem
from unstructured.documents.elements import Address, NarrativeText, Title, Text, ListItem
from unstructured.partition.auto import partition
import unstructured.partition.auto as auto
@ -115,6 +115,7 @@ def test_auto_partition_html_from_file_rb():
EXPECTED_TEXT_OUTPUT = [
NarrativeText(text="This is a test document to use for unit tests."),
Address(text="Doylestown, PA 18901"),
Title(text="Important points:"),
ListItem(text="Hamburgers are delicious"),
ListItem(text="Dogs are the best"),

View File

@ -3,7 +3,7 @@ import pytest
import docx
from unstructured.documents.elements import ListItem, NarrativeText, Title, Text
from unstructured.documents.elements import Address, ListItem, NarrativeText, Title, Text
from unstructured.partition.docx import partition_docx
@ -14,7 +14,11 @@ def mock_document():
document.add_paragraph("These are a few of my favorite things:", style="Heading 1")
# NOTE(robinson) - this should get picked up as a list item due to the •
document.add_paragraph("• Parrots", style="Normal")
# NOTE(robinson) - this should get dropped because it's empty
document.add_paragraph("", style="Normal")
document.add_paragraph("Hockey", style="List Bullet")
# NOTE(robinson) - this should get dropped because it's empty
document.add_paragraph("", style="List Bullet")
# NOTE(robinson) - this should get picked up as a title
document.add_paragraph("Analysis", style="Normal")
# NOTE(robinson) - this should get dropped because it is empty
@ -24,6 +28,8 @@ def mock_document():
document.add_paragraph("This is my third thought.", style="Body Text")
# NOTE(robinson) - this should just be regular text
document.add_paragraph("2023")
# NOTE(robinson) - this should be an address
document.add_paragraph("DOYLESTOWN, PA 18901")
return document
@ -38,6 +44,7 @@ def expected_elements():
NarrativeText("This is my first thought. This is my second thought."),
NarrativeText("This is my third thought."),
Text("2023"),
Address("DOYLESTOWN, PA 18901"),
]

View File

@ -2,13 +2,14 @@ import os
import pathlib
import pytest
from unstructured.documents.elements import NarrativeText, Title, ListItem
from unstructured.documents.elements import Address, NarrativeText, Title, ListItem
from unstructured.partition.text import partition_text
DIRECTORY = pathlib.Path(__file__).parent.resolve()
EXPECTED_OUTPUT = [
NarrativeText(text="This is a test document to use for unit tests."),
Address(text="Doylestown, PA 18901"),
Title(text="Important points:"),
ListItem(text="Hamburgers are delicious"),
ListItem(text="Dogs are the best"),
@ -52,3 +53,15 @@ def test_partition_text_raises_with_too_many_specified():
with pytest.raises(ValueError):
partition_text(filename=filename, text=text)
def test_partition_text_captures_everything_even_with_linebreaks():
text = """
VERY IMPORTANT MEMO
DOYLESTOWN, PA 18901
"""
elements = partition_text(text=text)
assert elements == [
Title(text="VERY IMPORTANT MEMO"),
Address(text="DOYLESTOWN, PA 18901"),
]

View File

@ -1,4 +1,5 @@
import pytest
from unittest.mock import patch
import unstructured.partition.text_type as text_type
@ -58,6 +59,7 @@ def test_is_possible_narrative_text(text, expected, monkeypatch):
("7", False), # Fails because it is numeric
("", False), # Fails because it is empty
("ITEM 1A. RISK FACTORS", True), # Two "sentences", but both are short
("To My Dearest Friends,", False), # Ends with a comma
],
)
def test_is_possible_title(text, expected, monkeypatch):
@ -120,11 +122,10 @@ def test_is_bulletized_text(text, expected):
[
("Ask the teacher for an apple", True),
("Intellectual property", False),
("THIS MESSAGE WAS APPROVED", True),
],
)
def test_contains_verb(text, expected, monkeypatch):
monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
monkeypatch.setattr(text_type, "pos_tag", mock_pos_tag)
has_verb = text_type.contains_verb(text)
assert has_verb is expected
@ -135,13 +136,26 @@ def test_contains_verb(text, expected, monkeypatch):
("Intellectual Property in the United States", True),
("Intellectual property helps incentivize innovation.", False),
("THIS IS ALL CAPS. BUT IT IS TWO SENTENCES.", False),
("LOOK AT THIS IT IS CAPS BUT NOT A TITLE.", False),
("This Has All Caps. It's Weird But Two Sentences", False),
("The Business Report is expected within 6 hours of closing", False),
("", False),
],
)
def test_contains_exceeds_cap_ratio(text, expected, monkeypatch):
assert text_type.exceeds_cap_ratio(text) is expected
def test_set_caps_ratio_with_environment_variable(monkeypatch):
monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
monkeypatch.setattr(text_type, "sent_tokenize", mock_sent_tokenize)
assert text_type.exceeds_cap_ratio(text, threshold=0.3) is expected
monkeypatch.setenv("NARRATIVE_TEXT_CAP_THRESHOLD", 0.8)
text = "All The King's Horses. And All The King's Men."
with patch.object(text_type, "exceeds_cap_ratio", return_value=False) as mock_exceeds:
text_type.is_possible_narrative_text(text)
mock_exceeds.assert_called_once_with(text, threshold=0.8)
def test_sentence_count(monkeypatch):
@ -153,3 +167,19 @@ def test_sentence_count(monkeypatch):
def test_item_titles():
text = "ITEM 1(A). THIS IS A TITLE"
assert text_type.sentence_count(text, 3) < 2
@pytest.mark.parametrize(
"text, expected",
[
("Doylestown, PA 18901", True),
("DOYLESTOWN, PENNSYLVANIA, 18901", True),
("DOYLESTOWN, PENNSYLVANIA 18901", True),
("Doylestown, Pennsylvania 18901", True),
(" Doylestown, Pennsylvania 18901", True),
("The Business Report is expected within 6 hours of closing", False),
("", False),
],
)
def test_is_us_city_state_zip(text, expected):
assert text_type.is_us_city_state_zip(text) is expected

View File

@ -1 +1 @@
__version__ = "0.4.4" # pragma: no cover
__version__ = "0.4.5-dev0" # pragma: no cover

View File

@ -60,7 +60,13 @@ class Text(Element):
return self.text
def __eq__(self, other):
return (self.text == other.text) and (self.coordinates == other.coordinates)
return all(
[
(self.text == other.text),
(self.coordinates == other.coordinates),
(self.category == other.category),
]
)
def apply(self, *cleaners: Callable):
"""Applies a cleaning brick to the text element. The function that's passed in
@ -108,6 +114,14 @@ class Title(Text):
pass
class Address(Text):
"""A text element for capturing addresses."""
category = "Address"
pass
class Image(Text):
"""A text element for capturing image metadata."""

View File

@ -13,12 +13,13 @@ from unstructured.logger import logger
from unstructured.cleaners.core import clean_bullets, replace_unicode_quotes
from unstructured.documents.base import Page
from unstructured.documents.elements import ListItem, Element, NarrativeText, Title
from unstructured.documents.elements import Address, ListItem, Element, NarrativeText, Text, Title
from unstructured.documents.xml import XMLDocument
from unstructured.partition.text_type import (
is_bulleted_text,
is_possible_narrative_text,
is_possible_title,
is_us_city_state_zip,
)
TEXT_TAGS: Final[List[str]] = ["p", "a", "td", "span", "font"]
@ -47,6 +48,18 @@ class TagsMixin:
super().__init__(*args, **kwargs)
class HTMLText(TagsMixin, Text):
"""Text with tag information."""
pass
class HTMLAddress(TagsMixin, Address):
"""Address with tag information."""
pass
class HTMLTitle(TagsMixin, Title):
"""Title with tag information."""
@ -203,6 +216,8 @@ def _text_to_element(text: str, tag: str, ancestortags: Tuple[str, ...]) -> Opti
if not clean_bullets(text):
return None
return HTMLListItem(text=clean_bullets(text), tag=tag, ancestortags=ancestortags)
elif is_us_city_state_zip(text):
return HTMLAddress(text=text, tag=tag, ancestortags=ancestortags)
if len(text) < 2:
return None
@ -211,8 +226,7 @@ def _text_to_element(text: str, tag: str, ancestortags: Tuple[str, ...]) -> Opti
elif is_possible_title(text):
return HTMLTitle(text, tag=tag, ancestortags=ancestortags)
else:
# Something that might end up here is text that's just a number.
return None
return HTMLText(text, tag=tag, ancestortags=ancestortags)
def _is_container_with_text(tag_elem: etree.Element) -> bool:

View File

@ -16,6 +16,23 @@ US_PHONE_NUMBERS_PATTERN = (
)
US_PHONE_NUMBERS_RE = re.compile(US_PHONE_NUMBERS_PATTERN)
# NOTE(robinson) - Based on this regex from regex101. Regex was updated to run fast
# and avoid catastrophic backtracking
# ref: https://regex101.com/library/oR3jU1?page=673
US_CITY_STATE_ZIP_PATTERN = (
r"(?i)\b(?:[A-Z][a-z.-]{1,15}[ ]?){1,5},\s?"
r"(?:{Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|Florida"
r"|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland"
r"|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|"
r"New[ ]Hampshire|New[ ]Jersey|New[ ]Mexico|New[ ]York|North[ ]Carolina|North[ ]Dakota"
r"|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode[ ]Island|South[ ]Carolina|South[ ]Dakota"
r"|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West[ ]Virginia|Wisconsin|Wyoming}"
r"|{AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN"
r"|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|"
r"WA|WV|WI|WY})(, |\s)?(?:\b\d{5}(?:-\d{4})?\b)"
)
US_CITY_STATE_ZIP_RE = re.compile(US_CITY_STATE_ZIP_PATTERN)
UNICODE_BULLETS: Final[List[str]] = [
"\u0095",
"\u2022",

View File

@ -3,20 +3,18 @@ from typing import IO, List, Optional
import docx
from unstructured.cleaners.core import clean_bullets
from unstructured.documents.elements import Element, ListItem, NarrativeText, Text, Title
from unstructured.documents.elements import Address, Element, ListItem, NarrativeText, Text, Title
from unstructured.partition.text_type import (
is_bulleted_text,
is_possible_narrative_text,
is_possible_title,
is_us_city_state_zip,
)
# NOTE(robinson) - documentation on built in styles can be found at the link below
# ref: https://python-docx.readthedocs.io/en/latest/user/
# styles-understanding.html#paragraph-styles-in-default-template
STYLE_TO_ELEMENT_MAPPING = {
"Body Text": NarrativeText,
"Body Text 2": NarrativeText,
"Body Text 3": NarrativeText,
"Caption": Text, # TODO(robinson) - add caption element type
"Heading 1": Title,
"Heading 2": Title,
@ -87,6 +85,9 @@ def _paragraph_to_element(paragraph: docx.text.paragraph.Paragraph) -> Optional[
text = paragraph.text
style_name = paragraph.style.name
if len(text.strip()) == 0:
return None
element_class = STYLE_TO_ELEMENT_MAPPING.get(style_name)
# NOTE(robinson) - The "Normal" style name will return None since it's in the mapping.
@ -100,7 +101,11 @@ def _paragraph_to_element(paragraph: docx.text.paragraph.Paragraph) -> Optional[
def _text_to_element(text: str) -> Optional[Text]:
"""Converts raw text into an unstructured Text element."""
if is_bulleted_text(text):
return ListItem(text=clean_bullets(text))
clean_text = clean_bullets(text).strip()
return ListItem(text=clean_bullets(text)) if clean_text else None
elif is_us_city_state_zip(text):
return Address(text=text)
if len(text) < 2:
return None

View File

@ -1,7 +1,7 @@
import re
from typing import IO, List, Optional
from unstructured.documents.elements import Element, ListItem, NarrativeText, Title
from unstructured.documents.elements import Address, Element, ListItem, NarrativeText, Text, Title
from unstructured.cleaners.core import clean_bullets
from unstructured.nlp.patterns import PARAGRAPH_PATTERN
@ -9,6 +9,7 @@ from unstructured.partition.text_type import (
is_possible_narrative_text,
is_possible_title,
is_bulleted_text,
is_us_city_state_zip,
)
@ -56,11 +57,16 @@ def partition_text(
ctext = ctext.strip()
if ctext == "":
break
continue
if is_bulleted_text(ctext):
elements.append(ListItem(text=clean_bullets(ctext)))
elif is_us_city_state_zip(ctext):
elements.append(Address(text=ctext))
elif is_possible_narrative_text(ctext):
elements.append(NarrativeText(text=ctext))
elif is_possible_title(ctext):
elements.append(Title(text=ctext))
else:
elements.append(Text(text=ctext))
return elements

View File

@ -1,15 +1,16 @@
"""partition.py implements logic for partitioning plain text documents into sections."""
import os
import sys
from typing import List, Optional
if sys.version_info < (3, 8):
from typing_extensions import Final
from typing_extensions import Final # pragma: nocover
else:
from typing import Final
from unstructured.cleaners.core import remove_punctuation
from unstructured.nlp.patterns import US_PHONE_NUMBERS_RE, UNICODE_BULLETS_RE
from unstructured.nlp.patterns import US_PHONE_NUMBERS_RE, UNICODE_BULLETS_RE, US_CITY_STATE_ZIP_RE
from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize
from unstructured.logger import logger
@ -17,8 +18,19 @@ from unstructured.logger import logger
POS_VERB_TAGS: Final[List[str]] = ["VB", "VBG", "VBD", "VBN", "VBP", "VBZ"]
def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
"""Checks to see if the text passes all of the checks for a narrative text section."""
def is_possible_narrative_text(text: str, cap_threshold: float = 0.5) -> bool:
"""Checks to see if the text passes all of the checks for a narrative text section.
You can change the cap threshold using the cap_threshold kwarg or the
NARRATIVE_TEXT_CAP_THRESHOLD environment variable. The environment variable takes
precedence over the kwarg.
Parameters
----------
text
the input text
cap_threshold
the percentage of capitalized words necessary to disqualify the segment as narrative
"""
if len(text) == 0:
logger.debug("Not narrative. Text is empty.")
return False
@ -27,6 +39,9 @@ def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
logger.debug(f"Not narrative. Text is all numeric:\n\n{text}")
return False
# NOTE(robinson): it gets read in from the environment as a string so we need to
# cast it to a float
cap_threshold = float(os.environ.get("NARRATIVE_TEXT_CAP_THRESHOLD", cap_threshold))
if exceeds_cap_ratio(text, threshold=cap_threshold):
logger.debug(f"Not narrative. Text exceeds cap ratio {cap_threshold}:\n\n{text}")
return False
@ -39,11 +54,23 @@ def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
def is_possible_title(text: str, sentence_min_length: int = 5) -> bool:
"""Checks to see if the text passes all of the checks for a valid title."""
"""Checks to see if the text passes all of the checks for a valid title.
Parameters
----------
text
the input text
setence_min_length
the minimum number of words required to consider a section of text a sentence
"""
if len(text) == 0:
logger.debug("Not a title. Text is empty.")
return False
# NOTE(robinson) - Prevent flagging salutations like "To My Dearest Friends," as titles
if text.endswith(","):
return False
if text.isnumeric():
logger.debug(f"Not a title. Text is all numeric:\n\n{text}")
return False
@ -76,6 +103,9 @@ def contains_us_phone_number(text: str) -> bool:
def contains_verb(text: str) -> bool:
"""Use a POS tagger to check if a segment contains verbs. If the section does not have verbs,
that indicates that it is not narrative text."""
if text.isupper():
text = text.lower()
pos_tags = pos_tag(text)
for _, tag in pos_tags:
if tag in POS_VERB_TAGS:
@ -109,7 +139,7 @@ def sentence_count(text: str, min_length: Optional[int] = None) -> int:
return count
def exceeds_cap_ratio(text: str, threshold: float = 0.3) -> bool:
def exceeds_cap_ratio(text: str, threshold: float = 0.5) -> bool:
"""Checks the title ratio in a section of text. If a sufficient proportion of the text is
capitalized."""
# NOTE(robinson) - Currently limiting this to only sections of text with one sentence.
@ -118,9 +148,24 @@ def exceeds_cap_ratio(text: str, threshold: float = 0.3) -> bool:
logger.debug(f"Text does not contain multiple sentences:\n\n{text}")
return False
if text.isupper():
return False
tokens = word_tokenize(text)
if len(tokens) == 0:
return False
capitalized = sum([word.istitle() or word.isupper() for word in tokens])
ratio = capitalized / len(tokens)
return ratio > threshold
def is_us_city_state_zip(text) -> bool:
"""Checks if the given text is in the format of US city/state/zip code.
Examples
--------
Doylestown, PA 18901
Doylestown, Pennsylvania, 18901
DOYLESTOWN, PENNSYLVANIA 18901
"""
return US_CITY_STATE_ZIP_RE.match(text.strip()) is not None