fix: cleanup from live .docx tests (#177)

* add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix
2025-12-13 16:11:05 +00:00 · 2023-01-26 10:52:25 -05:00 · 2023-01-26 10:52:25 -05:00 · 339c133326
commit 339c133326
parent 1ce8447ba7
16 changed files with 208 additions and 33 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,13 @@
+## 0.4.5-dev0
+
+* Loosen the default cap threshold to `0.5`.
+* Add a `NARRATIVE_TEXT_CAP_THRESHOLD` environment variable for controlling the cap ratio threshold.
+* Unknown text elements are identified as `Text` for HTML and plain text documents.
+* `Body Text` styles no longer default to `NarrativeText` for Word documents. The style information
+  is insufficient to determine that the text is narrative.
+* Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
+* Adds an `Address` element for capturing elements that only contain an address.
+
 ## 0.4.4

 * Updated `partition_pdf` and `partition_image` to return `unstructured` `Element` objects
--- a/docs/source/bricks.rst
+++ b/docs/source/bricks.rst
@ -246,8 +246,10 @@ for consideration as narrative text. The function performs the following checks
 * Text that does not contain a verb cannot be narrative text
 * Text that exceeds the specified caps ratio cannot be narrative text. The threshold
  is configurable with the ``cap_threshold`` kwarg. To ignore this check, you can set
-  ``cap_threshold=1.0``. You may want to ignore this check when dealing with text
-  that is all caps.
+  ``cap_threshold=1.0``. You can also set the threshold by using the
+  ``NARRATIVE_TEXT_CAP_THRESHOLD`` environment variable. The environment variable
+  takes precedence over the kwarg.
+* The cap ratio test does not apply to text that is all uppercase.


 Examples:
@ -277,8 +279,8 @@ for consideration as a title. The function performs the following checks:

 * Empty text cannot be a title
 * Text that is all numeric cannot be a title
-* If a title contains more than one sentence that exceeds a certain length, it cannot be a title.
-  Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
+* If a title contains more than one sentence that exceeds a certain length, it cannot be a title. Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
+* If a segment of text ends in a comma, it is not considered a potential title. This is to avoid salutations like "To My Dearest Friends," getting flagged as titles.


 Examples:
@ -320,7 +322,9 @@ Examples:

 Checks if the text contains a verb. This is used in ``is_possible_narrative_text``, but can
 be used independently as well. The function identifies verbs using the NLTK part of speech
-tagger. The following part of speech tags are identified as verbs:
+tagger. Text that is all upper case is lower cased before part of speech detection. This is
+because the upper case letters sometimes cause the part of speech tagger to miss verbs.
+The following part of speech tags are identified as verbs:

 * ``VB``
 * ``VBG``
@ -374,6 +378,9 @@ Determines if the section of text exceeds the specified caps ratio. Used in
 ``is_possible_narrative_text`` and ``is_possible_title``, but can be used independently
 as well. You can set the caps threshold using the ``threshold`` kwarg. The threshold
 defaults to ``0.3``. Only runs on sections of text that are a single sentence.
+You can also set the threshold using the ``NARRATIVE_TEXT_CAP_THRESHOLD`` environment
+variable. The environment variable takes precedence over the kwarg. The caps ratio
+check does not apply to text that is all capitalized.

 Examples:

--- a/docs/source/elements.rst
+++ b/docs/source/elements.rst
@ -11,6 +11,8 @@ elements.
 * ``NarrativeText`` - Sections of a document that include well-formed prose. Sub-class of ``Text``.
 * ``Title`` - Headings and sub-headings wtihin a document. Sub-class of ``Text``.
 * ``ListItem`` - A text element that is part of an ordered or unordered list. Sub-class of ``Text``.
+* ``Address`` - A text item that consists only of an address. Sub-class of ``Text``.
+* ``CheckBox`` - An element representing a check box. Has a ``checked`` element, which is a boolean indicating whether or not that box is checked.


 #########################################
--- a/example-docs/fake-text.txt
+++ b/example-docs/fake-text.txt
@ -1,7 +1,9 @@
 This is a test document to use for unit tests.

+Doylestown, PA 18901
+
 Important points:

   - Hamburgers are delicious
   - Dogs are the best
-   - I love fuzzy blankets
+   - I love fuzzy blankets
--- a/test_unstructured/documents/test_html.py
+++ b/test_unstructured/documents/test_html.py
@ -4,7 +4,7 @@ from lxml import etree
 import pytest

 from unstructured.documents.base import Page
-from unstructured.documents.elements import ListItem, NarrativeText, Title
+from unstructured.documents.elements import Address, ListItem, NarrativeText, Text, Title
 from unstructured.documents.html import (
    LIST_ITEM_TAGS,
    HTMLDocument,
@ -153,7 +153,7 @@ def test_parse_not_anything(monkeypatch):
    document_tree = etree.fromstring(doc, etree.HTMLParser())
    el = document_tree.find(".//p")
    parsed_el = html._parse_tag(el)
-    assert parsed_el is None
+    assert parsed_el == Text(text="This is nothing")


 def test_parse_bullets(monkeypatch):
@ -484,6 +484,7 @@ def test_containers_with_text_are_processed():
      <div dir=3D"ltr">
         <div dir=3D"ltr">Dino the Datasaur<div>Unstructured Technologies<br><div>Data Scientist
                </div>
+                <div>Doylestown, PA 18901</div>
               <div><br></div>
            </div>
         </div>
@ -494,12 +495,13 @@ def test_containers_with_text_are_processed():
    html_document._read()

    assert html_document.elements == [
-        Title(text="Hi All,"),
+        Text(text="Hi All,"),
        NarrativeText(text="Get excited for our first annual family day!"),
        Title(text="Best."),
        Title(text="Dino the Datasaur"),
        Title(text="Unstructured Technologies"),
        Title(text="Data Scientist"),
+        Address(text="Doylestown, PA 18901"),
    ]


--- a/test_unstructured/partition/test_auto.py
+++ b/test_unstructured/partition/test_auto.py
@ -4,7 +4,7 @@ import pytest

 import docx

-from unstructured.documents.elements import NarrativeText, Title, Text, ListItem
+from unstructured.documents.elements import Address, NarrativeText, Title, Text, ListItem
 from unstructured.partition.auto import partition
 import unstructured.partition.auto as auto

@ -115,6 +115,7 @@ def test_auto_partition_html_from_file_rb():

 EXPECTED_TEXT_OUTPUT = [
    NarrativeText(text="This is a test document to use for unit tests."),
+    Address(text="Doylestown, PA 18901"),
    Title(text="Important points:"),
    ListItem(text="Hamburgers are delicious"),
    ListItem(text="Dogs are the best"),
--- a/test_unstructured/partition/test_docx.py
+++ b/test_unstructured/partition/test_docx.py
@ -3,7 +3,7 @@ import pytest

 import docx

-from unstructured.documents.elements import ListItem, NarrativeText, Title, Text
+from unstructured.documents.elements import Address, ListItem, NarrativeText, Title, Text
 from unstructured.partition.docx import partition_docx


@ -14,7 +14,11 @@ def mock_document():
    document.add_paragraph("These are a few of my favorite things:", style="Heading 1")
    # NOTE(robinson) - this should get picked up as a list item due to the •
    document.add_paragraph("• Parrots", style="Normal")
+    # NOTE(robinson) - this should get dropped because it's empty
+    document.add_paragraph("• ", style="Normal")
    document.add_paragraph("Hockey", style="List Bullet")
+    # NOTE(robinson) - this should get dropped because it's empty
+    document.add_paragraph("", style="List Bullet")
    # NOTE(robinson) - this should get picked up as a title
    document.add_paragraph("Analysis", style="Normal")
    # NOTE(robinson) - this should get dropped because it is empty
@ -24,6 +28,8 @@ def mock_document():
    document.add_paragraph("This is my third thought.", style="Body Text")
    # NOTE(robinson) - this should just be regular text
    document.add_paragraph("2023")
+    # NOTE(robinson) - this should be an address
+    document.add_paragraph("DOYLESTOWN, PA 18901")

    return document

@ -38,6 +44,7 @@ def expected_elements():
        NarrativeText("This is my first thought. This is my second thought."),
        NarrativeText("This is my third thought."),
        Text("2023"),
+        Address("DOYLESTOWN, PA 18901"),
    ]


--- a/test_unstructured/partition/test_text.py
+++ b/test_unstructured/partition/test_text.py
@ -2,13 +2,14 @@ import os
 import pathlib
 import pytest

-from unstructured.documents.elements import NarrativeText, Title, ListItem
+from unstructured.documents.elements import Address, NarrativeText, Title, ListItem
 from unstructured.partition.text import partition_text

 DIRECTORY = pathlib.Path(__file__).parent.resolve()

 EXPECTED_OUTPUT = [
    NarrativeText(text="This is a test document to use for unit tests."),
+    Address(text="Doylestown, PA 18901"),
    Title(text="Important points:"),
    ListItem(text="Hamburgers are delicious"),
    ListItem(text="Dogs are the best"),
@ -52,3 +53,15 @@ def test_partition_text_raises_with_too_many_specified():

    with pytest.raises(ValueError):
        partition_text(filename=filename, text=text)
+
+
+def test_partition_text_captures_everything_even_with_linebreaks():
+    text = """
+    VERY IMPORTANT MEMO
+    DOYLESTOWN, PA 18901
+    """
+    elements = partition_text(text=text)
+    assert elements == [
+        Title(text="VERY IMPORTANT MEMO"),
+        Address(text="DOYLESTOWN, PA 18901"),
+    ]
--- a/test_unstructured/partition/test_text_type.py
+++ b/test_unstructured/partition/test_text_type.py
@ -1,4 +1,5 @@
 import pytest
+from unittest.mock import patch

 import unstructured.partition.text_type as text_type

@ -58,6 +59,7 @@ def test_is_possible_narrative_text(text, expected, monkeypatch):
        ("7", False),  # Fails because it is numeric
        ("", False),  # Fails because it is empty
        ("ITEM 1A. RISK FACTORS", True),  # Two "sentences", but both are short
+        ("To My Dearest Friends,", False),  # Ends with a comma
    ],
 )
 def test_is_possible_title(text, expected, monkeypatch):
@ -120,11 +122,10 @@ def test_is_bulletized_text(text, expected):
    [
        ("Ask the teacher for an apple", True),
        ("Intellectual property", False),
+        ("THIS MESSAGE WAS APPROVED", True),
    ],
 )
 def test_contains_verb(text, expected, monkeypatch):
-    monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
-    monkeypatch.setattr(text_type, "pos_tag", mock_pos_tag)
    has_verb = text_type.contains_verb(text)
    assert has_verb is expected

@ -135,13 +136,26 @@ def test_contains_verb(text, expected, monkeypatch):
        ("Intellectual Property in the United States", True),
        ("Intellectual property helps incentivize innovation.", False),
        ("THIS IS ALL CAPS. BUT IT IS TWO SENTENCES.", False),
+        ("LOOK AT THIS IT IS CAPS BUT NOT A TITLE.", False),
+        ("This Has All Caps. It's Weird But Two Sentences", False),
+        ("The Business Report is expected within 6 hours of closing", False),
        ("", False),
    ],
 )
 def test_contains_exceeds_cap_ratio(text, expected, monkeypatch):
+    assert text_type.exceeds_cap_ratio(text) is expected
+
+
+def test_set_caps_ratio_with_environment_variable(monkeypatch):
    monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
    monkeypatch.setattr(text_type, "sent_tokenize", mock_sent_tokenize)
-    assert text_type.exceeds_cap_ratio(text, threshold=0.3) is expected
+    monkeypatch.setenv("NARRATIVE_TEXT_CAP_THRESHOLD", 0.8)
+
+    text = "All The King's Horses. And All The King's Men."
+    with patch.object(text_type, "exceeds_cap_ratio", return_value=False) as mock_exceeds:
+        text_type.is_possible_narrative_text(text)
+
+    mock_exceeds.assert_called_once_with(text, threshold=0.8)


 def test_sentence_count(monkeypatch):
@ -153,3 +167,19 @@ def test_sentence_count(monkeypatch):
 def test_item_titles():
    text = "ITEM 1(A). THIS IS A TITLE"
    assert text_type.sentence_count(text, 3) < 2
+
+
+@pytest.mark.parametrize(
+    "text, expected",
+    [
+        ("Doylestown, PA 18901", True),
+        ("DOYLESTOWN, PENNSYLVANIA, 18901", True),
+        ("DOYLESTOWN, PENNSYLVANIA 18901", True),
+        ("Doylestown, Pennsylvania 18901", True),
+        ("     Doylestown, Pennsylvania 18901", True),
+        ("The Business Report is expected within 6 hours of closing", False),
+        ("", False),
+    ],
+)
+def test_is_us_city_state_zip(text, expected):
+    assert text_type.is_us_city_state_zip(text) is expected
--- a/unstructured/version.py
+++ b/unstructured/version.py
@ -1 +1 @@
-__version__ = "0.4.4"  # pragma: no cover
+__version__ = "0.4.5-dev0"  # pragma: no cover
--- a/unstructured/documents/elements.py
+++ b/unstructured/documents/elements.py
@ -60,7 +60,13 @@ class Text(Element):
        return self.text

    def __eq__(self, other):
-        return (self.text == other.text) and (self.coordinates == other.coordinates)
+        return all(
+            [
+                (self.text == other.text),
+                (self.coordinates == other.coordinates),
+                (self.category == other.category),
+            ]
+        )

    def apply(self, *cleaners: Callable):
        """Applies a cleaning brick to the text element. The function that's passed in
@ -108,6 +114,14 @@ class Title(Text):
    pass


+class Address(Text):
+    """A text element for capturing addresses."""
+
+    category = "Address"
+
+    pass
+
+
 class Image(Text):
    """A text element for capturing image metadata."""

--- a/unstructured/documents/html.py
+++ b/unstructured/documents/html.py
@ -13,12 +13,13 @@ from unstructured.logger import logger

 from unstructured.cleaners.core import clean_bullets, replace_unicode_quotes
 from unstructured.documents.base import Page
-from unstructured.documents.elements import ListItem, Element, NarrativeText, Title
+from unstructured.documents.elements import Address, ListItem, Element, NarrativeText, Text, Title
 from unstructured.documents.xml import XMLDocument
 from unstructured.partition.text_type import (
    is_bulleted_text,
    is_possible_narrative_text,
    is_possible_title,
+    is_us_city_state_zip,
 )

 TEXT_TAGS: Final[List[str]] = ["p", "a", "td", "span", "font"]
@ -47,6 +48,18 @@ class TagsMixin:
        super().__init__(*args, **kwargs)


+class HTMLText(TagsMixin, Text):
+    """Text with tag information."""
+
+    pass
+
+
+class HTMLAddress(TagsMixin, Address):
+    """Address with tag information."""
+
+    pass
+
+
 class HTMLTitle(TagsMixin, Title):
    """Title with tag information."""

@ -203,6 +216,8 @@ def _text_to_element(text: str, tag: str, ancestortags: Tuple[str, ...]) -> Opti
        if not clean_bullets(text):
            return None
        return HTMLListItem(text=clean_bullets(text), tag=tag, ancestortags=ancestortags)
+    elif is_us_city_state_zip(text):
+        return HTMLAddress(text=text, tag=tag, ancestortags=ancestortags)

    if len(text) < 2:
        return None
@ -211,8 +226,7 @@ def _text_to_element(text: str, tag: str, ancestortags: Tuple[str, ...]) -> Opti
    elif is_possible_title(text):
        return HTMLTitle(text, tag=tag, ancestortags=ancestortags)
    else:
-        # Something that might end up here is text that's just a number.
-        return None
+        return HTMLText(text, tag=tag, ancestortags=ancestortags)


 def _is_container_with_text(tag_elem: etree.Element) -> bool:
--- a/unstructured/nlp/patterns.py
+++ b/unstructured/nlp/patterns.py
@ -16,6 +16,23 @@ US_PHONE_NUMBERS_PATTERN = (
 )
 US_PHONE_NUMBERS_RE = re.compile(US_PHONE_NUMBERS_PATTERN)

+# NOTE(robinson) - Based on this regex from regex101. Regex was updated to run fast
+# and avoid catastrophic backtracking
+# ref: https://regex101.com/library/oR3jU1?page=673
+US_CITY_STATE_ZIP_PATTERN = (
+    r"(?i)\b(?:[A-Z][a-z.-]{1,15}[ ]?){1,5},\s?"
+    r"(?:{Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|Florida"
+    r"|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland"
+    r"|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|"
+    r"New[ ]Hampshire|New[ ]Jersey|New[ ]Mexico|New[ ]York|North[ ]Carolina|North[ ]Dakota"
+    r"|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode[ ]Island|South[ ]Carolina|South[ ]Dakota"
+    r"|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West[ ]Virginia|Wisconsin|Wyoming}"
+    r"|{AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN"
+    r"|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|"
+    r"WA|WV|WI|WY})(, |\s)?(?:\b\d{5}(?:-\d{4})?\b)"
+)
+US_CITY_STATE_ZIP_RE = re.compile(US_CITY_STATE_ZIP_PATTERN)
+
 UNICODE_BULLETS: Final[List[str]] = [
    "\u0095",
    "\u2022",
--- a/unstructured/partition/docx.py
+++ b/unstructured/partition/docx.py
@ -3,20 +3,18 @@ from typing import IO, List, Optional
 import docx

 from unstructured.cleaners.core import clean_bullets
-from unstructured.documents.elements import Element, ListItem, NarrativeText, Text, Title
+from unstructured.documents.elements import Address, Element, ListItem, NarrativeText, Text, Title
 from unstructured.partition.text_type import (
    is_bulleted_text,
    is_possible_narrative_text,
    is_possible_title,
+    is_us_city_state_zip,
 )

 # NOTE(robinson) - documentation on built in styles can be found at the link below
 # ref: https://python-docx.readthedocs.io/en/latest/user/
 #   styles-understanding.html#paragraph-styles-in-default-template
 STYLE_TO_ELEMENT_MAPPING = {
-    "Body Text": NarrativeText,
-    "Body Text 2": NarrativeText,
-    "Body Text 3": NarrativeText,
    "Caption": Text,  # TODO(robinson) - add caption element type
    "Heading 1": Title,
    "Heading 2": Title,
@ -87,6 +85,9 @@ def _paragraph_to_element(paragraph: docx.text.paragraph.Paragraph) -> Optional[
    text = paragraph.text
    style_name = paragraph.style.name

+    if len(text.strip()) == 0:
+        return None
+
    element_class = STYLE_TO_ELEMENT_MAPPING.get(style_name)

    # NOTE(robinson) - The "Normal" style name will return None since it's in the mapping.
@ -100,7 +101,11 @@ def _paragraph_to_element(paragraph: docx.text.paragraph.Paragraph) -> Optional[
 def _text_to_element(text: str) -> Optional[Text]:
    """Converts raw text into an unstructured Text element."""
    if is_bulleted_text(text):
-        return ListItem(text=clean_bullets(text))
+        clean_text = clean_bullets(text).strip()
+        return ListItem(text=clean_bullets(text)) if clean_text else None
+
+    elif is_us_city_state_zip(text):
+        return Address(text=text)

    if len(text) < 2:
        return None
--- a/unstructured/partition/text.py
+++ b/unstructured/partition/text.py
@ -1,7 +1,7 @@
 import re
 from typing import IO, List, Optional

-from unstructured.documents.elements import Element, ListItem, NarrativeText, Title
+from unstructured.documents.elements import Address, Element, ListItem, NarrativeText, Text, Title

 from unstructured.cleaners.core import clean_bullets
 from unstructured.nlp.patterns import PARAGRAPH_PATTERN
@ -9,6 +9,7 @@ from unstructured.partition.text_type import (
    is_possible_narrative_text,
    is_possible_title,
    is_bulleted_text,
+    is_us_city_state_zip,
 )


@ -56,11 +57,16 @@ def partition_text(
        ctext = ctext.strip()

        if ctext == "":
-            break
+            continue
        if is_bulleted_text(ctext):
            elements.append(ListItem(text=clean_bullets(ctext)))
+        elif is_us_city_state_zip(ctext):
+            elements.append(Address(text=ctext))
        elif is_possible_narrative_text(ctext):
            elements.append(NarrativeText(text=ctext))
        elif is_possible_title(ctext):
            elements.append(Title(text=ctext))
+        else:
+            elements.append(Text(text=ctext))
+
    return elements
--- a/unstructured/partition/text_type.py
+++ b/unstructured/partition/text_type.py
@ -1,15 +1,16 @@
 """partition.py implements logic for partitioning plain text documents into sections."""
+import os
 import sys

 from typing import List, Optional

 if sys.version_info < (3, 8):
-    from typing_extensions import Final
+    from typing_extensions import Final  # pragma: nocover
 else:
    from typing import Final

 from unstructured.cleaners.core import remove_punctuation
-from unstructured.nlp.patterns import US_PHONE_NUMBERS_RE, UNICODE_BULLETS_RE
+from unstructured.nlp.patterns import US_PHONE_NUMBERS_RE, UNICODE_BULLETS_RE, US_CITY_STATE_ZIP_RE
 from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize
 from unstructured.logger import logger

@ -17,8 +18,19 @@ from unstructured.logger import logger
 POS_VERB_TAGS: Final[List[str]] = ["VB", "VBG", "VBD", "VBN", "VBP", "VBZ"]


-def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
-    """Checks to see if the text passes all of the checks for a narrative text section."""
+def is_possible_narrative_text(text: str, cap_threshold: float = 0.5) -> bool:
+    """Checks to see if the text passes all of the checks for a narrative text section.
+    You can change the cap threshold using the cap_threshold kwarg or the
+    NARRATIVE_TEXT_CAP_THRESHOLD environment variable. The environment variable takes
+    precedence over the kwarg.
+
+    Parameters
+    ----------
+    text
+        the input text
+    cap_threshold
+        the percentage of capitalized words necessary to disqualify the segment as narrative
+    """
    if len(text) == 0:
        logger.debug("Not narrative. Text is empty.")
        return False
@ -27,6 +39,9 @@ def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
        logger.debug(f"Not narrative. Text is all numeric:\n\n{text}")
        return False

+    # NOTE(robinson): it gets read in from the environment as a string so we need to
+    # cast it to a float
+    cap_threshold = float(os.environ.get("NARRATIVE_TEXT_CAP_THRESHOLD", cap_threshold))
    if exceeds_cap_ratio(text, threshold=cap_threshold):
        logger.debug(f"Not narrative. Text exceeds cap ratio {cap_threshold}:\n\n{text}")
        return False
@ -39,11 +54,23 @@ def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:


 def is_possible_title(text: str, sentence_min_length: int = 5) -> bool:
-    """Checks to see if the text passes all of the checks for a valid title."""
+    """Checks to see if the text passes all of the checks for a valid title.
+
+    Parameters
+    ----------
+    text
+        the input text
+    setence_min_length
+        the minimum number of words required to consider a section of text a sentence
+    """
    if len(text) == 0:
        logger.debug("Not a title. Text is empty.")
        return False

+    # NOTE(robinson) - Prevent flagging salutations like "To My Dearest Friends," as titles
+    if text.endswith(","):
+        return False
+
    if text.isnumeric():
        logger.debug(f"Not a title. Text is all numeric:\n\n{text}")
        return False
@ -76,6 +103,9 @@ def contains_us_phone_number(text: str) -> bool:
 def contains_verb(text: str) -> bool:
    """Use a POS tagger to check if a segment contains verbs. If the section does not have verbs,
    that indicates that it is not narrative text."""
+    if text.isupper():
+        text = text.lower()
+
    pos_tags = pos_tag(text)
    for _, tag in pos_tags:
        if tag in POS_VERB_TAGS:
@ -109,7 +139,7 @@ def sentence_count(text: str, min_length: Optional[int] = None) -> int:
    return count


-def exceeds_cap_ratio(text: str, threshold: float = 0.3) -> bool:
+def exceeds_cap_ratio(text: str, threshold: float = 0.5) -> bool:
    """Checks the title ratio in a section of text. If a sufficient proportion of the text is
    capitalized."""
    # NOTE(robinson) - Currently limiting this to only sections of text with one sentence.
@ -118,9 +148,24 @@ def exceeds_cap_ratio(text: str, threshold: float = 0.3) -> bool:
        logger.debug(f"Text does not contain multiple sentences:\n\n{text}")
        return False

+    if text.isupper():
+        return False
+
    tokens = word_tokenize(text)
    if len(tokens) == 0:
        return False
    capitalized = sum([word.istitle() or word.isupper() for word in tokens])
    ratio = capitalized / len(tokens)
    return ratio > threshold
+
+
+def is_us_city_state_zip(text) -> bool:
+    """Checks if the given text is in the format of US city/state/zip code.
+
+    Examples
+    --------
+    Doylestown, PA 18901
+    Doylestown, Pennsylvania, 18901
+    DOYLESTOWN, PENNSYLVANIA 18901
+    """
+    return US_CITY_STATE_ZIP_RE.match(text.strip()) is not None