fix: cleanup from live .docx tests (#177)

* add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix
2025-12-14 08:44:29 +00:00 · 2023-01-26 10:52:25 -05:00 · 2023-01-26 10:52:25 -05:00 · 339c133326
commit 339c133326
parent 1ce8447ba7
16 changed files with 208 additions and 33 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,13 @@
 ## 0.4.5-dev0
 * Loosen the default cap threshold to `0.5`.
 * Add a `NARRATIVE_TEXT_CAP_THRESHOLD` environment variable for controlling the cap ratio threshold.
 * Unknown text elements are identified as `Text` for HTML and plain text documents.
 * `Body Text` styles no longer default to `NarrativeText` for Word documents. The style information
  is insufficient to determine that the text is narrative.
 * Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
 * Adds an `Address` element for capturing elements that only contain an address.
 ## 0.4.4
 * Updated `partition_pdf` and `partition_image` to return `unstructured` `Element` objects
--- a/docs/source/bricks.rst
+++ b/docs/source/bricks.rst
@ -246,8 +246,10 @@ for consideration as narrative text. The function performs the following checks
 * Text that does not contain a verb cannot be narrative text
 * Text that exceeds the specified caps ratio cannot be narrative text. The threshold
  is configurable with the ``cap_threshold`` kwarg. To ignore this check, you can set
-  ``cap_threshold=1.0``. You may want to ignore this check when dealing with text
+  ``cap_threshold=1.0``. You can also set the threshold by using the
-  that is all caps.
+  ``NARRATIVE_TEXT_CAP_THRESHOLD`` environment variable. The environment variable
  takes precedence over the kwarg.
 * The cap ratio test does not apply to text that is all uppercase.
 Examples:
@ -277,8 +279,8 @@ for consideration as a title. The function performs the following checks:
 * Empty text cannot be a title
 * Text that is all numeric cannot be a title
-* If a title contains more than one sentence that exceeds a certain length, it cannot be a title.
+* If a title contains more than one sentence that exceeds a certain length, it cannot be a title. Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
-  Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
+* If a segment of text ends in a comma, it is not considered a potential title. This is to avoid salutations like "To My Dearest Friends," getting flagged as titles.
 Examples:
@ -320,7 +322,9 @@ Examples:
 Checks if the text contains a verb. This is used in ``is_possible_narrative_text``, but can
 be used independently as well. The function identifies verbs using the NLTK part of speech
-tagger. The following part of speech tags are identified as verbs:
+tagger. Text that is all upper case is lower cased before part of speech detection. This is
 because the upper case letters sometimes cause the part of speech tagger to miss verbs.
 The following part of speech tags are identified as verbs:
 * ``VB``
 * ``VBG``
@ -374,6 +378,9 @@ Determines if the section of text exceeds the specified caps ratio. Used in
 ``is_possible_narrative_text`` and ``is_possible_title``, but can be used independently
 as well. You can set the caps threshold using the ``threshold`` kwarg. The threshold
 defaults to ``0.3``. Only runs on sections of text that are a single sentence.
 You can also set the threshold using the ``NARRATIVE_TEXT_CAP_THRESHOLD`` environment
 variable. The environment variable takes precedence over the kwarg. The caps ratio
 check does not apply to text that is all capitalized.
 Examples:
--- a/docs/source/elements.rst
+++ b/docs/source/elements.rst
@ -11,6 +11,8 @@ elements.
 * ``NarrativeText`` - Sections of a document that include well-formed prose. Sub-class of ``Text``.
 * ``Title`` - Headings and sub-headings wtihin a document. Sub-class of ``Text``.
 * ``ListItem`` - A text element that is part of an ordered or unordered list. Sub-class of ``Text``.
 * ``Address`` - A text item that consists only of an address. Sub-class of ``Text``.
 * ``CheckBox`` - An element representing a check box. Has a ``checked`` element, which is a boolean indicating whether or not that box is checked.
 #########################################
--- a/example-docs/fake-text.txt
+++ b/example-docs/fake-text.txt
@ -1,7 +1,9 @@
 This is a test document to use for unit tests.
 Doylestown, PA 18901
 Important points:
   - Hamburgers are delicious
   - Dogs are the best
-   - I love fuzzy blankets
+   - I love fuzzy blankets
--- a/test_unstructured/documents/test_html.py
+++ b/test_unstructured/documents/test_html.py
@ -4,7 +4,7 @@ from lxml import etree
 import pytest
 from unstructured.documents.base import Page
-from unstructured.documents.elements import ListItem, NarrativeText, Title
+from unstructured.documents.elements import Address, ListItem, NarrativeText, Text, Title
 from unstructured.documents.html import (
    LIST_ITEM_TAGS,
    HTMLDocument,
@ -153,7 +153,7 @@ def test_parse_not_anything(monkeypatch):
    document_tree = etree.fromstring(doc, etree.HTMLParser())
    el = document_tree.find(".//p")
    parsed_el = html._parse_tag(el)
-    assert parsed_el is None
+    assert parsed_el == Text(text="This is nothing")
 def test_parse_bullets(monkeypatch):
@ -484,6 +484,7 @@ def test_containers_with_text_are_processed():
      <div dir=3D"ltr">
         <div dir=3D"ltr">Dino the Datasaur<div>Unstructured Technologies<br><div>Data Scientist
                </div>
                <div>Doylestown, PA 18901</div>
               <div><br></div>
            </div>
         </div>
@ -494,12 +495,13 @@ def test_containers_with_text_are_processed():
    html_document._read()
    assert html_document.elements == [
-        Title(text="Hi All,"),
+        Text(text="Hi All,"),
        NarrativeText(text="Get excited for our first annual family day!"),
        Title(text="Best."),
        Title(text="Dino the Datasaur"),
        Title(text="Unstructured Technologies"),
        Title(text="Data Scientist"),
        Address(text="Doylestown, PA 18901"),
    ]
--- a/test_unstructured/partition/test_auto.py
+++ b/test_unstructured/partition/test_auto.py
@ -4,7 +4,7 @@ import pytest
 import docx
-from unstructured.documents.elements import NarrativeText, Title, Text, ListItem
+from unstructured.documents.elements import Address, NarrativeText, Title, Text, ListItem
 from unstructured.partition.auto import partition
 import unstructured.partition.auto as auto
@ -115,6 +115,7 @@ def test_auto_partition_html_from_file_rb():
 EXPECTED_TEXT_OUTPUT = [
    NarrativeText(text="This is a test document to use for unit tests."),
    Address(text="Doylestown, PA 18901"),
    Title(text="Important points:"),
    ListItem(text="Hamburgers are delicious"),
    ListItem(text="Dogs are the best"),
--- a/test_unstructured/partition/test_docx.py
+++ b/test_unstructured/partition/test_docx.py
@ -3,7 +3,7 @@ import pytest
 import docx
-from unstructured.documents.elements import ListItem, NarrativeText, Title, Text
+from unstructured.documents.elements import Address, ListItem, NarrativeText, Title, Text
 from unstructured.partition.docx import partition_docx
@ -14,7 +14,11 @@ def mock_document():
    document.add_paragraph("These are a few of my favorite things:", style="Heading 1")
    # NOTE(robinson) - this should get picked up as a list item due to the •
    document.add_paragraph("• Parrots", style="Normal")
    # NOTE(robinson) - this should get dropped because it's empty
    document.add_paragraph("• ", style="Normal")
    document.add_paragraph("Hockey", style="List Bullet")
    # NOTE(robinson) - this should get dropped because it's empty
    document.add_paragraph("", style="List Bullet")
    # NOTE(robinson) - this should get picked up as a title
    document.add_paragraph("Analysis", style="Normal")
    # NOTE(robinson) - this should get dropped because it is empty
@ -24,6 +28,8 @@ def mock_document():
    document.add_paragraph("This is my third thought.", style="Body Text")
    # NOTE(robinson) - this should just be regular text
    document.add_paragraph("2023")
    # NOTE(robinson) - this should be an address
    document.add_paragraph("DOYLESTOWN, PA 18901")
    return document
@ -38,6 +44,7 @@ def expected_elements():
        NarrativeText("This is my first thought. This is my second thought."),
        NarrativeText("This is my third thought."),
        Text("2023"),
        Address("DOYLESTOWN, PA 18901"),
    ]
--- a/test_unstructured/partition/test_text.py
+++ b/test_unstructured/partition/test_text.py
@ -2,13 +2,14 @@ import os
 import pathlib
 import pytest
-from unstructured.documents.elements import NarrativeText, Title, ListItem
+from unstructured.documents.elements import Address, NarrativeText, Title, ListItem
 from unstructured.partition.text import partition_text
 DIRECTORY = pathlib.Path(__file__).parent.resolve()
 EXPECTED_OUTPUT = [
    NarrativeText(text="This is a test document to use for unit tests."),
    Address(text="Doylestown, PA 18901"),
    Title(text="Important points:"),
    ListItem(text="Hamburgers are delicious"),
    ListItem(text="Dogs are the best"),
@ -52,3 +53,15 @@ def test_partition_text_raises_with_too_many_specified():
    with pytest.raises(ValueError):
        partition_text(filename=filename, text=text)
 def test_partition_text_captures_everything_even_with_linebreaks():
    text = """
    VERY IMPORTANT MEMO
    DOYLESTOWN, PA 18901
    """
    elements = partition_text(text=text)
    assert elements == [
        Title(text="VERY IMPORTANT MEMO"),
        Address(text="DOYLESTOWN, PA 18901"),
    ]
--- a/test_unstructured/partition/test_text_type.py
+++ b/test_unstructured/partition/test_text_type.py
@ -1,4 +1,5 @@
 import pytest
 from unittest.mock import patch
 import unstructured.partition.text_type as text_type
@ -58,6 +59,7 @@ def test_is_possible_narrative_text(text, expected, monkeypatch):
        ("7", False),  # Fails because it is numeric
        ("", False),  # Fails because it is empty
        ("ITEM 1A. RISK FACTORS", True),  # Two "sentences", but both are short
        ("To My Dearest Friends,", False),  # Ends with a comma
    ],
 )
 def test_is_possible_title(text, expected, monkeypatch):
@ -120,11 +122,10 @@ def test_is_bulletized_text(text, expected):
    [
        ("Ask the teacher for an apple", True),
        ("Intellectual property", False),
        ("THIS MESSAGE WAS APPROVED", True),
    ],
 )
 def test_contains_verb(text, expected, monkeypatch):
    monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
    monkeypatch.setattr(text_type, "pos_tag", mock_pos_tag)
    has_verb = text_type.contains_verb(text)
    assert has_verb is expected
@ -135,13 +136,26 @@ def test_contains_verb(text, expected, monkeypatch):
        ("Intellectual Property in the United States", True),
        ("Intellectual property helps incentivize innovation.", False),
        ("THIS IS ALL CAPS. BUT IT IS TWO SENTENCES.", False),
        ("LOOK AT THIS IT IS CAPS BUT NOT A TITLE.", False),
        ("This Has All Caps. It's Weird But Two Sentences", False),
        ("The Business Report is expected within 6 hours of closing", False),
        ("", False),
    ],
 )
 def test_contains_exceeds_cap_ratio(text, expected, monkeypatch):
    assert text_type.exceeds_cap_ratio(text) is expected
 def test_set_caps_ratio_with_environment_variable(monkeypatch):
    monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
    monkeypatch.setattr(text_type, "sent_tokenize", mock_sent_tokenize)
-    assert text_type.exceeds_cap_ratio(text, threshold=0.3) is expected
+    monkeypatch.setenv("NARRATIVE_TEXT_CAP_THRESHOLD", 0.8)
    text = "All The King's Horses. And All The King's Men."
    with patch.object(text_type, "exceeds_cap_ratio", return_value=False) as mock_exceeds:
        text_type.is_possible_narrative_text(text)
    mock_exceeds.assert_called_once_with(text, threshold=0.8)
 def test_sentence_count(monkeypatch):
@ -153,3 +167,19 @@ def test_sentence_count(monkeypatch):
 def test_item_titles():
    text = "ITEM 1(A). THIS IS A TITLE"
    assert text_type.sentence_count(text, 3) < 2
@pytest.mark.parametrize(
    "text, expected",
    [
        ("Doylestown, PA 18901", True),
        ("DOYLESTOWN, PENNSYLVANIA, 18901", True),
        ("DOYLESTOWN, PENNSYLVANIA 18901", True),
        ("Doylestown, Pennsylvania 18901", True),
        ("     Doylestown, Pennsylvania 18901", True),
        ("The Business Report is expected within 6 hours of closing", False),
        ("", False),
    ],
 )
 def test_is_us_city_state_zip(text, expected):
    assert text_type.is_us_city_state_zip(text) is expected
--- a/unstructured/version.py
+++ b/unstructured/version.py
@ -1 +1 @@
-__version__ = "0.4.4"  # pragma: no cover
+__version__ = "0.4.5-dev0"  # pragma: no cover
--- a/unstructured/documents/elements.py
+++ b/unstructured/documents/elements.py
@ -60,7 +60,13 @@ class Text(Element):
        return self.text
    def __eq__(self, other):
-        return (self.text == other.text) and (self.coordinates == other.coordinates)
+        return all(
            [
                (self.text == other.text),
                (self.coordinates == other.coordinates),
                (self.category == other.category),
            ]
        )
    def apply(self, *cleaners: Callable):
        """Applies a cleaning brick to the text element. The function that's passed in
@ -108,6 +114,14 @@ class Title(Text):
    pass
 class Address(Text):
    """A text element for capturing addresses."""
    category = "Address"
    pass
 class Image(Text):
    """A text element for capturing image metadata."""
--- a/unstructured/documents/html.py
+++ b/unstructured/documents/html.py
@ -13,12 +13,13 @@ from unstructured.logger import logger
 from unstructured.cleaners.core import clean_bullets, replace_unicode_quotes
 from unstructured.documents.base import Page
-from unstructured.documents.elements import ListItem, Element, NarrativeText, Title
+from unstructured.documents.elements import Address, ListItem, Element, NarrativeText, Text, Title
 from unstructured.documents.xml import XMLDocument
 from unstructured.partition.text_type import (
    is_bulleted_text,
    is_possible_narrative_text,
    is_possible_title,
    is_us_city_state_zip,
 )
 TEXT_TAGS: Final[List[str]] = ["p", "a", "td", "span", "font"]
@ -47,6 +48,18 @@ class TagsMixin:
        super().__init__(*args, **kwargs)
 class HTMLText(TagsMixin, Text):
    """Text with tag information."""
    pass
 class HTMLAddress(TagsMixin, Address):
    """Address with tag information."""
    pass
 class HTMLTitle(TagsMixin, Title):
    """Title with tag information."""
@ -203,6 +216,8 @@ def _text_to_element(text: str, tag: str, ancestortags: Tuple[str, ...]) -> Opti
        if not clean_bullets(text):
            return None
        return HTMLListItem(text=clean_bullets(text), tag=tag, ancestortags=ancestortags)
    elif is_us_city_state_zip(text):
        return HTMLAddress(text=text, tag=tag, ancestortags=ancestortags)
    if len(text) < 2:
        return None
@ -211,8 +226,7 @@ def _text_to_element(text: str, tag: str, ancestortags: Tuple[str, ...]) -> Opti
    elif is_possible_title(text):
        return HTMLTitle(text, tag=tag, ancestortags=ancestortags)
    else:
-        # Something that might end up here is text that's just a number.
+        return HTMLText(text, tag=tag, ancestortags=ancestortags)
        return None
 def _is_container_with_text(tag_elem: etree.Element) -> bool:
--- a/unstructured/nlp/patterns.py
+++ b/unstructured/nlp/patterns.py
@ -16,6 +16,23 @@ US_PHONE_NUMBERS_PATTERN = (
 )
 US_PHONE_NUMBERS_RE = re.compile(US_PHONE_NUMBERS_PATTERN)
 # NOTE(robinson) - Based on this regex from regex101. Regex was updated to run fast
 # and avoid catastrophic backtracking
 # ref: https://regex101.com/library/oR3jU1?page=673
 US_CITY_STATE_ZIP_PATTERN = (
    r"(?i)\b(?:[A-Z][a-z.-]{1,15}[ ]?){1,5},\s?"
    r"(?:{Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|Florida"
    r"|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland"
    r"|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|"
    r"New[ ]Hampshire|New[ ]Jersey|New[ ]Mexico|New[ ]York|North[ ]Carolina|North[ ]Dakota"
    r"|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode[ ]Island|South[ ]Carolina|South[ ]Dakota"
    r"|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West[ ]Virginia|Wisconsin|Wyoming}"
    r"|{AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN"
    r"|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|"
    r"WA|WV|WI|WY})(, |\s)?(?:\b\d{5}(?:-\d{4})?\b)"
 )
 US_CITY_STATE_ZIP_RE = re.compile(US_CITY_STATE_ZIP_PATTERN)
 UNICODE_BULLETS: Final[List[str]] = [
    "\u0095",
    "\u2022",
--- a/unstructured/partition/docx.py
+++ b/unstructured/partition/docx.py
@ -3,20 +3,18 @@ from typing import IO, List, Optional
 import docx
 from unstructured.cleaners.core import clean_bullets
-from unstructured.documents.elements import Element, ListItem, NarrativeText, Text, Title
+from unstructured.documents.elements import Address, Element, ListItem, NarrativeText, Text, Title
 from unstructured.partition.text_type import (
    is_bulleted_text,
    is_possible_narrative_text,
    is_possible_title,
    is_us_city_state_zip,
 )
 # NOTE(robinson) - documentation on built in styles can be found at the link below
 # ref: https://python-docx.readthedocs.io/en/latest/user/
 #   styles-understanding.html#paragraph-styles-in-default-template
 STYLE_TO_ELEMENT_MAPPING = {
    "Body Text": NarrativeText,
    "Body Text 2": NarrativeText,
    "Body Text 3": NarrativeText,
    "Caption": Text,  # TODO(robinson) - add caption element type
    "Heading 1": Title,
    "Heading 2": Title,
@ -87,6 +85,9 @@ def _paragraph_to_element(paragraph: docx.text.paragraph.Paragraph) -> Optional[
    text = paragraph.text
    style_name = paragraph.style.name
    if len(text.strip()) == 0:
        return None
    element_class = STYLE_TO_ELEMENT_MAPPING.get(style_name)
    # NOTE(robinson) - The "Normal" style name will return None since it's in the mapping.
@ -100,7 +101,11 @@ def _paragraph_to_element(paragraph: docx.text.paragraph.Paragraph) -> Optional[
 def _text_to_element(text: str) -> Optional[Text]:
    """Converts raw text into an unstructured Text element."""
    if is_bulleted_text(text):
-        return ListItem(text=clean_bullets(text))
+        clean_text = clean_bullets(text).strip()
        return ListItem(text=clean_bullets(text)) if clean_text else None
    elif is_us_city_state_zip(text):
        return Address(text=text)
    if len(text) < 2:
        return None
--- a/unstructured/partition/text.py
+++ b/unstructured/partition/text.py
@ -1,7 +1,7 @@
 import re
 from typing import IO, List, Optional
-from unstructured.documents.elements import Element, ListItem, NarrativeText, Title
+from unstructured.documents.elements import Address, Element, ListItem, NarrativeText, Text, Title
 from unstructured.cleaners.core import clean_bullets
 from unstructured.nlp.patterns import PARAGRAPH_PATTERN
@ -9,6 +9,7 @@ from unstructured.partition.text_type import (
    is_possible_narrative_text,
    is_possible_title,
    is_bulleted_text,
    is_us_city_state_zip,
 )
@ -56,11 +57,16 @@ def partition_text(
        ctext = ctext.strip()
        if ctext == "":
-            break
+            continue
        if is_bulleted_text(ctext):
            elements.append(ListItem(text=clean_bullets(ctext)))
        elif is_us_city_state_zip(ctext):
            elements.append(Address(text=ctext))
        elif is_possible_narrative_text(ctext):
            elements.append(NarrativeText(text=ctext))
        elif is_possible_title(ctext):
            elements.append(Title(text=ctext))
        else:
            elements.append(Text(text=ctext))
    return elements
--- a/unstructured/partition/text_type.py
+++ b/unstructured/partition/text_type.py
@ -1,15 +1,16 @@
 """partition.py implements logic for partitioning plain text documents into sections."""
 import os
 import sys
 from typing import List, Optional
 if sys.version_info < (3, 8):
-    from typing_extensions import Final
+    from typing_extensions import Final  # pragma: nocover
 else:
    from typing import Final
 from unstructured.cleaners.core import remove_punctuation
-from unstructured.nlp.patterns import US_PHONE_NUMBERS_RE, UNICODE_BULLETS_RE
+from unstructured.nlp.patterns import US_PHONE_NUMBERS_RE, UNICODE_BULLETS_RE, US_CITY_STATE_ZIP_RE
 from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize
 from unstructured.logger import logger
@ -17,8 +18,19 @@ from unstructured.logger import logger
 POS_VERB_TAGS: Final[List[str]] = ["VB", "VBG", "VBD", "VBN", "VBP", "VBZ"]
-def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
+def is_possible_narrative_text(text: str, cap_threshold: float = 0.5) -> bool:
-    """Checks to see if the text passes all of the checks for a narrative text section."""
+    """Checks to see if the text passes all of the checks for a narrative text section.
    You can change the cap threshold using the cap_threshold kwarg or the
    NARRATIVE_TEXT_CAP_THRESHOLD environment variable. The environment variable takes
    precedence over the kwarg.
    Parameters
    ----------
    text
        the input text
    cap_threshold
        the percentage of capitalized words necessary to disqualify the segment as narrative
    """
    if len(text) == 0:
        logger.debug("Not narrative. Text is empty.")
        return False
@ -27,6 +39,9 @@ def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
        logger.debug(f"Not narrative. Text is all numeric:\n\n{text}")
        return False
    # NOTE(robinson): it gets read in from the environment as a string so we need to
    # cast it to a float
    cap_threshold = float(os.environ.get("NARRATIVE_TEXT_CAP_THRESHOLD", cap_threshold))
    if exceeds_cap_ratio(text, threshold=cap_threshold):
        logger.debug(f"Not narrative. Text exceeds cap ratio {cap_threshold}:\n\n{text}")
        return False
@ -39,11 +54,23 @@ def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
 def is_possible_title(text: str, sentence_min_length: int = 5) -> bool:
-    """Checks to see if the text passes all of the checks for a valid title."""
+    """Checks to see if the text passes all of the checks for a valid title.
    Parameters
    ----------
    text
        the input text
    setence_min_length
        the minimum number of words required to consider a section of text a sentence
    """
    if len(text) == 0:
        logger.debug("Not a title. Text is empty.")
        return False
    # NOTE(robinson) - Prevent flagging salutations like "To My Dearest Friends," as titles
    if text.endswith(","):
        return False
    if text.isnumeric():
        logger.debug(f"Not a title. Text is all numeric:\n\n{text}")
        return False
@ -76,6 +103,9 @@ def contains_us_phone_number(text: str) -> bool:
 def contains_verb(text: str) -> bool:
    """Use a POS tagger to check if a segment contains verbs. If the section does not have verbs,
    that indicates that it is not narrative text."""
    if text.isupper():
        text = text.lower()
    pos_tags = pos_tag(text)
    for _, tag in pos_tags:
        if tag in POS_VERB_TAGS:
@ -109,7 +139,7 @@ def sentence_count(text: str, min_length: Optional[int] = None) -> int:
    return count
-def exceeds_cap_ratio(text: str, threshold: float = 0.3) -> bool:
+def exceeds_cap_ratio(text: str, threshold: float = 0.5) -> bool:
    """Checks the title ratio in a section of text. If a sufficient proportion of the text is
    capitalized."""
    # NOTE(robinson) - Currently limiting this to only sections of text with one sentence.
@ -118,9 +148,24 @@ def exceeds_cap_ratio(text: str, threshold: float = 0.3) -> bool:
        logger.debug(f"Text does not contain multiple sentences:\n\n{text}")
        return False
    if text.isupper():
        return False
    tokens = word_tokenize(text)
    if len(tokens) == 0:
        return False
    capitalized = sum([word.istitle() or word.isupper() for word in tokens])
    ratio = capitalized / len(tokens)
    return ratio > threshold
 def is_us_city_state_zip(text) -> bool:
    """Checks if the given text is in the format of US city/state/zip code.
    Examples
    --------
    Doylestown, PA 18901
    Doylestown, Pennsylvania, 18901
    DOYLESTOWN, PENNSYLVANIA 18901
    """
    return US_CITY_STATE_ZIP_RE.match(text.strip()) is not None
`@ -1 +1 @@`
	`__version__ = "0.4.4" # pragma: no cover`	`__version__ = "0.4.5-dev0" # pragma: no cover`