mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-06-27 02:30:08 +00:00
fix: cleanup from live .docx
tests (#177)
* add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix
This commit is contained in:
parent
1ce8447ba7
commit
339c133326
10
CHANGELOG.md
10
CHANGELOG.md
@ -1,3 +1,13 @@
|
||||
## 0.4.5-dev0
|
||||
|
||||
* Loosen the default cap threshold to `0.5`.
|
||||
* Add a `NARRATIVE_TEXT_CAP_THRESHOLD` environment variable for controlling the cap ratio threshold.
|
||||
* Unknown text elements are identified as `Text` for HTML and plain text documents.
|
||||
* `Body Text` styles no longer default to `NarrativeText` for Word documents. The style information
|
||||
is insufficient to determine that the text is narrative.
|
||||
* Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
|
||||
* Adds an `Address` element for capturing elements that only contain an address.
|
||||
|
||||
## 0.4.4
|
||||
|
||||
* Updated `partition_pdf` and `partition_image` to return `unstructured` `Element` objects
|
||||
|
@ -246,8 +246,10 @@ for consideration as narrative text. The function performs the following checks
|
||||
* Text that does not contain a verb cannot be narrative text
|
||||
* Text that exceeds the specified caps ratio cannot be narrative text. The threshold
|
||||
is configurable with the ``cap_threshold`` kwarg. To ignore this check, you can set
|
||||
``cap_threshold=1.0``. You may want to ignore this check when dealing with text
|
||||
that is all caps.
|
||||
``cap_threshold=1.0``. You can also set the threshold by using the
|
||||
``NARRATIVE_TEXT_CAP_THRESHOLD`` environment variable. The environment variable
|
||||
takes precedence over the kwarg.
|
||||
* The cap ratio test does not apply to text that is all uppercase.
|
||||
|
||||
|
||||
Examples:
|
||||
@ -277,8 +279,8 @@ for consideration as a title. The function performs the following checks:
|
||||
|
||||
* Empty text cannot be a title
|
||||
* Text that is all numeric cannot be a title
|
||||
* If a title contains more than one sentence that exceeds a certain length, it cannot be a title.
|
||||
Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
|
||||
* If a title contains more than one sentence that exceeds a certain length, it cannot be a title. Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
|
||||
* If a segment of text ends in a comma, it is not considered a potential title. This is to avoid salutations like "To My Dearest Friends," getting flagged as titles.
|
||||
|
||||
|
||||
Examples:
|
||||
@ -320,7 +322,9 @@ Examples:
|
||||
|
||||
Checks if the text contains a verb. This is used in ``is_possible_narrative_text``, but can
|
||||
be used independently as well. The function identifies verbs using the NLTK part of speech
|
||||
tagger. The following part of speech tags are identified as verbs:
|
||||
tagger. Text that is all upper case is lower cased before part of speech detection. This is
|
||||
because the upper case letters sometimes cause the part of speech tagger to miss verbs.
|
||||
The following part of speech tags are identified as verbs:
|
||||
|
||||
* ``VB``
|
||||
* ``VBG``
|
||||
@ -374,6 +378,9 @@ Determines if the section of text exceeds the specified caps ratio. Used in
|
||||
``is_possible_narrative_text`` and ``is_possible_title``, but can be used independently
|
||||
as well. You can set the caps threshold using the ``threshold`` kwarg. The threshold
|
||||
defaults to ``0.3``. Only runs on sections of text that are a single sentence.
|
||||
You can also set the threshold using the ``NARRATIVE_TEXT_CAP_THRESHOLD`` environment
|
||||
variable. The environment variable takes precedence over the kwarg. The caps ratio
|
||||
check does not apply to text that is all capitalized.
|
||||
|
||||
Examples:
|
||||
|
||||
|
@ -11,6 +11,8 @@ elements.
|
||||
* ``NarrativeText`` - Sections of a document that include well-formed prose. Sub-class of ``Text``.
|
||||
* ``Title`` - Headings and sub-headings wtihin a document. Sub-class of ``Text``.
|
||||
* ``ListItem`` - A text element that is part of an ordered or unordered list. Sub-class of ``Text``.
|
||||
* ``Address`` - A text item that consists only of an address. Sub-class of ``Text``.
|
||||
* ``CheckBox`` - An element representing a check box. Has a ``checked`` element, which is a boolean indicating whether or not that box is checked.
|
||||
|
||||
|
||||
#########################################
|
||||
|
@ -1,7 +1,9 @@
|
||||
This is a test document to use for unit tests.
|
||||
|
||||
Doylestown, PA 18901
|
||||
|
||||
Important points:
|
||||
|
||||
- Hamburgers are delicious
|
||||
- Dogs are the best
|
||||
- I love fuzzy blankets
|
||||
- I love fuzzy blankets
|
||||
|
@ -4,7 +4,7 @@ from lxml import etree
|
||||
import pytest
|
||||
|
||||
from unstructured.documents.base import Page
|
||||
from unstructured.documents.elements import ListItem, NarrativeText, Title
|
||||
from unstructured.documents.elements import Address, ListItem, NarrativeText, Text, Title
|
||||
from unstructured.documents.html import (
|
||||
LIST_ITEM_TAGS,
|
||||
HTMLDocument,
|
||||
@ -153,7 +153,7 @@ def test_parse_not_anything(monkeypatch):
|
||||
document_tree = etree.fromstring(doc, etree.HTMLParser())
|
||||
el = document_tree.find(".//p")
|
||||
parsed_el = html._parse_tag(el)
|
||||
assert parsed_el is None
|
||||
assert parsed_el == Text(text="This is nothing")
|
||||
|
||||
|
||||
def test_parse_bullets(monkeypatch):
|
||||
@ -484,6 +484,7 @@ def test_containers_with_text_are_processed():
|
||||
<div dir=3D"ltr">
|
||||
<div dir=3D"ltr">Dino the Datasaur<div>Unstructured Technologies<br><div>Data Scientist
|
||||
</div>
|
||||
<div>Doylestown, PA 18901</div>
|
||||
<div><br></div>
|
||||
</div>
|
||||
</div>
|
||||
@ -494,12 +495,13 @@ def test_containers_with_text_are_processed():
|
||||
html_document._read()
|
||||
|
||||
assert html_document.elements == [
|
||||
Title(text="Hi All,"),
|
||||
Text(text="Hi All,"),
|
||||
NarrativeText(text="Get excited for our first annual family day!"),
|
||||
Title(text="Best."),
|
||||
Title(text="Dino the Datasaur"),
|
||||
Title(text="Unstructured Technologies"),
|
||||
Title(text="Data Scientist"),
|
||||
Address(text="Doylestown, PA 18901"),
|
||||
]
|
||||
|
||||
|
||||
|
@ -4,7 +4,7 @@ import pytest
|
||||
|
||||
import docx
|
||||
|
||||
from unstructured.documents.elements import NarrativeText, Title, Text, ListItem
|
||||
from unstructured.documents.elements import Address, NarrativeText, Title, Text, ListItem
|
||||
from unstructured.partition.auto import partition
|
||||
import unstructured.partition.auto as auto
|
||||
|
||||
@ -115,6 +115,7 @@ def test_auto_partition_html_from_file_rb():
|
||||
|
||||
EXPECTED_TEXT_OUTPUT = [
|
||||
NarrativeText(text="This is a test document to use for unit tests."),
|
||||
Address(text="Doylestown, PA 18901"),
|
||||
Title(text="Important points:"),
|
||||
ListItem(text="Hamburgers are delicious"),
|
||||
ListItem(text="Dogs are the best"),
|
||||
|
@ -3,7 +3,7 @@ import pytest
|
||||
|
||||
import docx
|
||||
|
||||
from unstructured.documents.elements import ListItem, NarrativeText, Title, Text
|
||||
from unstructured.documents.elements import Address, ListItem, NarrativeText, Title, Text
|
||||
from unstructured.partition.docx import partition_docx
|
||||
|
||||
|
||||
@ -14,7 +14,11 @@ def mock_document():
|
||||
document.add_paragraph("These are a few of my favorite things:", style="Heading 1")
|
||||
# NOTE(robinson) - this should get picked up as a list item due to the •
|
||||
document.add_paragraph("• Parrots", style="Normal")
|
||||
# NOTE(robinson) - this should get dropped because it's empty
|
||||
document.add_paragraph("• ", style="Normal")
|
||||
document.add_paragraph("Hockey", style="List Bullet")
|
||||
# NOTE(robinson) - this should get dropped because it's empty
|
||||
document.add_paragraph("", style="List Bullet")
|
||||
# NOTE(robinson) - this should get picked up as a title
|
||||
document.add_paragraph("Analysis", style="Normal")
|
||||
# NOTE(robinson) - this should get dropped because it is empty
|
||||
@ -24,6 +28,8 @@ def mock_document():
|
||||
document.add_paragraph("This is my third thought.", style="Body Text")
|
||||
# NOTE(robinson) - this should just be regular text
|
||||
document.add_paragraph("2023")
|
||||
# NOTE(robinson) - this should be an address
|
||||
document.add_paragraph("DOYLESTOWN, PA 18901")
|
||||
|
||||
return document
|
||||
|
||||
@ -38,6 +44,7 @@ def expected_elements():
|
||||
NarrativeText("This is my first thought. This is my second thought."),
|
||||
NarrativeText("This is my third thought."),
|
||||
Text("2023"),
|
||||
Address("DOYLESTOWN, PA 18901"),
|
||||
]
|
||||
|
||||
|
||||
|
@ -2,13 +2,14 @@ import os
|
||||
import pathlib
|
||||
import pytest
|
||||
|
||||
from unstructured.documents.elements import NarrativeText, Title, ListItem
|
||||
from unstructured.documents.elements import Address, NarrativeText, Title, ListItem
|
||||
from unstructured.partition.text import partition_text
|
||||
|
||||
DIRECTORY = pathlib.Path(__file__).parent.resolve()
|
||||
|
||||
EXPECTED_OUTPUT = [
|
||||
NarrativeText(text="This is a test document to use for unit tests."),
|
||||
Address(text="Doylestown, PA 18901"),
|
||||
Title(text="Important points:"),
|
||||
ListItem(text="Hamburgers are delicious"),
|
||||
ListItem(text="Dogs are the best"),
|
||||
@ -52,3 +53,15 @@ def test_partition_text_raises_with_too_many_specified():
|
||||
|
||||
with pytest.raises(ValueError):
|
||||
partition_text(filename=filename, text=text)
|
||||
|
||||
|
||||
def test_partition_text_captures_everything_even_with_linebreaks():
|
||||
text = """
|
||||
VERY IMPORTANT MEMO
|
||||
DOYLESTOWN, PA 18901
|
||||
"""
|
||||
elements = partition_text(text=text)
|
||||
assert elements == [
|
||||
Title(text="VERY IMPORTANT MEMO"),
|
||||
Address(text="DOYLESTOWN, PA 18901"),
|
||||
]
|
||||
|
@ -1,4 +1,5 @@
|
||||
import pytest
|
||||
from unittest.mock import patch
|
||||
|
||||
import unstructured.partition.text_type as text_type
|
||||
|
||||
@ -58,6 +59,7 @@ def test_is_possible_narrative_text(text, expected, monkeypatch):
|
||||
("7", False), # Fails because it is numeric
|
||||
("", False), # Fails because it is empty
|
||||
("ITEM 1A. RISK FACTORS", True), # Two "sentences", but both are short
|
||||
("To My Dearest Friends,", False), # Ends with a comma
|
||||
],
|
||||
)
|
||||
def test_is_possible_title(text, expected, monkeypatch):
|
||||
@ -120,11 +122,10 @@ def test_is_bulletized_text(text, expected):
|
||||
[
|
||||
("Ask the teacher for an apple", True),
|
||||
("Intellectual property", False),
|
||||
("THIS MESSAGE WAS APPROVED", True),
|
||||
],
|
||||
)
|
||||
def test_contains_verb(text, expected, monkeypatch):
|
||||
monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
|
||||
monkeypatch.setattr(text_type, "pos_tag", mock_pos_tag)
|
||||
has_verb = text_type.contains_verb(text)
|
||||
assert has_verb is expected
|
||||
|
||||
@ -135,13 +136,26 @@ def test_contains_verb(text, expected, monkeypatch):
|
||||
("Intellectual Property in the United States", True),
|
||||
("Intellectual property helps incentivize innovation.", False),
|
||||
("THIS IS ALL CAPS. BUT IT IS TWO SENTENCES.", False),
|
||||
("LOOK AT THIS IT IS CAPS BUT NOT A TITLE.", False),
|
||||
("This Has All Caps. It's Weird But Two Sentences", False),
|
||||
("The Business Report is expected within 6 hours of closing", False),
|
||||
("", False),
|
||||
],
|
||||
)
|
||||
def test_contains_exceeds_cap_ratio(text, expected, monkeypatch):
|
||||
assert text_type.exceeds_cap_ratio(text) is expected
|
||||
|
||||
|
||||
def test_set_caps_ratio_with_environment_variable(monkeypatch):
|
||||
monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
|
||||
monkeypatch.setattr(text_type, "sent_tokenize", mock_sent_tokenize)
|
||||
assert text_type.exceeds_cap_ratio(text, threshold=0.3) is expected
|
||||
monkeypatch.setenv("NARRATIVE_TEXT_CAP_THRESHOLD", 0.8)
|
||||
|
||||
text = "All The King's Horses. And All The King's Men."
|
||||
with patch.object(text_type, "exceeds_cap_ratio", return_value=False) as mock_exceeds:
|
||||
text_type.is_possible_narrative_text(text)
|
||||
|
||||
mock_exceeds.assert_called_once_with(text, threshold=0.8)
|
||||
|
||||
|
||||
def test_sentence_count(monkeypatch):
|
||||
@ -153,3 +167,19 @@ def test_sentence_count(monkeypatch):
|
||||
def test_item_titles():
|
||||
text = "ITEM 1(A). THIS IS A TITLE"
|
||||
assert text_type.sentence_count(text, 3) < 2
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text, expected",
|
||||
[
|
||||
("Doylestown, PA 18901", True),
|
||||
("DOYLESTOWN, PENNSYLVANIA, 18901", True),
|
||||
("DOYLESTOWN, PENNSYLVANIA 18901", True),
|
||||
("Doylestown, Pennsylvania 18901", True),
|
||||
(" Doylestown, Pennsylvania 18901", True),
|
||||
("The Business Report is expected within 6 hours of closing", False),
|
||||
("", False),
|
||||
],
|
||||
)
|
||||
def test_is_us_city_state_zip(text, expected):
|
||||
assert text_type.is_us_city_state_zip(text) is expected
|
||||
|
@ -1 +1 @@
|
||||
__version__ = "0.4.4" # pragma: no cover
|
||||
__version__ = "0.4.5-dev0" # pragma: no cover
|
||||
|
@ -60,7 +60,13 @@ class Text(Element):
|
||||
return self.text
|
||||
|
||||
def __eq__(self, other):
|
||||
return (self.text == other.text) and (self.coordinates == other.coordinates)
|
||||
return all(
|
||||
[
|
||||
(self.text == other.text),
|
||||
(self.coordinates == other.coordinates),
|
||||
(self.category == other.category),
|
||||
]
|
||||
)
|
||||
|
||||
def apply(self, *cleaners: Callable):
|
||||
"""Applies a cleaning brick to the text element. The function that's passed in
|
||||
@ -108,6 +114,14 @@ class Title(Text):
|
||||
pass
|
||||
|
||||
|
||||
class Address(Text):
|
||||
"""A text element for capturing addresses."""
|
||||
|
||||
category = "Address"
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class Image(Text):
|
||||
"""A text element for capturing image metadata."""
|
||||
|
||||
|
@ -13,12 +13,13 @@ from unstructured.logger import logger
|
||||
|
||||
from unstructured.cleaners.core import clean_bullets, replace_unicode_quotes
|
||||
from unstructured.documents.base import Page
|
||||
from unstructured.documents.elements import ListItem, Element, NarrativeText, Title
|
||||
from unstructured.documents.elements import Address, ListItem, Element, NarrativeText, Text, Title
|
||||
from unstructured.documents.xml import XMLDocument
|
||||
from unstructured.partition.text_type import (
|
||||
is_bulleted_text,
|
||||
is_possible_narrative_text,
|
||||
is_possible_title,
|
||||
is_us_city_state_zip,
|
||||
)
|
||||
|
||||
TEXT_TAGS: Final[List[str]] = ["p", "a", "td", "span", "font"]
|
||||
@ -47,6 +48,18 @@ class TagsMixin:
|
||||
super().__init__(*args, **kwargs)
|
||||
|
||||
|
||||
class HTMLText(TagsMixin, Text):
|
||||
"""Text with tag information."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class HTMLAddress(TagsMixin, Address):
|
||||
"""Address with tag information."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class HTMLTitle(TagsMixin, Title):
|
||||
"""Title with tag information."""
|
||||
|
||||
@ -203,6 +216,8 @@ def _text_to_element(text: str, tag: str, ancestortags: Tuple[str, ...]) -> Opti
|
||||
if not clean_bullets(text):
|
||||
return None
|
||||
return HTMLListItem(text=clean_bullets(text), tag=tag, ancestortags=ancestortags)
|
||||
elif is_us_city_state_zip(text):
|
||||
return HTMLAddress(text=text, tag=tag, ancestortags=ancestortags)
|
||||
|
||||
if len(text) < 2:
|
||||
return None
|
||||
@ -211,8 +226,7 @@ def _text_to_element(text: str, tag: str, ancestortags: Tuple[str, ...]) -> Opti
|
||||
elif is_possible_title(text):
|
||||
return HTMLTitle(text, tag=tag, ancestortags=ancestortags)
|
||||
else:
|
||||
# Something that might end up here is text that's just a number.
|
||||
return None
|
||||
return HTMLText(text, tag=tag, ancestortags=ancestortags)
|
||||
|
||||
|
||||
def _is_container_with_text(tag_elem: etree.Element) -> bool:
|
||||
|
@ -16,6 +16,23 @@ US_PHONE_NUMBERS_PATTERN = (
|
||||
)
|
||||
US_PHONE_NUMBERS_RE = re.compile(US_PHONE_NUMBERS_PATTERN)
|
||||
|
||||
# NOTE(robinson) - Based on this regex from regex101. Regex was updated to run fast
|
||||
# and avoid catastrophic backtracking
|
||||
# ref: https://regex101.com/library/oR3jU1?page=673
|
||||
US_CITY_STATE_ZIP_PATTERN = (
|
||||
r"(?i)\b(?:[A-Z][a-z.-]{1,15}[ ]?){1,5},\s?"
|
||||
r"(?:{Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|Florida"
|
||||
r"|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland"
|
||||
r"|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|"
|
||||
r"New[ ]Hampshire|New[ ]Jersey|New[ ]Mexico|New[ ]York|North[ ]Carolina|North[ ]Dakota"
|
||||
r"|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode[ ]Island|South[ ]Carolina|South[ ]Dakota"
|
||||
r"|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West[ ]Virginia|Wisconsin|Wyoming}"
|
||||
r"|{AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN"
|
||||
r"|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|"
|
||||
r"WA|WV|WI|WY})(, |\s)?(?:\b\d{5}(?:-\d{4})?\b)"
|
||||
)
|
||||
US_CITY_STATE_ZIP_RE = re.compile(US_CITY_STATE_ZIP_PATTERN)
|
||||
|
||||
UNICODE_BULLETS: Final[List[str]] = [
|
||||
"\u0095",
|
||||
"\u2022",
|
||||
|
@ -3,20 +3,18 @@ from typing import IO, List, Optional
|
||||
import docx
|
||||
|
||||
from unstructured.cleaners.core import clean_bullets
|
||||
from unstructured.documents.elements import Element, ListItem, NarrativeText, Text, Title
|
||||
from unstructured.documents.elements import Address, Element, ListItem, NarrativeText, Text, Title
|
||||
from unstructured.partition.text_type import (
|
||||
is_bulleted_text,
|
||||
is_possible_narrative_text,
|
||||
is_possible_title,
|
||||
is_us_city_state_zip,
|
||||
)
|
||||
|
||||
# NOTE(robinson) - documentation on built in styles can be found at the link below
|
||||
# ref: https://python-docx.readthedocs.io/en/latest/user/
|
||||
# styles-understanding.html#paragraph-styles-in-default-template
|
||||
STYLE_TO_ELEMENT_MAPPING = {
|
||||
"Body Text": NarrativeText,
|
||||
"Body Text 2": NarrativeText,
|
||||
"Body Text 3": NarrativeText,
|
||||
"Caption": Text, # TODO(robinson) - add caption element type
|
||||
"Heading 1": Title,
|
||||
"Heading 2": Title,
|
||||
@ -87,6 +85,9 @@ def _paragraph_to_element(paragraph: docx.text.paragraph.Paragraph) -> Optional[
|
||||
text = paragraph.text
|
||||
style_name = paragraph.style.name
|
||||
|
||||
if len(text.strip()) == 0:
|
||||
return None
|
||||
|
||||
element_class = STYLE_TO_ELEMENT_MAPPING.get(style_name)
|
||||
|
||||
# NOTE(robinson) - The "Normal" style name will return None since it's in the mapping.
|
||||
@ -100,7 +101,11 @@ def _paragraph_to_element(paragraph: docx.text.paragraph.Paragraph) -> Optional[
|
||||
def _text_to_element(text: str) -> Optional[Text]:
|
||||
"""Converts raw text into an unstructured Text element."""
|
||||
if is_bulleted_text(text):
|
||||
return ListItem(text=clean_bullets(text))
|
||||
clean_text = clean_bullets(text).strip()
|
||||
return ListItem(text=clean_bullets(text)) if clean_text else None
|
||||
|
||||
elif is_us_city_state_zip(text):
|
||||
return Address(text=text)
|
||||
|
||||
if len(text) < 2:
|
||||
return None
|
||||
|
@ -1,7 +1,7 @@
|
||||
import re
|
||||
from typing import IO, List, Optional
|
||||
|
||||
from unstructured.documents.elements import Element, ListItem, NarrativeText, Title
|
||||
from unstructured.documents.elements import Address, Element, ListItem, NarrativeText, Text, Title
|
||||
|
||||
from unstructured.cleaners.core import clean_bullets
|
||||
from unstructured.nlp.patterns import PARAGRAPH_PATTERN
|
||||
@ -9,6 +9,7 @@ from unstructured.partition.text_type import (
|
||||
is_possible_narrative_text,
|
||||
is_possible_title,
|
||||
is_bulleted_text,
|
||||
is_us_city_state_zip,
|
||||
)
|
||||
|
||||
|
||||
@ -56,11 +57,16 @@ def partition_text(
|
||||
ctext = ctext.strip()
|
||||
|
||||
if ctext == "":
|
||||
break
|
||||
continue
|
||||
if is_bulleted_text(ctext):
|
||||
elements.append(ListItem(text=clean_bullets(ctext)))
|
||||
elif is_us_city_state_zip(ctext):
|
||||
elements.append(Address(text=ctext))
|
||||
elif is_possible_narrative_text(ctext):
|
||||
elements.append(NarrativeText(text=ctext))
|
||||
elif is_possible_title(ctext):
|
||||
elements.append(Title(text=ctext))
|
||||
else:
|
||||
elements.append(Text(text=ctext))
|
||||
|
||||
return elements
|
||||
|
@ -1,15 +1,16 @@
|
||||
"""partition.py implements logic for partitioning plain text documents into sections."""
|
||||
import os
|
||||
import sys
|
||||
|
||||
from typing import List, Optional
|
||||
|
||||
if sys.version_info < (3, 8):
|
||||
from typing_extensions import Final
|
||||
from typing_extensions import Final # pragma: nocover
|
||||
else:
|
||||
from typing import Final
|
||||
|
||||
from unstructured.cleaners.core import remove_punctuation
|
||||
from unstructured.nlp.patterns import US_PHONE_NUMBERS_RE, UNICODE_BULLETS_RE
|
||||
from unstructured.nlp.patterns import US_PHONE_NUMBERS_RE, UNICODE_BULLETS_RE, US_CITY_STATE_ZIP_RE
|
||||
from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize
|
||||
from unstructured.logger import logger
|
||||
|
||||
@ -17,8 +18,19 @@ from unstructured.logger import logger
|
||||
POS_VERB_TAGS: Final[List[str]] = ["VB", "VBG", "VBD", "VBN", "VBP", "VBZ"]
|
||||
|
||||
|
||||
def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
|
||||
"""Checks to see if the text passes all of the checks for a narrative text section."""
|
||||
def is_possible_narrative_text(text: str, cap_threshold: float = 0.5) -> bool:
|
||||
"""Checks to see if the text passes all of the checks for a narrative text section.
|
||||
You can change the cap threshold using the cap_threshold kwarg or the
|
||||
NARRATIVE_TEXT_CAP_THRESHOLD environment variable. The environment variable takes
|
||||
precedence over the kwarg.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
text
|
||||
the input text
|
||||
cap_threshold
|
||||
the percentage of capitalized words necessary to disqualify the segment as narrative
|
||||
"""
|
||||
if len(text) == 0:
|
||||
logger.debug("Not narrative. Text is empty.")
|
||||
return False
|
||||
@ -27,6 +39,9 @@ def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
|
||||
logger.debug(f"Not narrative. Text is all numeric:\n\n{text}")
|
||||
return False
|
||||
|
||||
# NOTE(robinson): it gets read in from the environment as a string so we need to
|
||||
# cast it to a float
|
||||
cap_threshold = float(os.environ.get("NARRATIVE_TEXT_CAP_THRESHOLD", cap_threshold))
|
||||
if exceeds_cap_ratio(text, threshold=cap_threshold):
|
||||
logger.debug(f"Not narrative. Text exceeds cap ratio {cap_threshold}:\n\n{text}")
|
||||
return False
|
||||
@ -39,11 +54,23 @@ def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
|
||||
|
||||
|
||||
def is_possible_title(text: str, sentence_min_length: int = 5) -> bool:
|
||||
"""Checks to see if the text passes all of the checks for a valid title."""
|
||||
"""Checks to see if the text passes all of the checks for a valid title.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
text
|
||||
the input text
|
||||
setence_min_length
|
||||
the minimum number of words required to consider a section of text a sentence
|
||||
"""
|
||||
if len(text) == 0:
|
||||
logger.debug("Not a title. Text is empty.")
|
||||
return False
|
||||
|
||||
# NOTE(robinson) - Prevent flagging salutations like "To My Dearest Friends," as titles
|
||||
if text.endswith(","):
|
||||
return False
|
||||
|
||||
if text.isnumeric():
|
||||
logger.debug(f"Not a title. Text is all numeric:\n\n{text}")
|
||||
return False
|
||||
@ -76,6 +103,9 @@ def contains_us_phone_number(text: str) -> bool:
|
||||
def contains_verb(text: str) -> bool:
|
||||
"""Use a POS tagger to check if a segment contains verbs. If the section does not have verbs,
|
||||
that indicates that it is not narrative text."""
|
||||
if text.isupper():
|
||||
text = text.lower()
|
||||
|
||||
pos_tags = pos_tag(text)
|
||||
for _, tag in pos_tags:
|
||||
if tag in POS_VERB_TAGS:
|
||||
@ -109,7 +139,7 @@ def sentence_count(text: str, min_length: Optional[int] = None) -> int:
|
||||
return count
|
||||
|
||||
|
||||
def exceeds_cap_ratio(text: str, threshold: float = 0.3) -> bool:
|
||||
def exceeds_cap_ratio(text: str, threshold: float = 0.5) -> bool:
|
||||
"""Checks the title ratio in a section of text. If a sufficient proportion of the text is
|
||||
capitalized."""
|
||||
# NOTE(robinson) - Currently limiting this to only sections of text with one sentence.
|
||||
@ -118,9 +148,24 @@ def exceeds_cap_ratio(text: str, threshold: float = 0.3) -> bool:
|
||||
logger.debug(f"Text does not contain multiple sentences:\n\n{text}")
|
||||
return False
|
||||
|
||||
if text.isupper():
|
||||
return False
|
||||
|
||||
tokens = word_tokenize(text)
|
||||
if len(tokens) == 0:
|
||||
return False
|
||||
capitalized = sum([word.istitle() or word.isupper() for word in tokens])
|
||||
ratio = capitalized / len(tokens)
|
||||
return ratio > threshold
|
||||
|
||||
|
||||
def is_us_city_state_zip(text) -> bool:
|
||||
"""Checks if the given text is in the format of US city/state/zip code.
|
||||
|
||||
Examples
|
||||
--------
|
||||
Doylestown, PA 18901
|
||||
Doylestown, Pennsylvania, 18901
|
||||
DOYLESTOWN, PENNSYLVANIA 18901
|
||||
"""
|
||||
return US_CITY_STATE_ZIP_RE.match(text.strip()) is not None
|
||||
|
Loading…
x
Reference in New Issue
Block a user