mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-06-27 02:30:08 +00:00
fix: cleanup from live .docx
tests (#177)
* add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix
This commit is contained in:
parent
1ce8447ba7
commit
339c133326
10
CHANGELOG.md
10
CHANGELOG.md
@ -1,3 +1,13 @@
|
|||||||
|
## 0.4.5-dev0
|
||||||
|
|
||||||
|
* Loosen the default cap threshold to `0.5`.
|
||||||
|
* Add a `NARRATIVE_TEXT_CAP_THRESHOLD` environment variable for controlling the cap ratio threshold.
|
||||||
|
* Unknown text elements are identified as `Text` for HTML and plain text documents.
|
||||||
|
* `Body Text` styles no longer default to `NarrativeText` for Word documents. The style information
|
||||||
|
is insufficient to determine that the text is narrative.
|
||||||
|
* Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
|
||||||
|
* Adds an `Address` element for capturing elements that only contain an address.
|
||||||
|
|
||||||
## 0.4.4
|
## 0.4.4
|
||||||
|
|
||||||
* Updated `partition_pdf` and `partition_image` to return `unstructured` `Element` objects
|
* Updated `partition_pdf` and `partition_image` to return `unstructured` `Element` objects
|
||||||
|
@ -246,8 +246,10 @@ for consideration as narrative text. The function performs the following checks
|
|||||||
* Text that does not contain a verb cannot be narrative text
|
* Text that does not contain a verb cannot be narrative text
|
||||||
* Text that exceeds the specified caps ratio cannot be narrative text. The threshold
|
* Text that exceeds the specified caps ratio cannot be narrative text. The threshold
|
||||||
is configurable with the ``cap_threshold`` kwarg. To ignore this check, you can set
|
is configurable with the ``cap_threshold`` kwarg. To ignore this check, you can set
|
||||||
``cap_threshold=1.0``. You may want to ignore this check when dealing with text
|
``cap_threshold=1.0``. You can also set the threshold by using the
|
||||||
that is all caps.
|
``NARRATIVE_TEXT_CAP_THRESHOLD`` environment variable. The environment variable
|
||||||
|
takes precedence over the kwarg.
|
||||||
|
* The cap ratio test does not apply to text that is all uppercase.
|
||||||
|
|
||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
@ -277,8 +279,8 @@ for consideration as a title. The function performs the following checks:
|
|||||||
|
|
||||||
* Empty text cannot be a title
|
* Empty text cannot be a title
|
||||||
* Text that is all numeric cannot be a title
|
* Text that is all numeric cannot be a title
|
||||||
* If a title contains more than one sentence that exceeds a certain length, it cannot be a title.
|
* If a title contains more than one sentence that exceeds a certain length, it cannot be a title. Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
|
||||||
Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
|
* If a segment of text ends in a comma, it is not considered a potential title. This is to avoid salutations like "To My Dearest Friends," getting flagged as titles.
|
||||||
|
|
||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
@ -320,7 +322,9 @@ Examples:
|
|||||||
|
|
||||||
Checks if the text contains a verb. This is used in ``is_possible_narrative_text``, but can
|
Checks if the text contains a verb. This is used in ``is_possible_narrative_text``, but can
|
||||||
be used independently as well. The function identifies verbs using the NLTK part of speech
|
be used independently as well. The function identifies verbs using the NLTK part of speech
|
||||||
tagger. The following part of speech tags are identified as verbs:
|
tagger. Text that is all upper case is lower cased before part of speech detection. This is
|
||||||
|
because the upper case letters sometimes cause the part of speech tagger to miss verbs.
|
||||||
|
The following part of speech tags are identified as verbs:
|
||||||
|
|
||||||
* ``VB``
|
* ``VB``
|
||||||
* ``VBG``
|
* ``VBG``
|
||||||
@ -374,6 +378,9 @@ Determines if the section of text exceeds the specified caps ratio. Used in
|
|||||||
``is_possible_narrative_text`` and ``is_possible_title``, but can be used independently
|
``is_possible_narrative_text`` and ``is_possible_title``, but can be used independently
|
||||||
as well. You can set the caps threshold using the ``threshold`` kwarg. The threshold
|
as well. You can set the caps threshold using the ``threshold`` kwarg. The threshold
|
||||||
defaults to ``0.3``. Only runs on sections of text that are a single sentence.
|
defaults to ``0.3``. Only runs on sections of text that are a single sentence.
|
||||||
|
You can also set the threshold using the ``NARRATIVE_TEXT_CAP_THRESHOLD`` environment
|
||||||
|
variable. The environment variable takes precedence over the kwarg. The caps ratio
|
||||||
|
check does not apply to text that is all capitalized.
|
||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
|
@ -11,6 +11,8 @@ elements.
|
|||||||
* ``NarrativeText`` - Sections of a document that include well-formed prose. Sub-class of ``Text``.
|
* ``NarrativeText`` - Sections of a document that include well-formed prose. Sub-class of ``Text``.
|
||||||
* ``Title`` - Headings and sub-headings wtihin a document. Sub-class of ``Text``.
|
* ``Title`` - Headings and sub-headings wtihin a document. Sub-class of ``Text``.
|
||||||
* ``ListItem`` - A text element that is part of an ordered or unordered list. Sub-class of ``Text``.
|
* ``ListItem`` - A text element that is part of an ordered or unordered list. Sub-class of ``Text``.
|
||||||
|
* ``Address`` - A text item that consists only of an address. Sub-class of ``Text``.
|
||||||
|
* ``CheckBox`` - An element representing a check box. Has a ``checked`` element, which is a boolean indicating whether or not that box is checked.
|
||||||
|
|
||||||
|
|
||||||
#########################################
|
#########################################
|
||||||
|
@ -1,7 +1,9 @@
|
|||||||
This is a test document to use for unit tests.
|
This is a test document to use for unit tests.
|
||||||
|
|
||||||
|
Doylestown, PA 18901
|
||||||
|
|
||||||
Important points:
|
Important points:
|
||||||
|
|
||||||
- Hamburgers are delicious
|
- Hamburgers are delicious
|
||||||
- Dogs are the best
|
- Dogs are the best
|
||||||
- I love fuzzy blankets
|
- I love fuzzy blankets
|
||||||
|
@ -4,7 +4,7 @@ from lxml import etree
|
|||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from unstructured.documents.base import Page
|
from unstructured.documents.base import Page
|
||||||
from unstructured.documents.elements import ListItem, NarrativeText, Title
|
from unstructured.documents.elements import Address, ListItem, NarrativeText, Text, Title
|
||||||
from unstructured.documents.html import (
|
from unstructured.documents.html import (
|
||||||
LIST_ITEM_TAGS,
|
LIST_ITEM_TAGS,
|
||||||
HTMLDocument,
|
HTMLDocument,
|
||||||
@ -153,7 +153,7 @@ def test_parse_not_anything(monkeypatch):
|
|||||||
document_tree = etree.fromstring(doc, etree.HTMLParser())
|
document_tree = etree.fromstring(doc, etree.HTMLParser())
|
||||||
el = document_tree.find(".//p")
|
el = document_tree.find(".//p")
|
||||||
parsed_el = html._parse_tag(el)
|
parsed_el = html._parse_tag(el)
|
||||||
assert parsed_el is None
|
assert parsed_el == Text(text="This is nothing")
|
||||||
|
|
||||||
|
|
||||||
def test_parse_bullets(monkeypatch):
|
def test_parse_bullets(monkeypatch):
|
||||||
@ -484,6 +484,7 @@ def test_containers_with_text_are_processed():
|
|||||||
<div dir=3D"ltr">
|
<div dir=3D"ltr">
|
||||||
<div dir=3D"ltr">Dino the Datasaur<div>Unstructured Technologies<br><div>Data Scientist
|
<div dir=3D"ltr">Dino the Datasaur<div>Unstructured Technologies<br><div>Data Scientist
|
||||||
</div>
|
</div>
|
||||||
|
<div>Doylestown, PA 18901</div>
|
||||||
<div><br></div>
|
<div><br></div>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
@ -494,12 +495,13 @@ def test_containers_with_text_are_processed():
|
|||||||
html_document._read()
|
html_document._read()
|
||||||
|
|
||||||
assert html_document.elements == [
|
assert html_document.elements == [
|
||||||
Title(text="Hi All,"),
|
Text(text="Hi All,"),
|
||||||
NarrativeText(text="Get excited for our first annual family day!"),
|
NarrativeText(text="Get excited for our first annual family day!"),
|
||||||
Title(text="Best."),
|
Title(text="Best."),
|
||||||
Title(text="Dino the Datasaur"),
|
Title(text="Dino the Datasaur"),
|
||||||
Title(text="Unstructured Technologies"),
|
Title(text="Unstructured Technologies"),
|
||||||
Title(text="Data Scientist"),
|
Title(text="Data Scientist"),
|
||||||
|
Address(text="Doylestown, PA 18901"),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@ -4,7 +4,7 @@ import pytest
|
|||||||
|
|
||||||
import docx
|
import docx
|
||||||
|
|
||||||
from unstructured.documents.elements import NarrativeText, Title, Text, ListItem
|
from unstructured.documents.elements import Address, NarrativeText, Title, Text, ListItem
|
||||||
from unstructured.partition.auto import partition
|
from unstructured.partition.auto import partition
|
||||||
import unstructured.partition.auto as auto
|
import unstructured.partition.auto as auto
|
||||||
|
|
||||||
@ -115,6 +115,7 @@ def test_auto_partition_html_from_file_rb():
|
|||||||
|
|
||||||
EXPECTED_TEXT_OUTPUT = [
|
EXPECTED_TEXT_OUTPUT = [
|
||||||
NarrativeText(text="This is a test document to use for unit tests."),
|
NarrativeText(text="This is a test document to use for unit tests."),
|
||||||
|
Address(text="Doylestown, PA 18901"),
|
||||||
Title(text="Important points:"),
|
Title(text="Important points:"),
|
||||||
ListItem(text="Hamburgers are delicious"),
|
ListItem(text="Hamburgers are delicious"),
|
||||||
ListItem(text="Dogs are the best"),
|
ListItem(text="Dogs are the best"),
|
||||||
|
@ -3,7 +3,7 @@ import pytest
|
|||||||
|
|
||||||
import docx
|
import docx
|
||||||
|
|
||||||
from unstructured.documents.elements import ListItem, NarrativeText, Title, Text
|
from unstructured.documents.elements import Address, ListItem, NarrativeText, Title, Text
|
||||||
from unstructured.partition.docx import partition_docx
|
from unstructured.partition.docx import partition_docx
|
||||||
|
|
||||||
|
|
||||||
@ -14,7 +14,11 @@ def mock_document():
|
|||||||
document.add_paragraph("These are a few of my favorite things:", style="Heading 1")
|
document.add_paragraph("These are a few of my favorite things:", style="Heading 1")
|
||||||
# NOTE(robinson) - this should get picked up as a list item due to the •
|
# NOTE(robinson) - this should get picked up as a list item due to the •
|
||||||
document.add_paragraph("• Parrots", style="Normal")
|
document.add_paragraph("• Parrots", style="Normal")
|
||||||
|
# NOTE(robinson) - this should get dropped because it's empty
|
||||||
|
document.add_paragraph("• ", style="Normal")
|
||||||
document.add_paragraph("Hockey", style="List Bullet")
|
document.add_paragraph("Hockey", style="List Bullet")
|
||||||
|
# NOTE(robinson) - this should get dropped because it's empty
|
||||||
|
document.add_paragraph("", style="List Bullet")
|
||||||
# NOTE(robinson) - this should get picked up as a title
|
# NOTE(robinson) - this should get picked up as a title
|
||||||
document.add_paragraph("Analysis", style="Normal")
|
document.add_paragraph("Analysis", style="Normal")
|
||||||
# NOTE(robinson) - this should get dropped because it is empty
|
# NOTE(robinson) - this should get dropped because it is empty
|
||||||
@ -24,6 +28,8 @@ def mock_document():
|
|||||||
document.add_paragraph("This is my third thought.", style="Body Text")
|
document.add_paragraph("This is my third thought.", style="Body Text")
|
||||||
# NOTE(robinson) - this should just be regular text
|
# NOTE(robinson) - this should just be regular text
|
||||||
document.add_paragraph("2023")
|
document.add_paragraph("2023")
|
||||||
|
# NOTE(robinson) - this should be an address
|
||||||
|
document.add_paragraph("DOYLESTOWN, PA 18901")
|
||||||
|
|
||||||
return document
|
return document
|
||||||
|
|
||||||
@ -38,6 +44,7 @@ def expected_elements():
|
|||||||
NarrativeText("This is my first thought. This is my second thought."),
|
NarrativeText("This is my first thought. This is my second thought."),
|
||||||
NarrativeText("This is my third thought."),
|
NarrativeText("This is my third thought."),
|
||||||
Text("2023"),
|
Text("2023"),
|
||||||
|
Address("DOYLESTOWN, PA 18901"),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@ -2,13 +2,14 @@ import os
|
|||||||
import pathlib
|
import pathlib
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from unstructured.documents.elements import NarrativeText, Title, ListItem
|
from unstructured.documents.elements import Address, NarrativeText, Title, ListItem
|
||||||
from unstructured.partition.text import partition_text
|
from unstructured.partition.text import partition_text
|
||||||
|
|
||||||
DIRECTORY = pathlib.Path(__file__).parent.resolve()
|
DIRECTORY = pathlib.Path(__file__).parent.resolve()
|
||||||
|
|
||||||
EXPECTED_OUTPUT = [
|
EXPECTED_OUTPUT = [
|
||||||
NarrativeText(text="This is a test document to use for unit tests."),
|
NarrativeText(text="This is a test document to use for unit tests."),
|
||||||
|
Address(text="Doylestown, PA 18901"),
|
||||||
Title(text="Important points:"),
|
Title(text="Important points:"),
|
||||||
ListItem(text="Hamburgers are delicious"),
|
ListItem(text="Hamburgers are delicious"),
|
||||||
ListItem(text="Dogs are the best"),
|
ListItem(text="Dogs are the best"),
|
||||||
@ -52,3 +53,15 @@ def test_partition_text_raises_with_too_many_specified():
|
|||||||
|
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
partition_text(filename=filename, text=text)
|
partition_text(filename=filename, text=text)
|
||||||
|
|
||||||
|
|
||||||
|
def test_partition_text_captures_everything_even_with_linebreaks():
|
||||||
|
text = """
|
||||||
|
VERY IMPORTANT MEMO
|
||||||
|
DOYLESTOWN, PA 18901
|
||||||
|
"""
|
||||||
|
elements = partition_text(text=text)
|
||||||
|
assert elements == [
|
||||||
|
Title(text="VERY IMPORTANT MEMO"),
|
||||||
|
Address(text="DOYLESTOWN, PA 18901"),
|
||||||
|
]
|
||||||
|
@ -1,4 +1,5 @@
|
|||||||
import pytest
|
import pytest
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
import unstructured.partition.text_type as text_type
|
import unstructured.partition.text_type as text_type
|
||||||
|
|
||||||
@ -58,6 +59,7 @@ def test_is_possible_narrative_text(text, expected, monkeypatch):
|
|||||||
("7", False), # Fails because it is numeric
|
("7", False), # Fails because it is numeric
|
||||||
("", False), # Fails because it is empty
|
("", False), # Fails because it is empty
|
||||||
("ITEM 1A. RISK FACTORS", True), # Two "sentences", but both are short
|
("ITEM 1A. RISK FACTORS", True), # Two "sentences", but both are short
|
||||||
|
("To My Dearest Friends,", False), # Ends with a comma
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
def test_is_possible_title(text, expected, monkeypatch):
|
def test_is_possible_title(text, expected, monkeypatch):
|
||||||
@ -120,11 +122,10 @@ def test_is_bulletized_text(text, expected):
|
|||||||
[
|
[
|
||||||
("Ask the teacher for an apple", True),
|
("Ask the teacher for an apple", True),
|
||||||
("Intellectual property", False),
|
("Intellectual property", False),
|
||||||
|
("THIS MESSAGE WAS APPROVED", True),
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
def test_contains_verb(text, expected, monkeypatch):
|
def test_contains_verb(text, expected, monkeypatch):
|
||||||
monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
|
|
||||||
monkeypatch.setattr(text_type, "pos_tag", mock_pos_tag)
|
|
||||||
has_verb = text_type.contains_verb(text)
|
has_verb = text_type.contains_verb(text)
|
||||||
assert has_verb is expected
|
assert has_verb is expected
|
||||||
|
|
||||||
@ -135,13 +136,26 @@ def test_contains_verb(text, expected, monkeypatch):
|
|||||||
("Intellectual Property in the United States", True),
|
("Intellectual Property in the United States", True),
|
||||||
("Intellectual property helps incentivize innovation.", False),
|
("Intellectual property helps incentivize innovation.", False),
|
||||||
("THIS IS ALL CAPS. BUT IT IS TWO SENTENCES.", False),
|
("THIS IS ALL CAPS. BUT IT IS TWO SENTENCES.", False),
|
||||||
|
("LOOK AT THIS IT IS CAPS BUT NOT A TITLE.", False),
|
||||||
|
("This Has All Caps. It's Weird But Two Sentences", False),
|
||||||
|
("The Business Report is expected within 6 hours of closing", False),
|
||||||
("", False),
|
("", False),
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
def test_contains_exceeds_cap_ratio(text, expected, monkeypatch):
|
def test_contains_exceeds_cap_ratio(text, expected, monkeypatch):
|
||||||
|
assert text_type.exceeds_cap_ratio(text) is expected
|
||||||
|
|
||||||
|
|
||||||
|
def test_set_caps_ratio_with_environment_variable(monkeypatch):
|
||||||
monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
|
monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
|
||||||
monkeypatch.setattr(text_type, "sent_tokenize", mock_sent_tokenize)
|
monkeypatch.setattr(text_type, "sent_tokenize", mock_sent_tokenize)
|
||||||
assert text_type.exceeds_cap_ratio(text, threshold=0.3) is expected
|
monkeypatch.setenv("NARRATIVE_TEXT_CAP_THRESHOLD", 0.8)
|
||||||
|
|
||||||
|
text = "All The King's Horses. And All The King's Men."
|
||||||
|
with patch.object(text_type, "exceeds_cap_ratio", return_value=False) as mock_exceeds:
|
||||||
|
text_type.is_possible_narrative_text(text)
|
||||||
|
|
||||||
|
mock_exceeds.assert_called_once_with(text, threshold=0.8)
|
||||||
|
|
||||||
|
|
||||||
def test_sentence_count(monkeypatch):
|
def test_sentence_count(monkeypatch):
|
||||||
@ -153,3 +167,19 @@ def test_sentence_count(monkeypatch):
|
|||||||
def test_item_titles():
|
def test_item_titles():
|
||||||
text = "ITEM 1(A). THIS IS A TITLE"
|
text = "ITEM 1(A). THIS IS A TITLE"
|
||||||
assert text_type.sentence_count(text, 3) < 2
|
assert text_type.sentence_count(text, 3) < 2
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"text, expected",
|
||||||
|
[
|
||||||
|
("Doylestown, PA 18901", True),
|
||||||
|
("DOYLESTOWN, PENNSYLVANIA, 18901", True),
|
||||||
|
("DOYLESTOWN, PENNSYLVANIA 18901", True),
|
||||||
|
("Doylestown, Pennsylvania 18901", True),
|
||||||
|
(" Doylestown, Pennsylvania 18901", True),
|
||||||
|
("The Business Report is expected within 6 hours of closing", False),
|
||||||
|
("", False),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_is_us_city_state_zip(text, expected):
|
||||||
|
assert text_type.is_us_city_state_zip(text) is expected
|
||||||
|
@ -1 +1 @@
|
|||||||
__version__ = "0.4.4" # pragma: no cover
|
__version__ = "0.4.5-dev0" # pragma: no cover
|
||||||
|
@ -60,7 +60,13 @@ class Text(Element):
|
|||||||
return self.text
|
return self.text
|
||||||
|
|
||||||
def __eq__(self, other):
|
def __eq__(self, other):
|
||||||
return (self.text == other.text) and (self.coordinates == other.coordinates)
|
return all(
|
||||||
|
[
|
||||||
|
(self.text == other.text),
|
||||||
|
(self.coordinates == other.coordinates),
|
||||||
|
(self.category == other.category),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
def apply(self, *cleaners: Callable):
|
def apply(self, *cleaners: Callable):
|
||||||
"""Applies a cleaning brick to the text element. The function that's passed in
|
"""Applies a cleaning brick to the text element. The function that's passed in
|
||||||
@ -108,6 +114,14 @@ class Title(Text):
|
|||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class Address(Text):
|
||||||
|
"""A text element for capturing addresses."""
|
||||||
|
|
||||||
|
category = "Address"
|
||||||
|
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
class Image(Text):
|
class Image(Text):
|
||||||
"""A text element for capturing image metadata."""
|
"""A text element for capturing image metadata."""
|
||||||
|
|
||||||
|
@ -13,12 +13,13 @@ from unstructured.logger import logger
|
|||||||
|
|
||||||
from unstructured.cleaners.core import clean_bullets, replace_unicode_quotes
|
from unstructured.cleaners.core import clean_bullets, replace_unicode_quotes
|
||||||
from unstructured.documents.base import Page
|
from unstructured.documents.base import Page
|
||||||
from unstructured.documents.elements import ListItem, Element, NarrativeText, Title
|
from unstructured.documents.elements import Address, ListItem, Element, NarrativeText, Text, Title
|
||||||
from unstructured.documents.xml import XMLDocument
|
from unstructured.documents.xml import XMLDocument
|
||||||
from unstructured.partition.text_type import (
|
from unstructured.partition.text_type import (
|
||||||
is_bulleted_text,
|
is_bulleted_text,
|
||||||
is_possible_narrative_text,
|
is_possible_narrative_text,
|
||||||
is_possible_title,
|
is_possible_title,
|
||||||
|
is_us_city_state_zip,
|
||||||
)
|
)
|
||||||
|
|
||||||
TEXT_TAGS: Final[List[str]] = ["p", "a", "td", "span", "font"]
|
TEXT_TAGS: Final[List[str]] = ["p", "a", "td", "span", "font"]
|
||||||
@ -47,6 +48,18 @@ class TagsMixin:
|
|||||||
super().__init__(*args, **kwargs)
|
super().__init__(*args, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
class HTMLText(TagsMixin, Text):
|
||||||
|
"""Text with tag information."""
|
||||||
|
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class HTMLAddress(TagsMixin, Address):
|
||||||
|
"""Address with tag information."""
|
||||||
|
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
class HTMLTitle(TagsMixin, Title):
|
class HTMLTitle(TagsMixin, Title):
|
||||||
"""Title with tag information."""
|
"""Title with tag information."""
|
||||||
|
|
||||||
@ -203,6 +216,8 @@ def _text_to_element(text: str, tag: str, ancestortags: Tuple[str, ...]) -> Opti
|
|||||||
if not clean_bullets(text):
|
if not clean_bullets(text):
|
||||||
return None
|
return None
|
||||||
return HTMLListItem(text=clean_bullets(text), tag=tag, ancestortags=ancestortags)
|
return HTMLListItem(text=clean_bullets(text), tag=tag, ancestortags=ancestortags)
|
||||||
|
elif is_us_city_state_zip(text):
|
||||||
|
return HTMLAddress(text=text, tag=tag, ancestortags=ancestortags)
|
||||||
|
|
||||||
if len(text) < 2:
|
if len(text) < 2:
|
||||||
return None
|
return None
|
||||||
@ -211,8 +226,7 @@ def _text_to_element(text: str, tag: str, ancestortags: Tuple[str, ...]) -> Opti
|
|||||||
elif is_possible_title(text):
|
elif is_possible_title(text):
|
||||||
return HTMLTitle(text, tag=tag, ancestortags=ancestortags)
|
return HTMLTitle(text, tag=tag, ancestortags=ancestortags)
|
||||||
else:
|
else:
|
||||||
# Something that might end up here is text that's just a number.
|
return HTMLText(text, tag=tag, ancestortags=ancestortags)
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def _is_container_with_text(tag_elem: etree.Element) -> bool:
|
def _is_container_with_text(tag_elem: etree.Element) -> bool:
|
||||||
|
@ -16,6 +16,23 @@ US_PHONE_NUMBERS_PATTERN = (
|
|||||||
)
|
)
|
||||||
US_PHONE_NUMBERS_RE = re.compile(US_PHONE_NUMBERS_PATTERN)
|
US_PHONE_NUMBERS_RE = re.compile(US_PHONE_NUMBERS_PATTERN)
|
||||||
|
|
||||||
|
# NOTE(robinson) - Based on this regex from regex101. Regex was updated to run fast
|
||||||
|
# and avoid catastrophic backtracking
|
||||||
|
# ref: https://regex101.com/library/oR3jU1?page=673
|
||||||
|
US_CITY_STATE_ZIP_PATTERN = (
|
||||||
|
r"(?i)\b(?:[A-Z][a-z.-]{1,15}[ ]?){1,5},\s?"
|
||||||
|
r"(?:{Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|Florida"
|
||||||
|
r"|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland"
|
||||||
|
r"|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|"
|
||||||
|
r"New[ ]Hampshire|New[ ]Jersey|New[ ]Mexico|New[ ]York|North[ ]Carolina|North[ ]Dakota"
|
||||||
|
r"|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode[ ]Island|South[ ]Carolina|South[ ]Dakota"
|
||||||
|
r"|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West[ ]Virginia|Wisconsin|Wyoming}"
|
||||||
|
r"|{AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN"
|
||||||
|
r"|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|"
|
||||||
|
r"WA|WV|WI|WY})(, |\s)?(?:\b\d{5}(?:-\d{4})?\b)"
|
||||||
|
)
|
||||||
|
US_CITY_STATE_ZIP_RE = re.compile(US_CITY_STATE_ZIP_PATTERN)
|
||||||
|
|
||||||
UNICODE_BULLETS: Final[List[str]] = [
|
UNICODE_BULLETS: Final[List[str]] = [
|
||||||
"\u0095",
|
"\u0095",
|
||||||
"\u2022",
|
"\u2022",
|
||||||
|
@ -3,20 +3,18 @@ from typing import IO, List, Optional
|
|||||||
import docx
|
import docx
|
||||||
|
|
||||||
from unstructured.cleaners.core import clean_bullets
|
from unstructured.cleaners.core import clean_bullets
|
||||||
from unstructured.documents.elements import Element, ListItem, NarrativeText, Text, Title
|
from unstructured.documents.elements import Address, Element, ListItem, NarrativeText, Text, Title
|
||||||
from unstructured.partition.text_type import (
|
from unstructured.partition.text_type import (
|
||||||
is_bulleted_text,
|
is_bulleted_text,
|
||||||
is_possible_narrative_text,
|
is_possible_narrative_text,
|
||||||
is_possible_title,
|
is_possible_title,
|
||||||
|
is_us_city_state_zip,
|
||||||
)
|
)
|
||||||
|
|
||||||
# NOTE(robinson) - documentation on built in styles can be found at the link below
|
# NOTE(robinson) - documentation on built in styles can be found at the link below
|
||||||
# ref: https://python-docx.readthedocs.io/en/latest/user/
|
# ref: https://python-docx.readthedocs.io/en/latest/user/
|
||||||
# styles-understanding.html#paragraph-styles-in-default-template
|
# styles-understanding.html#paragraph-styles-in-default-template
|
||||||
STYLE_TO_ELEMENT_MAPPING = {
|
STYLE_TO_ELEMENT_MAPPING = {
|
||||||
"Body Text": NarrativeText,
|
|
||||||
"Body Text 2": NarrativeText,
|
|
||||||
"Body Text 3": NarrativeText,
|
|
||||||
"Caption": Text, # TODO(robinson) - add caption element type
|
"Caption": Text, # TODO(robinson) - add caption element type
|
||||||
"Heading 1": Title,
|
"Heading 1": Title,
|
||||||
"Heading 2": Title,
|
"Heading 2": Title,
|
||||||
@ -87,6 +85,9 @@ def _paragraph_to_element(paragraph: docx.text.paragraph.Paragraph) -> Optional[
|
|||||||
text = paragraph.text
|
text = paragraph.text
|
||||||
style_name = paragraph.style.name
|
style_name = paragraph.style.name
|
||||||
|
|
||||||
|
if len(text.strip()) == 0:
|
||||||
|
return None
|
||||||
|
|
||||||
element_class = STYLE_TO_ELEMENT_MAPPING.get(style_name)
|
element_class = STYLE_TO_ELEMENT_MAPPING.get(style_name)
|
||||||
|
|
||||||
# NOTE(robinson) - The "Normal" style name will return None since it's in the mapping.
|
# NOTE(robinson) - The "Normal" style name will return None since it's in the mapping.
|
||||||
@ -100,7 +101,11 @@ def _paragraph_to_element(paragraph: docx.text.paragraph.Paragraph) -> Optional[
|
|||||||
def _text_to_element(text: str) -> Optional[Text]:
|
def _text_to_element(text: str) -> Optional[Text]:
|
||||||
"""Converts raw text into an unstructured Text element."""
|
"""Converts raw text into an unstructured Text element."""
|
||||||
if is_bulleted_text(text):
|
if is_bulleted_text(text):
|
||||||
return ListItem(text=clean_bullets(text))
|
clean_text = clean_bullets(text).strip()
|
||||||
|
return ListItem(text=clean_bullets(text)) if clean_text else None
|
||||||
|
|
||||||
|
elif is_us_city_state_zip(text):
|
||||||
|
return Address(text=text)
|
||||||
|
|
||||||
if len(text) < 2:
|
if len(text) < 2:
|
||||||
return None
|
return None
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
import re
|
import re
|
||||||
from typing import IO, List, Optional
|
from typing import IO, List, Optional
|
||||||
|
|
||||||
from unstructured.documents.elements import Element, ListItem, NarrativeText, Title
|
from unstructured.documents.elements import Address, Element, ListItem, NarrativeText, Text, Title
|
||||||
|
|
||||||
from unstructured.cleaners.core import clean_bullets
|
from unstructured.cleaners.core import clean_bullets
|
||||||
from unstructured.nlp.patterns import PARAGRAPH_PATTERN
|
from unstructured.nlp.patterns import PARAGRAPH_PATTERN
|
||||||
@ -9,6 +9,7 @@ from unstructured.partition.text_type import (
|
|||||||
is_possible_narrative_text,
|
is_possible_narrative_text,
|
||||||
is_possible_title,
|
is_possible_title,
|
||||||
is_bulleted_text,
|
is_bulleted_text,
|
||||||
|
is_us_city_state_zip,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@ -56,11 +57,16 @@ def partition_text(
|
|||||||
ctext = ctext.strip()
|
ctext = ctext.strip()
|
||||||
|
|
||||||
if ctext == "":
|
if ctext == "":
|
||||||
break
|
continue
|
||||||
if is_bulleted_text(ctext):
|
if is_bulleted_text(ctext):
|
||||||
elements.append(ListItem(text=clean_bullets(ctext)))
|
elements.append(ListItem(text=clean_bullets(ctext)))
|
||||||
|
elif is_us_city_state_zip(ctext):
|
||||||
|
elements.append(Address(text=ctext))
|
||||||
elif is_possible_narrative_text(ctext):
|
elif is_possible_narrative_text(ctext):
|
||||||
elements.append(NarrativeText(text=ctext))
|
elements.append(NarrativeText(text=ctext))
|
||||||
elif is_possible_title(ctext):
|
elif is_possible_title(ctext):
|
||||||
elements.append(Title(text=ctext))
|
elements.append(Title(text=ctext))
|
||||||
|
else:
|
||||||
|
elements.append(Text(text=ctext))
|
||||||
|
|
||||||
return elements
|
return elements
|
||||||
|
@ -1,15 +1,16 @@
|
|||||||
"""partition.py implements logic for partitioning plain text documents into sections."""
|
"""partition.py implements logic for partitioning plain text documents into sections."""
|
||||||
|
import os
|
||||||
import sys
|
import sys
|
||||||
|
|
||||||
from typing import List, Optional
|
from typing import List, Optional
|
||||||
|
|
||||||
if sys.version_info < (3, 8):
|
if sys.version_info < (3, 8):
|
||||||
from typing_extensions import Final
|
from typing_extensions import Final # pragma: nocover
|
||||||
else:
|
else:
|
||||||
from typing import Final
|
from typing import Final
|
||||||
|
|
||||||
from unstructured.cleaners.core import remove_punctuation
|
from unstructured.cleaners.core import remove_punctuation
|
||||||
from unstructured.nlp.patterns import US_PHONE_NUMBERS_RE, UNICODE_BULLETS_RE
|
from unstructured.nlp.patterns import US_PHONE_NUMBERS_RE, UNICODE_BULLETS_RE, US_CITY_STATE_ZIP_RE
|
||||||
from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize
|
from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize
|
||||||
from unstructured.logger import logger
|
from unstructured.logger import logger
|
||||||
|
|
||||||
@ -17,8 +18,19 @@ from unstructured.logger import logger
|
|||||||
POS_VERB_TAGS: Final[List[str]] = ["VB", "VBG", "VBD", "VBN", "VBP", "VBZ"]
|
POS_VERB_TAGS: Final[List[str]] = ["VB", "VBG", "VBD", "VBN", "VBP", "VBZ"]
|
||||||
|
|
||||||
|
|
||||||
def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
|
def is_possible_narrative_text(text: str, cap_threshold: float = 0.5) -> bool:
|
||||||
"""Checks to see if the text passes all of the checks for a narrative text section."""
|
"""Checks to see if the text passes all of the checks for a narrative text section.
|
||||||
|
You can change the cap threshold using the cap_threshold kwarg or the
|
||||||
|
NARRATIVE_TEXT_CAP_THRESHOLD environment variable. The environment variable takes
|
||||||
|
precedence over the kwarg.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
text
|
||||||
|
the input text
|
||||||
|
cap_threshold
|
||||||
|
the percentage of capitalized words necessary to disqualify the segment as narrative
|
||||||
|
"""
|
||||||
if len(text) == 0:
|
if len(text) == 0:
|
||||||
logger.debug("Not narrative. Text is empty.")
|
logger.debug("Not narrative. Text is empty.")
|
||||||
return False
|
return False
|
||||||
@ -27,6 +39,9 @@ def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
|
|||||||
logger.debug(f"Not narrative. Text is all numeric:\n\n{text}")
|
logger.debug(f"Not narrative. Text is all numeric:\n\n{text}")
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
# NOTE(robinson): it gets read in from the environment as a string so we need to
|
||||||
|
# cast it to a float
|
||||||
|
cap_threshold = float(os.environ.get("NARRATIVE_TEXT_CAP_THRESHOLD", cap_threshold))
|
||||||
if exceeds_cap_ratio(text, threshold=cap_threshold):
|
if exceeds_cap_ratio(text, threshold=cap_threshold):
|
||||||
logger.debug(f"Not narrative. Text exceeds cap ratio {cap_threshold}:\n\n{text}")
|
logger.debug(f"Not narrative. Text exceeds cap ratio {cap_threshold}:\n\n{text}")
|
||||||
return False
|
return False
|
||||||
@ -39,11 +54,23 @@ def is_possible_narrative_text(text: str, cap_threshold: float = 0.3) -> bool:
|
|||||||
|
|
||||||
|
|
||||||
def is_possible_title(text: str, sentence_min_length: int = 5) -> bool:
|
def is_possible_title(text: str, sentence_min_length: int = 5) -> bool:
|
||||||
"""Checks to see if the text passes all of the checks for a valid title."""
|
"""Checks to see if the text passes all of the checks for a valid title.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
text
|
||||||
|
the input text
|
||||||
|
setence_min_length
|
||||||
|
the minimum number of words required to consider a section of text a sentence
|
||||||
|
"""
|
||||||
if len(text) == 0:
|
if len(text) == 0:
|
||||||
logger.debug("Not a title. Text is empty.")
|
logger.debug("Not a title. Text is empty.")
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
# NOTE(robinson) - Prevent flagging salutations like "To My Dearest Friends," as titles
|
||||||
|
if text.endswith(","):
|
||||||
|
return False
|
||||||
|
|
||||||
if text.isnumeric():
|
if text.isnumeric():
|
||||||
logger.debug(f"Not a title. Text is all numeric:\n\n{text}")
|
logger.debug(f"Not a title. Text is all numeric:\n\n{text}")
|
||||||
return False
|
return False
|
||||||
@ -76,6 +103,9 @@ def contains_us_phone_number(text: str) -> bool:
|
|||||||
def contains_verb(text: str) -> bool:
|
def contains_verb(text: str) -> bool:
|
||||||
"""Use a POS tagger to check if a segment contains verbs. If the section does not have verbs,
|
"""Use a POS tagger to check if a segment contains verbs. If the section does not have verbs,
|
||||||
that indicates that it is not narrative text."""
|
that indicates that it is not narrative text."""
|
||||||
|
if text.isupper():
|
||||||
|
text = text.lower()
|
||||||
|
|
||||||
pos_tags = pos_tag(text)
|
pos_tags = pos_tag(text)
|
||||||
for _, tag in pos_tags:
|
for _, tag in pos_tags:
|
||||||
if tag in POS_VERB_TAGS:
|
if tag in POS_VERB_TAGS:
|
||||||
@ -109,7 +139,7 @@ def sentence_count(text: str, min_length: Optional[int] = None) -> int:
|
|||||||
return count
|
return count
|
||||||
|
|
||||||
|
|
||||||
def exceeds_cap_ratio(text: str, threshold: float = 0.3) -> bool:
|
def exceeds_cap_ratio(text: str, threshold: float = 0.5) -> bool:
|
||||||
"""Checks the title ratio in a section of text. If a sufficient proportion of the text is
|
"""Checks the title ratio in a section of text. If a sufficient proportion of the text is
|
||||||
capitalized."""
|
capitalized."""
|
||||||
# NOTE(robinson) - Currently limiting this to only sections of text with one sentence.
|
# NOTE(robinson) - Currently limiting this to only sections of text with one sentence.
|
||||||
@ -118,9 +148,24 @@ def exceeds_cap_ratio(text: str, threshold: float = 0.3) -> bool:
|
|||||||
logger.debug(f"Text does not contain multiple sentences:\n\n{text}")
|
logger.debug(f"Text does not contain multiple sentences:\n\n{text}")
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
if text.isupper():
|
||||||
|
return False
|
||||||
|
|
||||||
tokens = word_tokenize(text)
|
tokens = word_tokenize(text)
|
||||||
if len(tokens) == 0:
|
if len(tokens) == 0:
|
||||||
return False
|
return False
|
||||||
capitalized = sum([word.istitle() or word.isupper() for word in tokens])
|
capitalized = sum([word.istitle() or word.isupper() for word in tokens])
|
||||||
ratio = capitalized / len(tokens)
|
ratio = capitalized / len(tokens)
|
||||||
return ratio > threshold
|
return ratio > threshold
|
||||||
|
|
||||||
|
|
||||||
|
def is_us_city_state_zip(text) -> bool:
|
||||||
|
"""Checks if the given text is in the format of US city/state/zip code.
|
||||||
|
|
||||||
|
Examples
|
||||||
|
--------
|
||||||
|
Doylestown, PA 18901
|
||||||
|
Doylestown, Pennsylvania, 18901
|
||||||
|
DOYLESTOWN, PENNSYLVANIA 18901
|
||||||
|
"""
|
||||||
|
return US_CITY_STATE_ZIP_RE.match(text.strip()) is not None
|
||||||
|
Loading…
x
Reference in New Issue
Block a user