mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-08-15 20:27:37 +00:00
feat: add partition_email
cleaning brick (#104)
* fix for processing deeply embedded list elements * fix types in mime encodings cleaner * first pass on partition_email * tests for email * test for mime encodings * changelog bump * added note about \n= * linting, linting, linting * added email docs * add partition_email to the readme * add one more test
This commit is contained in:
parent
1d68bb2482
commit
7a74cdda86
@ -1,3 +1,9 @@
|
|||||||
|
## 0.3.3-dev1
|
||||||
|
|
||||||
|
* Adds the `partition_email` partitioning brick
|
||||||
|
* Adds the `replace_mime_encodings` cleaning bricks
|
||||||
|
* Small fix to HTML parsing related to processing list items with sub-tags
|
||||||
|
|
||||||
## 0.3.2
|
## 0.3.2
|
||||||
|
|
||||||
* Added `translate_text` brick for translating text between languages
|
* Added `translate_text` brick for translating text between languages
|
||||||
|
42
README.md
42
README.md
@ -148,6 +148,48 @@ has an `element` attribute consisting of `Element` objects. Sub-types of the `El
|
|||||||
represent different components of a document, such as `NarrativeText` and `Title`. You can use
|
represent different components of a document, such as `NarrativeText` and `Title`. You can use
|
||||||
these normalized elements to zero in on the components of a document you most care about.
|
these normalized elements to zero in on the components of a document you most care about.
|
||||||
|
|
||||||
|
### E-mail Parsing
|
||||||
|
|
||||||
|
The `partition_email` function within `unstructured` is helpful for parsing `.eml` files. Common
|
||||||
|
e-mail clients such as Microsoft Outlook and Gmail support exproting e-mails as `.eml` files.
|
||||||
|
`partition_email` accepts filenames, file-like object, and raw text as input. The following
|
||||||
|
three snippets for parsing `.eml` files are equivalent:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from unstructured.partition.email import partition_email
|
||||||
|
|
||||||
|
elements = partition_email(filename="example-docs/fake-email.eml")
|
||||||
|
|
||||||
|
with open("example-docs/fake-email.eml", "r") as f:
|
||||||
|
elements = partition_email(file=f)
|
||||||
|
|
||||||
|
with open("example-docs/fake-email.eml", "r") as f:
|
||||||
|
text = f.read()
|
||||||
|
elements = partition_email(text=text)
|
||||||
|
```
|
||||||
|
|
||||||
|
The `elements` output will look like the following:
|
||||||
|
|
||||||
|
```python
|
||||||
|
[<unstructured.documents.html.HTMLNarrativeText at 0x13ab14370>,
|
||||||
|
<unstructured.documents.html.HTMLTitle at 0x106877970>,
|
||||||
|
<unstructured.documents.html.HTMLListItem at 0x1068776a0>,
|
||||||
|
<unstructured.documents.html.HTMLListItem at 0x13fe4b0a0>]
|
||||||
|
```
|
||||||
|
|
||||||
|
Run `print("\n\n".join([str(el) for el in elements]))` to get a string representation of the
|
||||||
|
output, which looks like:
|
||||||
|
|
||||||
|
```python
|
||||||
|
This is a test email to use for unit tests.
|
||||||
|
|
||||||
|
Important points:
|
||||||
|
|
||||||
|
Roses are red
|
||||||
|
|
||||||
|
Violets are blue
|
||||||
|
```
|
||||||
|
|
||||||
## :guardsman: Security Policy
|
## :guardsman: Security Policy
|
||||||
|
|
||||||
See our [security policy](https://github.com/Unstructured-IO/unstructured/security/policy) for
|
See our [security policy](https://github.com/Unstructured-IO/unstructured/security/policy) for
|
||||||
|
@ -54,6 +54,30 @@ Examples:
|
|||||||
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf")
|
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf")
|
||||||
|
|
||||||
|
|
||||||
|
``partition_email``
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
The ``partition_email`` function partitions ``.eml`` documents and works with exports
|
||||||
|
from email clients such as Microsoft Outlook and Gmail. The ``partition_email``
|
||||||
|
takes a filename, file-like object, or raw text as input and produces a list of
|
||||||
|
document ``Element`` objects as output.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
.. code:: python
|
||||||
|
|
||||||
|
from unstructured.partition.email import partition_email
|
||||||
|
|
||||||
|
elements = partition_email(filename="example-docs/fake-email.eml")
|
||||||
|
|
||||||
|
with open("example-docs/fake-email.eml", "r") as f:
|
||||||
|
elements = partition_email(file=f)
|
||||||
|
|
||||||
|
with open("example-docs/fake-email.eml", "r") as f:
|
||||||
|
text = f.read()
|
||||||
|
elements = partition_email(text=text)
|
||||||
|
|
||||||
|
|
||||||
``is_bulleted_text``
|
``is_bulleted_text``
|
||||||
----------------------
|
----------------------
|
||||||
|
|
||||||
|
24
example-docs/fake-email.eml
Normal file
24
example-docs/fake-email.eml
Normal file
@ -0,0 +1,24 @@
|
|||||||
|
MIME-Version: 1.0
|
||||||
|
Date: Fri, 16 Dec 2022 17:04:16 -0500
|
||||||
|
Message-ID: <CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com>
|
||||||
|
Subject: Test Email
|
||||||
|
From: Matthew Robinson <mrobinson@unstructured.io>
|
||||||
|
To: Matthew Robinson <mrobinson@unstructured.io>
|
||||||
|
Content-Type: multipart/alternative; boundary="00000000000095c9b205eff92630"
|
||||||
|
|
||||||
|
--00000000000095c9b205eff92630
|
||||||
|
Content-Type: text/plain; charset="UTF-8"
|
||||||
|
|
||||||
|
This is a test email to use for unit tests.
|
||||||
|
|
||||||
|
Important points:
|
||||||
|
|
||||||
|
- Roses are red
|
||||||
|
- Violets are blue
|
||||||
|
|
||||||
|
--00000000000095c9b205eff92630
|
||||||
|
Content-Type: text/html; charset="UTF-8"
|
||||||
|
|
||||||
|
<div dir="ltr"><div>This is a test email to use for unit tests.</div><div><br></div><div>Important points:</div><div><ul><li>Roses are red</li><li>Violets are blue</li></ul></div></div>
|
||||||
|
|
||||||
|
--00000000000095c9b205eff92630--
|
@ -29,6 +29,14 @@ def test_replace_unicode_quotes(text, expected):
|
|||||||
assert core.replace_unicode_quotes(text=text) == expected
|
assert core.replace_unicode_quotes(text=text) == expected
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"text, expected",
|
||||||
|
[("5 w=E2=80=99s", "5 w’s")],
|
||||||
|
)
|
||||||
|
def test_replace_mime_encodings(text, expected):
|
||||||
|
assert core.replace_mime_encodings(text=text) == expected
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"text, expected",
|
"text, expected",
|
||||||
[
|
[
|
||||||
|
61
test_unstructured/partition/test_email.py
Normal file
61
test_unstructured/partition/test_email.py
Normal file
@ -0,0 +1,61 @@
|
|||||||
|
import os
|
||||||
|
import pathlib
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from unstructured.documents.elements import NarrativeText, Title, ListItem
|
||||||
|
from unstructured.partition.email import partition_email
|
||||||
|
|
||||||
|
|
||||||
|
DIRECTORY = pathlib.Path(__file__).parent.resolve()
|
||||||
|
|
||||||
|
|
||||||
|
EXPECTED_OUTPUT = [
|
||||||
|
NarrativeText(text="This is a test email to use for unit tests."),
|
||||||
|
Title(text="Important points:"),
|
||||||
|
ListItem(text="Roses are red"),
|
||||||
|
ListItem(text="Violets are blue"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_partition_email_from_filename():
|
||||||
|
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.eml")
|
||||||
|
elements = partition_email(filename=filename)
|
||||||
|
assert len(elements) > 0
|
||||||
|
assert elements == EXPECTED_OUTPUT
|
||||||
|
|
||||||
|
|
||||||
|
def test_partition_email_from_file():
|
||||||
|
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.eml")
|
||||||
|
with open(filename, "r") as f:
|
||||||
|
elements = partition_email(file=f)
|
||||||
|
assert len(elements) > 0
|
||||||
|
assert elements == EXPECTED_OUTPUT
|
||||||
|
|
||||||
|
|
||||||
|
def test_partition_email_from_text():
|
||||||
|
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.eml")
|
||||||
|
with open(filename, "r") as f:
|
||||||
|
text = f.read()
|
||||||
|
elements = partition_email(text=text)
|
||||||
|
assert len(elements) > 0
|
||||||
|
assert elements == EXPECTED_OUTPUT
|
||||||
|
|
||||||
|
|
||||||
|
def test_partition_email_raises_with_none_specified():
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
partition_email()
|
||||||
|
|
||||||
|
|
||||||
|
def test_partition_email_raises_with_too_many_specified():
|
||||||
|
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.eml")
|
||||||
|
with open(filename, "r") as f:
|
||||||
|
text = f.read()
|
||||||
|
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
partition_email(filename=filename, text=text)
|
||||||
|
|
||||||
|
|
||||||
|
def test_partition_email_raises_with_invalid_content_type():
|
||||||
|
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.eml")
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
partition_email(filename=filename, content_source="application/json")
|
@ -1 +1 @@
|
|||||||
__version__ = "0.3.2" # pragma: no cover
|
__version__ = "0.3.3-dev1" # pragma: no cover
|
||||||
|
@ -1,6 +1,8 @@
|
|||||||
import re
|
import re
|
||||||
import sys
|
import sys
|
||||||
import unicodedata
|
import unicodedata
|
||||||
|
import quopri
|
||||||
|
|
||||||
from unstructured.nlp.patterns import UNICODE_BULLETS_RE
|
from unstructured.nlp.patterns import UNICODE_BULLETS_RE
|
||||||
|
|
||||||
|
|
||||||
@ -81,6 +83,16 @@ def clean_trailing_punctuation(text: str) -> str:
|
|||||||
return text.strip().rstrip(".,:;")
|
return text.strip().rstrip(".,:;")
|
||||||
|
|
||||||
|
|
||||||
|
def replace_mime_encodings(text: str) -> str:
|
||||||
|
"""Replaces MIME encodings with their UTF-8 equivalent characters.
|
||||||
|
|
||||||
|
Example
|
||||||
|
-------
|
||||||
|
5 w=E2=80-99s -> 5 w’s
|
||||||
|
"""
|
||||||
|
return quopri.decodestring(text.encode()).decode("utf-8")
|
||||||
|
|
||||||
|
|
||||||
def clean_prefix(text: str, pattern: str, ignore_case: bool = False, strip: bool = True) -> str:
|
def clean_prefix(text: str, pattern: str, ignore_case: bool = False, strip: bool = True) -> str:
|
||||||
"""Removes prefixes from a string according to the specified pattern. Strips leading
|
"""Removes prefixes from a string according to the specified pattern. Strips leading
|
||||||
whitespace if the strip parameter is set to True.
|
whitespace if the strip parameter is set to True.
|
||||||
|
@ -225,8 +225,13 @@ def _construct_text(tag_elem: etree.Element) -> str:
|
|||||||
return text.strip()
|
return text.strip()
|
||||||
|
|
||||||
|
|
||||||
def _is_text_tag(tag_elem: etree.Element) -> bool:
|
def _is_text_tag(tag_elem: etree.Element, max_predecessor_len: int = 5) -> bool:
|
||||||
"""Deteremines if a tag potentially contains narrative text."""
|
"""Deteremines if a tag potentially contains narrative text."""
|
||||||
|
# NOTE(robinson) - Only consider elements with limited depth. Otherwise,
|
||||||
|
# it could be the text representation of a giant div
|
||||||
|
if len(tag_elem) > max_predecessor_len:
|
||||||
|
return False
|
||||||
|
|
||||||
if tag_elem.tag in TEXT_TAGS + HEADING_TAGS:
|
if tag_elem.tag in TEXT_TAGS + HEADING_TAGS:
|
||||||
return True
|
return True
|
||||||
|
|
||||||
@ -250,7 +255,7 @@ def _process_list_item(
|
|||||||
we can skip processing if bullets are found in a div element."""
|
we can skip processing if bullets are found in a div element."""
|
||||||
if tag_elem.tag in LIST_ITEM_TAGS:
|
if tag_elem.tag in LIST_ITEM_TAGS:
|
||||||
text = _construct_text(tag_elem)
|
text = _construct_text(tag_elem)
|
||||||
return HTMLListItem(text=text, tag=tag_elem.tag), None
|
return HTMLListItem(text=text, tag=tag_elem.tag), tag_elem
|
||||||
|
|
||||||
elif tag_elem.tag == "div":
|
elif tag_elem.tag == "div":
|
||||||
text = _construct_text(tag_elem)
|
text = _construct_text(tag_elem)
|
||||||
|
74
unstructured/partition/email.py
Normal file
74
unstructured/partition/email.py
Normal file
@ -0,0 +1,74 @@
|
|||||||
|
import email
|
||||||
|
from typing import Dict, Final, IO, List, Optional
|
||||||
|
|
||||||
|
from unstructured.cleaners.core import replace_mime_encodings
|
||||||
|
from unstructured.documents.elements import Element, Text
|
||||||
|
from unstructured.partition.html import partition_html
|
||||||
|
|
||||||
|
|
||||||
|
VALID_CONTENT_SOURCES: Final[List[str]] = ["text/html"]
|
||||||
|
|
||||||
|
|
||||||
|
def partition_email(
|
||||||
|
filename: Optional[str] = None,
|
||||||
|
file: Optional[IO] = None,
|
||||||
|
text: Optional[str] = None,
|
||||||
|
content_source: str = "text/html",
|
||||||
|
) -> List[Element]:
|
||||||
|
"""Partitions an .eml documents into its constituent elements.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
filename
|
||||||
|
A string defining the target filename path.
|
||||||
|
file
|
||||||
|
A file-like object using "r" mode --> open(filename, "r").
|
||||||
|
text
|
||||||
|
The string representation of the .eml document.
|
||||||
|
"""
|
||||||
|
if content_source not in VALID_CONTENT_SOURCES:
|
||||||
|
raise ValueError(
|
||||||
|
f"{content_source} is not a valid value for content_source. "
|
||||||
|
f"Valid content sources are: {VALID_CONTENT_SOURCES}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if not any([filename, file, text]):
|
||||||
|
raise ValueError("One of filename, file, or text must be specified.")
|
||||||
|
|
||||||
|
if filename is not None and not file and not text:
|
||||||
|
with open(filename, "r") as f:
|
||||||
|
msg = email.message_from_file(f)
|
||||||
|
|
||||||
|
elif file is not None and not filename and not text:
|
||||||
|
file_text = file.read()
|
||||||
|
msg = email.message_from_string(file_text)
|
||||||
|
|
||||||
|
elif text is not None and not filename and not file:
|
||||||
|
_text: str = str(text)
|
||||||
|
msg = email.message_from_string(_text)
|
||||||
|
|
||||||
|
else:
|
||||||
|
raise ValueError("Only one of filename, file, or text can be specified.")
|
||||||
|
|
||||||
|
content_map: Dict[str, str] = {
|
||||||
|
part.get_content_type(): part.get_payload() for part in msg.walk()
|
||||||
|
}
|
||||||
|
|
||||||
|
content = content_map.get(content_source, "")
|
||||||
|
if not content:
|
||||||
|
raise ValueError("text/html content not found in email")
|
||||||
|
|
||||||
|
# NOTE(robinson) - In the .eml files, the HTML content gets stored in a format that
|
||||||
|
# looks like the following, resulting in extraneous "=" chracters in the output if
|
||||||
|
# you don't clean it up
|
||||||
|
# <ul> =
|
||||||
|
# <li>Item 1</li>=
|
||||||
|
# <li>Item 2<li>=
|
||||||
|
# </ul>
|
||||||
|
content = "".join(content.split("=\n"))
|
||||||
|
|
||||||
|
elements = partition_html(text=content)
|
||||||
|
for element in elements:
|
||||||
|
if isinstance(element, Text):
|
||||||
|
element.apply(replace_mime_encodings)
|
||||||
|
|
||||||
|
return elements
|
Loading…
x
Reference in New Issue
Block a user