feat: Add new functionality to parse text and header of emails (#111)

* partition_text function
2025-11-05 20:37:36 +00:00 · 2023-01-09 11:08:08 -06:00 · 2023-01-09 11:08:08 -06:00 · d7a00046a9
commit d7a00046a9
parent 7fb8713527
15 changed files with 670 additions and 42 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,10 +1,14 @@
-## 0.3.6-dev1
+## 0.3.6-dev2
 * Cleaning brick for removing ordered bullets `clean_ordered_bullets`.
 * Extract brick method for ordered bullets `extract_ordered_bullets`.
 * Test for `clean_ordered_bullets`.
 * Test for `extract_ordered_bullets`.
 * Added `partition_docx` for pre-processing Word Documents.
 * Added new REGEX patterns to extract email header information
 * Added new functions to extract header information `parse_received_data` and `partition_header`
 * Added new function to parse plain text files `partition_text`
 * Added new cleaners functions `extract_ip_address`, `extract_ip_address_name`, `extract_mapi_id`, `extract_datetimetz`
 ## 0.3.5
@ -18,6 +22,7 @@
 * Add new function `extract_attachment_info` that extracts and decode the attachment
 of an email.
 * Staging brick to convert a list of `Element`s to a `pandas` dataframe.
 * Add plain text functionality to `partition_email`
 ## 0.3.4
--- a/README.md
+++ b/README.md
@ -190,6 +190,51 @@ Roses are red
 Violets are blue
 ```
 ### Text Document Parsing
 The `partition_text` function within `unstructured` can be used to parse simple
 text files into elements.
 `partition_text` accepts filenames, file-like object, and raw text as input. The following three snippets are for parsing text files:
 ```python
 from unstructured.partition.text import partition_text
 elements = partition_text(filename="example-docs/fake-text.txt")
 with open("example-docs/fake-text.txt", "r") as f:
  elements = partition_text(file=f)
 with open("example-docs/fake-text.txt", "r") as f:
  text = f.read()
 elements = partition_text(text=text)
 ```
 The `elements` output will look like the following:
 ```python
 [<unstructured.documents.html.HTMLNarrativeText at 0x13ab14370>,
 <unstructured.documents.html.HTMLTitle at 0x106877970>,
 <unstructured.documents.html.HTMLListItem at 0x1068776a0>,
 <unstructured.documents.html.HTMLListItem at 0x13fe4b0a0>]
 ```
 Run `print("\n\n".join([str(el) for el in elements]))` to get a string representation of the
 output, which looks like:
 ```python
 This is a test document to use for unit tests.
 Important points:
 Hamburgers are delicious
 Dogs are the best
 I love fuzzy blankets
 ```
 ## :guardsman: Security Policy
 See our [security policy](https://github.com/Unstructured-IO/unstructured/security/policy) for
--- a/docs/source/bricks.rst
+++ b/docs/source/bricks.rst
@ -90,7 +90,11 @@ Examples:
 The ``partition_email`` function partitions ``.eml`` documents and works with exports
 from email clients such as Microsoft Outlook and Gmail. The ``partition_email`` 
 takes a filename, file-like object, or raw text as input and produces a list of
-document ``Element`` objects as output.
+document ``Element`` objects as output. Also ``content_source`` can be set to ``text/html``
 (default) or ``text/plain`` to process the html or plain text version of the email, respectively.
 In order for ``partition_email`` to also return the header information (e.g. sender, recipient,
 attachment, etc.), ``include_headers`` must be set to ``True``. Returns tuple with body elements
 first and header elements second, if ``include_headers`` is True.
 Examples:
@ -107,6 +111,37 @@ Examples:
      text = f.read()
  elements = partition_email(text=text)
  with open("example-docs/fake-email.eml", "r") as f:
      text = f.read()
  elements = partition_email(text=text, content_source="text/plain")
  with open("example-docs/fake-email.eml", "r") as f:
      text = f.read()
  elements = partition_email(text=text, include_headers=True)
 ``partition_text``
 ---------------------
 The ``partition_text`` function partitions text files. The ``partition_text`` 
 takes a filename, file-like object, and raw text as input and produces ``Element`` objects as output.
 Examples:
 .. code:: python
  from unstructured.partition.text import partition_text
  elements = partition_text(filename="example-docs/fake-text.txt")
  with open("example-docs/fake-text.txt", "r") as f:
    elements = partition_text(file=f)
  with open("example-docs/fake-text.txt", "r") as f:
    text = f.read()
  elements = partition_text(text=text)
 ``extract_attachment_info``
 ----------------------------
@ -550,6 +585,96 @@ Examples:
  # Returns "Look at me, I'm flying!"
  extract_text_after(text, r"SPEAKER \d{1}:")
 ``extract_email_address``
 --------------------------
 Extracts email addresses from a string input and returns a list of all the email
 addresses in the input string.
 .. code:: python
  from unstructured.cleaners.extract import extract_email_address
  text = """Me me@email.com and You <You@email.com> 
      ([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)"""
  # Returns "['me@email.com', 'you@email.com']"
  extract_email_address(text)
 ``extract_ip_address``
 ------------------------
 Extracts IPv4 and IPv6 IP addresses in the input string and
 returns a list of all IP address in input string.
 .. code:: python
  from unstructured.cleaners.extract import extract_ip_address
  text = """Me me@email.com and You <You@email.com> 
    ([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)"""
  # Returns "['ba23::58b5:2236:45g2:88h2', '10.0.2.01']"
  extract_ip_address(text)
 ``extract_ip_address_name``
 ----------------------------
 Extracts the names of each IP address in the ``Received`` field(s) from an ``.eml`` 
 file. ``extract_ip_address_name`` takes in a string and returns a list of all
 IP addresses in the input string.
 .. code:: python
  from unstructured.cleaners.extract import extract_ip_address_name
  text = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by
    \n ABC.DEF.local2 ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\
    n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200"""
  # Returns "['ABC.DEF.local', 'ABC.DEF.local2']"
  extract_ip_address_name(text)
 ``extract_mapi_id``
 ----------------------
 Extracts the ``mapi id`` in the ``Received`` field(s) from an ``.eml`` 
 file. ``extract_mapi_id`` takes in a string and returns a list of a string
 containing the ``mapi id`` in the input string.
 .. code:: python
  from unstructured.cleaners.extract import extract_mapi_id
  text = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by
    \n ABC.DEF.local2 ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\
    n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200"""
  # Returns "['32.88.5467.123']"
  extract_mapi_id(text)
 ``extract_datetimetz``
 ----------------------
 Extracts the date, time, and timezone in the ``Received`` field(s) from an ``.eml`` 
 file. ``extract_datetimetz`` takes in a string and returns a datetime.datetime
 object from the input string.
 .. code:: python
  from unstructured.cleaners.extract import extract_datetimetz
  text = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by
    \n ABC.DEF.local2 ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\
    n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200"""
  # Returns datetime.datetime(2021, 3, 26, 11, 4, 9, tzinfo=datetime.timezone(datetime.timedelta(seconds=43200)))
  extract_datetimetz(text)
 ``extract_us_phone_number``
 ---------------------------
--- a/example-docs/fake-email.txt
+++ b/example-docs/fake-email.txt
@ -0,0 +1,24 @@
 MIME-Version: 1.0
 Date: Fri, 16 Dec 2022 17:04:16 -0500
 Message-ID: <CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com>
 Subject: Test Email
 From: Matthew Robinson <mrobinson@unstructured.io>
 To: Matthew Robinson <mrobinson@unstructured.io>
 Content-Type: multipart/alternative; boundary="00000000000095c9b205eff92630"
 --00000000000095c9b205eff92630
 Content-Type: text/plain; charset="UTF-8"
 This is a test email to use for unit tests.
 Important points:
   - Roses are red
   - Violets are blue
 --00000000000095c9b205eff92630
 Content-Type: text/html; charset="UTF-8"
 <div dir="ltr"><div>This is a test email to use for unit tests.</div><div><br></div><div>Important points:</div><div><ul><li>Roses are red</li><li>Violets are blue</li></ul></div></div>
 --00000000000095c9b205eff92630--
--- a/example-docs/fake-text.txt
+++ b/example-docs/fake-text.txt
@ -0,0 +1,7 @@
 This is a test document to use for unit tests.
 Important points:
   - Hamburgers are delicious
   - Dogs are the best
   - I love fuzzy blankets
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,66 @@
 #
 # This file is autogenerated by pip-compile with python 3.9
 # To update, run:
 #
 #    pip-compile
 #
 argilla==1.1.1
    # via unstructured (setup.py)
 backoff==2.2.1
    # via argilla
 certifi==2022.12.7
    # via httpx
 click==8.1.3
    # via nltk
 deprecated==1.2.13
    # via argilla
 h11==0.9.0
    # via httpcore
 httpcore==0.11.1
    # via httpx
 httpx==0.15.5
    # via argilla
 idna==3.4
    # via rfc3986
 joblib==1.2.0
    # via nltk
 lxml==4.9.2
    # via unstructured (setup.py)
 monotonic==1.6
    # via argilla
 nltk==3.8
    # via unstructured (setup.py)
 numpy==1.23.5
    # via
    #   argilla
    #   pandas
 packaging==22.0
    # via argilla
 pandas==1.5.2
    # via argilla
 pydantic==1.10.2
    # via argilla
 python-dateutil==2.8.2
    # via pandas
 pytz==2022.6
    # via pandas
 regex==2022.10.31
    # via nltk
 rfc3986[idna2008]==1.5.0
    # via httpx
 six==1.16.0
    # via python-dateutil
 sniffio==1.3.0
    # via
    #   httpcore
    #   httpx
 tqdm==4.64.1
    # via
    #   argilla
    #   nltk
 typing-extensions==4.4.0
    # via pydantic
 wrapt==1.13.3
    # via
    #   argilla
    #   deprecated
--- a/test_unstructured/cleaners/test_extract.py
+++ b/test_unstructured/cleaners/test_extract.py
@ -1,7 +1,12 @@
 import pytest
 import datetime
 import unstructured.cleaners.extract as extract
 EMAIL_META_DATA_INPUT = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by
    \n ABC.DEF.local ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\
    n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200"""
 def test_get_indexed_match_raises_with_bad_index():
    with pytest.raises(ValueError):
@ -23,6 +28,35 @@ def test_extract_text_after():
    assert extract.extract_text_after(text, "BLAH;", 0) == "Student: BLAH BLAH BLAH!"
 def test_extract_email_address():
    text = "Im Rabn <Im.Rabn@npf.gov.nr>"
    assert extract.extract_email_address(text) == ["im.rabn@npf.gov.nr"]
 def test_extract_ip_address():
    assert extract.extract_ip_address(EMAIL_META_DATA_INPUT) == [
        "ba23::58b5:2236:45g2:88h2",
        "ba23::58b5:2236:45g2:88h2%25",
    ]
 def test_extract_ip_address_name():
    assert extract.extract_ip_address_name(EMAIL_META_DATA_INPUT) == [
        "ABC.DEF.local",
        "ABC.DEF.local",
    ]
 def test_extract_mapi_id():
    assert extract.extract_mapi_id(EMAIL_META_DATA_INPUT) == ["32.88.5467.123"]
 def test_extract_datetimetz():
    assert extract.extract_datetimetz(EMAIL_META_DATA_INPUT) == datetime.datetime(
        2021, 3, 26, 11, 4, 9, tzinfo=datetime.timezone(datetime.timedelta(seconds=43200))
    )
@pytest.mark.parametrize(
    "text, expected",
    [
--- a/test_unstructured/partition/test_email.py
+++ b/test_unstructured/partition/test_email.py
@ -4,7 +4,17 @@ import pathlib
 import pytest
 from unstructured.documents.elements import NarrativeText, Title, ListItem
-from unstructured.partition.email import partition_email, extract_attachment_info
+from unstructured.documents.email_elements import (
    MetaData,
    Recipient,
    Sender,
    Subject,
 )
 from unstructured.partition.email import (
    extract_attachment_info,
    partition_email,
    partition_email_header,
 )
 DIRECTORY = pathlib.Path(__file__).parent.resolve()
@ -17,6 +27,23 @@ EXPECTED_OUTPUT = [
    ListItem(text="Violets are blue"),
 ]
 HEADER_EXPECTED_OUTPUT = [
    MetaData(name="MIME-Version", text="1.0"),
    MetaData(name="Date", text="Fri, 16 Dec 2022 17:04:16 -0500"),
    MetaData(
        name="Message-ID",
        text="<CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com>",
    ),
    Subject(text="Test Email"),
    Sender(name="Matthew Robinson", text="mrobinson@unstructured.io"),
    Recipient(name="Matthew Robinson", text="mrobinson@unstructured.io"),
    MetaData(
        name="Content-Type", text='multipart/alternative; boundary="00000000000095c9b205eff92630"'
    ),
 ]
 ALL_EXPECTED_OUTPUT = HEADER_EXPECTED_OUTPUT + EXPECTED_OUTPUT
 ATTACH_EXPECTED_OUTPUT = [
    {"filename": "fake-attachment.txt", "payload": b"Hey this is a fake attachment!"}
 ]
@ -37,6 +64,22 @@ def test_partition_email_from_file():
    assert elements == EXPECTED_OUTPUT
 def test_partition_email_from_text_file():
    filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.txt")
    with open(filename, "r") as f:
        elements = partition_email(file=f, content_source="text/plain")
    assert len(elements) > 0
    assert elements == EXPECTED_OUTPUT
 def test_partition_email_from_text_file_with_headers():
    filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.txt")
    with open(filename, "r") as f:
        elements = partition_email(file=f, content_source="text/plain", include_headers=True)
    assert len(elements) > 0
    assert elements == ALL_EXPECTED_OUTPUT
 def test_partition_email_from_text():
    filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.eml")
    with open(filename, "r") as f:
@ -46,6 +89,15 @@ def test_partition_email_from_text():
    assert elements == EXPECTED_OUTPUT
 def test_partition_email_header():
    filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.eml")
    with open(filename, "r") as f:
        msg = email.message_from_file(f)
    elements = partition_email_header(msg)
    assert len(elements) > 0
    assert elements == HEADER_EXPECTED_OUTPUT
 def test_extract_attachment_info():
    filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email-attachment.eml")
    with open(filename, "r") as f:
--- a/test_unstructured/partition/test_text.py
+++ b/test_unstructured/partition/test_text.py
@ -0,0 +1,54 @@
 import os
 import pathlib
 import pytest
 from unstructured.documents.elements import NarrativeText, Title, ListItem
 from unstructured.partition.text import partition_text
 DIRECTORY = pathlib.Path(__file__).parent.resolve()
 EXPECTED_OUTPUT = [
    NarrativeText(text="This is a test document to use for unit tests."),
    Title(text="Important points:"),
    ListItem(text="Hamburgers are delicious"),
    ListItem(text="Dogs are the best"),
    ListItem(text="I love fuzzy blankets"),
 ]
 def test_partition_email_from_filename():
    filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-text.txt")
    elements = partition_text(filename=filename)
    assert len(elements) > 0
    assert elements == EXPECTED_OUTPUT
 def test_partition_email_from_file():
    filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-text.txt")
    with open(filename, "r") as f:
        elements = partition_text(file=f)
    assert len(elements) > 0
    assert elements == EXPECTED_OUTPUT
 def test_partition_email_from_text():
    filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-text.txt")
    with open(filename, "r") as f:
        text = f.read()
    elements = partition_text(text=text)
    assert len(elements) > 0
    assert elements == EXPECTED_OUTPUT
 def test_partition_email_raises_with_none_specified():
    with pytest.raises(ValueError):
        partition_text()
 def test_partition_email_raises_with_too_many_specified():
    filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-text.txt")
    with open(filename, "r") as f:
        text = f.read()
    with pytest.raises(ValueError):
        partition_text(filename=filename, text=text)
--- a/unstructured/version.py
+++ b/unstructured/version.py
@ -1 +1 @@
-__version__ = "0.3.6-dev1"  # pragma: no cover
+__version__ = "0.3.6-dev2"  # pragma: no cover
--- a/unstructured/cleaners/extract.py
+++ b/unstructured/cleaners/extract.py
@ -1,4 +1,13 @@
 import re
 import datetime
 from typing import List
 from unstructured.nlp.patterns import (
    IP_ADDRESS_PATTERN_RE,
    IP_ADDRESS_NAME_PATTERN,
    MAPI_ID_PATTERN,
    EMAIL_DATETIMETZ_PATTERN,
    EMAIL_ADDRESS_PATTERN,
 )
 from unstructured.nlp.patterns import US_PHONE_NUMBERS_RE
@ -48,6 +57,29 @@ def extract_text_after(text: str, pattern: str, index: int = 0, strip: bool = Tr
    return before_text.lstrip() if strip else before_text
 def extract_email_address(text: str) -> List[str]:
    return re.findall(EMAIL_ADDRESS_PATTERN, text.lower())
 def extract_ip_address(text: str) -> List[str]:
    return re.findall(IP_ADDRESS_PATTERN_RE, text)
 def extract_ip_address_name(text: str) -> List[str]:
    return re.findall(IP_ADDRESS_NAME_PATTERN, text)
 def extract_mapi_id(text: str) -> List[str]:
    mapi_ids = re.findall(MAPI_ID_PATTERN, text)
    mapi_ids = [mid.replace(";", "") for mid in mapi_ids]
    return mapi_ids
 def extract_datetimetz(text: str) -> datetime.datetime:
    date_string = re.findall(EMAIL_DATETIMETZ_PATTERN, text)
    return datetime.datetime.strptime(date_string[0], "%a, %d %b %Y %H:%M:%S %z")
 def extract_us_phone_number(text: str):
    """Extracts a US phone number from a section of text that includes a phone number. If there
    is no phone number present, the result will be an empty string.
--- a/unstructured/documents/email_elements.py
+++ b/unstructured/documents/email_elements.py
@ -1,4 +1,5 @@
 from abc import ABC
 from datetime import datetime
 import hashlib
 from typing import Callable, List, Union
 from unstructured.documents.elements import Element, Text, NoID
@ -15,9 +16,16 @@ class Name(EmailElement):
    category = "Uncategorized"
-    def __init__(self, name: str, text: str, element_id: Union[str, NoID] = NoID()):
+    def __init__(
        self,
        name: str,
        text: str,
        element_id: Union[str, NoID] = NoID(),
    ):
        self.name: str = name
        self.text: str = text
        self.datestamp: datetime
        self.has_datestamp: bool = False
        if isinstance(element_id, NoID):
            # NOTE(robinson) - Cut the SHA256 hex in half to get the first 128 bits
@ -25,10 +33,20 @@ class Name(EmailElement):
        super().__init__(element_id=element_id)
    def set_datestamp(self, datestamp: datetime):
        self.datestamp = datestamp
        self.has_datestamp = True
    def __str__(self):
        return f"{self.name}: {self.text}"
    def __eq__(self, other):
        if self.has_datestamp:
            return (
                self.name == other.name
                and self.text == other.text
                and self.datestamp == other.datestamp
            )
        return self.name == other.name and self.text == other.text
    def apply(self, *cleaners: Callable):
@ -60,54 +78,50 @@ class BodyText(List[Text]):
    pass
-class Recipient(Text):
+class Recipient(Name):
-    """A text element for capturing the recipient information of an email (e.g. Subject,
+    """A text element for capturing the recipient information of an email"""
    To, From, etc)."""
    category = "Recipient"
    pass
-class Sender(Text):
+class Sender(Name):
-    """A text element for capturing the sender information of an email (e.g. Subject,
+    """A text element for capturing the sender information of an email"""
    To, From, etc)."""
    category = "Sender"
    pass
-class Subject(Text):
+class Subject(Text, EmailElement):
-    """A text element for capturing the subject information of an email (e.g. Subject,
+    """A text element for capturing the subject information of an email"""
    To, From, etc)."""
    category = "Subject"
    pass
 class ReceivedInfo(List[Text]):
    """A text element for capturing header information of an email (e.g. Subject,
    To, From, etc)."""
    category = "ReceivedInfo"
    pass
 class MetaData(Name):
-    """A text element for capturing header meta data of an email (e.g. Subject,
+    """A text element for capturing header meta data of an email
-    To, From, etc)."""
+    (miscellaneous data in the email)."""
    category = "MetaData"
    pass
 class ReceivedInfo(Name):
    """A text element for capturing header information of an email (e.g. IP addresses, etc)."""
    category = "ReceivedInfo"
    pass
 class Attachment(Name):
-    """A text element for capturing the attachment name in an email (e.g. Subject,
+    """A text element for capturing the attachment name in an email (e.g. documents,
-    To, From, etc)."""
+    images, etc)."""
    category = "Attachment"
@ -117,11 +131,11 @@ class Attachment(Name):
 class Email(ABC):
    """An email class with it's attributes"""
-    def __init__(self, recipient: Recipient, sender: Sender, subject: Subject, body: BodyText):
+    def __init__(self):
-        self.recipient = recipient
+        self.recipient = Recipient
-        self.sender = sender
+        self.sender = Sender
-        self.subject = subject
+        self.subject = Subject
-        self.body = body
+        self.body = BodyText
        self.received_info: ReceivedInfo
        self.meta_data: MetaData
        self.attachment: List[Attachment]
--- a/unstructured/nlp/patterns.py
+++ b/unstructured/nlp/patterns.py
@ -41,3 +41,28 @@ UNICODE_BULLETS: Final[List[str]] = [
    "·",
 ]
 UNICODE_BULLETS_RE = re.compile(f"({'|'.join(UNICODE_BULLETS)})")
 # Helps split text by paragraphs
 PARAGRAPH_PATTERN = "\n\n\n|\n\n|\r\n|\r|\n"  # noqa: W605 NOTE(harrell)
 # IP Address examples: ba23::58b5:2236:45g2:88h2 or 10.0.2.01
 IP_ADDRESS_PATTERN = (
    "[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2}",  # noqa: W605 NOTE(harrell)
    # - skipping qa because we need the escape for the regex
    "[a-z0-9]{4}::[a-z0-9]{4}:[a-z0-9]{4}:[a-z0-9]{4}:[a-z0-9]{4}%?[0-9]*",
 )
 IP_ADDRESS_PATTERN_RE = re.compile(f"({'|'.join(IP_ADDRESS_PATTERN)})")
 IP_ADDRESS_NAME_PATTERN = "[a-zA-Z0-9-]*\.[a-zA-Z]*\.[a-zA-Z]*"  # noqa: W605 NOTE(harrell)
 # - skipping qa because we need the escape for the regex
 # Mapi ID example: 32.88.5467.123
 MAPI_ID_PATTERN = "[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*;"  # noqa: W605 NOTE(harrell)
 # - skipping qa because we need the escape for the regex
 # Date, time, timezone example: Fri, 26 Mar 2021 11:04:09 +1200
 EMAIL_DATETIMETZ_PATTERN = "[a-zA-z]{3},\s[0-9]{2}\s[a-zA-Z]{3}\s[0-9]{4}\s[0-9]{2}:[0-9]{2}:[0-9]{2}\s[+0-9]{5}"  # noqa: W605,E501
 # NOTE(harrell) - skipping qa because we need the escape for the regex
 EMAIL_ADDRESS_PATTERN = "[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+"  # noqa: W605 NOTE(harrell)
 # - skipping qa because we need the escape for the regex
--- a/unstructured/partition/email.py
+++ b/unstructured/partition/email.py
@ -1,7 +1,8 @@
 import email
 import sys
 import re
 from email.message import Message
-from typing import Dict, IO, List, Optional
+from typing import Dict, IO, List, Optional, Tuple
 if sys.version_info < (3, 8):
    from typing_extensions import Final
@ -9,11 +10,77 @@ else:
    from typing import Final
 from unstructured.cleaners.core import replace_mime_encodings, clean_extra_whitespace
 from unstructured.cleaners.extract import (
    extract_ip_address,
    extract_ip_address_name,
    extract_mapi_id,
    extract_datetimetz,
    extract_email_address,
 )
 from unstructured.documents.email_elements import (
    Recipient,
    Sender,
    Subject,
    ReceivedInfo,
    MetaData,
 )
 from unstructured.documents.elements import Element, Text
 from unstructured.partition.html import partition_html
 from unstructured.partition.text import split_by_paragraph, partition_text
-VALID_CONTENT_SOURCES: Final[List[str]] = ["text/html"]
+VALID_CONTENT_SOURCES: Final[List[str]] = ["text/html", "text/plain"]
 def _parse_received_data(data: str) -> List[Element]:
    ip_address_names = extract_ip_address_name(data)
    ip_addresses = extract_ip_address(data)
    mapi_id = extract_mapi_id(data)
    datetimetz = extract_datetimetz(data)
    elements: List[Element] = list()
    if ip_address_names and ip_addresses:
        for name, ip in zip(ip_address_names, ip_addresses):
            elements.append(ReceivedInfo(name=name, text=ip))
    if mapi_id:
        elements.append(ReceivedInfo(name="mapi_id", text=mapi_id[0]))
    if datetimetz:
        elements.append(
            ReceivedInfo(name="received_datetimetz", text=str(datetimetz)).set_datestamp(
                datestamp=datetimetz
            )
        )
    return elements
 def _parse_email_address(data: str) -> Tuple[str, str]:
    email_address = extract_email_address(data)
    PATTERN = "<[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+>"  # noqa: W605 Note(harrell)
    name = re.split(PATTERN, data.lower())[0].title().strip()
    return name, email_address[0]
 def partition_email_header(msg: Message) -> List[Element]:
    elements: List[Element] = list()
    for item in msg.raw_items():
        if item[0] == "To":
            text = _parse_email_address(item[1])
            elements.append(Recipient(name=text[0], text=text[1]))
        elif item[0] == "From":
            text = _parse_email_address(item[1])
            elements.append(Sender(name=text[0], text=text[1]))
        elif item[0] == "Subject":
            elements.append(Subject(text=item[1]))
        elif item[0] == "Received":
            elements += _parse_received_data(item[1])
        else:
            elements.append(MetaData(name=item[0], text=item[1]))
    return elements
 def extract_attachment_info(
@ -40,7 +107,7 @@ def extract_attachment_info(
                if output_dir:
                    filename = output_dir + "/" + attachment["filename"]
                    with open(filename, "wb") as f:
-                        # mypy wants to just us `w` when opening the file but this
+                        # Note(harrell) mypy wants to just us `w` when opening the file but this
                        # causes an error since the payloads are bytes not str
                        f.write(attachment["payload"])  # type: ignore
    return list_attachments
@ -51,6 +118,7 @@ def partition_email(
    file: Optional[IO] = None,
    text: Optional[str] = None,
    content_source: str = "text/html",
    include_headers: bool = False,
 ) -> List[Element]:
    """Partitions an .eml documents into its constituent elements.
    Parameters
@ -61,6 +129,9 @@ def partition_email(
        A file-like object using "r" mode --> open(filename, "r").
    text
        The string representation of the .eml document.
    content_source
        default: "text/html"
        other: "text/plain"
    """
    if content_source not in VALID_CONTENT_SOURCES:
        raise ValueError(
@ -92,7 +163,7 @@ def partition_email(
    content = content_map.get(content_source, "")
    if not content:
-        raise ValueError("text/html content not found in email")
+        raise ValueError(f"{content_source} content not found in email")
    # NOTE(robinson) - In the .eml files, the HTML content gets stored in a format that
    # looks like the following, resulting in extraneous "=" chracters in the output if
@ -101,11 +172,19 @@ def partition_email(
    #    <li>Item 1</li>=
    #    <li>Item 2<li>=
    # </ul>
-    content = "".join(content.split("=\n"))
+    list_content = split_by_paragraph(content)
    if content_source == "text/html":
        content = "".join(list_content)
        elements = partition_html(text=content)
        for element in elements:
            if isinstance(element, Text):
                element.apply(replace_mime_encodings)
    elif content_source == "text/plain":
        elements = partition_text(text=content)
-    return elements
+    header: List[Element] = list()
    if include_headers:
        header = partition_email_header(msg)
    all_elements = header + elements
    return all_elements
--- a/unstructured/partition/text.py
+++ b/unstructured/partition/text.py
@ -0,0 +1,66 @@
 import re
 from typing import IO, List, Optional
 from unstructured.documents.elements import Element, ListItem, NarrativeText, Title
 from unstructured.cleaners.core import clean_bullets
 from unstructured.nlp.patterns import PARAGRAPH_PATTERN
 from unstructured.partition.text_type import (
    is_possible_narrative_text,
    is_possible_title,
    is_bulleted_text,
 )
 def split_by_paragraph(content: str) -> List[str]:
    return re.split(PARAGRAPH_PATTERN, content)
 def partition_text(
    filename: Optional[str] = None,
    file: Optional[IO] = None,
    text: Optional[str] = None,
 ) -> List[Element]:
    """Partitions an .txt documents into its constituent elements.
    Parameters
    ----------
    filename
        A string defining the target filename path.
    file
        A file-like object using "r" mode --> open(filename, "r").
    text
        The string representation of the .txt document.
    """
    if not any([filename, file, text]):
        raise ValueError("One of filename, file, or text must be specified.")
    if filename is not None and not file and not text:
        with open(filename, "r") as f:
            file_text = f.read()
    elif file is not None and not filename and not text:
        file_text = file.read()
    elif text is not None and not filename and not file:
        file_text = str(text)
    else:
        raise ValueError("Only one of filename, file, or text can be specified.")
    file_content = split_by_paragraph(file_text)
    elements: List[Element] = list()
    for ctext in file_content:
        ctext = ctext.strip()
        if ctext == "":
            break
        if is_bulleted_text(ctext):
            elements.append(ListItem(text=clean_bullets(ctext)))
        elif is_possible_narrative_text(ctext):
            elements.append(NarrativeText(text=ctext))
        elif is_possible_title(ctext):
            elements.append(Title(text=ctext))
    return elements
`@ -1 +1 @@`
	`__version__ = "0.3.6-dev1" # pragma: no cover`	`__version__ = "0.3.6-dev2" # pragma: no cover`