607 lines
21 KiB
Python
Raw Normal View History

"""Test suite for `unstructured.partition.email` module."""
# pyright: reportPrivateUsage=false
from __future__ import annotations
import datetime
import email
import os
import pathlib
import tempfile
from email import policy
from email.message import EmailMessage
from typing import cast
import pytest
from pytest_mock import MockFixture
Dynamic ElementMetadata implementation (#2043) ### Executive Summary The structure of element metadata is currently static, meaning only predefined fields can appear in the metadata. We would like the flexibility for end-users, at their own discretion, to define and use additional metadata fields that make sense for their particular use-case. ### Concepts A key concept for dynamic metadata is _known field_. A known-field is one of those explicitly defined on `ElementMetadata`. Each of these has a type and can be specified when _constructing_ a new `ElementMetadata` instance. This is in contrast to an _end-user defined_ (or _ad-hoc_) metadata field, one not known at "compile" time and added at the discretion of an end-user to suit the purposes of their application. An ad-hoc field can only be added by _assignment_ on an already constructed instance. ### End-user ad-hoc metadata field behaviors An ad-hoc field can be added to an `ElementMetadata` instance by assignment: ```python >>> metadata = ElementMetadata() >>> metadata.coefficient = 0.536 ``` A field added in this way can be accessed by name: ```python >>> metadata.coefficient 0.536 ``` and that field will appear in the JSON/dict for that instance: ```python >>> metadata = ElementMetadata() >>> metadata.coefficient = 0.536 >>> metadata.to_dict() {"coefficient": 0.536} ``` However, accessing a "user-defined" value that has _not_ been assigned on that instance raises `AttributeError`: ```python >>> metadata.coeffcient # -- misspelled "coefficient" -- AttributeError: 'ElementMetadata' object has no attribute 'coeffcient' ``` This makes "tagging" a metadata item with a value very convenient, but entails the proviso that if an end-user wants to add a metadata field to _some_ elements and not others (sparse population), AND they want to access that field by name on ANY element and receive `None` where it has not been assigned, they will need to use an expression like this: ```python coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None ``` ### Implementation Notes - **ad-hoc metadata fields** are discarded during consolidation (for chunking) because we don't have a consolidation strategy defined for those. We could consider using a default consolidation strategy like `FIRST` or possibly allow a user to register a strategy (although that gets hairy in non-private and multiple-memory-space situations.) - ad-hoc metadata fields **cannot start with an underscore**. - We have no way to distinguish an ad-hoc field from any "noise" fields that might appear in a JSON/dict loaded using `.from_dict()`, so unlike the original (which only loaded known-fields), we'll rehydrate anything that we find there. - No real type-safety is possible on ad-hoc fields but the type-checker does not complain because the type of all ad-hoc fields is `Any` (which is the best available behavior in my view). - We may want to consider whether end-users should be able to add ad-hoc fields to "sub" metadata objects too, like `DataSourceMetadata` and conceivably `CoordinatesMetadata` (although I'm not immediately seeing a use-case for the second one).
2023-11-15 13:22:15 -08:00
from test_unstructured.unit_utils import (
LogCaptureFixture,
Dynamic ElementMetadata implementation (#2043) ### Executive Summary The structure of element metadata is currently static, meaning only predefined fields can appear in the metadata. We would like the flexibility for end-users, at their own discretion, to define and use additional metadata fields that make sense for their particular use-case. ### Concepts A key concept for dynamic metadata is _known field_. A known-field is one of those explicitly defined on `ElementMetadata`. Each of these has a type and can be specified when _constructing_ a new `ElementMetadata` instance. This is in contrast to an _end-user defined_ (or _ad-hoc_) metadata field, one not known at "compile" time and added at the discretion of an end-user to suit the purposes of their application. An ad-hoc field can only be added by _assignment_ on an already constructed instance. ### End-user ad-hoc metadata field behaviors An ad-hoc field can be added to an `ElementMetadata` instance by assignment: ```python >>> metadata = ElementMetadata() >>> metadata.coefficient = 0.536 ``` A field added in this way can be accessed by name: ```python >>> metadata.coefficient 0.536 ``` and that field will appear in the JSON/dict for that instance: ```python >>> metadata = ElementMetadata() >>> metadata.coefficient = 0.536 >>> metadata.to_dict() {"coefficient": 0.536} ``` However, accessing a "user-defined" value that has _not_ been assigned on that instance raises `AttributeError`: ```python >>> metadata.coeffcient # -- misspelled "coefficient" -- AttributeError: 'ElementMetadata' object has no attribute 'coeffcient' ``` This makes "tagging" a metadata item with a value very convenient, but entails the proviso that if an end-user wants to add a metadata field to _some_ elements and not others (sparse population), AND they want to access that field by name on ANY element and receive `None` where it has not been assigned, they will need to use an expression like this: ```python coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None ``` ### Implementation Notes - **ad-hoc metadata fields** are discarded during consolidation (for chunking) because we don't have a consolidation strategy defined for those. We could consider using a default consolidation strategy like `FIRST` or possibly allow a user to register a strategy (although that gets hairy in non-private and multiple-memory-space situations.) - ad-hoc metadata fields **cannot start with an underscore**. - We have no way to distinguish an ad-hoc field from any "noise" fields that might appear in a JSON/dict loaded using `.from_dict()`, so unlike the original (which only loaded known-fields), we'll rehydrate anything that we find there. - No real type-safety is possible on ad-hoc fields but the type-checker does not complain because the type of all ad-hoc fields is `Any` (which is the best available behavior in my view). - We may want to consider whether end-users should be able to add ad-hoc fields to "sub" metadata objects too, like `DataSourceMetadata` and conceivably `CoordinatesMetadata` (although I'm not immediately seeing a use-case for the second one).
2023-11-15 13:22:15 -08:00
assert_round_trips_through_JSON,
example_doc_path,
parse_optional_datetime,
)
from unstructured.chunking.title import chunk_by_title
from unstructured.documents.elements import (
Element,
ElementMetadata,
Image,
ListItem,
NarrativeText,
Text,
Title,
)
from unstructured.documents.email_elements import (
MetaData,
ReceivedInfo,
Recipient,
Sender,
Subject,
)
from unstructured.partition.email import (
_convert_to_iso_8601,
_extract_attachment_info,
_partition_email_header,
partition_email,
)
from unstructured.partition.text import partition_text
EXPECTED_OUTPUT = [
NarrativeText(text="This is a test email to use for unit tests."),
Title(text="Important points:"),
ListItem(text="Roses are red"),
ListItem(text="Violets are blue"),
]
IMAGE_EXPECTED_OUTPUT = [
NarrativeText(text="This is a test email to use for unit tests."),
Title(text="Important points:"),
NarrativeText(text="hello this is our logo."),
Image(text="unstructured_logo.png"),
ListItem(text="Roses are red"),
ListItem(text="Violets are blue"),
]
RECEIVED_HEADER_OUTPUT = [
ReceivedInfo(name="ABCDEFG-000.ABC.guide", text="00.0.0.00"),
ReceivedInfo(name="ABCDEFG-000.ABC.guide", text="ba23::58b5:2236:45g2:88h2"),
ReceivedInfo(
name="received_datetimetz",
text="2023-02-20 10:03:18+12:00",
datestamp=datetime.datetime(
2023,
2,
20,
10,
3,
18,
tzinfo=datetime.timezone(datetime.timedelta(seconds=43200)),
),
),
MetaData(name="MIME-Version", text="1.0"),
MetaData(name="Date", text="Fri, 16 Dec 2022 17:04:16 -0500"),
Recipient(name="Hello", text="hello@unstructured.io"),
MetaData(
name="Message-ID",
text="CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com",
),
Subject(text="Test Email"),
Sender(name="Matthew Robinson", text="mrobinson@unstructured.io"),
Recipient(name="Matthew Robinson", text="mrobinson@unstructured.io"),
Recipient(name="Fake Email", text="fake-email@unstructured.io"),
Recipient(name="test", text="test@unstructured.io"),
MetaData(
name="Content-Type",
text='multipart/alternative; boundary="00000000000095c9b205eff92630"',
),
]
HEADER_EXPECTED_OUTPUT = [
MetaData(name="MIME-Version", text="1.0"),
MetaData(name="Date", text="Fri, 16 Dec 2022 17:04:16 -0500"),
MetaData(
name="Message-ID",
text="CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com",
),
Subject(text="Test Email"),
Sender(name="Matthew Robinson", text="mrobinson@unstructured.io"),
Recipient(name="Matthew Robinson", text="mrobinson@unstructured.io"),
MetaData(
name="Content-Type",
text='multipart/alternative; boundary="00000000000095c9b205eff92630"',
),
]
ALL_EXPECTED_OUTPUT = HEADER_EXPECTED_OUTPUT + EXPECTED_OUTPUT
ATTACH_EXPECTED_OUTPUT = [
{"filename": "fake-attachment.txt", "payload": b"Hey this is a fake attachment!"},
]
def test_partition_email_from_filename():
elements = partition_email(filename=example_doc_path("eml/fake-email.eml"))
assert len(elements) > 0
assert elements == EXPECTED_OUTPUT
for element in elements:
assert element.metadata.filename == "fake-email.eml"
def test_partition_email_from_filename_malformed_encoding():
elements = partition_email(filename=example_doc_path("eml/fake-email-malformed-encoding.eml"))
assert len(elements) > 0
assert elements == EXPECTED_OUTPUT
@pytest.mark.parametrize(
("filename", "expected_output"),
[
("fake-email-utf-16.eml", EXPECTED_OUTPUT),
("fake-email-utf-16-be.eml", EXPECTED_OUTPUT),
("fake-email-utf-16-le.eml", EXPECTED_OUTPUT),
("fake-email-b64.eml", EXPECTED_OUTPUT),
("email-no-utf8-2008-07-16.062410.eml", None),
("email-no-utf8-2014-03-17.111517.eml", None),
("email-replace-mime-encodings-error-1.eml", None),
("email-replace-mime-encodings-error-2.eml", None),
("email-replace-mime-encodings-error-3.eml", None),
("email-replace-mime-encodings-error-4.eml", None),
("email-replace-mime-encodings-error-5.eml", None),
],
)
def test_partition_email_from_filename_default_encoding(
filename: str, expected_output: Element | None
):
elements = partition_email(example_doc_path("eml/" + filename))
assert len(elements) > 0
if expected_output:
assert elements == expected_output
for element in elements:
assert element.metadata.filename == filename
def test_partition_email_from_file():
with open(example_doc_path("eml/fake-email.eml"), "rb") as f:
elements = partition_email(file=f)
assert len(elements) > 0
assert elements == EXPECTED_OUTPUT
for element in elements:
assert element.metadata.filename is None
@pytest.mark.parametrize(
("filename", "expected_output"),
[
("fake-email-utf-16.eml", EXPECTED_OUTPUT),
("fake-email-utf-16-be.eml", EXPECTED_OUTPUT),
("fake-email-utf-16-le.eml", EXPECTED_OUTPUT),
("fake-email-b64.eml", EXPECTED_OUTPUT),
("email-no-utf8-2008-07-16.062410.eml", None),
("email-no-utf8-2014-03-17.111517.eml", None),
("email-replace-mime-encodings-error-1.eml", None),
("email-replace-mime-encodings-error-2.eml", None),
("email-replace-mime-encodings-error-3.eml", None),
("email-replace-mime-encodings-error-4.eml", None),
("email-replace-mime-encodings-error-5.eml", None),
],
)
def test_partition_email_from_file_default_encoding(filename: str, expected_output: Element | None):
with open(example_doc_path("eml/" + filename), "rb") as f:
elements = partition_email(file=f)
assert len(elements) > 0
if expected_output:
assert elements == expected_output
for element in elements:
assert element.metadata.filename is None
def test_partition_email_from_file_rb():
with open(example_doc_path("eml/fake-email.eml"), "rb") as f:
elements = partition_email(file=f)
assert len(elements) > 0
assert elements == EXPECTED_OUTPUT
for element in elements:
assert element.metadata.filename is None
@pytest.mark.parametrize(
("filename", "expected_output"),
[
("fake-email-utf-16.eml", EXPECTED_OUTPUT),
("fake-email-utf-16-be.eml", EXPECTED_OUTPUT),
("fake-email-utf-16-le.eml", EXPECTED_OUTPUT),
("email-no-utf8-2008-07-16.062410.eml", None),
("email-no-utf8-2014-03-17.111517.eml", None),
("email-replace-mime-encodings-error-1.eml", None),
("email-replace-mime-encodings-error-2.eml", None),
("email-replace-mime-encodings-error-3.eml", None),
("email-replace-mime-encodings-error-4.eml", None),
("email-replace-mime-encodings-error-5.eml", None),
],
)
def test_partition_email_from_file_rb_default_encoding(
filename: str, expected_output: Element | None
):
with open(example_doc_path("eml/" + filename), "rb") as f:
elements = partition_email(file=f)
assert len(elements) > 0
if expected_output:
assert elements == expected_output
for element in elements:
assert element.metadata.filename is None
def test_partition_email_from_spooled_temp_file():
filename = example_doc_path("eml/family-day.eml")
with open(filename, "rb") as test_file:
spooled_temp_file = tempfile.SpooledTemporaryFile()
spooled_temp_file.write(test_file.read())
spooled_temp_file.seek(0)
elements = partition_email(file=spooled_temp_file)
assert len(elements) == 9
assert elements[3].text == "Make sure to RSVP!"
def test_partition_email_from_text_file():
with open(example_doc_path("eml/fake-email.txt"), "rb") as f:
elements = partition_email(file=f, content_source="text/plain")
assert len(elements) > 0
assert elements == EXPECTED_OUTPUT
for element in elements:
assert element.metadata.filename is None
def test_partition_email_from_text_file_with_headers():
with open(example_doc_path("eml/fake-email.txt"), "rb") as f:
elements = partition_email(file=f, content_source="text/plain", include_headers=True)
assert len(elements) > 0
assert elements == ALL_EXPECTED_OUTPUT
for element in elements:
assert element.metadata.filename is None
def test_partition_email_from_text():
with open(example_doc_path("eml/fake-email.eml")) as f:
text = f.read()
elements = partition_email(text=text)
assert len(elements) > 0
assert elements == EXPECTED_OUTPUT
for element in elements:
assert element.metadata.filename is None
def test_partition_email_from_text_work_with_empty_string():
assert partition_email(text="") == []
def test_partition_email_from_filename_with_embedded_image():
elements = partition_email(
example_doc_path("eml/fake-email-image-embedded.eml"), content_source="text/plain"
)
assert len(elements) > 0
assert elements == IMAGE_EXPECTED_OUTPUT
for element in elements:
assert element.metadata.filename == "fake-email-image-embedded.eml"
def test_partition_email_from_file_with_header():
with open(example_doc_path("eml/fake-email-header.eml")) as f:
msg = email.message_from_file(f, policy=policy.default)
msg = cast(EmailMessage, msg)
elements = _partition_email_header(msg)
Feat: Create a naive hierarchy for elements (#1268) ## **Summary** By adding hierarchy to unstructured elements, users will have more information for implementing vector db/LLM chunking strategies. For example, text elements could be queried by their preceding title element. The hierarchy is implemented by a parent_id tag in the element's metadata. ### Features - Introduces a parent_id to ElementMetadata (The id of the parent element, not a pointer) - Creates a rule set for assigning hierarchies. Sensible default is assigned, with an optional override parameter - Sets element parent ids if there isn't an existing parent id or matches the ruleset ### How it works Hierarchies are assigned via a parent id field in element metadata. Elements are read sequentially and evaluated against a ruleset. For example take the following elements: 1. Title, "This is the Title" 2. Text, "this is the text" And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of 2 will be the id of 1. The algorithm for determining this is more complex and resolves several edge cases, so please read the code for further details. ### Schema Changes ``` @dataclass class ElementMetadata: coordinates: Optional[CoordinatesMetadata] = None data_source: Optional[DataSourceMetadata] = None filename: Optional[str] = None file_directory: Optional[str] = None last_modified: Optional[str] = None filetype: Optional[str] = None attached_to_filename: Optional[str] = None + parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None + category_depth: Optional[int] = None ... ``` ### Testing ``` from unstructured.partition.auto import partition from typing import List elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto") for element in elements: print( f"Category: {getattr(element, 'category', '')}\n"\ f"Text: {getattr(element, 'text', '')}\n" f"ID: {element.id}\n" \ f"Parent ID: {element.metadata.parent_id}\n"\ f"Depth: {element.metadata.category_depth}\n" \ ) ``` ### Additional Notes Implementing this feature revealed a possibly undesired side-effect in how element metadata are processed. In `unstructured/partition/common.py` the `_add_element_metadata` is invoked as part of the `add_metadata_with_filetype` decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers. This likely warrants a new issue. I'm guessing that the implementation is done this way to avoid issues with deserializing elements, but could be wrong. --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
2023-09-14 11:23:16 -04:00
assert len(elements) > 0
assert elements == RECEIVED_HEADER_OUTPUT
all(element.metadata.filename is None for element in elements)
def test_extract_email_text_matches_html():
filename = example_doc_path("eml/fake-email-attachment.eml")
elements_from_text = partition_email(filename, content_source="text/plain")
elements_from_html = partition_email(filename, content_source="text/html")
assert len(elements_from_text) == len(elements_from_html)
# NOTE(robinson) - checking each individually is necessary because the text/html returns
# HTMLTitle, HTMLNarrativeText, etc
for i, element in enumerate(elements_from_text):
assert element == elements_from_text[i]
assert element.metadata.filename == "fake-email-attachment.eml"
def test_extract_base64_email_text_matches_html():
filename = example_doc_path("eml/fake-email-b64.eml")
elements_from_text = partition_email(filename, content_source="text/plain")
elements_from_html = partition_email(filename, content_source="text/html")
assert len(elements_from_text) == len(elements_from_html)
for i, element in enumerate(elements_from_text):
assert element == elements_from_text[i]
assert element.metadata.filename == "fake-email-b64.eml"
def test_partition_email_processes_fake_email_with_header():
elements = partition_email(example_doc_path("eml/fake-email-header.eml"))
assert len(elements) > 0
assert all(element.metadata.filename == "fake-email-header.eml" for element in elements)
assert all(
element.metadata.bcc_recipient == ["Hello <hello@unstructured.io>"] for element in elements
)
assert all(
element.metadata.cc_recipient
== ["Fake Email <fake-email@unstructured.io>", "test@unstructured.io"]
for element in elements
)
assert all(element.metadata.email_message_id is not None for element in elements)
@pytest.mark.parametrize(
(("time", "expected")),
[
("Thu, 4 May 2023 02:32:49 +0000", "2023-05-04T02:32:49+00:00"),
("Thu, 4 May 2023 02:32:49 +0000", "2023-05-04T02:32:49+00:00"),
("Thu, 4 May 2023 02:32:49 +0000 (UTC)", "2023-05-04T02:32:49+00:00"),
("Thursday 5/3/2023 02:32:49", None),
],
)
def test_convert_to_iso_8601(time: str, expected: str | None):
iso_time = _convert_to_iso_8601(time)
assert iso_time == expected
def test_partition_email_still_works_with_no_content(caplog: LogCaptureFixture):
elements = partition_email(example_doc_path("eml/email-no-html-content-1.eml"))
assert len(elements) == 1
assert elements[0].text.startswith("Hey there")
assert "text/html was not found. Falling back to text/plain" in caplog.text
def test_partition_email_with_json():
elements = partition_email(example_doc_path("eml/fake-email.eml"))
assert_round_trips_through_JSON(elements)
def test_partition_email_with_pgp_encrypted_message(caplog: LogCaptureFixture):
elements = partition_email(example_doc_path("eml/fake-encrypted.eml"))
assert elements == []
assert "WARNING" in caplog.text
assert "Encrypted email detected" in caplog.text
def test_partition_email_inline_content_disposition():
elements = partition_email(
example_doc_path("eml/email-inline-content-disposition.eml"),
process_attachments=True,
attachment_partitioner=partition_text,
)
assert isinstance(elements[0], Text)
assert isinstance(elements[1], Text)
def test_add_chunking_strategy_on_partition_email():
chunk_elements = partition_email(
example_doc_path("eml/fake-email.txt"), chunking_strategy="by_title"
)
elements = partition_email(example_doc_path("eml/fake-email.txt"))
chunks = chunk_by_title(elements)
assert chunk_elements != elements
assert chunk_elements == chunks
# -- raise error behaviors -----------------------------------------------------------------------
def test_partition_msg_raises_with_no_partitioner():
with pytest.raises(ValueError):
partition_email(example_doc_path("eml/fake-email-attachment.eml"), process_attachments=True)
def test_partition_email_raises_with_none_specified():
with pytest.raises(ValueError):
partition_email()
def test_partition_email_raises_with_too_many_specified():
with open(example_doc_path("eml/fake-email.eml")) as f:
text = f.read()
with pytest.raises(ValueError):
partition_email(example_doc_path("eml/fake-email.eml"), text=text)
def test_partition_email_raises_with_invalid_content_type():
with pytest.raises(ValueError):
partition_email(example_doc_path("eml/fake-email.eml"), content_source="application/json")
# -- metadata behaviors --------------------------------------------------------------------------
def test_partition_email_from_filename_with_metadata_filename():
elements = partition_email(example_doc_path("eml/fake-email.eml"), metadata_filename="test")
assert len(elements) > 0
assert all(element.metadata.filename == "test" for element in elements)
def test_partition_email_from_filename_has_metadata():
elements = partition_email(example_doc_path("eml/fake-email.eml"))
parent_id = elements[0].metadata.parent_id
assert len(elements) > 0
assert (
elements[0].metadata.to_dict()
== ElementMetadata(
coordinates=None,
filename=example_doc_path("eml/fake-email.eml"),
last_modified="2022-12-16T17:04:16-05:00",
page_number=None,
url=None,
sent_from=["Matthew Robinson <mrobinson@unstructured.io>"],
sent_to=["NotMatthew <NotMatthew@notunstructured.com>"],
subject="Test Email",
filetype="message/rfc822",
parent_id=parent_id,
languages=["eng"],
email_message_id="CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com",
).to_dict()
)
expected_dt = datetime.datetime.fromisoformat("2022-12-16T17:04:16-05:00")
assert parse_optional_datetime(elements[0].metadata.last_modified) == expected_dt
for element in elements:
assert element.metadata.filename == "fake-email.eml"
rfctr(part): prepare for pluggable auto-partitioners 1 (#3655) **Summary** In preparation for pluggable auto-partitioners simplify metadata as discussed. **Additional Context** - Pluggable auto-partitioners requires partitioners to have a consistent call signature. An arbitrary partitioner provided at runtime needs to have a call signature that is known and consistent. Basically `partition_x(filename, *, file, **kwargs)`. - The current `auto.partition()` is highly coupled to each distinct file-type partitioner, deciding which arguments to forward to each. - This is driven by the existence of "delegating" partitioners, those that convert their file-type and then call a second partitioner to do the actual partitioning. Both the delegating and proxy partitioners are decorated with metadata-post-processing decorators and those decorators are not idempotent. We call the situation where those decorators would run twice "double-decorating". For example, EPUB converts to HTML and calls `partition_html()` and both `partition_epub()` and `partition_html()` are decorated. - The way double-decorating has been avoided in the past is to avoid sending the arguments the metadata decorators are sensitive to to the proxy partitioner. This is very obscure, complex to reason about, error-prone, and just overall not a viable strategy. The better solution is to not decorate delegating partitioners and let the proxy partitioner handle all the metadata. - This first step in preparation for that is part of simplifying the metadata processing by removing unused or unwanted legacy parameters. - `date_from_file_object` is a misnomer because a file-object never contains last-modified data. - It can never produce useful results in the API where last-modified information must be provided by `metadata_last_modified`. - It is an undocumented parameter so not in use. - Using it can produce incorrect metadata.
2024-09-23 15:23:10 -07:00
# -- .metadata.last_modified ---------------------------------------------------------------------
def test_partition_email_metadata_date_from_header(mocker: MockFixture):
mocker.patch("unstructured.partition.email.get_last_modified_date", return_value=None)
elements = partition_email(example_doc_path("eml/fake-email-attachment.eml"))
assert elements[0].metadata.last_modified == "2022-12-23T12:08:48-06:00"
def test_partition_email_from_file_custom_metadata_date():
with open(example_doc_path("eml/fake-email-attachment.eml"), "rb") as f:
elements = partition_email(file=f, metadata_last_modified="2020-07-05T09:24:28")
assert elements[0].metadata.last_modified == "2020-07-05T09:24:28"
def test_partition_email_custom_metadata_date():
elements = partition_email(
example_doc_path("eml/fake-email-attachment.eml"),
metadata_last_modified="2020-07-05T09:24:28",
)
assert elements[0].metadata.last_modified == "2020-07-05T09:24:28"
rfctr(part): prepare for pluggable auto-partitioners 1 (#3655) **Summary** In preparation for pluggable auto-partitioners simplify metadata as discussed. **Additional Context** - Pluggable auto-partitioners requires partitioners to have a consistent call signature. An arbitrary partitioner provided at runtime needs to have a call signature that is known and consistent. Basically `partition_x(filename, *, file, **kwargs)`. - The current `auto.partition()` is highly coupled to each distinct file-type partitioner, deciding which arguments to forward to each. - This is driven by the existence of "delegating" partitioners, those that convert their file-type and then call a second partitioner to do the actual partitioning. Both the delegating and proxy partitioners are decorated with metadata-post-processing decorators and those decorators are not idempotent. We call the situation where those decorators would run twice "double-decorating". For example, EPUB converts to HTML and calls `partition_html()` and both `partition_epub()` and `partition_html()` are decorated. - The way double-decorating has been avoided in the past is to avoid sending the arguments the metadata decorators are sensitive to to the proxy partitioner. This is very obscure, complex to reason about, error-prone, and just overall not a viable strategy. The better solution is to not decorate delegating partitioners and let the proxy partitioner handle all the metadata. - This first step in preparation for that is part of simplifying the metadata processing by removing unused or unwanted legacy parameters. - `date_from_file_object` is a misnomer because a file-object never contains last-modified data. - It can never produce useful results in the API where last-modified information must be provided by `metadata_last_modified`. - It is an undocumented parameter so not in use. - Using it can produce incorrect metadata.
2024-09-23 15:23:10 -07:00
# ------------------------------------------------------------------------------------------------
def test_partition_eml_add_signature_to_metadata():
elements = partition_email(example_doc_path("eml/signed-doc.p7s"))
assert len(elements) == 1
assert elements[0].text == "This is a test"
assert elements[0].metadata.signature == "<SIGNATURE>\n"
# -- attachment behaviors ------------------------------------------------------------------------
def test_extract_attachment_info():
with open(example_doc_path("eml/fake-email-attachment.eml")) as f:
msg = email.message_from_file(f, policy=policy.default)
msg = cast(EmailMessage, msg)
attachment_info = _extract_attachment_info(msg)
assert len(attachment_info) > 0
assert attachment_info == ATTACH_EXPECTED_OUTPUT
def test_partition_email_odd_attachment_filename():
elements = partition_email(
example_doc_path("eml/email-equals-attachment-filename.eml"),
process_attachments=True,
attachment_partitioner=partition_text,
)
assert elements[1].metadata.filename == "odd=file=name.txt"
def test_partition_email_can_process_attachments(tmp_path: pathlib.Path):
output_dir = tmp_path / "output"
output_dir.mkdir()
filename = example_doc_path("eml/fake-email-attachment.eml")
with open(filename) as f:
msg = email.message_from_file(f, policy=policy.default)
msg = cast(EmailMessage, msg)
_extract_attachment_info(msg, output_dir=str(output_dir))
feat: add document date for remaining file types (#930) (#969) * feat: add document date for remaining file types (#930) * feat: add functions for getting modification date * feat: add date field to metadata from csv file * feat: add tests for csv patition * feat: add date field to metadata from html file * feat: add tests for html partition * fix: return file name onlyif possible * feat: add csv tests * fix: renaming * feat: add filed metadata_date as date of last mod * feat: add tests for partition_docx * feat: add filed metadata_date to .doc file * feat: add tests for partition_doc * feat: add metadata_date to .epub file * feat: add tests for partition_epub * fix: fix test mocking * feat: add metadata_date for image partition * feat: add test for image partition * feat: add coorrdinate system argument * feat: add date to element metadata * feat: add metadata_date for JSON partition * feat: add test for JSON partition * fix: rename variable * feat: add metadata_date for md partition * feat: add test for md partition * feat: update doc string * feat: add metadata_date for .odt partition * feat: update .odt string * feat: add metadata_date for .org partition * feat: add tests for .org partition * feat: add metadata_date for .pdf partition * feat: add tests for .pdf partition * feat: add metadata_date for .pptx partition * feat: add metadata_date for .ppt partition * feat: add tests for .ppt partition * feat: add tests for .pptx partition * feat: add metadata_date for .rst partition * feat: add tests for .rst partition * fix: get modification date after file checking * feat: add tests for .rtf partition * feat: add tests for .rtf partition * feat: add metadata_date for .txt partition * fix: rename argument * feat: add tests for .txt partition * feat: update doc string rst patrition function * feat: add metadata_date for .tsv partition * feat: add tests for .tsv partition * feat: add metadata_date for .xlsx partition * feat: add tests for .xlsx partition * fix: clean up * feat: add tests for .xml partition * feat: add tests for .xml partition * fix: use `or ` instead of `if` * fix: fix epub tests * fix: remove not used code * fix: add try block for getting file name * fix: applying linter changes * fix: fix test_partition_file * feat: add metadata_date for email * feat: add test for email partition * feat: add metadata_date for msg * feat: add tests for msg partition * feat: update CHANGELOG file * fix: update partitions doc string * don't push * fix: clean up code * linting, linting, linting * remove unnecessary example doc * update version and changelog * ingest-test-fixtures-update * set metadata date in test --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> * ingest-test-fixtures-update * Update ingest test fixtures (#970) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * Revert "Update ingest test fixtures (#970)" This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2. * remove date from metadata in outputs * update docstring ordering * remove print * remove print * remove print * linting, linting, linting * fix version and test * fix changelog * fix changelog * update version --------- Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-07-26 15:10:14 -04:00
attachment_filename = os.path.join(
output_dir,
str(ATTACH_EXPECTED_OUTPUT[0]["filename"]),
feat: add document date for remaining file types (#930) (#969) * feat: add document date for remaining file types (#930) * feat: add functions for getting modification date * feat: add date field to metadata from csv file * feat: add tests for csv patition * feat: add date field to metadata from html file * feat: add tests for html partition * fix: return file name onlyif possible * feat: add csv tests * fix: renaming * feat: add filed metadata_date as date of last mod * feat: add tests for partition_docx * feat: add filed metadata_date to .doc file * feat: add tests for partition_doc * feat: add metadata_date to .epub file * feat: add tests for partition_epub * fix: fix test mocking * feat: add metadata_date for image partition * feat: add test for image partition * feat: add coorrdinate system argument * feat: add date to element metadata * feat: add metadata_date for JSON partition * feat: add test for JSON partition * fix: rename variable * feat: add metadata_date for md partition * feat: add test for md partition * feat: update doc string * feat: add metadata_date for .odt partition * feat: update .odt string * feat: add metadata_date for .org partition * feat: add tests for .org partition * feat: add metadata_date for .pdf partition * feat: add tests for .pdf partition * feat: add metadata_date for .pptx partition * feat: add metadata_date for .ppt partition * feat: add tests for .ppt partition * feat: add tests for .pptx partition * feat: add metadata_date for .rst partition * feat: add tests for .rst partition * fix: get modification date after file checking * feat: add tests for .rtf partition * feat: add tests for .rtf partition * feat: add metadata_date for .txt partition * fix: rename argument * feat: add tests for .txt partition * feat: update doc string rst patrition function * feat: add metadata_date for .tsv partition * feat: add tests for .tsv partition * feat: add metadata_date for .xlsx partition * feat: add tests for .xlsx partition * fix: clean up * feat: add tests for .xml partition * feat: add tests for .xml partition * fix: use `or ` instead of `if` * fix: fix epub tests * fix: remove not used code * fix: add try block for getting file name * fix: applying linter changes * fix: fix test_partition_file * feat: add metadata_date for email * feat: add test for email partition * feat: add metadata_date for msg * feat: add tests for msg partition * feat: update CHANGELOG file * fix: update partitions doc string * don't push * fix: clean up code * linting, linting, linting * remove unnecessary example doc * update version and changelog * ingest-test-fixtures-update * set metadata date in test --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> * ingest-test-fixtures-update * Update ingest test fixtures (#970) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * Revert "Update ingest test fixtures (#970)" This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2. * remove date from metadata in outputs * update docstring ordering * remove print * remove print * remove print * linting, linting, linting * fix version and test * fix changelog * fix changelog * update version --------- Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-07-26 15:10:14 -04:00
)
mocked_last_modification_date = "0000-00-05T09:24:28"
attachment_elements = partition_text(
filename=attachment_filename,
metadata_filename=attachment_filename,
metadata_last_modified=mocked_last_modification_date,
)
expected_metadata = attachment_elements[0].metadata
expected_metadata.file_directory = None
expected_metadata.attached_to_filename = filename
elements = partition_email(
filename=filename,
attachment_partitioner=partition_text,
process_attachments=True,
metadata_last_modified=mocked_last_modification_date,
)
Feat: Create a naive hierarchy for elements (#1268) ## **Summary** By adding hierarchy to unstructured elements, users will have more information for implementing vector db/LLM chunking strategies. For example, text elements could be queried by their preceding title element. The hierarchy is implemented by a parent_id tag in the element's metadata. ### Features - Introduces a parent_id to ElementMetadata (The id of the parent element, not a pointer) - Creates a rule set for assigning hierarchies. Sensible default is assigned, with an optional override parameter - Sets element parent ids if there isn't an existing parent id or matches the ruleset ### How it works Hierarchies are assigned via a parent id field in element metadata. Elements are read sequentially and evaluated against a ruleset. For example take the following elements: 1. Title, "This is the Title" 2. Text, "this is the text" And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of 2 will be the id of 1. The algorithm for determining this is more complex and resolves several edge cases, so please read the code for further details. ### Schema Changes ``` @dataclass class ElementMetadata: coordinates: Optional[CoordinatesMetadata] = None data_source: Optional[DataSourceMetadata] = None filename: Optional[str] = None file_directory: Optional[str] = None last_modified: Optional[str] = None filetype: Optional[str] = None attached_to_filename: Optional[str] = None + parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None + category_depth: Optional[int] = None ... ``` ### Testing ``` from unstructured.partition.auto import partition from typing import List elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto") for element in elements: print( f"Category: {getattr(element, 'category', '')}\n"\ f"Text: {getattr(element, 'text', '')}\n" f"ID: {element.id}\n" \ f"Parent ID: {element.metadata.parent_id}\n"\ f"Depth: {element.metadata.category_depth}\n" \ ) ``` ### Additional Notes Implementing this feature revealed a possibly undesired side-effect in how element metadata are processed. In `unstructured/partition/common.py` the `_add_element_metadata` is invoked as part of the `add_metadata_with_filetype` decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers. This likely warrants a new issue. I'm guessing that the implementation is done this way to avoid issues with deserializing elements, but could be wrong. --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
2023-09-14 11:23:16 -04:00
# This test does not need to validate if hierarchy is working
# Patch to nullify parent_id
expected_metadata.parent_id = None
elements[-1].metadata.parent_id = None
assert [a.name for a in os.scandir(output_dir) if a.is_file()] == ["fake-attachment.txt"]
assert elements[0].text.startswith("Hello!")
for element in elements[:-1]:
assert element.metadata.filename == "fake-email-attachment.eml"
assert element.metadata.subject == "Fake email with attachment"
assert elements[-1].text == "Hey this is a fake attachment!"
assert elements[-1].metadata == expected_metadata
# -- language behaviors --------------------------------------------------------------------------
def test_partition_email_element_metadata_has_languages():
elements = partition_email(example_doc_path("eml/fake-email.eml"))
assert elements[0].metadata.languages == ["eng"]
def test_partition_email_respects_languages_arg():
elements = partition_email(example_doc_path("eml/fake-email.eml"), languages=["deu"])
assert all(element.metadata.languages == ["deu"] for element in elements)
def test_partition_eml_respects_detect_language_per_element():
elements = partition_email(
example_doc_path("language-docs/eng_spa_mult.eml"),
detect_language_per_element=True,
)
# languages other than English and Spanish are detected by this partitioner,
# so this test is slightly different from the other partition tests
rfctr(html): replace html parser (#3218) **Summary** Replace legacy HTML parser with recursive version that captures all content and provides flexibility to add new metadata. It's also substantially faster although that's just a happy side-effect. **Additional Context** The prior HTML parsing algorithm that makes up the core of HTML partitioning was buggy and very difficult to reason about because it did not conform to the inherently recursive structure of HTML. The new version retains `lxml` as the performant and reliable base library but uses `lxml`'s custom element classes to efficiently classify HTML elements by their behaviors (block-item and inline (phrasing) primarily) and give those elements the desired partitioning behaviors. This solves a host of existing problems with content being skipped and elements (paragraphs) being divided improperly, but also provides a clear domain model for reasoning about its behavior and reliably adjusting it to suit our existing and future purposes. The parser's operation is recursive, closely modeling the recursive structure of HTML itself. It's behaviors are based on the HTML Standard and reliably produce proper and explainable results even for novel cases. Fixes #2325 Fixes #2562 Fixes #2675 Fixes #3168 Fixes #3227 Fixes #3228 Fixes #3230 Fixes #3237 Fixes #3245 Fixes #3247 Fixes #3255 Fixes #3309 ### BEHAVIOR DIFFERENCES #### `emphasized_text_tags` encoding is changed: - `<strong>` is encoded as `"b"` rather than `"strong"`. - `<em>` is encoded as `"i"` rather than `"em"`. - `<span>` is no longer recorded in `emphasized_text_tags` (because without the CSS we can't tell whether it's used for emphasis or if so what kind). - nested emphasis (e.g. bold+italic) is encoded as multiple characters ("bi"). - `emphasized_text_contents` is broken on emphasis-change boundaries, like: ```html `<p>foo <b>bar <i>baz</i> bada</b> bing</p>` ``` produces: ```json { "emphasized_text_contents": ["bar", "baz", "bada"], "emphasized_text_tags": ["b", "bi", "b"] } ``` whereas previously it would have produced: ```json { "emphasized_text_contents": ["bar baz bada", "baz"], "emphasized_text_tags": ["b", "i"] } ``` #### `<pre>` text is preserved as it appears in the html Except that a leading newline is removed if present (has to be in position 0 of text). Also, a trailing newline is stripped but only if it appears in the very last position ([-1]) of the `<pre>` text. Old parser stripped all leading and trailing whitespace. Result is that: ```html <pre> foo bar baz </pre> ``` parses to `"foo\nbar\nbaz"` which is the same result produced for: ```html <pre>foo bar baz</pre> ``` This equivalence is the same behavior exhibited by a browser, which is why we did the extra work to make it this way. #### Whitespace normalization Leading and trailing whitespace are removed from element text, just as it is removed in the browser. Runs of whitespace within the element text are reduced to a single space character (like in the browser). Note this means that `\t`, `\n`, and `&nbsp;` are replaced with a regular space character. All text derived from elements is whitespace normalized except the text within a `<pre>` tag. Any leading or trailing newline is trimmed from `<pre>` element text; all other whitespace is preserved just as it appeared in the HTML source. #### `link_start_indexes` metadata is no longer captured. Rationale: - It was frequently wrong, often `-1`. - It was deprecated but then added back in a community PR. - Maintaining it across any possible downstream transformations (e.g. chunking) would be expensive and almost certainly lead to wrong values as distant code evolves. - It is complex to compute and recompute when whitespace is normalized, adding substantial complexity to the code and reducing readability and maintainability #### `<br/>` element is replaced with a single newline (`"\n"`) but that is usually replaced with a space in `Element.text` when it is normalized. The newline is preserved within a `<pre>` element. - Related: _No paragraph-break on `<br/><br/>`_ #### Empty `h1..h6` elements are dropped. HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a `Title` element) when they contain no text or contain only whitespace. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-07-10 17:14:28 -07:00
langs = {e.metadata.languages[0] for e in elements if e.metadata.languages is not None}
assert "eng" in langs
assert "spa" in langs