Introduce `date_from_file_object` to `partition*` functions, by default
set to `False`.
If set to `True` and file is provided via `file` parameter, partition
will attempt to infer last modified date from `file`'s contents
otherwise last modified metadata will be set to `None`.
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
### Summary
Closes#2489, which reported an inability to process `.p7s` files. PR
implements two changes:
- If the user selected content type for the email is not available and
there is another valid content type available, fall back to the other
valid content type.
- For signed message, extract the signature and add it to the metadata
### Testing
```python
from unstructured.partition.auto import partition
filename = "example-docs/eml/signed-doc.p7s"
elements = partition(filename=filename) # should get a message about fall back logic
print(elements[0]) # "This is a test"
elements[0].metadata.to_dict() # Will see the signature
```
closes#816
## Description
Added functionality for `partition_email` to automatically decode base64
text before passing it to `partition_text` or `partition_html`.
Also adds base64 encoded email text test cases.
### Executive Summary
The structure of element metadata is currently static, meaning only
predefined fields can appear in the metadata. We would like the
flexibility for end-users, at their own discretion, to define and use
additional metadata fields that make sense for their particular
use-case.
### Concepts
A key concept for dynamic metadata is _known field_. A known-field is
one of those explicitly defined on `ElementMetadata`. Each of these has
a type and can be specified when _constructing_ a new `ElementMetadata`
instance. This is in contrast to an _end-user defined_ (or _ad-hoc_)
metadata field, one not known at "compile" time and added at the
discretion of an end-user to suit the purposes of their application.
An ad-hoc field can only be added by _assignment_ on an already
constructed instance.
### End-user ad-hoc metadata field behaviors
An ad-hoc field can be added to an `ElementMetadata` instance by
assignment:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
```
A field added in this way can be accessed by name:
```python
>>> metadata.coefficient
0.536
```
and that field will appear in the JSON/dict for that instance:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
>>> metadata.to_dict()
{"coefficient": 0.536}
```
However, accessing a "user-defined" value that has _not_ been assigned
on that instance raises `AttributeError`:
```python
>>> metadata.coeffcient # -- misspelled "coefficient" --
AttributeError: 'ElementMetadata' object has no attribute 'coeffcient'
```
This makes "tagging" a metadata item with a value very convenient, but
entails the proviso that if an end-user wants to add a metadata field to
_some_ elements and not others (sparse population), AND they want to
access that field by name on ANY element and receive `None` where it has
not been assigned, they will need to use an expression like this:
```python
coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None
```
### Implementation Notes
- **ad-hoc metadata fields** are discarded during consolidation (for
chunking) because we don't have a consolidation strategy defined for
those. We could consider using a default consolidation strategy like
`FIRST` or possibly allow a user to register a strategy (although that
gets hairy in non-private and multiple-memory-space situations.)
- ad-hoc metadata fields **cannot start with an underscore**.
- We have no way to distinguish an ad-hoc field from any "noise" fields
that might appear in a JSON/dict loaded using `.from_dict()`, so unlike
the original (which only loaded known-fields), we'll rehydrate anything
that we find there.
- No real type-safety is possible on ad-hoc fields but the type-checker
does not complain because the type of all ad-hoc fields is `Any` (which
is the best available behavior in my view).
- We may want to consider whether end-users should be able to add ad-hoc
fields to "sub" metadata objects too, like `DataSourceMetadata` and
conceivably `CoordinatesMetadata` (although I'm not immediately seeing a
use-case for the second one).
Each partitioner has a test like `test_partition_x_with_json()`. What
these do is serialize the elements produced by the partitioner to JSON,
then read them back in from JSON and compare the before and after
elements.
Because our element equality (`Element.__eq__()`) is shallow, this
doesn't tell us a lot, but if we take it one more step, like
`List[Element] -> JSON -> List[Element] -> JSON` and then compare the
JSON, it gives us some confidence that the serialized elements can be
"re-hydrated" without losing any information.
This actually showed up a few problems, all in the
serialization/deserialization (serde) code that all elements share.
### Summary
Closes#1534 and #1535
Detects document language using `langdetect` package.
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)
---------
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
## **Summary**
By adding hierarchy to unstructured elements, users will have more
information for implementing vector db/LLM chunking strategies. For
example, text elements could be queried by their preceding title
element. The hierarchy is implemented by a parent_id tag in the
element's metadata.
### Features
- Introduces a parent_id to ElementMetadata (The id of the parent
element, not a pointer)
- Creates a rule set for assigning hierarchies. Sensible default is
assigned, with an optional override parameter
- Sets element parent ids if there isn't an existing parent id or
matches the ruleset
### How it works
Hierarchies are assigned via a parent id field in element metadata.
Elements are read sequentially and evaluated against a ruleset. For
example take the following elements:
1. Title, "This is the Title"
2. Text, "this is the text"
And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of
2 will be the id of 1. The algorithm for determining this is more
complex and resolves several edge cases, so please read the code for
further details.
### Schema Changes
```
@dataclass
class ElementMetadata:
coordinates: Optional[CoordinatesMetadata] = None
data_source: Optional[DataSourceMetadata] = None
filename: Optional[str] = None
file_directory: Optional[str] = None
last_modified: Optional[str] = None
filetype: Optional[str] = None
attached_to_filename: Optional[str] = None
+ parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None
+ category_depth: Optional[int] = None
...
```
### Testing
```
from unstructured.partition.auto import partition
from typing import List
elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto")
for element in elements:
print(
f"Category: {getattr(element, 'category', '')}\n"\
f"Text: {getattr(element, 'text', '')}\n"
f"ID: {element.id}\n" \
f"Parent ID: {element.metadata.parent_id}\n"\
f"Depth: {element.metadata.category_depth}\n" \
)
```
### Additional Notes
Implementing this feature revealed a possibly undesired side-effect in
how element metadata are processed. In
`unstructured/partition/common.py` the `_add_element_metadata` is
invoked as part of the `add_metadata_with_filetype` decorator for
filetype partitioning. This method is intended to add additional
information to the metadata generated with the element including
filename and filetype, however the existing metadata is merged into a
newly created metadata object rather than the other way around. Because
of the way it's structured, new metadata fields can easily be forgotten
and pose debugging challenges to developers. This likely warrants a new
issue.
I'm guessing that the implementation is done this way to avoid issues
with deserializing elements, but could be wrong.
---------
Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
### Summary
Partial solution to #1185.
Related to #1222.
Creates decorator from `chunk_by_title` cleaning brick.
Breaks a document into sections based on the presence of Title elements.
Also starts a new section under the following conditions:
- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is 1500. The **chunking function does not split individual
elements**, so it's possible for a section to exceed that threshold if
an individual element if over `new_after_n_chars characters`, which
could occur with a long NarrativeText element.
Combines sections under these conditions
- Sections under `combine_under_n_chars` characters are combined. The
default is 500.
### Testing
from unstructured.partition.html import partition_html
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
chunks = partition_html(url=url, chunking_strategy="by_title")
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
### Summary
Closes#1018. Enables `partition_email` and `partition_msg` to detect if
an email has PGP encrypted content. Based on the specification in [RFC
2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based
on the example email in the spec. If PGP detected content is detected, a
warning is emitted and an empty set of lists is returned.
### Testing
```python
from unstructured.partition_email import partition_email
filename = "example-docs/eml/fake-encrypted.eml"
partition_email(filename=filename)
```
```python
from unstructured.partition_msg import partition_msg
filename = "example-docs/fake-encrypted.msg"
partition_msgl(filename=filename)
```
### Summary
Closes#1027
The msg test in question was no longer failing after removing the
quick-fix and comment explaining the issue. However, the test was not
functioning as intended. Test was refactored to appropriately test
`metadata_last_modified` of attachments.
`partition_msg` was then updated to pass `metadata_last_modified` to
`attachment_partitioner`.
The same was done for email partitioning.
### Testing
```
from unstructured.partition.text import partition_text
from unstructured.partition.msg import partition_msg
from unstructured.partition.email import partition_email
filename="example-docs/fake-email-attachment.msg"
elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")
# previously, these were different values because last_modified wasn't being updated in attachments
elements[1].metadata.last_modified
elements[-1].text
elements[-1].metadata.last_modified
email_filename="example-docs/eml/fake-email-attachment.eml"
email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")
email_elements[1].metadata.last_modified
email_elements[-1].text
email_elements[-1].metadata.last_modified
```
Handle Content-Disposition: inline and attachment without filename
* Add new email test example and test with Content-Disposition: inline.
* Move attachment_info above for loop so it is always defined
* Check if item is inline as well as attachment as these both lack an = character to split on
* Create filename if filename is not specified and write file.
* Update list_attachments with new filename
Fix attachments with = in filename
* Limit split to first match of = to prevent creating a list of more than two parts
* Add example email with attachment name and test for issue
* add min_partition
* functioning _split_content_to_fit_min_max
* create test and make tidy/check
* fix rebase issues
* fix type hinting, remove unused code, add tests
* various changes and refactoring of methods
* add test, refactor, change var names for debugging purposes
* update test
* make tidy/check
* give more descriptive var names and add comments
* update xml partition via partition_text and create test
* fix <pre> bug for test_partition_html_with_pre_tag
* make tidy
* refactor and fix tests
* make tidy/check
* ingest-test-fixtures-update
* change list comprehension to for loop
* fix error check
* fix conflicts
* add tests and clean metadata_filename in partitions
* fix test_email and remove comments
* make tidy/check
* update changelog and version
* fix tests
* make tidy again
* add include_metadata kwarg and tests to parsers
add exclude_metadata to docx
add test for doc to exclude metadata
add include_metadata kwarg to email
add include_metadata kwarg to epub
add include_metadata kwarg to json
add exclude_metadata tests to md
add include_metadata kwarg and tests for msg parse
add include_metadata kwarg and tests for odt parse
add include_metadata kwarg and tests for org parse
add include_metadata kwarg and tests for ppt and pptx parse
add include_metadata kwarg and tests for rst parse
add include_metadata kwarg and tests for rtf parse
add include_metadata tests for text parse
add include_metadata tests for tsv parse
add include_metadata tests for xlsx parse
add include_metadata tests for xml parse
* WIP add include_metadata to partition_pdf
* add include_metadata tests to partition_pdf
* make tidy/check
* update changelog and version
* change test asserts and move docstring logic to process_metadata
* make tidy
* fix tests asserts
* linting, linting, linting
* sync versions
* skip api call test not on main
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* process attachments for email
* add attachment processing to msg
* fix up metadata for attachments
* add test for processing email attachments
* added test for processing msg attachments
* update docs
* tests for error conditions
* version and changelog
* Adds functionality to extract charset info from eml files
* Adds missed file-like object handling in detect_file_encoding
* Adds functionality to replace the MIME encodings for eml files with one of the
common encodings if a unicode error occurs
* Organize the eml example files in the example-docs/eml directory
* first pass on regex metadata
* fix typing for regex metadata
* add dataclass back in
* add decorators
* fix tests
* update docs
* add tests for regex metadata
* add process metadata to tsv
* changelog and version
* docs typos
* consolidate to using a single kwarg
* fix test
* add support for page numbers in docx when present
* version and changelog
* add comment on page numbers
* add header and footer to doc elements list
* update integrations docs
* include_page_breaks kwarg for doc and docx
* merge element metadata for pagebreaks
* fix typo
* fix changelog typo
* change page number default to None
* add initial_page_number kwarg
* make page number tests in pdf more explicit
* revert test file
* update ingest tests
* update test fixture outputs
* updates to IRS forms fixtures
* ingest-test-fixtures-update
* Update ingest test fixtures (#759)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
---------
Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
This PR adds functionality to try other common encodings for email (.eml) files if an error related to the
encoding is raised and the user has not specified an encoding.
Adds filetype to metadata. I've created a decorator that adds metadata to a list of elements. This replaces some existing boilerplate, but also adds a nice layered approach to determining the filetype. Since in some cases several partition_ functions handle a file in various formats, the partition function that first touches a file will be the last one to alter its metadata, resulting in the correct filetype metadata.
Tests are added to make sure:
* When partition is used, any content type or auto file type detection will override file-specific partition function metadata
* Both auto and file-specific partitioning gives the desired filetype metadata
Won't work with image files currently... the plumbing is there to use the image format inferred by PIL, but we need to pull in the fix from this PR to unstructured-inference .
* added method for extracting datetime
* change filename metadata to the base filename
* fix filename metadata for msg
* changelog and bump version
* fix expected structured output
* newline back in file
* reset outpout file
* update filename output
* update test fixtures
* update fixture
* Apply import sorting
ruff . --select I --fix
* Remove unnecessary open mode parameter
ruff . --select UP015 --fix
* Use f-string formatting rather than .format
* Remove extraneous parentheses
Also use "" instead of str()
* Resolve missing trailing commas
ruff . --select COM --fix
* Rewrite list() and dict() calls using literals
ruff . --select C4 --fix
* Add () to pytest.fixture, use tuples for parametrize, etc.
ruff . --select PT --fix
* Simplify code: merge conditionals, context managers
ruff . --select SIM --fix
* Import without unnecessary alias
ruff . --select PLR0402 --fix
* Apply formatting via black
* Rewrite ValueError somewhat
Slightly unrelated to the rest of the PR
* Apply formatting to tests via black
* Update expected exception message to match
0d81564
* Satisfy E501 line too long in test
* Update changelog & version
* Add ruff to make tidy and test deps
* Run 'make tidy'
* Update changelog & version
* Update changelog & version
* Add ruff to 'check' target
Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
* add python-magic
* first pass on filetype detection
* tests for filetype detection
* more tests for file detection
* added tests for error conditions
* install libmagic dev in github
* libmagic install instructions
* pattern for checking email files
* support reading .eml in rb mode
* add auto partition function
* auto tests for emal
* auto tests for docx
* added tests for html
* add pdf and html tests
* linting, linting, linting
* added docs for auto partitioning
* update readme with generic partition brick
* bumped version
* added test for bad type
* detect .docx files from application/octet-stream
* linting, linting, linting
* identify xlsx from octet stream
* install poppler in ci
* fix mocks; test for unknown type
* install poppler utils
* install in one line
* only poppler-utils
* file extension logic from application/octet-stream
* install local inference for ci
* install detectron2
* removing unused dockerfile
* fix for processing deeply embedded list elements
* fix types in mime encodings cleaner
* first pass on partition_email
* tests for email
* test for mime encodings
* changelog bump
* added note about \n=
* linting, linting, linting
* added email docs
* add partition_email to the readme
* add one more test