unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-05 08:02:48 +00:00

Author	SHA1	Message	Date
Newel H	cd704e873b	Feat: Create a naive hierarchy for elements (#1268 ) ## Summary By adding hierarchy to unstructured elements, users will have more information for implementing vector db/LLM chunking strategies. For example, text elements could be queried by their preceding title element. The hierarchy is implemented by a parent_id tag in the element's metadata. ### Features - Introduces a parent_id to ElementMetadata (The id of the parent element, not a pointer) - Creates a rule set for assigning hierarchies. Sensible default is assigned, with an optional override parameter - Sets element parent ids if there isn't an existing parent id or matches the ruleset ### How it works Hierarchies are assigned via a parent id field in element metadata. Elements are read sequentially and evaluated against a ruleset. For example take the following elements: 1. Title, "This is the Title" 2. Text, "this is the text" And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of 2 will be the id of 1. The algorithm for determining this is more complex and resolves several edge cases, so please read the code for further details. ### Schema Changes ``` @dataclass class ElementMetadata: coordinates: Optional[CoordinatesMetadata] = None data_source: Optional[DataSourceMetadata] = None filename: Optional[str] = None file_directory: Optional[str] = None last_modified: Optional[str] = None filetype: Optional[str] = None attached_to_filename: Optional[str] = None + parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None + category_depth: Optional[int] = None ... ``` ### Testing ``` from unstructured.partition.auto import partition from typing import List elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto") for element in elements: print( f"Category: {getattr(element, 'category', '')}\n"\ f"Text: {getattr(element, 'text', '')}\n" f"ID: {element.id}\n" \ f"Parent ID: {element.metadata.parent_id}\n"\ f"Depth: {element.metadata.category_depth}\n" \ ) ``` ### Additional Notes Implementing this feature revealed a possibly undesired side-effect in how element metadata are processed. In `unstructured/partition/common.py` the `_add_element_metadata` is invoked as part of the `add_metadata_with_filetype` decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers. This likely warrants a new issue. I'm guessing that the implementation is done this way to avoid issues with deserializing elements, but could be wrong. --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>	2023-09-14 11:23:16 -04:00
John	c58b261feb	chunk_by_title decorator (#1304 ) ### Summary Partial solution to #1185. Related to #1222. Creates decorator from `chunk_by_title` cleaning brick. Breaks a document into sections based on the presence of Title elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is 1500. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars characters`, which could occur with a long NarrativeText element. Combines sections under these conditions - Sections under `combine_under_n_chars` characters are combined. The default is 500. ### Testing from unstructured.partition.html import partition_html url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" chunks = partition_html(url=url, chunking_strategy="by_title") for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input()	2023-09-11 21:00:14 +00:00
Klaijan	675a10ea69	fix: update test_json to not use auto partition (#1187 ) Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.	2023-08-29 16:59:26 -04:00
Matt Robinson	07f76275f1	feat: detect PGP encrypted content in `partition_email` and `partition_msg` (#1205 ) ### Summary Closes #1018. Enables `partition_email` and `partition_msg` to detect if an email has PGP encrypted content. Based on the specification in [RFC 2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based on the example email in the spec. If PGP detected content is detected, a warning is emitted and an empty set of lists is returned. ### Testing ```python from unstructured.partition_email import partition_email filename = "example-docs/eml/fake-encrypted.eml" partition_email(filename=filename) ``` ```python from unstructured.partition_msg import partition_msg filename = "example-docs/fake-encrypted.msg" partition_msgl(filename=filename) ```	2023-08-25 17:09:25 -07:00
John	69edffb0c0	bug: update partition_msg and partition_email so attachments also receive metadata_last_modified kwarg (#1134 ) ### Summary Closes #1027 The msg test in question was no longer failing after removing the quick-fix and comment explaining the issue. However, the test was not functioning as intended. Test was refactored to appropriately test `metadata_last_modified` of attachments. `partition_msg` was then updated to pass `metadata_last_modified` to `attachment_partitioner`. The same was done for email partitioning. ### Testing ``` from unstructured.partition.text import partition_text from unstructured.partition.msg import partition_msg from unstructured.partition.email import partition_email filename="example-docs/fake-email-attachment.msg" elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00") # previously, these were different values because last_modified wasn't being updated in attachments elements[1].metadata.last_modified elements[-1].text elements[-1].metadata.last_modified email_filename="example-docs/eml/fake-email-attachment.eml" email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00") email_elements[1].metadata.last_modified email_elements[-1].text email_elements[-1].metadata.last_modified ```	2023-08-18 23:21:11 +00:00
Mike Lay	79a1eb8683	Handle inline and lacking filename (#1109 ) Handle Content-Disposition: inline and attachment without filename * Add new email test example and test with Content-Disposition: inline. * Move attachment_info above for loop so it is always defined * Check if item is inline as well as attachment as these both lack an = character to split on * Create filename if filename is not specified and write file. * Update list_attachments with new filename	2023-08-14 18:38:53 +00:00
Mike Lay	2e0ab86c6a	Fix attachments with `=` in filename (#1110 ) Fix attachments with = in filename * Limit split to first match of = to prevent creating a list of more than two parts * Add example email with attachment name and test for issue	2023-08-13 20:35:18 -07:00
cragwolfe	02af625b93	chore: fix fickle test to not be so time sensitive (#1105 )	2023-08-13 10:58:46 -07:00
cragwolfe	13d3559fa4	chore: rename Element's "date" field to "last_modified" (#997 ) Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.	2023-08-01 02:55:43 +00:00
Matt Robinson	d9aed66b65	feat: add document date for remaining file types (#930 ) (#969 ) * feat: add document date for remaining file types (#930) * feat: add functions for getting modification date * feat: add date field to metadata from csv file * feat: add tests for csv patition * feat: add date field to metadata from html file * feat: add tests for html partition * fix: return file name onlyif possible * feat: add csv tests * fix: renaming * feat: add filed metadata_date as date of last mod * feat: add tests for partition_docx * feat: add filed metadata_date to .doc file * feat: add tests for partition_doc * feat: add metadata_date to .epub file * feat: add tests for partition_epub * fix: fix test mocking * feat: add metadata_date for image partition * feat: add test for image partition * feat: add coorrdinate system argument * feat: add date to element metadata * feat: add metadata_date for JSON partition * feat: add test for JSON partition * fix: rename variable * feat: add metadata_date for md partition * feat: add test for md partition * feat: update doc string * feat: add metadata_date for .odt partition * feat: update .odt string * feat: add metadata_date for .org partition * feat: add tests for .org partition * feat: add metadata_date for .pdf partition * feat: add tests for .pdf partition * feat: add metadata_date for .pptx partition * feat: add metadata_date for .ppt partition * feat: add tests for .ppt partition * feat: add tests for .pptx partition * feat: add metadata_date for .rst partition * feat: add tests for .rst partition * fix: get modification date after file checking * feat: add tests for .rtf partition * feat: add tests for .rtf partition * feat: add metadata_date for .txt partition * fix: rename argument * feat: add tests for .txt partition * feat: update doc string rst patrition function * feat: add metadata_date for .tsv partition * feat: add tests for .tsv partition * feat: add metadata_date for .xlsx partition * feat: add tests for .xlsx partition * fix: clean up * feat: add tests for .xml partition * feat: add tests for .xml partition * fix: use `or ` instead of `if` * fix: fix epub tests * fix: remove not used code * fix: add try block for getting file name * fix: applying linter changes * fix: fix test_partition_file * feat: add metadata_date for email * feat: add test for email partition * feat: add metadata_date for msg * feat: add tests for msg partition * feat: update CHANGELOG file * fix: update partitions doc string * don't push * fix: clean up code * linting, linting, linting * remove unnecessary example doc * update version and changelog * ingest-test-fixtures-update * set metadata date in test --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> * ingest-test-fixtures-update * Update ingest test fixtures (#970) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * Revert "Update ingest test fixtures (#970)" This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2. * remove date from metadata in outputs * update docstring ordering * remove print * remove print * remove print * linting, linting, linting * fix version and test * fix changelog * fix changelog * update version --------- Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-07-26 15:10:14 -04:00
David Potter	f7e46af22f	feat: adds Outlook connector (#939 ) * bonus: fixes issue with email partitioning where From field was being assigned the To field value.	2023-07-26 04:09:26 +00:00
John	676c50a6ec	feat: add min_partition kwarg to that combines elements below a specified threshold (#926 ) * add min_partition * functioning _split_content_to_fit_min_max * create test and make tidy/check * fix rebase issues * fix type hinting, remove unused code, add tests * various changes and refactoring of methods * add test, refactor, change var names for debugging purposes * update test * make tidy/check * give more descriptive var names and add comments * update xml partition via partition_text and create test * fix <pre> bug for test_partition_html_with_pre_tag * make tidy * refactor and fix tests * make tidy/check * ingest-test-fixtures-update * change list comprehension to for loop * fix error check	2023-07-24 15:57:24 +00:00
Matt Robinson	52aced8677	fix: validate encodings from email headers (#881 ) * add validate encoding function * remove extraneous file * added test case for malformed encoding * version and changelog	2023-07-06 13:49:27 +00:00
John	dc6d7d7268	feat: add metadata_filename parameter across all partition functions (#811 ) * fix conflicts * add tests and clean metadata_filename in partitions * fix test_email and remove comments * make tidy/check * update changelog and version * fix tests * make tidy again	2023-07-05 16:02:22 -04:00
Emily Chen	24ebd0fa4e	chore: Move coordinate details from Element model to a metadata model (#827 )	2023-07-05 11:25:11 -07:00
John	e9fdbb0943	feat: add include_metadata across all partition functions (#853 ) * add include_metadata kwarg and tests to parsers add exclude_metadata to docx add test for doc to exclude metadata add include_metadata kwarg to email add include_metadata kwarg to epub add include_metadata kwarg to json add exclude_metadata tests to md add include_metadata kwarg and tests for msg parse add include_metadata kwarg and tests for odt parse add include_metadata kwarg and tests for org parse add include_metadata kwarg and tests for ppt and pptx parse add include_metadata kwarg and tests for rst parse add include_metadata kwarg and tests for rtf parse add include_metadata tests for text parse add include_metadata tests for tsv parse add include_metadata tests for xlsx parse add include_metadata tests for xml parse * WIP add include_metadata to partition_pdf * add include_metadata tests to partition_pdf * make tidy/check * update changelog and version * change test asserts and move docstring logic to process_metadata * make tidy * fix tests asserts * linting, linting, linting * sync versions * skip api call test not on main --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2023-06-30 10:44:46 -04:00
Matt Robinson	c581a33c8a	feat: attachment processing for emails (#855 ) * process attachments for email * add attachment processing to msg * fix up metadata for attachments * add test for processing email attachments * added test for processing msg attachments * update docs * tests for error conditions * version and changelog	2023-06-29 18:01:12 -04:00
Matt Robinson	901ef16835	fix: allow `partition_email` to process emails with no content (#797 ) * version and changelog * ingest-test-fixtures-update	2023-06-22 12:52:27 -04:00
Christine Straub	743482b6d3	Bug/635 unicode decode error eml (#739 ) * Adds functionality to extract charset info from eml files * Adds missed file-like object handling in detect_file_encoding * Adds functionality to replace the MIME encodings for eml files with one of the common encodings if a unicode error occurs * Organize the eml example files in the example-docs/eml directory	2023-06-17 00:52:13 +00:00
Matt Robinson	4ea716837d	feat: add ability to extract extra metadata with regex (#763 ) * first pass on regex metadata * fix typing for regex metadata * add dataclass back in * add decorators * fix tests * update docs * add tests for regex metadata * add process metadata to tsv * changelog and version * docs typos * consolidate to using a single kwarg * fix test	2023-06-16 10:10:56 -04:00
Matt Robinson	a800967478	enhancements: add page numbers for word docs when available (#750 ) * add support for page numbers in docx when present * version and changelog * add comment on page numbers * add header and footer to doc elements list * update integrations docs * include_page_breaks kwarg for doc and docx * merge element metadata for pagebreaks * fix typo * fix changelog typo * change page number default to None * add initial_page_number kwarg * make page number tests in pdf more explicit * revert test file * update ingest tests * update test fixture outputs * updates to IRS forms fixtures * ingest-test-fixtures-update * Update ingest test fixtures (#759) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> --------- Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-15 12:21:17 -04:00
Christine Straub	5b5fb3e13b	Issue/encoding error eml (#639 ) This PR adds functionality to try other common encodings for email (.eml) files if an error related to the encoding is raised and the user has not specified an encoding.	2023-05-30 10:24:02 -07:00
qued	55272eeceb	enhancement: filetype in metadata (#583 ) Adds filetype to metadata. I've created a decorator that adds metadata to a list of elements. This replaces some existing boilerplate, but also adds a nice layered approach to determining the filetype. Since in some cases several partition_ functions handle a file in various formats, the partition function that first touches a file will be the last one to alter its metadata, resulting in the correct filetype metadata. Tests are added to make sure: * When partition is used, any content type or auto file type detection will override file-specific partition function metadata * Both auto and file-specific partitioning gives the desired filetype metadata Won't work with image files currently... the plumbing is there to use the image format inferred by PIL, but we need to pull in the fix from this PR to unstructured-inference .	2023-05-15 13:23:19 -05:00
Matt Robinson	8da1ddc6ec	enhancement: add method for getting datetime; cleanup filename attribute (#575 ) * added method for extracting datetime * change filename metadata to the base filename * fix filename metadata for msg * changelog and bump version * fix expected structured output * newline back in file * reset outpout file * update filename output * update test fixtures * update fixture	2023-05-12 11:33:01 -04:00
Matt Robinson	38f7b652de	fix: add handling for non-standard rfc-2822 formats (#564 ) * fix: add handling for non-standard rfc-2822 formats * version and changelog * linting, linting, linting	2023-05-11 14:36:25 +00:00
Matt Robinson	5ae895051a	feat: add sender and receive info to element metadata for emails (#439 ) * add header metadata for .eml messages * sent to and from are lists * add metadata for outlook emails * version and changelog	2023-04-04 14:23:41 -04:00
Matt Robinson	09b52b4fc4	fix: text kwargs no longer fail with empty string (#413 ) * fix: text kwargs no longer fail with empty string * linting	2023-03-28 21:03:51 +00:00
Matt Robinson	30b5a4da65	fix: parsing for files with `message/rfc822` MIME type; dir for unsupported files (#358 ) Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.	2023-03-10 15:10:39 -08:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
Mallori Harrell	08ccee0acb	chore: Fix parse received data (#143 ) * fix parse_received data	2023-01-17 16:36:44 -06:00
Matt Robinson	9c3c14e94d	fix: resolves `UnicodeDecodeError` in `partition_email` for emails with attachments (#158 ) * split emails by \n= * added test for equivalence betweent html and plain text * changelog and bump version * add check for content disposition	2023-01-17 11:33:45 -05:00
Mallori Harrell	e0feba83f6	feat: Add Image element and `find_embedded_image` function (#130 ) * add find_embedded_image	2023-01-09 19:49:19 -06:00
Matt Robinson	5376bc510f	feat: generic `partition` brick with filetype detection (#132 ) * add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile	2023-01-09 16:15:14 -05:00
Mallori Harrell	d7a00046a9	feat: Add new functionality to parse text and header of emails (#111 ) * partition_text function	2023-01-09 17:08:08 +00:00
Mallori Harrell	509ad4951c	feat: Add `extract_attachment_info` (#112 ) * Adds function to extract attachments and their metadata from eml files	2023-01-03 11:41:54 -06:00
Matt Robinson	7a74cdda86	feat: add `partition_email` cleaning brick (#104 ) * fix for processing deeply embedded list elements * fix types in mime encodings cleaner * first pass on partition_email * tests for email * test for mime encodings * changelog bump * added note about \n= * linting, linting, linting * added email docs * add partition_email to the readme * add one more test	2022-12-19 18:02:44 +00:00

36 Commits