unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-24 06:10:20 +00:00

History

Feat: Create a naive hierarchy for elements (#1268 )

## **Summary**
By adding hierarchy to unstructured elements, users will have more
information for implementing vector db/LLM chunking strategies. For
example, text elements could be queried by their preceding title
element. The hierarchy is implemented by a parent_id tag in the
element's metadata.

### Features
- Introduces a parent_id to ElementMetadata (The id of the parent
element, not a pointer)
- Creates a rule set for assigning hierarchies. Sensible default is
assigned, with an optional override parameter
- Sets element parent ids if there isn't an existing parent id or
matches the ruleset

### How it works

Hierarchies are assigned via a parent id field in element metadata.
Elements are read sequentially and evaluated against a ruleset. For
example take the following elements:

1. Title, "This is the Title"
2. Text, "this is the text"

And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of
2 will be the id of 1. The algorithm for determining this is more
complex and resolves several edge cases, so please read the code for
further details.

### Schema Changes

```
@dataclass
class ElementMetadata:
      coordinates: Optional[CoordinatesMetadata] = None
      data_source: Optional[DataSourceMetadata] = None
      filename: Optional[str] = None
      file_directory: Optional[str] = None
      last_modified: Optional[str] = None
      filetype: Optional[str] = None
      attached_to_filename: Optional[str] = None
+     parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None
+     category_depth: Optional[int] = None

...
```

### Testing
```
from unstructured.partition.auto import partition
from typing import List

elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto")

for element in elements:
    print(
        f"Category:  {getattr(element, 'category', '')}\n"\
        f"Text:      {getattr(element, 'text', '')}\n"
        f"ID:        {element.id}\n" \
        f"Parent ID: {element.metadata.parent_id}\n"\
        f"Depth:     {element.metadata.category_depth}\n" \
    )
```

### Additional Notes
Implementing this feature revealed a possibly undesired side-effect in
how element metadata are processed. In
`unstructured/partition/common.py` the `_add_element_metadata` is
invoked as part of the `add_metadata_with_filetype` decorator for
filetype partitioning. This method is intended to add additional
information to the metadata generated with the element including
filename and filetype, however the existing metadata is merged into a
newly created metadata object rather than the other way around. Because
of the way it's structured, new metadata fields can easily be forgotten
and pose debugging challenges to developers. This likely warrants a new
issue.

I'm guessing that the implementation is done this way to avoid issues
with deserializing elements, but could be wrong.

---------

Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>

2023-09-14 11:23:16 -04:00

eml

feat: detect PGP encrypted content in partition_email and partition_msg (#1205 )

2023-08-25 17:09:25 -07:00

unsupported

feat: add partition_xml for XML files (#596 )

2023-05-18 15:40:12 +00:00

book-war-and-peace-1p.txt

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

book-war-and-peace-1225p.txt

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

chevron-page.pdf

fix: group together text from the same bounding box in partition_pdf with fast strategy (#542 )

2023-05-03 18:33:24 -04:00

copy-protected.pdf

enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514 )

2023-04-21 21:35:43 +00:00

DA-1p.pdf

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

DA-619p.pdf

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

double-column-A.jpg

chore: custom layout order example notebook (#1024 )

2023-08-02 18:29:04 -06:00

double-column-B.jpg

chore: custom layout order example notebook (#1024 )

2023-08-02 18:29:04 -06:00

emoji.xlsx

fix: extract emojis with partition_xlsx (#1009 )

2023-08-04 10:14:08 -04:00

empty.txt

enhancement: handling for empty files in detect_filetype and partition (#710 )

2023-06-09 16:07:50 -04:00

english-and-korean.png

enhancement: add ocr_only strategy for partition_image (#540 )

2023-05-04 20:23:51 +00:00

example-10k-1p.html

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

example-10k-230p.html

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

example-10k-utf-16.html

fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 )

2023-06-05 11:27:12 -07:00

example-10k.html

Initial Release

2022-09-26 14:55:20 -07:00

example-list-items-multiple.docx

fix: detect list items in MS Word documents (#909 )

2023-07-10 15:29:08 +00:00

example-steelJIS-datasheet-utf-16.html

fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 )

2023-06-05 11:27:12 -07:00

example-steelJIS-datasheet.html

fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 )

2023-06-05 11:27:12 -07:00

example-with-scripts.html

fix: Remove JavaScript from HTML reader output (#313 )

2023-02-28 14:24:24 -08:00

example.jpg

feat: extract metadata from .docx, .xlsx, and .jpg (#113 )

2022-12-26 09:34:36 -05:00

factbook-utf-16.xml

fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 )

2023-06-05 11:27:12 -07:00

factbook.xml

feat: add partition_xml for XML files (#596 )

2023-05-18 15:40:12 +00:00

fake_table.docx

feat: Read docx tables (#572 )

2023-05-11 18:31:38 +00:00

fake-doc-emphasized-text.doc

feat: track emphasized text msword (#1048 )

2023-08-04 17:04:12 -04:00

fake-doc-emphasized-text.docx

feat: track emphasized text msword (#1048 )

2023-08-04 17:04:12 -04:00

fake-doc.rtf

Table processing test for RTF (#1388 )

2023-09-12 18:27:05 -07:00

fake-email-attachment.msg

feat: add msg attachment support (#510 )

2023-04-21 11:14:46 -05:00

fake-email-multiple-attachments.msg

feat: add msg attachment support (#510 )

2023-04-21 11:14:46 -05:00

fake-email.msg

feat: add partition_msg for MSFT Outlook files (#412 )

2023-03-28 20:15:22 +00:00

fake-encrypted.msg

feat: detect PGP encrypted content in partition_email and partition_msg (#1205 )

2023-08-25 17:09:25 -07:00

fake-html-cp1252.html

chore: Add encoding param to ingest (#955 )

2023-07-24 10:06:13 -07:00

fake-html-lang-de.html

fix: adjust threshold for encoding detection (#894 )

2023-07-07 09:25:03 -04:00

fake-html-pre.htm

feature(html partition): parse pre tag (#642 )

2023-06-27 18:52:39 +00:00

fake-html-with-footer-and-header.html

feat: optionally ignore header and footer tags in partition html (#1013 )

2023-08-04 21:56:33 +00:00

fake-html.html

Feat: Create a naive hierarchy for elements (#1268 )

2023-09-14 11:23:16 -04:00

fake-incomplete-json.txt

enhancement: improve json detection by detect_filetype (#971 )

2023-07-25 12:47:39 -04:00

fake-memo.pdf

feat: Create spacy notebook example (#593 )

2023-05-17 15:42:15 -05:00

fake-power-point-malformed.pptx

fix malformed pptx issue (#761 )

2023-06-15 19:52:44 +00:00

fake-power-point-many-pages.pptx

fix: metadata.page_number of pptx files (#675 )

2023-06-02 13:22:43 +00:00

fake-power-point-table.pptx

feat: table extraction for power points (#664 )

2023-05-31 18:26:32 +00:00

fake-power-point.ppt

feat: add partition_ppt for older power point docs (#238 )

2023-02-17 16:57:08 +00:00

fake-power-point.pptx

feat: basic PowerPoint parsing in partition_pptx (#166 )

2023-01-23 17:03:09 +00:00

fake-text-utf-16-be.txt

Adding optional encoding arg, and text_partition tests (#339 )

2023-03-06 15:07:33 -08:00

fake-text-utf-16-le.txt

Issue/unicode error (#608 )

2023-05-23 13:35:38 -07:00

fake-text-utf-16.txt

Issue/unicode error (#608 )

2023-05-23 13:35:38 -07:00

fake-text-utf-32.txt

Issue/unicode error (#608 )

2023-05-23 13:35:38 -07:00

fake-text.txt

fix: cleanup from live .docx tests (#177 )

2023-01-26 15:52:25 +00:00

fake.doc

feat: add partition_doc for .doc files (#236 )

2023-02-17 09:30:23 -05:00

fake.docx

feat: extract metadata from .docx, .xlsx, and .jpg (#113 )

2022-12-26 09:34:36 -05:00

fake.odt

feat: add partition_odt for open office docs (#548 )

2023-05-04 19:28:08 +00:00

handbook-1p.docx

enhancements: add page numbers for word docs when available (#750 )

2023-06-15 12:21:17 -04:00

handbook-872p.docx

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

hebrew-text-base64-iso88598i.txt

fix: format Arabic and Hebrew annotated encodings (#823 )

2023-06-27 18:15:02 -07:00

ideas-page.html

fix: ensure all text is maintained in html output (#335 )

2023-03-02 14:03:13 -05:00

layout-parser-paper-10p.jpg

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

layout-parser-paper-combined.tiff

feat: supports multipage tiff (#1131 )

2023-08-24 15:12:50 +00:00

layout-parser-paper-fast.jpg

docs: add bricks training notebook (#211 )

2023-02-10 14:39:14 +00:00

layout-parser-paper-fast.pdf

revert pdf changes and add new pdf for empty page testing (#1255 )

2023-09-01 22:33:06 +00:00

layout-parser-paper-fast.tiff

feat: supports multipage tiff (#1131 )

2023-08-24 15:12:50 +00:00

layout-parser-paper-with-empty-pages.pdf

revert pdf changes and add new pdf for empty page testing (#1255 )

2023-09-01 22:33:06 +00:00

layout-parser-paper-with-table.jpg

Chore: Pass table support param to partition image (#973 )

2023-07-27 13:33:36 -04:00

layout-parser-paper.pdf

Initial Release

2022-09-26 14:55:20 -07:00

list-item-example.pdf

fix pdf partition of list items being detected as titles in OCR only mode (#1119 )

2023-08-15 09:35:54 -07:00

loremipsum-flat.pdf

fix: better extractable check (#900 )

2023-07-07 23:41:37 -05:00

multi-column-2p.pdf

Feat/1136 elements ordering for pdf (#1161 )

2023-08-24 17:46:19 -07:00

multi-column.pdf

Feat/1136 elements ordering for pdf (#1161 )

2023-08-24 17:46:19 -07:00

norwich-city.txt

enhancement: max_partition kwarg for limiting element size (#818 )

2023-06-28 15:26:01 -04:00

pdf2image-memory-error-test-400p.pdf

fix: 521 pdf2image memory error (#924 )

2023-07-14 15:08:33 -05:00

README.md

Update README.md (#435 )

2023-04-02 09:52:14 -07:00

README.org

feat: partition_org for Org Mode documents (#780 )

2023-06-23 18:45:31 +00:00

README.rst

feat: partition_rst for ReStructured Text documents (#725 )

2023-06-12 19:31:10 +00:00

reliance.pdf

fix: enable partition_pdf to recursively grab text with fast strategy (#796 )

2023-06-22 11:19:54 -04:00

science-exploration-1p.pptx

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

science-exploration-369p.pptx

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

spring-weather.html.json

fix: update detect_filetype for JSONs with text/plain MIME type (#520 )

2023-04-26 13:52:47 -04:00

stanley-cups-with-emoji.csv

fix: etree parser error (#1077 )

2023-08-10 23:28:57 +00:00

stanley-cups-with-emoji.tsv

Fix/1057 etree parser error tsv (#1106 )

2023-08-14 01:22:36 +00:00

stanley-cups.csv

feat: add partition_csv function (#619 )

2023-05-19 15:57:42 -04:00

stanley-cups.tsv

feat: partition_tsv for tab separated value files (#758 )

2023-06-15 18:50:53 +00:00

stanley-cups.xlsx

feat: add partition_xlsx for MSFT Excel files (#594 )

2023-05-16 19:40:40 +00:00

tests-example.xls

feat: add xls support (#632 )

2023-05-26 01:55:32 -07:00

winter-sports.epub

feat: add partition_epub function (#364 )

2023-03-14 15:52:21 +00:00

README.md

Example Docs

The sample docs directory contains the following files:

example-10k.html - A 10-K SEC filing in HTML format
layout-parser-paper.pdf - A PDF copy of the layout parser paper
factbook.xml/factbook.xsl - Example XML/XLS files that you can use to test stylesheets

These documents can be used to test out the parsers in the library. In addition, here are instructions for pulling in some sample docs that are too big to store in the repo.

XBRL 10-K

You can get an example 10-K in inline XBRL format using the following curl. Note, you need to have the user agent set in the header or the SEC site will reject your request.

curl -O \
  -A '${organization} ${email}'
  https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt

You can parse this document using the HTML parser.