unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2026-01-09 05:41:29 +00:00

History

feat: add partition_xml for XML files (#596 )

* first pass on partition_xml

* add option to keep xml tags

* added tests for xml

* fix filename

* update filenames

* remove outdated readme

* add xml to auto

* version and changelog

* update readme and docs

* pass through include_metadata

* update include_metadata description

* add README back in

* linting, linting, linting

* more linting

* spooled to bytes doesnt need to be a tuple

* Add tests for newly supported filetypes

* Correct metadata filetype

* doc typo

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* keep_xml_tags -> xml_keep_tags

---------

Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>

2023-05-18 15:40:12 +00:00

unsupported

feat: add partition_xml for XML files (#596 )

2023-05-18 15:40:12 +00:00

chevron-page.pdf

fix: group together text from the same bounding box in partition_pdf with fast strategy (#542 )

2023-05-03 18:33:24 -04:00

copy-protected.pdf

enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514 )

2023-04-21 21:35:43 +00:00

email-with-image.eml

feat: Add Image element and find_embedded_image function (#130 )

2023-01-09 19:49:19 -06:00

english-and-korean.png

enhancement: add ocr_only strategy for partition_image (#540 )

2023-05-04 20:23:51 +00:00

example-10k.html

Initial Release

2022-09-26 14:55:20 -07:00

example-with-scripts.html

fix: Remove JavaScript from HTML reader output (#313 )

2023-02-28 14:24:24 -08:00

example.jpg

feat: extract metadata from .docx, .xlsx, and .jpg (#113 )

2022-12-26 09:34:36 -05:00

factbook.xml

feat: add partition_xml for XML files (#596 )

2023-05-18 15:40:12 +00:00

fake_table.docx

feat: Read docx tables (#572 )

2023-05-11 18:31:38 +00:00

fake-doc.rtf

feat: add partition_rtf for rich text files (#466 )

2023-04-10 21:25:03 +00:00

fake-email-attachment.eml

feat: Add extract_attachment_info (#112 )

2023-01-03 11:41:54 -06:00

fake-email-attachment.msg

feat: add msg attachment support (#510 )

2023-04-21 11:14:46 -05:00

fake-email-header.eml

chore: Fix parse received data (#143 )

2023-01-17 16:36:44 -06:00

fake-email-image-embedded.eml

feat: Add Image element and find_embedded_image function (#130 )

2023-01-09 19:49:19 -06:00

fake-email-multiple-attachments.msg

feat: add msg attachment support (#510 )

2023-04-21 11:14:46 -05:00

fake-email.eml

feat: add partition_email cleaning brick (#104 )

2022-12-19 18:02:44 +00:00

fake-email.msg

feat: add partition_msg for MSFT Outlook files (#412 )

2023-03-28 20:15:22 +00:00

fake-email.txt

feat: Add new functionality to parse text and header of emails (#111 )

2023-01-09 17:08:08 +00:00

fake-html.html

feat: generic partition brick with filetype detection (#132 )

2023-01-09 16:15:14 -05:00

fake-memo.pdf

feat: Create spacy notebook example (#593 )

2023-05-17 15:42:15 -05:00

fake-power-point.ppt

feat: add partition_ppt for older power point docs (#238 )

2023-02-17 16:57:08 +00:00

fake-power-point.pptx

feat: basic PowerPoint parsing in partition_pptx (#166 )

2023-01-23 17:03:09 +00:00

fake-text-utf-16-be.txt

Adding optional encoding arg, and text_partition tests (#339 )

2023-03-06 15:07:33 -08:00

fake-text.txt

fix: cleanup from live .docx tests (#177 )

2023-01-26 15:52:25 +00:00

fake.doc

feat: add partition_doc for .doc files (#236 )

2023-02-17 09:30:23 -05:00

fake.docx

feat: extract metadata from .docx, .xlsx, and .jpg (#113 )

2022-12-26 09:34:36 -05:00

fake.odt

feat: add partition_odt for open office docs (#548 )

2023-05-04 19:28:08 +00:00

ideas-page.html

fix: ensure all text is maintained in html output (#335 )

2023-03-02 14:03:13 -05:00

layout-parser-paper-fast.jpg

docs: add bricks training notebook (#211 )

2023-02-10 14:39:14 +00:00

layout-parser-paper-fast.pdf

feat: new partitioning brick that calls the document image analysis API (#68 )

2022-11-16 17:48:30 +01:00

layout-parser-paper.pdf

Initial Release

2022-09-26 14:55:20 -07:00

README.md

Update README.md (#435 )

2023-04-02 09:52:14 -07:00

spring-weather.html.json

fix: update detect_filetype for JSONs with text/plain MIME type (#520 )

2023-04-26 13:52:47 -04:00

stanley-cups.xlsx

feat: add partition_xlsx for MSFT Excel files (#594 )

2023-05-16 19:40:40 +00:00

winter-sports.epub

feat: add partition_epub function (#364 )

2023-03-14 15:52:21 +00:00

README.md

Example Docs

The sample docs directory contains the following files:

example-10k.html - A 10-K SEC filing in HTML format
layout-parser-paper.pdf - A PDF copy of the layout parser paper
factbook.xml/factbook.xsl - Example XML/XLS files that you can use to test stylesheets

These documents can be used to test out the parsers in the library. In addition, here are instructions for pulling in some sample docs that are too big to store in the repo.

XBRL 10-K

You can get an example 10-K in inline XBRL format using the following curl. Note, you need to have the user agent set in the header or the SEC site will reject your request.

curl -O \
  -A '${organization} ${email}'
  https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt

You can parse this document using the HTML parser.