unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-23 05:40:07 +00:00

Author	SHA1	Message	Date
Christine Straub	2d951722df	Feat/1332 save embedded images in pdf (#1371 ) Addresses [#1332](https://github.com/Unstructured-IO/unstructured/issues/1332) with `unstructured-inference` PR [#208](https://github.com/Unstructured-IO/unstructured-inference/pull/208). ### Summary - Add `image_path` to element metadata - Pass parameters related to extracting images in PDF - Preserve image elements ignored due to garbage text if `el.metadata.image_path` is `True` ### Testing from unstructured.partition.pdf import partition_pdf f_path = "example-docs/embedded-images.pdf" # default image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, ) # specific image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, image_output_dir_path=<directory path>, )	2023-09-22 09:16:03 +00:00
Amanda Cameron	e359afafbe	fix: coordinates bug on pdf parsing (#1462 ) Addresses: https://github.com/Unstructured-IO/unstructured/issues/1460 We were raising an error with invalid coordinates, which prevented us from continuing to return the element and continue parsing the pdf. Now instead of raising the error we'll return early. to test: ``` from unstructured.partition.auto import partition elements = partition(url='https://www.apple.com/environment/pdf/Apple_Environmental_Progress_Report_2022.pdf', strategy="fast") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-09-19 19:25:31 -07:00
shreyanid	eb8ce89137	chore: function to map between standard and Tesseract language codes (#1421 ) ### Summary In order to convert between incompatible language codes from packages used for OCR, this change adds a function to map between any standard language codes and tesseract OCR specific codes. Users can input language information to `languages` in any Tesseract-supported langcode or any ISO 639 standard language code. ### Details - Introduces the [python-iso639](https://pypi.org/project/python-iso639/) package for matching standard language codes. Recompiles all dependencies. - If a language is not already supplied by the user as a Tesseract specific langcode, supplies all possible script/orthography variants of the language to the Tesseract OCR agent. ### Test Added many unit tests for a variety of language combinations, special cases, and variants. For general testing, call partition functions with any lang codes in the languages parameter (Tesseract or standard). for example, ``` from unstructured.partition.auto import partition elements = partition(filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", languages=["en", "chi"]) print("\n\n".join([str(el) for el in elements])) ``` should supply eng+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert to Tesseract	2023-09-18 08:42:02 -07:00
Amanda Cameron	a9f18eddb8	chore: adding test case for odt tables (#1434 ) ODT table extraction is happening! Just added to an existing example-doc and an accompanying test case.	2023-09-16 22:29:44 -07:00
Newel H	cd704e873b	Feat: Create a naive hierarchy for elements (#1268 ) ## Summary By adding hierarchy to unstructured elements, users will have more information for implementing vector db/LLM chunking strategies. For example, text elements could be queried by their preceding title element. The hierarchy is implemented by a parent_id tag in the element's metadata. ### Features - Introduces a parent_id to ElementMetadata (The id of the parent element, not a pointer) - Creates a rule set for assigning hierarchies. Sensible default is assigned, with an optional override parameter - Sets element parent ids if there isn't an existing parent id or matches the ruleset ### How it works Hierarchies are assigned via a parent id field in element metadata. Elements are read sequentially and evaluated against a ruleset. For example take the following elements: 1. Title, "This is the Title" 2. Text, "this is the text" And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of 2 will be the id of 1. The algorithm for determining this is more complex and resolves several edge cases, so please read the code for further details. ### Schema Changes ``` @dataclass class ElementMetadata: coordinates: Optional[CoordinatesMetadata] = None data_source: Optional[DataSourceMetadata] = None filename: Optional[str] = None file_directory: Optional[str] = None last_modified: Optional[str] = None filetype: Optional[str] = None attached_to_filename: Optional[str] = None + parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None + category_depth: Optional[int] = None ... ``` ### Testing ``` from unstructured.partition.auto import partition from typing import List elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto") for element in elements: print( f"Category: {getattr(element, 'category', '')}\n"\ f"Text: {getattr(element, 'text', '')}\n" f"ID: {element.id}\n" \ f"Parent ID: {element.metadata.parent_id}\n"\ f"Depth: {element.metadata.category_depth}\n" \ ) ``` ### Additional Notes Implementing this feature revealed a possibly undesired side-effect in how element metadata are processed. In `unstructured/partition/common.py` the `_add_element_metadata` is invoked as part of the `add_metadata_with_filetype` decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers. This likely warrants a new issue. I'm guessing that the implementation is done this way to avoid issues with deserializing elements, but could be wrong. --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>	2023-09-14 11:23:16 -04:00
Amanda Cameron	7fd81dc7df	Table processing test for RTF (#1388 ) This PR does two things: 1. Adds test case (and alters sample doc) for rtf and epub files with table 2. Adds `xls/x` file extension to `skip_infer_table_types` default list --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-09-12 18:27:05 -07:00
Yao You	1a0b737e9c	revert pdf changes and add new pdf for empty page testing (#1255 ) - revert the layout parser fast pdf file to original with just two pages - add a new file that has one empty page and one page says "this page is intentionally left blank" for tests	2023-09-01 22:33:06 +00:00
Yao You	27773132b7	[issue 1247] fix element and bbox mismatch bug (#1250 ) This PR resolves #1247 by using the matching elements and bbox for coordinate computation. This PR also updates the example doc `example-docs/layout-parser-paper-fast.pdf` so that it includes a true blank page and a page with text "this page is intentionally left blank". This change helps us testing: - differences between fast and hi_res - code handling empty pages in between pages with contents (which triggers the bug found in #1247 ) Lastly, this PR updates the names of the variables inside `_partition_pdf_or_image_with_ocr` so that matching inputs all starts with `_` like `_elements`, `_text`, and `_bboxes` to improve readability. This change also improves partition performance for multi-page pdfs as it reduces the amount of iterations inside `add_pytesseract_bbox_to_elements`. Testing locally on m2 mac + Rocky docker shows it reduces partition time for DA-619p.pdf file from around 1min to around 23s.	2023-08-30 23:34:55 +00:00
Matt Robinson	07f76275f1	feat: detect PGP encrypted content in `partition_email` and `partition_msg` (#1205 ) ### Summary Closes #1018. Enables `partition_email` and `partition_msg` to detect if an email has PGP encrypted content. Based on the specification in [RFC 2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based on the example email in the spec. If PGP detected content is detected, a warning is emitted and an empty set of lists is returned. ### Testing ```python from unstructured.partition_email import partition_email filename = "example-docs/eml/fake-encrypted.eml" partition_email(filename=filename) ``` ```python from unstructured.partition_msg import partition_msg filename = "example-docs/fake-encrypted.msg" partition_msgl(filename=filename) ```	2023-08-25 17:09:25 -07:00
Christine Straub	483b09b3c9	Feat/1136 elements ordering for pdf (#1161 ) ### Summary Address [#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for `hi_res` and `fast` strategies. The `ocr_only` strategy does not include coordinates. - add functionality to switch sort mode between the current `basic` sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies - add the script to evaluate the `xy-cut` sorting approach - add jupyter notebook to provide evaluation and visualization for the `xy-cut` sorting approach ### Evaluation ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` Here, the file should be under the project root directory. For example, ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast ```	2023-08-24 17:46:19 -07:00
Klaijan	1524841cd9	feat: supports multipage tiff (#1131 ) Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and - confirms that the function reads all the pages in the TIFF. - page number is added to the metadata This PR is branched from and developed on top of 6d6be99 commit.	2023-08-24 15:12:50 +00:00
John	6e5d27c6c3	fix pdf partition of list items being detected as titles in OCR only mode (#1119 ) Closes Github issue #1010 adds group_bullet_paragraph func to handle grouping of bullet items that are split across multiple lines	2023-08-15 09:35:54 -07:00
Mike Lay	79a1eb8683	Handle inline and lacking filename (#1109 ) Handle Content-Disposition: inline and attachment without filename * Add new email test example and test with Content-Disposition: inline. * Move attachment_info above for loop so it is always defined * Check if item is inline as well as attachment as these both lack an = character to split on * Create filename if filename is not specified and write file. * Update list_attachments with new filename	2023-08-14 18:38:53 +00:00
Mike Lay	2e0ab86c6a	Fix attachments with `=` in filename (#1110 ) Fix attachments with = in filename * Limit split to first match of = to prevent creating a list of more than two parts * Add example email with attachment name and test for issue	2023-08-13 20:35:18 -07:00
Christine Straub	fc2699ff06	Fix/1057 etree parser error tsv (#1106 ) * feat: always use `soupparser_fromstring` to parse `html text` which gracefully handles emoji * chore: update changelog & version	2023-08-14 01:22:36 +00:00
Christine Straub	d26ab1deac	fix: etree parser error (#1077 ) * feat: add functionality to check if a string contains any emoji characters * feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji * chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use * chore: update changelog & version * chore: update changelog & version * chore: update dependencies * test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename` * chore: update changelog & version	2023-08-10 23:28:57 +00:00
kravetsmic	25ca5744cf	feat: optionally ignore header and footer tags in partition html (#1013 ) --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-08-04 21:56:33 +00:00
Christine Straub	b76d2ee745	feat: track emphasized text msword (#1048 ) * feat: add functionality to track emphasized text (`bold/italic` formatting) from paragraph * chore: add docstring * chore: fix lint errors * feat: ignore spaces when extracting emphasized texts from a paragraph * feat: add functionality to track emphasized text (`bold/italic` formatting) from table * test: add test case for grabbing emphasized texts from element metadata * chore: fix lint errors * chore: update changelog & version * Update ingest test fixtures (#1047)	2023-08-04 17:04:12 -04:00
Hynek Kydlíček	47b20119c3	fix: extract emojis with `partition_xlsx` (#1009 ) * 🐛 fixxed emoji xlsx bug * update version and changelog * check if beautifulsoup exists * update docs * fix html parser call * fix failing attachment test * ✅ added emoji test, added requirment fixed dependency * 🐛 dependency * 🐛 correct depeendency * linting, linting, linting * check for bs4 * skip auto xls filename test --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-08-04 10:14:08 -04:00
Sebastian Laverde Alfonso	084ead173a	chore: custom layout order example notebook (#1024 ) * chore: CFR double column sample Federal Regulations document for example notebook in `examples/custom-layout-order` * chore: custom-layout-order example dir * feat: helper methods to plot and reorder layouts Helper methods: `plot_image_with_bounding_boxes_coloured` and `reorder_elements_in_double_columns` * chore: delete __init__.py --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>	2023-08-02 18:29:04 -06:00
Yuming Long	d46c1c2d83	Chore: Pass table support param to partition image (#973 ) * add param and test in image table extraction * version and changelog * need to publish this one for api repo * add new param skip_infer_table_types * use warning * clean up with mapping * add test for tsv * fix test fail * weird change from merge * doc nit * don't use mapping * correct conflict	2023-07-27 13:33:36 -04:00
David Potter	f7e46af22f	feat: adds Outlook connector (#939 ) * bonus: fixes issue with email partitioning where From field was being assigned the To field value.	2023-07-26 04:09:26 +00:00
John	f282a10715	enhancement: improve json detection by detect_filetype (#971 ) * update regex pattern * improve json regex pattern checks and add test file * update file name * update tests and formatting * update changelog and version	2023-07-25 12:47:39 -04:00
Jason Scheirer	196efa09b1	chore: Add encoding param to ingest (#955 ) * Add encoding param to ingest	2023-07-24 10:06:13 -07:00
Christine Straub	5b7ae29876	fix: 521 pdf2image memory error (#924 ) Closes issue #521. Implements the same logic as unstructured-inference/PR #136 for the ocr_only strategy. * Add functionality to convert a PDF in small chunks of pages at a time * Add functionality to write images to computer storage temporarily instead of keeping them in memory * Set the file's current position to the beginning after reading the file in convert_to_bytes	2023-07-14 15:08:33 -05:00
John	6173362620	fix: detect list items in MS Word documents (#909 ) * fix merge conflict * update changelog and version	2023-07-10 15:29:08 +00:00
qued	79f734d3f9	fix: better extractable check (#900 ) auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.	2023-07-07 23:41:37 -05:00
Christine Straub	47bc4009a8	fix: adjust threshold for encoding detection (#894 ) * chore: add example doc * fix: adjust encoding recognition threshold value in `detect_file_encoding` * test: add test cases for German characters * chore: update changelog & version	2023-07-07 09:25:03 -04:00
Matt Robinson	52aced8677	fix: validate encodings from email headers (#881 ) * add validate encoding function * remove extraneous file * added test case for malformed encoding * version and changelog	2023-07-06 13:49:27 +00:00
Matt Robinson	44411ecc59	enhancement: `max_partition` kwarg for limiting element size (#818 ) * add max partition size logic * work splitting logic into split_by_paragraph * pass through max_partition to other functions * added test for splitting long document * add type hint * add documentation * version and changelog * ingest-test-fixtures-update * Update ingest test fixtures (#819) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * retrigger ci * ingest-test-fixtures-update * ingest-test-fixtures-update * Update ingest test fixtures (#821) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * update default for partition_xml * update version for release * update msg doc string --------- Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-28 15:26:01 -04:00
shreyanid	433d6af1bc	fix: format Arabic and Hebrew annotated encodings (#823 ) * add modified arabic and hebrew encodings * added calls to format_encoding_str so encoding is checked before use * added formatting to detect_filetype() * explicitly provided default value for null encoding parameter * fixed format of annotated encodings list * adding hebrew base64 test file * small lint fixes * update changelog * bump version to -dev2	2023-06-27 18:15:02 -07:00
kravetsmic	58e988e110	feature(html partition): parse pre tag (#642 ) * feature(html partition): parse pre tag * chore: update CHANGELOG.md * style: black format xml.py * Added tests dor html with pre tag * remove skip test, update parse pre tag * fix style * chore: spell check * chore: update changelog & version * chore: update ingest test fixtures * chore: add exception handling if `element.text` is `None` in `_read_xml` * test: add more sanity testing on the `.text` content of the element(s) * refactor: move the conditional logic for <pre> outside of the `try/except` block --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-06-27 18:52:39 +00:00
Martin Mauch	752e78e803	feat: partition_org for Org Mode documents (#780 ) * feat: partition_org for Org Mode documents * update version	2023-06-23 18:45:31 +00:00
Matt Robinson	8683e2695c	fix: enable `partition_pdf` to recursively grab text with fast strategy (#796 ) * initial pass on text in figures * refactor text extraction * update tests * fix title test * add test for docs that require recursive text grab * version and changelog * ingest-test-fixtures-update * there are 8 pdf files now	2023-06-22 11:19:54 -04:00
Christine Straub	743482b6d3	Bug/635 unicode decode error eml (#739 ) * Adds functionality to extract charset info from eml files * Adds missed file-like object handling in detect_file_encoding * Adds functionality to replace the MIME encodings for eml files with one of the common encodings if a unicode error occurs * Organize the eml example files in the example-docs/eml directory	2023-06-17 00:52:13 +00:00
Angus Sinclair	ec403e245c	fix malformed pptx issue (#761 ) * fix malformed pptx issue Added a new test to check for the ability to partition a malformed PowerPoint file. Modified the `partition_pptx` function to skip processing shapes that are not on the actual slide, but only if they have top and left positions. Also modified `_order_shapes` function to handle cases where shapes do not have top or left positions. * update changelog * fix lint issue SIM102 nested ifs * fix black linting	2023-06-15 19:52:44 +00:00
John	a9b9b873b1	feat: partition_tsv for tab separated value files (#758 ) * first pass at partition_tsv * working tests * create constants for tests and debug `make test` failure * make check and tidy * undo changes for testing locally * update changelog and version * fix bricks.rst * refactor if statements * make tidy * fix README and change try/except to if/else * update changelog and version * fix\ docstring	2023-06-15 18:50:53 +00:00
Matt Robinson	a800967478	enhancements: add page numbers for word docs when available (#750 ) * add support for page numbers in docx when present * version and changelog * add comment on page numbers * add header and footer to doc elements list * update integrations docs * include_page_breaks kwarg for doc and docx * merge element metadata for pagebreaks * fix typo * fix changelog typo * change page number default to None * add initial_page_number kwarg * make page number tests in pdf more explicit * revert test file * update ingest tests * update test fixture outputs * updates to IRS forms fixtures * ingest-test-fixtures-update * Update ingest test fixtures (#759) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> --------- Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-15 12:21:17 -04:00
Matt Robinson	053a6c6e5c	enhancement: extract headers and footers in `partition_docx` (#742 ) * added tests for headers and footers * add docs on headers and footers; tweak to metadata * version and changelog	2023-06-14 09:42:59 -04:00
Matt Robinson	c82fdb6a89	feat: `partition_rst` for ReStructured Text documents (#725 ) * add example rst file * filetype detection for rst files * add partition_rst function * add partition_rst to auto * update readme * update docs * changelog and version * pandocs -> pandoc * fix typo	2023-06-12 19:31:10 +00:00
Matt Robinson	19ab6d960f	enhancement: handling for empty files in `detect_filetype` and `partition` (#710 ) * add empty filetype * add empty handling to partition * changelog and version	2023-06-09 16:07:50 -04:00
Christine Straub	547bb38d86	fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 ) Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding. Change auto.py to have a None default for encoding Remove the unused parameter encoding from partition_pdf Add functionality to the read_txt_file utility function to handle file-like object from URL	2023-06-05 11:27:12 -07:00
ryannikolaidis	7d157c1ede	test: add benchmark script (#638 )	2023-06-05 09:14:43 -07:00
Meir	74a61e33d8	fix: metadata.page_number of pptx files (#675 ) * fix: metadata.page_number of pptx files * update changelog	2023-06-02 13:22:43 +00:00
cshaddox	d23e0d6420	feat: table extraction for power points (#664 ) * Handling tables * updating changelog * Adding accidentally removed code * remove newline * reuse table extraction function; add test --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-31 18:26:32 +00:00
Christine Straub	5b5fb3e13b	Issue/encoding error eml (#639 ) This PR adds functionality to try other common encodings for email (.eml) files if an error related to the encoding is raised and the user has not specified an encoding.	2023-05-30 10:24:02 -07:00
cragwolfe	c5d9469001	feat: add xls support (#632 ) Add support for older .XLS files from the partition function in unstructured.partition.auto. Note, this should also work on the centos7 unstructured image (with the requirements/*txt updates in this PR).	2023-05-26 01:55:32 -07:00
Christine Straub	a1fed6d4c6	Issue/unicode error (#608 ) This PR adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.	2023-05-23 13:35:38 -07:00
Matt Robinson	21c821d651	feat: add `partition_csv` function (#619 ) * add csv into filetype detection * first pass on csv * add tests for csv * add csv to auto * version bump * update readme and docs * fix doc strings	2023-05-19 15:57:42 -04:00
Matt Robinson	23ff32cc42	feat: add `partition_xml` for XML files (#596 ) * first pass on partition_xml * add option to keep xml tags * added tests for xml * fix filename * update filenames * remove outdated readme * add xml to auto * version and changelog * update readme and docs * pass through include_metadata * update include_metadata description * add README back in * linting, linting, linting * more linting * spooled to bytes doesnt need to be a tuple * Add tests for newly supported filetypes * Correct metadata filetype * doc typo Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * keep_xml_tags -> xml_keep_tags --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-05-18 15:40:12 +00:00

1 2

83 Commits