unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-16 10:27:23 +00:00

Author	SHA1	Message	Date
cragwolfe	4c13d12dc3	fix: prevent spammy ListItem's from images and PDF's (#1210 ) The issue was that for blocks detected in an image such as: ![image](https://github.com/Unstructured-IO/unstructured/assets/28578599/a955bf2c-a683-4cef-a19f-546f9378835a) , where the full image is: https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin//Users/cragwolfe/tmp/IRS-form-1987.png , many ListItem's would be extracted that were not adding much value to the output (assuming the block was determined to be of type List from the layout model). This particular file is also used in ingest tests, and you can see the prior output here: https://github.com/Unstructured-IO/unstructured/blob/483b09b/test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.png.json#L93-L280 Test Instructions: 1. run the following snippet: ``` import json import os from datetime import datetime from unstructured.__version__ import __version__ from unstructured.partition.auto import partition from unstructured.staging.base import elements_to_json filename = "/opt/home/tmp/IRS-form-1987.png" output_dir = "/opt/home/tmp/json" base_name_with_ext = os.path.basename(filename) output_filename_part = os.path.join(output_dir, base_name_with_ext) print(f"unstructured version: {__version__}") #for strategy in ("hi_res", "fast", "auto"): for strategy in ("hi_res",): d1 = datetime.now() elements = partition(filename=filename, strategy=strategy) elems_as_dicts = json.loads(elements_to_json(elements, indent=2)) # strip out metadata for the sake of more readable results for element_dict in elems_as_dicts: del element_dict["metadata"] json_filename=f"{output_filename_part}-{strategy}.json" with open(json_filename, "w") as jsonf: jsonf.write(json.dumps(elems_as_dicts, indent=2)) d2 = datetime.now() print(f"num elements for {strategy}: {len(elements)}") print(f"time elapsed {strategy}: {(d2-d1).total_seconds()}") ``` updating the `filename` and `output_dir` paths for your particular local environment. 2. Open the json file that was writen to your `output_dir`, named IRS-form-1987.png-hi_res.json Witness the new element: ``` { "type": "ListItem", "element_id": "7d3ba328af2c20ddeef5d2c1d270f60f", "text": "Long-term contracts.\u2014If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87 -61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed Other methods. \u2014Unless the Service has Published a regulation or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissio ner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method f or bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving cre dit plan (Act section 812); (3) the Inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not fil e Form 3115 for these changes." }, ``` 0.10.7	2023-08-26 21:01:07 -07:00
cragwolfe	3f1c90eef2	build: bump unstructured-inference==0.5.17, cut release (#1207 ) Pulls in @awalker4's tesseract enhancement: https://github.com/Unstructured-IO/unstructured-inference/pull/185 0.10.6	2023-08-26 01:05:48 +00:00
Matt Robinson	07f76275f1	feat: detect PGP encrypted content in `partition_email` and `partition_msg` (#1205 ) ### Summary Closes #1018. Enables `partition_email` and `partition_msg` to detect if an email has PGP encrypted content. Based on the specification in [RFC 2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based on the example email in the spec. If PGP detected content is detected, a warning is emitted and an empty set of lists is returned. ### Testing ```python from unstructured.partition_email import partition_email filename = "example-docs/eml/fake-encrypted.eml" partition_email(filename=filename) ``` ```python from unstructured.partition_msg import partition_msg filename = "example-docs/fake-encrypted.msg" partition_msgl(filename=filename) ```	2023-08-25 17:09:25 -07:00
John	5872fa23c3	Extract coordinates from PDFs and images when using OCR only strategy (#1163 ) ### Summary Closes #983 Creates new function `add_pytesseract_bbox_to_elements` Fixes typos in docstrings ### Testing ``` from unstructured.partition.image import partition_image from PIL import Image, ImageDraw png_filename="example-docs/english-and-korean.png" png_elements = partition_image(filename=png_filename, strategy="ocr_only") png_image = Image.open(png_filename) draw = ImageDraw.Draw(png_image) draw.polygon(png_elements[0].metadata.coordinates.points, outline="red", width=2) draw.polygon(png_elements[1].metadata.coordinates.points, outline="red", width=2) draw.polygon(png_elements[2].metadata.coordinates.points, outline="red", width=2) output = "example-docs/english-and-korean-box.png" png_image.save(output) png_image.close() ```	2023-08-25 05:32:12 +00:00
Matt Robinson	c578b85699	fix: respect `<pre>` tag order in `partition_html` (#1197 ) ### Summary Closes #1184. Updates `partition_html` to respect the ordering of `<pre>` tags in HTML documents. ### Testing The elements in the following example should be in the correct order. ```python from unstructured.partition.html import partition_html html_text = """ <pre>The Big Brown Bear</pre> <div>The big brown bear is growling.</div> <pre>The big brown bear is sleeping.</pre> <div>The Big Blue Bear</div> """ elements = partition_html(text=html_text) print("\n\n".join([str(el) for el in elements])) ```	2023-08-25 04:14:48 +00:00
Christine Straub	483b09b3c9	Feat/1136 elements ordering for pdf (#1161 ) ### Summary Address [#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for `hi_res` and `fast` strategies. The `ocr_only` strategy does not include coordinates. - add functionality to switch sort mode between the current `basic` sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies - add the script to evaluate the `xy-cut` sorting approach - add jupyter notebook to provide evaluation and visualization for the `xy-cut` sorting approach ### Evaluation ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` Here, the file should be under the project root directory. For example, ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast ```	2023-08-24 17:46:19 -07:00
Trevor Bossert	f267cef329	feat: Adds in threaded replies (#1188 ) - Puts threaded replies into the same text field as parent message, allowing for a full thread to be under a single element_id - Output is now XML instead of TXT to allow for easier parsing of new format. https://github.com/Unstructured-IO/unstructured/issues/1186	2023-08-24 12:12:29 -07:00
ryannikolaidis	566e947d13	fix: ARM build with constraint for safetensors <=0.3.2 (#1196 )	2023-08-24 18:00:25 +00:00
Klaijan	1524841cd9	feat: supports multipage tiff (#1131 ) Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and - confirms that the function reads all the pages in the TIFF. - page number is added to the metadata This PR is branched from and developed on top of 6d6be99 commit.	2023-08-24 15:12:50 +00:00
Matt Robinson	cdae53cc29	chore: deprecation warning for `file_filename` (#1191 ) ### Summary Closes #1007. Adds a deprecation warning for the `file_filename` kwarg to `partition`, `partition_via_api`, and `partition_multiple_via_api`. Also catches a warning in `ebooklib` that we do not want to emit in `unstructured`. ### Testing ```python from unstructured.partition.auto import partition filename = "example-docs/winter-sports.epub" # Should not emit a warning with open(filename, "rb") as f: elements = partition(file=f, metadata_filename="test.epub") # Should be test.epub elements[0].metadata.filename # Should emit a warning with open(filename, "rb") as f: elements = partition(file=f, file_filename="test.epub") # Should be test.epub elements[0].metadata.filename # Should raise an error with open(filename, "rb") as f: elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub") ```	2023-08-24 07:02:47 +00:00
ryannikolaidis	835378aba6	ci: fix documentation build flow (#1181 )	2023-08-24 00:24:03 -05:00
cragwolfe	df4bd459d5	build(deps): bump unstructured-inference==0.5.16 (#1182 ) Pulls in @newelh's fix: https://github.com/Unstructured-IO/unstructured-inference/pull/184	2023-08-23 05:28:45 +00:00
Charles	1ddf542e14	fix: Don't call extractable_elements if strategy is ocr_only (#1160 ) - fixes #1079 where partitioning is happening twice in the case of `strategy="ocr_only"` - only calls `extractable_elements` if we can predetermine that `ocr_only` is not a possible strategy even if it was the intended strategy. - Adds additional assertion test that `_partition_pdf_or_image_with_ocr` is not called when falling back to `fast` from `ocr_only`	2023-08-22 19:43:33 -07:00
cragwolfe	e9c649224e	chore: changelog repair (#1179 ) chaos reigns in the changelog. whyyy * there was no 0.10.3 release, so remove that from the CHANGELOG. * fixup 0.10.5 with a couple that were added (in retrospect)	2023-08-22 16:48:18 -07:00
Austin Walker	e7d189fcc8	chore: Bump inference and set default ocr_mode to entire_page (#1172 ) * pip-compile in order to bump unstructured-inference * Set the default `ocr_mode` back to `enitre_page` now that [this error](https://github.com/Unstructured-IO/unstructured-inference/pull/183) is addressed * Explicitly add `sphinx-tabs` to `build.in`. This file provides `docs/requirements.txt`. * Remove a pinned `pydantic` version * Fix a makefile command to `pip-compile` a missing ingest file. 0.10.5	2023-08-22 16:05:02 -07:00
Jack Retterer	05e311651a	doc: add delta tables connector reference (#1177 ) Added delta tables to connectors page for users to discover	2023-08-22 12:50:27 -07:00
ryannikolaidis	ac2313a3fa	doc: fix get-api-key link (#1175 )	2023-08-22 19:31:07 +00:00
ryannikolaidis	ab7fafcb41	doc: add pdf extra note (#1165 )	2023-08-22 18:20:26 +00:00
Roman Isecke	4114022d9d	roman/ingest-custom-errors (#1152 ) ### Description Adds three custom errors to ingest: * `SourceConnectionError` * `DestinationConnectionError` * `PartitionError` Included is a base custom error class that adds a wrapper. This wrapper wraps any raised exception into the custom error.	2023-08-22 12:28:29 -04:00
Roman Isecke	106ee965a6	Roman/delta table connector (#1132 ) ### Description Add delta table connector and test against a delta table generated via delta.io and uploaded to s3. Shows an example of how to use the connection options to leverage s3. I was able to get this to work with s3 if I pass in the access and secret keys as storage options. Even though the s3 bucket being used is public, would not work without those. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-08-22 10:19:46 -04:00
Matt Robinson	ad595d32f6	enhancement: tell users to install missing extras (#1167 ) ### Summary Updates `partition` to let users know to installs the appropriate extras if they're missing. Prior to this PR, users would get an exception stating `partition_pdf` (or whichever function that requires extras) does not exist. ### Testing First `pip uninstall ebooklib`. Then run ```python from unstructured.partition.auto import partition partition(filename="example-docs/winter-sports.epub") ``` The error should look like ```python ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]" ```	2023-08-22 03:00:21 +00:00
Jack Retterer	f639d04695	Fixed some typos (#1162 ) The Wikipedia data connector was labeled as Airtable.	2023-08-21 18:03:15 -07:00
Roman Isecke	db8af4f5de	Roman/notion tests (#1072 ) ### Description * Add ingest test for Notion docs * Update default cache dir for connectors to include connector name. Makes debugging the cached content easier. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-08-21 15:16:50 -04:00
Jack Retterer	a35ff890e0	Update docs jack (#1157 ) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work)	2023-08-21 10:27:32 -07:00
ryannikolaidis	6330278839	chore: add ingest diagrams and explanation of flow to ingest README (#1158 )	2023-08-21 16:24:03 +00:00
Newel H	e4aa7373e2	test: create CI pipelines for verifying base and extras pass respective tests (#1137 ) Summary Closes #747 * Create CI Pipeline for running text, xml, email, and html doc tests against the library installed without extras * Create CI Pipeline for running each library extra against their respective tests	2023-08-19 12:56:13 -04:00
John	69edffb0c0	bug: update partition_msg and partition_email so attachments also receive metadata_last_modified kwarg (#1134 ) ### Summary Closes #1027 The msg test in question was no longer failing after removing the quick-fix and comment explaining the issue. However, the test was not functioning as intended. Test was refactored to appropriately test `metadata_last_modified` of attachments. `partition_msg` was then updated to pass `metadata_last_modified` to `attachment_partitioner`. The same was done for email partitioning. ### Testing ``` from unstructured.partition.text import partition_text from unstructured.partition.msg import partition_msg from unstructured.partition.email import partition_email filename="example-docs/fake-email-attachment.msg" elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00") # previously, these were different values because last_modified wasn't being updated in attachments elements[1].metadata.last_modified elements[-1].text elements[-1].metadata.last_modified email_filename="example-docs/eml/fake-email-attachment.eml" email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00") email_elements[1].metadata.last_modified email_elements[-1].text email_elements[-1].metadata.last_modified ```	2023-08-18 23:21:11 +00:00
Austin Walker	dd243b4fd9	chore: pass ocr_mode in partition_pdf_or_image (#1154 ) Set to individual_blocks for now to work around [this bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179). I verified by printing the current ocr_mode in inference. The `entire_page` default is overridden. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: awalker4 <awalker4@users.noreply.github.com> 0.10.4	2023-08-18 20:59:08 +00:00
cragwolfe	1456f06b2d	chore: skip consistently failing test in main (#1150 ) The reason this test is failing is the API is returning "fast" results when "hi_res" is requested, which is being tracked in this ticket: https://github.com/Unstructured-IO/unstructured-api/issues/188 . This failure was only showing up on the `main` branch, per the commented out `pytest` skips.	2023-08-18 10:06:17 -07:00
cragwolfe	6001e2fa62	chore: changelog repair (#1149 ) 0.10.2 had been released, but prior commit co-mingled 0.10.2 and 0.10.3. This corrects the changelog and intentionally skips over 0.10.3. Bonus: remove accidental dupe line in 0.10.0.	2023-08-17 23:41:04 -07:00
Francisco Kurucz	d2a41f462d	doc: fix typo on partition_md function in bricks documentation (#1147 )	2023-08-17 20:54:11 -07:00
ryannikolaidis	668d0f1b01	feat: per-process ingest connections (#1058 ) * adds per process connections for Google Drive connector	2023-08-17 17:34:08 +00:00
cragwolfe	dd0f582585	build(deps): bump unstructured-inference==0.5.13 (#1141 ) Bump to unstructured-inference==0.5.13, which includes: Fix extracted image elements being included in layout merge, addresses the issue where an entire-page image in a PDF was not passed to the layout model when using hi_res. 0.10.2	2023-08-17 06:25:00 +00:00
John	9f7bd6127b	enhancement: Add `include_header` kwarg for xlsx, default True(#1125 ) Closes Github issue #1121 Adds include_header kwarg to partition_xlsx and change default behavior to True. 0.10.1	2023-08-17 04:16:23 +00:00
cragwolfe	22c12ef806	bump unstructured-inference (#1140 ) Pulls in fix from unstructured-inference==0.5.12: When a pdf page doesn't have much data, it may get buffered in the write to a tempfile. If this happens, we'll hit an error reading the file back. Open to suggestions for a way to unit test this - I was creating some test files with pypdf but I couldn't trigger the error.	2023-08-16 22:29:37 +00:00
cragwolfe	6f1b8d5f28	build(deps): bump unstructured-inference to 0.5.11 (#1138 ) * Bump unstructured-inference==0.5.11: - better defaults for DPI for hi_res and Chipper	2023-08-16 20:52:40 +00:00
Christine Straub	0a23139720	enhancement: implement full-page OCR(#1133 ) *implements full-page OCR as supported in unstructured-inference=0.5.11.	2023-08-16 19:16:35 +00:00
Newel H	be093d2e66	chore: Update dead links to correct pages (#1127 ) Summary Closes #1124 Updates dead links in repository README - Quick Start > Install for local development - Learn more > Batch Processing) Updates document dependencies to include tesseract-lang for additional language support (requirement for tests to pass) Testing All tests pass	2023-08-16 10:43:37 -04:00
Christine Straub	0e887cc36b	Feat/1060 update metadata fields (#1099 ) Closes Github Issue #1060. * update the metadata field links * update the metadata field emphasized_texts 0.10.0	2023-08-16 04:33:06 +00:00
Sebastian Laverde Alfonso	fe5048a834	feat: chipper local inference notebook (#1116 ) Download chipper model for local use and demonstrate how to partition a .pdf document through the unstructured and unstructured_inference libraries.	2023-08-15 20:43:23 -07:00
cragwolfe	d19183f442	build(lint): don't check version in main against self (#1123 ) If on the main branch already, it does not make sense to check if the latest commit is the same non-dev version. This fixes an annoyance where the CI Lint job would fail on release main commits, but besides that was not causing any other issues.	2023-08-15 17:57:59 +00:00
John	6e5d27c6c3	fix pdf partition of list items being detected as titles in OCR only mode (#1119 ) Closes Github issue #1010 adds group_bullet_paragraph func to handle grouping of bullet items that are split across multiple lines	2023-08-15 09:35:54 -07:00
qued	cb923b96a2	build(deps): dependency cleanup (#1102 ) Cleans up some pins that were prone to conflicts. All pins belong in constraints.in. 0.9.3	2023-08-15 05:15:44 +00:00
cragwolfe	d835fb1086	chore: bump pip version in published image (#1111 ) for consistency with the development environment, i.e. the Makefile.	2023-08-14 21:59:31 +00:00
Mike Lay	79a1eb8683	Handle inline and lacking filename (#1109 ) Handle Content-Disposition: inline and attachment without filename * Add new email test example and test with Content-Disposition: inline. * Move attachment_info above for loop so it is always defined * Check if item is inline as well as attachment as these both lack an = character to split on * Create filename if filename is not specified and write file. * Update list_attachments with new filename	2023-08-14 18:38:53 +00:00
Christine Straub	80266460fd	fix: GH issue 1057 etree parser error (csv) (#1112 ) Addresses #1057 for CSV. Related to PR #1077. * update partition_csv to always use soupparser_fromstring to parse html text	2023-08-14 17:48:57 +00:00
Mark Risher	612f9da6e8	Update news-of-the-day.ipynb - typo (#1113 ) Fixed typo	2023-08-14 16:48:49 +00:00
Mike Lay	2e0ab86c6a	Fix attachments with `=` in filename (#1110 ) Fix attachments with = in filename * Limit split to first match of = to prevent creating a list of more than two parts * Add example email with attachment name and test for issue	2023-08-13 20:35:18 -07:00
Christine Straub	fc2699ff06	Fix/1057 etree parser error tsv (#1106 ) * feat: always use `soupparser_fromstring` to parse `html text` which gracefully handles emoji * chore: update changelog & version	2023-08-14 01:22:36 +00:00
cragwolfe	b4b8ac4d8a	chore: run make pip-compile on mac (#1107 ) so cuda deps removed.	2023-08-13 20:42:12 +00:00

... 8 9 10 11 12 ...

1109 Commits