unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-03 18:49:53 +00:00

Author	SHA1	Message	Date
Matt Robinson	c578b85699	fix: respect `<pre>` tag order in `partition_html` (#1197 ) ### Summary Closes #1184. Updates `partition_html` to respect the ordering of `<pre>` tags in HTML documents. ### Testing The elements in the following example should be in the correct order. ```python from unstructured.partition.html import partition_html html_text = """ <pre>The Big Brown Bear</pre> <div>The big brown bear is growling.</div> <pre>The big brown bear is sleeping.</pre> <div>The Big Blue Bear</div> """ elements = partition_html(text=html_text) print("\n\n".join([str(el) for el in elements])) ```	2023-08-25 04:14:48 +00:00
Christine Straub	483b09b3c9	Feat/1136 elements ordering for pdf (#1161 ) ### Summary Address [#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for `hi_res` and `fast` strategies. The `ocr_only` strategy does not include coordinates. - add functionality to switch sort mode between the current `basic` sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies - add the script to evaluate the `xy-cut` sorting approach - add jupyter notebook to provide evaluation and visualization for the `xy-cut` sorting approach ### Evaluation ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` Here, the file should be under the project root directory. For example, ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast ```	2023-08-24 17:46:19 -07:00
Trevor Bossert	f267cef329	feat: Adds in threaded replies (#1188 ) - Puts threaded replies into the same text field as parent message, allowing for a full thread to be under a single element_id - Output is now XML instead of TXT to allow for easier parsing of new format. https://github.com/Unstructured-IO/unstructured/issues/1186	2023-08-24 12:12:29 -07:00
ryannikolaidis	566e947d13	fix: ARM build with constraint for safetensors <=0.3.2 (#1196 )	2023-08-24 18:00:25 +00:00
Klaijan	1524841cd9	feat: supports multipage tiff (#1131 ) Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and - confirms that the function reads all the pages in the TIFF. - page number is added to the metadata This PR is branched from and developed on top of 6d6be99 commit.	2023-08-24 15:12:50 +00:00
Matt Robinson	cdae53cc29	chore: deprecation warning for `file_filename` (#1191 ) ### Summary Closes #1007. Adds a deprecation warning for the `file_filename` kwarg to `partition`, `partition_via_api`, and `partition_multiple_via_api`. Also catches a warning in `ebooklib` that we do not want to emit in `unstructured`. ### Testing ```python from unstructured.partition.auto import partition filename = "example-docs/winter-sports.epub" # Should not emit a warning with open(filename, "rb") as f: elements = partition(file=f, metadata_filename="test.epub") # Should be test.epub elements[0].metadata.filename # Should emit a warning with open(filename, "rb") as f: elements = partition(file=f, file_filename="test.epub") # Should be test.epub elements[0].metadata.filename # Should raise an error with open(filename, "rb") as f: elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub") ```	2023-08-24 07:02:47 +00:00
ryannikolaidis	835378aba6	ci: fix documentation build flow (#1181 )	2023-08-24 00:24:03 -05:00
cragwolfe	df4bd459d5	build(deps): bump unstructured-inference==0.5.16 (#1182 ) Pulls in @newelh's fix: https://github.com/Unstructured-IO/unstructured-inference/pull/184	2023-08-23 05:28:45 +00:00
Charles	1ddf542e14	fix: Don't call extractable_elements if strategy is ocr_only (#1160 ) - fixes #1079 where partitioning is happening twice in the case of `strategy="ocr_only"` - only calls `extractable_elements` if we can predetermine that `ocr_only` is not a possible strategy even if it was the intended strategy. - Adds additional assertion test that `_partition_pdf_or_image_with_ocr` is not called when falling back to `fast` from `ocr_only`	2023-08-22 19:43:33 -07:00
cragwolfe	e9c649224e	chore: changelog repair (#1179 ) chaos reigns in the changelog. whyyy * there was no 0.10.3 release, so remove that from the CHANGELOG. * fixup 0.10.5 with a couple that were added (in retrospect)	2023-08-22 16:48:18 -07:00
Austin Walker	e7d189fcc8	chore: Bump inference and set default ocr_mode to entire_page (#1172 ) * pip-compile in order to bump unstructured-inference * Set the default `ocr_mode` back to `enitre_page` now that [this error](https://github.com/Unstructured-IO/unstructured-inference/pull/183) is addressed * Explicitly add `sphinx-tabs` to `build.in`. This file provides `docs/requirements.txt`. * Remove a pinned `pydantic` version * Fix a makefile command to `pip-compile` a missing ingest file. 0.10.5	2023-08-22 16:05:02 -07:00
Jack Retterer	05e311651a	doc: add delta tables connector reference (#1177 ) Added delta tables to connectors page for users to discover	2023-08-22 12:50:27 -07:00
ryannikolaidis	ac2313a3fa	doc: fix get-api-key link (#1175 )	2023-08-22 19:31:07 +00:00
ryannikolaidis	ab7fafcb41	doc: add pdf extra note (#1165 )	2023-08-22 18:20:26 +00:00
Roman Isecke	4114022d9d	roman/ingest-custom-errors (#1152 ) ### Description Adds three custom errors to ingest: * `SourceConnectionError` * `DestinationConnectionError` * `PartitionError` Included is a base custom error class that adds a wrapper. This wrapper wraps any raised exception into the custom error.	2023-08-22 12:28:29 -04:00
Roman Isecke	106ee965a6	Roman/delta table connector (#1132 ) ### Description Add delta table connector and test against a delta table generated via delta.io and uploaded to s3. Shows an example of how to use the connection options to leverage s3. I was able to get this to work with s3 if I pass in the access and secret keys as storage options. Even though the s3 bucket being used is public, would not work without those. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-08-22 10:19:46 -04:00
Matt Robinson	ad595d32f6	enhancement: tell users to install missing extras (#1167 ) ### Summary Updates `partition` to let users know to installs the appropriate extras if they're missing. Prior to this PR, users would get an exception stating `partition_pdf` (or whichever function that requires extras) does not exist. ### Testing First `pip uninstall ebooklib`. Then run ```python from unstructured.partition.auto import partition partition(filename="example-docs/winter-sports.epub") ``` The error should look like ```python ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]" ```	2023-08-22 03:00:21 +00:00
Jack Retterer	f639d04695	Fixed some typos (#1162 ) The Wikipedia data connector was labeled as Airtable.	2023-08-21 18:03:15 -07:00
Roman Isecke	db8af4f5de	Roman/notion tests (#1072 ) ### Description * Add ingest test for Notion docs * Update default cache dir for connectors to include connector name. Makes debugging the cached content easier. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-08-21 15:16:50 -04:00
Jack Retterer	a35ff890e0	Update docs jack (#1157 ) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work)	2023-08-21 10:27:32 -07:00
ryannikolaidis	6330278839	chore: add ingest diagrams and explanation of flow to ingest README (#1158 )	2023-08-21 16:24:03 +00:00
Newel H	e4aa7373e2	test: create CI pipelines for verifying base and extras pass respective tests (#1137 ) Summary Closes #747 * Create CI Pipeline for running text, xml, email, and html doc tests against the library installed without extras * Create CI Pipeline for running each library extra against their respective tests	2023-08-19 12:56:13 -04:00
John	69edffb0c0	bug: update partition_msg and partition_email so attachments also receive metadata_last_modified kwarg (#1134 ) ### Summary Closes #1027 The msg test in question was no longer failing after removing the quick-fix and comment explaining the issue. However, the test was not functioning as intended. Test was refactored to appropriately test `metadata_last_modified` of attachments. `partition_msg` was then updated to pass `metadata_last_modified` to `attachment_partitioner`. The same was done for email partitioning. ### Testing ``` from unstructured.partition.text import partition_text from unstructured.partition.msg import partition_msg from unstructured.partition.email import partition_email filename="example-docs/fake-email-attachment.msg" elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00") # previously, these were different values because last_modified wasn't being updated in attachments elements[1].metadata.last_modified elements[-1].text elements[-1].metadata.last_modified email_filename="example-docs/eml/fake-email-attachment.eml" email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00") email_elements[1].metadata.last_modified email_elements[-1].text email_elements[-1].metadata.last_modified ```	2023-08-18 23:21:11 +00:00
Austin Walker	dd243b4fd9	chore: pass ocr_mode in partition_pdf_or_image (#1154 ) Set to individual_blocks for now to work around [this bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179). I verified by printing the current ocr_mode in inference. The `entire_page` default is overridden. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: awalker4 <awalker4@users.noreply.github.com> 0.10.4	2023-08-18 20:59:08 +00:00
cragwolfe	1456f06b2d	chore: skip consistently failing test in main (#1150 ) The reason this test is failing is the API is returning "fast" results when "hi_res" is requested, which is being tracked in this ticket: https://github.com/Unstructured-IO/unstructured-api/issues/188 . This failure was only showing up on the `main` branch, per the commented out `pytest` skips.	2023-08-18 10:06:17 -07:00
cragwolfe	6001e2fa62	chore: changelog repair (#1149 ) 0.10.2 had been released, but prior commit co-mingled 0.10.2 and 0.10.3. This corrects the changelog and intentionally skips over 0.10.3. Bonus: remove accidental dupe line in 0.10.0.	2023-08-17 23:41:04 -07:00
Francisco Kurucz	d2a41f462d	doc: fix typo on partition_md function in bricks documentation (#1147 )	2023-08-17 20:54:11 -07:00
ryannikolaidis	668d0f1b01	feat: per-process ingest connections (#1058 ) * adds per process connections for Google Drive connector	2023-08-17 17:34:08 +00:00
cragwolfe	dd0f582585	build(deps): bump unstructured-inference==0.5.13 (#1141 ) Bump to unstructured-inference==0.5.13, which includes: Fix extracted image elements being included in layout merge, addresses the issue where an entire-page image in a PDF was not passed to the layout model when using hi_res. 0.10.2	2023-08-17 06:25:00 +00:00
John	9f7bd6127b	enhancement: Add `include_header` kwarg for xlsx, default True(#1125 ) Closes Github issue #1121 Adds include_header kwarg to partition_xlsx and change default behavior to True. 0.10.1	2023-08-17 04:16:23 +00:00
cragwolfe	22c12ef806	bump unstructured-inference (#1140 ) Pulls in fix from unstructured-inference==0.5.12: When a pdf page doesn't have much data, it may get buffered in the write to a tempfile. If this happens, we'll hit an error reading the file back. Open to suggestions for a way to unit test this - I was creating some test files with pypdf but I couldn't trigger the error.	2023-08-16 22:29:37 +00:00
cragwolfe	6f1b8d5f28	build(deps): bump unstructured-inference to 0.5.11 (#1138 ) * Bump unstructured-inference==0.5.11: - better defaults for DPI for hi_res and Chipper	2023-08-16 20:52:40 +00:00
Christine Straub	0a23139720	enhancement: implement full-page OCR(#1133 ) *implements full-page OCR as supported in unstructured-inference=0.5.11.	2023-08-16 19:16:35 +00:00
Newel H	be093d2e66	chore: Update dead links to correct pages (#1127 ) Summary Closes #1124 Updates dead links in repository README - Quick Start > Install for local development - Learn more > Batch Processing) Updates document dependencies to include tesseract-lang for additional language support (requirement for tests to pass) Testing All tests pass	2023-08-16 10:43:37 -04:00
Christine Straub	0e887cc36b	Feat/1060 update metadata fields (#1099 ) Closes Github Issue #1060. * update the metadata field links * update the metadata field emphasized_texts 0.10.0	2023-08-16 04:33:06 +00:00
Sebastian Laverde Alfonso	fe5048a834	feat: chipper local inference notebook (#1116 ) Download chipper model for local use and demonstrate how to partition a .pdf document through the unstructured and unstructured_inference libraries.	2023-08-15 20:43:23 -07:00
cragwolfe	d19183f442	build(lint): don't check version in main against self (#1123 ) If on the main branch already, it does not make sense to check if the latest commit is the same non-dev version. This fixes an annoyance where the CI Lint job would fail on release main commits, but besides that was not causing any other issues.	2023-08-15 17:57:59 +00:00
John	6e5d27c6c3	fix pdf partition of list items being detected as titles in OCR only mode (#1119 ) Closes Github issue #1010 adds group_bullet_paragraph func to handle grouping of bullet items that are split across multiple lines	2023-08-15 09:35:54 -07:00
qued	cb923b96a2	build(deps): dependency cleanup (#1102 ) Cleans up some pins that were prone to conflicts. All pins belong in constraints.in. 0.9.3	2023-08-15 05:15:44 +00:00
cragwolfe	d835fb1086	chore: bump pip version in published image (#1111 ) for consistency with the development environment, i.e. the Makefile.	2023-08-14 21:59:31 +00:00
Mike Lay	79a1eb8683	Handle inline and lacking filename (#1109 ) Handle Content-Disposition: inline and attachment without filename * Add new email test example and test with Content-Disposition: inline. * Move attachment_info above for loop so it is always defined * Check if item is inline as well as attachment as these both lack an = character to split on * Create filename if filename is not specified and write file. * Update list_attachments with new filename	2023-08-14 18:38:53 +00:00
Christine Straub	80266460fd	fix: GH issue 1057 etree parser error (csv) (#1112 ) Addresses #1057 for CSV. Related to PR #1077. * update partition_csv to always use soupparser_fromstring to parse html text	2023-08-14 17:48:57 +00:00
Mark Risher	612f9da6e8	Update news-of-the-day.ipynb - typo (#1113 ) Fixed typo	2023-08-14 16:48:49 +00:00
Mike Lay	2e0ab86c6a	Fix attachments with `=` in filename (#1110 ) Fix attachments with = in filename * Limit split to first match of = to prevent creating a list of more than two parts * Add example email with attachment name and test for issue	2023-08-13 20:35:18 -07:00
Christine Straub	fc2699ff06	Fix/1057 etree parser error tsv (#1106 ) * feat: always use `soupparser_fromstring` to parse `html text` which gracefully handles emoji * chore: update changelog & version	2023-08-14 01:22:36 +00:00
cragwolfe	b4b8ac4d8a	chore: run make pip-compile on mac (#1107 ) so cuda deps removed.	2023-08-13 20:42:12 +00:00
Christine Straub	4a3176885f	Fix/1057 etree parser error xlsx (#1094 ) * feat: add functionality to check if a string contains any emoji characters * feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji * chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use * chore: update changelog & version * chore: update changelog & version * chore: update dependencies * test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename` * chore: update changelog & version * feat: add functionality to switch html text parser based on whether the html text contains emoji * chore: update changelog & version * fix lint errors * test: revert the `EXPECTED_XLS_TEXT_LEN` value back * feat: always use `soupparser_fromstring` to parse `html text` * fix lint error	2023-08-13 12:20:33 -07:00
cragwolfe	02af625b93	chore: fix fickle test to not be so time sensitive (#1105 )	2023-08-13 10:58:46 -07:00
Noah Greer	fa0a5afb71	docs: correct spelling of partition in docs (#1104 ) Fixes a typo in several places where the word `partition` is misspelled as `partiton`	2023-08-12 14:57:27 -07:00
John	f63a66dbef	Capture section and chapter in the metadata for epubs under `epub_section` (#1005 ) Capture section and chapter in the metadata for epubs under epub_section. Closes Github issue #459	2023-08-12 21:02:06 +00:00

... 2 3 4 5 6 ...

805 Commits