unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-20 20:37:24 +00:00

Author	SHA1	Message	Date
Charles	1ddf542e14	fix: Don't call extractable_elements if strategy is ocr_only (#1160 ) - fixes #1079 where partitioning is happening twice in the case of `strategy="ocr_only"` - only calls `extractable_elements` if we can predetermine that `ocr_only` is not a possible strategy even if it was the intended strategy. - Adds additional assertion test that `_partition_pdf_or_image_with_ocr` is not called when falling back to `fast` from `ocr_only`	2023-08-22 19:43:33 -07:00
cragwolfe	e9c649224e	chore: changelog repair (#1179 ) chaos reigns in the changelog. whyyy * there was no 0.10.3 release, so remove that from the CHANGELOG. * fixup 0.10.5 with a couple that were added (in retrospect)	2023-08-22 16:48:18 -07:00
Austin Walker	e7d189fcc8	chore: Bump inference and set default ocr_mode to entire_page (#1172 ) * pip-compile in order to bump unstructured-inference * Set the default `ocr_mode` back to `enitre_page` now that [this error](https://github.com/Unstructured-IO/unstructured-inference/pull/183) is addressed * Explicitly add `sphinx-tabs` to `build.in`. This file provides `docs/requirements.txt`. * Remove a pinned `pydantic` version * Fix a makefile command to `pip-compile` a missing ingest file. 0.10.5	2023-08-22 16:05:02 -07:00
Jack Retterer	05e311651a	doc: add delta tables connector reference (#1177 ) Added delta tables to connectors page for users to discover	2023-08-22 12:50:27 -07:00
ryannikolaidis	ac2313a3fa	doc: fix get-api-key link (#1175 )	2023-08-22 19:31:07 +00:00
ryannikolaidis	ab7fafcb41	doc: add pdf extra note (#1165 )	2023-08-22 18:20:26 +00:00
Roman Isecke	4114022d9d	roman/ingest-custom-errors (#1152 ) ### Description Adds three custom errors to ingest: * `SourceConnectionError` * `DestinationConnectionError` * `PartitionError` Included is a base custom error class that adds a wrapper. This wrapper wraps any raised exception into the custom error.	2023-08-22 12:28:29 -04:00
Roman Isecke	106ee965a6	Roman/delta table connector (#1132 ) ### Description Add delta table connector and test against a delta table generated via delta.io and uploaded to s3. Shows an example of how to use the connection options to leverage s3. I was able to get this to work with s3 if I pass in the access and secret keys as storage options. Even though the s3 bucket being used is public, would not work without those. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-08-22 10:19:46 -04:00
Matt Robinson	ad595d32f6	enhancement: tell users to install missing extras (#1167 ) ### Summary Updates `partition` to let users know to installs the appropriate extras if they're missing. Prior to this PR, users would get an exception stating `partition_pdf` (or whichever function that requires extras) does not exist. ### Testing First `pip uninstall ebooklib`. Then run ```python from unstructured.partition.auto import partition partition(filename="example-docs/winter-sports.epub") ``` The error should look like ```python ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]" ```	2023-08-22 03:00:21 +00:00
Jack Retterer	f639d04695	Fixed some typos (#1162 ) The Wikipedia data connector was labeled as Airtable.	2023-08-21 18:03:15 -07:00
Roman Isecke	db8af4f5de	Roman/notion tests (#1072 ) ### Description * Add ingest test for Notion docs * Update default cache dir for connectors to include connector name. Makes debugging the cached content easier. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-08-21 15:16:50 -04:00
Jack Retterer	a35ff890e0	Update docs jack (#1157 ) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work)	2023-08-21 10:27:32 -07:00
ryannikolaidis	6330278839	chore: add ingest diagrams and explanation of flow to ingest README (#1158 )	2023-08-21 16:24:03 +00:00
Newel H	e4aa7373e2	test: create CI pipelines for verifying base and extras pass respective tests (#1137 ) Summary Closes #747 * Create CI Pipeline for running text, xml, email, and html doc tests against the library installed without extras * Create CI Pipeline for running each library extra against their respective tests	2023-08-19 12:56:13 -04:00
John	69edffb0c0	bug: update partition_msg and partition_email so attachments also receive metadata_last_modified kwarg (#1134 ) ### Summary Closes #1027 The msg test in question was no longer failing after removing the quick-fix and comment explaining the issue. However, the test was not functioning as intended. Test was refactored to appropriately test `metadata_last_modified` of attachments. `partition_msg` was then updated to pass `metadata_last_modified` to `attachment_partitioner`. The same was done for email partitioning. ### Testing ``` from unstructured.partition.text import partition_text from unstructured.partition.msg import partition_msg from unstructured.partition.email import partition_email filename="example-docs/fake-email-attachment.msg" elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00") # previously, these were different values because last_modified wasn't being updated in attachments elements[1].metadata.last_modified elements[-1].text elements[-1].metadata.last_modified email_filename="example-docs/eml/fake-email-attachment.eml" email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00") email_elements[1].metadata.last_modified email_elements[-1].text email_elements[-1].metadata.last_modified ```	2023-08-18 23:21:11 +00:00
Austin Walker	dd243b4fd9	chore: pass ocr_mode in partition_pdf_or_image (#1154 ) Set to individual_blocks for now to work around [this bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179). I verified by printing the current ocr_mode in inference. The `entire_page` default is overridden. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: awalker4 <awalker4@users.noreply.github.com> 0.10.4	2023-08-18 20:59:08 +00:00
cragwolfe	1456f06b2d	chore: skip consistently failing test in main (#1150 ) The reason this test is failing is the API is returning "fast" results when "hi_res" is requested, which is being tracked in this ticket: https://github.com/Unstructured-IO/unstructured-api/issues/188 . This failure was only showing up on the `main` branch, per the commented out `pytest` skips.	2023-08-18 10:06:17 -07:00
cragwolfe	6001e2fa62	chore: changelog repair (#1149 ) 0.10.2 had been released, but prior commit co-mingled 0.10.2 and 0.10.3. This corrects the changelog and intentionally skips over 0.10.3. Bonus: remove accidental dupe line in 0.10.0.	2023-08-17 23:41:04 -07:00
Francisco Kurucz	d2a41f462d	doc: fix typo on partition_md function in bricks documentation (#1147 )	2023-08-17 20:54:11 -07:00
ryannikolaidis	668d0f1b01	feat: per-process ingest connections (#1058 ) * adds per process connections for Google Drive connector	2023-08-17 17:34:08 +00:00
cragwolfe	dd0f582585	build(deps): bump unstructured-inference==0.5.13 (#1141 ) Bump to unstructured-inference==0.5.13, which includes: Fix extracted image elements being included in layout merge, addresses the issue where an entire-page image in a PDF was not passed to the layout model when using hi_res. 0.10.2	2023-08-17 06:25:00 +00:00
John	9f7bd6127b	enhancement: Add `include_header` kwarg for xlsx, default True(#1125 ) Closes Github issue #1121 Adds include_header kwarg to partition_xlsx and change default behavior to True. 0.10.1	2023-08-17 04:16:23 +00:00
cragwolfe	22c12ef806	bump unstructured-inference (#1140 ) Pulls in fix from unstructured-inference==0.5.12: When a pdf page doesn't have much data, it may get buffered in the write to a tempfile. If this happens, we'll hit an error reading the file back. Open to suggestions for a way to unit test this - I was creating some test files with pypdf but I couldn't trigger the error.	2023-08-16 22:29:37 +00:00
cragwolfe	6f1b8d5f28	build(deps): bump unstructured-inference to 0.5.11 (#1138 ) * Bump unstructured-inference==0.5.11: - better defaults for DPI for hi_res and Chipper	2023-08-16 20:52:40 +00:00
Christine Straub	0a23139720	enhancement: implement full-page OCR(#1133 ) *implements full-page OCR as supported in unstructured-inference=0.5.11.	2023-08-16 19:16:35 +00:00
Newel H	be093d2e66	chore: Update dead links to correct pages (#1127 ) Summary Closes #1124 Updates dead links in repository README - Quick Start > Install for local development - Learn more > Batch Processing) Updates document dependencies to include tesseract-lang for additional language support (requirement for tests to pass) Testing All tests pass	2023-08-16 10:43:37 -04:00
Christine Straub	0e887cc36b	Feat/1060 update metadata fields (#1099 ) Closes Github Issue #1060. * update the metadata field links * update the metadata field emphasized_texts 0.10.0	2023-08-16 04:33:06 +00:00
Sebastian Laverde Alfonso	fe5048a834	feat: chipper local inference notebook (#1116 ) Download chipper model for local use and demonstrate how to partition a .pdf document through the unstructured and unstructured_inference libraries.	2023-08-15 20:43:23 -07:00
cragwolfe	d19183f442	build(lint): don't check version in main against self (#1123 ) If on the main branch already, it does not make sense to check if the latest commit is the same non-dev version. This fixes an annoyance where the CI Lint job would fail on release main commits, but besides that was not causing any other issues.	2023-08-15 17:57:59 +00:00
John	6e5d27c6c3	fix pdf partition of list items being detected as titles in OCR only mode (#1119 ) Closes Github issue #1010 adds group_bullet_paragraph func to handle grouping of bullet items that are split across multiple lines	2023-08-15 09:35:54 -07:00
qued	cb923b96a2	build(deps): dependency cleanup (#1102 ) Cleans up some pins that were prone to conflicts. All pins belong in constraints.in. 0.9.3	2023-08-15 05:15:44 +00:00
cragwolfe	d835fb1086	chore: bump pip version in published image (#1111 ) for consistency with the development environment, i.e. the Makefile.	2023-08-14 21:59:31 +00:00
Mike Lay	79a1eb8683	Handle inline and lacking filename (#1109 ) Handle Content-Disposition: inline and attachment without filename * Add new email test example and test with Content-Disposition: inline. * Move attachment_info above for loop so it is always defined * Check if item is inline as well as attachment as these both lack an = character to split on * Create filename if filename is not specified and write file. * Update list_attachments with new filename	2023-08-14 18:38:53 +00:00
Christine Straub	80266460fd	fix: GH issue 1057 etree parser error (csv) (#1112 ) Addresses #1057 for CSV. Related to PR #1077. * update partition_csv to always use soupparser_fromstring to parse html text	2023-08-14 17:48:57 +00:00
Mark Risher	612f9da6e8	Update news-of-the-day.ipynb - typo (#1113 ) Fixed typo	2023-08-14 16:48:49 +00:00
Mike Lay	2e0ab86c6a	Fix attachments with `=` in filename (#1110 ) Fix attachments with = in filename * Limit split to first match of = to prevent creating a list of more than two parts * Add example email with attachment name and test for issue	2023-08-13 20:35:18 -07:00
Christine Straub	fc2699ff06	Fix/1057 etree parser error tsv (#1106 ) * feat: always use `soupparser_fromstring` to parse `html text` which gracefully handles emoji * chore: update changelog & version	2023-08-14 01:22:36 +00:00
cragwolfe	b4b8ac4d8a	chore: run make pip-compile on mac (#1107 ) so cuda deps removed.	2023-08-13 20:42:12 +00:00
Christine Straub	4a3176885f	Fix/1057 etree parser error xlsx (#1094 ) * feat: add functionality to check if a string contains any emoji characters * feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji * chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use * chore: update changelog & version * chore: update changelog & version * chore: update dependencies * test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename` * chore: update changelog & version * feat: add functionality to switch html text parser based on whether the html text contains emoji * chore: update changelog & version * fix lint errors * test: revert the `EXPECTED_XLS_TEXT_LEN` value back * feat: always use `soupparser_fromstring` to parse `html text` * fix lint error	2023-08-13 12:20:33 -07:00
cragwolfe	02af625b93	chore: fix fickle test to not be so time sensitive (#1105 )	2023-08-13 10:58:46 -07:00
Noah Greer	fa0a5afb71	docs: correct spelling of partition in docs (#1104 ) Fixes a typo in several places where the word `partition` is misspelled as `partiton`	2023-08-12 14:57:27 -07:00
John	f63a66dbef	Capture section and chapter in the metadata for epubs under `epub_section` (#1005 ) Capture section and chapter in the metadata for epubs under epub_section. Closes Github issue #459	2023-08-12 21:02:06 +00:00
Ronny H	0d5b5a0e79	Revamp README & Bricks documentation (#1103 ) Reorganize README.md	2023-08-12 19:58:51 +00:00
Roman Isecke	9d29f5dc2e	Add init file to make notion module discoverable (#1100 ) One of the added modules was missing an __init__.py file which made it undiscoverable in the path when running as a cli command via console script rather than the PYTHONPATH=. python ... approach.	2023-08-12 12:21:07 -07:00
Ahmet Melek	627f78c16f	feat: airtable connector (#1012 ) * add the first version of airtable connector * change imports as inline to fail gracefully in case of lacking dependency * parse tables as csv rather than plain text * add relevant logic to be able to use --airtable-list-of-paths * add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings * fix ingest test names * add scripts for the large table test * remove large table test from diff test * make base and table ids explicit * add and remove comments * use -ne instead of != * update code based on the recent ingest refactor, update changelog and version * shellcheck fix * update comments * update check-num-rows-and-columns-output error message Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * update help comments * update help comments * update help comments * update workflows to set auth tokens and to run make install * add comments on create_scale_test_components * separate component ids from the test script, add comments to document test component creation * add LARGE_BASE test, implement LARGE_BASE component creation, replace component id * shellcheck fixes * shellcheck fixes * update docs * update comment * bump version * add wrongly deleted file * sort columns before saving to process * Update ingest test fixtures (#1098) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-11 12:02:51 -07:00
Matt Robinson	fa5a3dbd81	feat: `unique_element_ids` kwarg for UUID elements (#1085 ) * added kwarg for unique elements * test for unique ids * update docs * changelog and version	2023-08-11 11:02:37 +00:00
Christine Straub	d26ab1deac	fix: etree parser error (#1077 ) * feat: add functionality to check if a string contains any emoji characters * feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji * chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use * chore: update changelog & version * chore: update changelog & version * chore: update dependencies * test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename` * chore: update changelog & version	2023-08-10 23:28:57 +00:00
Ronny H	b31c62fa84	replace Weaviate nearText with BM25 query algorithm (#1078 )	2023-08-10 22:15:27 +00:00
cragwolfe	6779918406	build(release): bump unstructured-inference (#1074 ) * build(release): bump unstructured-inference Related to downstream issue: Unstructured-IO/unstructured-api#182 And upstream PR: Unstructured-IO/unstructured-inference#165 --------- Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com> 0.9.2	2023-08-10 20:57:46 +00:00
Ahmet Melek	64a1930c46	chore[ingest]: fix confluence ingest diff tests (#1082 ) * trigger CI * trigger CI * trigger CI * do not ingest personal spaces in the diff test * fix argument * Update ingest test fixtures (#1083) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-10 17:45:17 +00:00

... 15 16 17 18 19 ...

1447 Commits