unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-16 18:37:25 +00:00

Author	SHA1	Message	Date
Christine Straub	35ec21ecd0	fix: decide table extraction (#3090 ) This PR aims to add backward compatibility for the deprecated `pdf_infer_table_structure` parameter. A missing part of turning table extraction for PDFs and Images off by default in https://github.com/Unstructured-IO/unstructured/pull/3035, which was turned on in https://github.com/Unstructured-IO/unstructured/pull/2588. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-05-23 20:37:15 +00:00
Roman Isecke	3eaf65a8c1	feat: refactor ingest (#3009 ) ### Description This refactors the current ingest CLI process to support better granularity in how the steps are ran * Both multiprocessing and async now supported. Given that a lot of the steps are IO-bound, such as downloading and uploading content, we can achieve better parallelization by using async here * Destination step broken up into a stager step and an upload step. This will allow for steps that require manipulation of the data between formats, such as converting the elements json into a csv format to upload for tabular destinations, to be pulled out of the step that does the actual upload. * The process of writing the content to a local destination was now pulled out as it's own dedicated destination connector, meaning you no longer need to persist the content locally once the process is done if the content was uploaded elsewhere. * Quick update to the chunker/partition step to use the python client. * Move the uncompress suppport as a pipeline step since this can arbitrarily apply to any concrete files that have been downloaded, regardless of where they came from. * Leverage last modified date to mark files to be reprocessed, even if the file already exists locally. ### Callouts Retry configs haven't been moved over yet. This is an open question because the intent was for it to wrap potential connection errors but now any of the other steps that leverage an API might run into network connection issues. Should those be isolated in each of the steps and wrapped with the same retry configs? Or do we need to expose a unique retry config for each step? This would bloat the input params even more. ### Testing * If you want to run the new code as an SDK, there's an example file that was added to highlight how to do that: [example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py) * If you want to run the new code as an isolated CLI: ```shell PYTHONPATH=. python unstructured/ingest/v2/main.py --help ``` * If you want to see which commands have been migrated to the new version, there's now a `v2` short help text next to those commands when running the current cli: ```shell PYTHONPATH=. python unstructured/ingest/main.py --help Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help Options: --help Show this message and exit. Commands: airtable azure biomed box confluence delta-table discord dropbox elasticsearch fsspec gcs github gitlab google-drive hubspot jira local v2 mongodb notion onedrive opensearch outlook reddit s3 v2 salesforce sftp sharepoint slack wikipedia ``` You can run any of the local or s3 specific ingest tests and these should now work. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-05-21 17:01:49 +00:00
Yuming Long	542d442699	chore CORE-4775: remove html page number metadata field (#2942 ) ### Summary Rip off page_number metadata fields until we have page counting for all kinds of html files (not just limited to news articles with multiple `<article>` tag) ### Test Unit tests `test_add_chunking_strategy_on_partition_html_respects_multipage` and `test_add_chunking_strategy_title_on_partition_auto_respects_multipage` removed since they relay on the `page_number` fields from the SEC html file - now test moved to mock test for chunk_by_title -> revisit those tests when we find test file for this Also changed the element ids from partition outputs for html files - element id change due to page number change (in element id hashing) -> todo ticket: update other deterministic element id tests per crag's comment --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2024-04-30 15:20:26 +00:00
Pluto	df1f7bcd0e	Save table prediction in cells format (#2892 ) This pull request allows to return predictions in raw cell representation from table transformer. It will be later used to save prediction in a cells format for simpler metrics calculation. This PR has to be merged, after https://github.com/Unstructured-IO/unstructured-inference/pull/335	2024-04-25 11:14:48 +00:00
Michał Martyniak	2d1923ac7e	Better element IDs - deterministic and document-unique hashes (#2673 ) Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842 Main changes compared to part one: * hash computation includes element's sequence number on page, page number, document filename and its text * there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents) This PR addresses the following issue: https://github.com/Unstructured-IO/unstructured/issues/2461	2024-04-24 00:05:20 -07:00
Michał Martyniak	001fa17c86	Preparing the foundation for better element IDs (#2842 ) Part one of the issue described here: https://github.com/Unstructured-IO/unstructured/issues/2461 It does not change how hashing algorithm works, just reworks how ids are assigned: > Element ID Design Principles > > 1. A partitioning function can assign only one of two available ID types to a returned element: a hash or UUID. > 2. All elements that are returned come with an ID, which is never None. > 3. No matter which type of ID is used, it will always be in string format. > 4. Partitioning a document returns elements with hashes as their default IDs. Big thanks to @scanny for explaining the current design and suggesting ways to do it right, especially with chunking. Here's the next PR in line: https://github.com/Unstructured-IO/unstructured/pull/2673 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>	2024-04-16 21:14:53 +00:00
MiXiBo	0506aff788	add support for `start_index` in `html` links extraction (#2600 ) add support for start_index in html links extraction (closes #2625) Testing ``` from unstructured.partition.html import partition_html from unstructured.staging.base import elements_to_json html_text = """<html> <p>Hello there I am a <a href="/link">very important link!</a></p> <p>Here is a list of my favorite things</p> <ul> <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li> <li>Dogs</li> </ul> <a href="/loner">A lone link!</a> </html>""" elements = partition_html(text=html_text) print(elements_to_json(elements)) ``` --------- Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-04-12 06:14:20 +00:00
Ahmet Melek	d46792214a	feat: add vertexai embeddings (#2693 ) This PR: - Adds VertexAI embeddings as an embedding provider Testing - Tested with pinecone destination connector on [this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693) job run. --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-03-28 21:15:36 +00:00
Steve Canny	9ae838e50a	feat: add --include-orig-elements option to Ingest CLI (#2687 ) Summary Add an `--include-orig-elements` option to the Ingest CLI to allow users to specify that corresponding new chunking parameter. Reviewer A lot of this is cleanup, the second commit is where the actual adding of this option are. The first commit fixes a number of inaccuracies in the documentation and does some other clean-up. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-27 06:35:01 +00:00
Christine Straub	08fafc564f	Fix: embedded text not getting merged with inferred elements (#2679 ) This PR is the second part of fixing "embedded text not getting merged with inferred elements", the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/331. ### Summary - replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()` when removing pdfminer (embedded) elements that were merged with inferred elements - use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD` introduced in the [first part](https://github.com/Unstructured-IO/unstructured-inference/pull/331) when removing pdfminer (embedded) elements that were merged with inferred elements - bump `unstructured-inference` to 0.7.25 ### Testing PDF: [pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf) ``` $ pip uninstall unstructured-inference -y $ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference $ pip install -e . ``` ``` elements = partition_pdf( filename="pwc-financial-statements-p114.pdf", strategy="hi_res", infer_table_structure=True, extract_image_block_types=["Image"], ) table_elements = [el for el in elements if el.category == "Table"] print(table_elements[0].text) ``` --------- Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-03-23 03:59:23 +00:00
Steve Canny	56fbaaed10	feat(chunking): add metadata.orig_elements serde (#2680 ) Summary This final PR in the "orig_elements" series adds the needful such that `.metadata.orig_elements`, when present on a chunk (element), is serialized to JSON when the chunk is serialized, for instance, to be used in an HTTP response payload. It also provides for deserializing such a JSON payload into chunks that contain the `.orig_elements` metadata. Additional Context Note that `.metadata.orig_elements` is always `Optional[list[Element]]` when in memory. However, those original elements are serialized as Base64-encoded gzipped JSON and are in that form (str) when present as JSON or as "element-dicts" which is an intermediate serialization/deserialization format. That is, serialization is `Element -> dict -> JSON` and deserialization is `JSON -> dict -> Element` and `.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-22 21:53:26 +00:00
Filip Knefel	bdfd975115	chore: change table extraction defaults (#2588 ) Change default values for table extraction - works in pair with [this](https://github.com/Unstructured-IO/unstructured-api/pull/370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-22 10:08:49 +00:00
Roman Isecke	4ff6a5b78e	Roman/bugfix support bedrock embeddings (#2650 ) ### Description This PR resolved the following open issue: [bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319). To do so, the following changes were made: * All aws configs were added as input parameters to the CLI * These were mapped to the bedrock embedder when an embedder is generated via `get_embedder` * An ingest test was added to call the aws bedrock service * Requirements for boto were bumped because the first version to introduce the bedrock runtime, which is required to hit the bedrock service, was introduced in version `1.34.63`, which was ahead of the version of boto pinned. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-03-21 18:21:04 +00:00
David Potter	5b92e0bb6b	bug CORE-4089: Onedrive partitioning fails - datetime formatting error (#2638 ) Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both are microsoft products) https://github.com/Unstructured-IO/unstructured/pull/2591 https://github.com/Unstructured-IO/unstructured/pull/2592/files We are seeing occurrences of inconsistency in the timestamps returned by Onedrive when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. Changes This adds logic to guarantee Onedrive dates will be properly formatted as ISO, regardless of the format provided by the onedrive library. Bumps timestamp format output to include timezone offset (as we do with others) Adds unit tests for isofomat. json_to_dict already unit tested here: https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py Adds small change for AstraDB to allow them to see what source called their api	2024-03-15 14:01:05 +00:00
Steve Canny	8ea203adf7	feat(chunking): composite text gets is_continuation (#2639 ) Summary Add `metadata.is_continuation = True` to metadata of second-and-later text-split chunks formed from an oversized non-table element. Previously this metadata was only present on text-split `TableChunk` elements. This enables downstream filtering of intentionally redundant metadata on chunk elements that may not be desired for all purposes. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-12 19:44:41 +00:00
David Potter	1ca90d209a	bug: update sharepoint-with-permissions test to fix CI (#2589 ) Adding `metadata.data_source.permissions_data` to sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint deprecation warning from ruining test. Updating expected-structured-output As per Ahmet's comment. We do want to check sharepoint permissions metadata at some point. But that will take a separate type of test. A file diff test is too unstable. Permissions checking will be later down the road.	2024-03-06 17:15:36 +00:00
ryannikolaidis	71d5d513ef	fix: handling of varied SharePoint date formats (#2591 ) We are seeing occurrences of inconsistency in the timestamps returned by office365.sharepoint when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. ## Changes - This adds logic to guarantee SharePoint dates will be properly formatted as ISO, regardless of the format provided by the sharepoint library. - Bumps timestamp format output to include timezone offset (as we do with others) ## Testing Unit test added to validate this datetime handling across various formats. --------- Co-authored-by: David Potter <potterdavidm@gmail.com>	2024-02-28 16:11:53 +00:00
Matt Robinson	b4d9ad8130	enhancement: detect headers in `partition_pdf` with fast strategy (#2455 ) ### Summary Detects headers and footers when using `partition_pdf` with the fast strategy. Identifies elements that are positioned in the top or bottom 5% of the page as headers or footers. If no coordinate information is available, an element won't be detected as a header or footer. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-02-23 16:56:09 +00:00
Steve Canny	d9f8467187	fix(xlsx): xlsx subtable algorithm (#2534 ) Reviewers: It may be easier to review each of the two commits separately. The first adds the new `_SubtableParser` object with its unit-tests and the second one uses that object to replace the flawed existing subtable-parsing algorithm. Summary There are a cluster of bugs in `partition_xlsx()` that all derive from flaws in the algorithm we use to detect "subtables". These are encountered when the user wants to get multiple document-elements from each worksheet, which is the default (argument `find_subtable = True`). This PR replaces the flawed existing algorithm with a `_SubtableParser` object that encapsulates all that logic and has thorough unit-tests. Additional Context This is a summary of the failure cases. There are a few other cases but they're closely related and this was enough evidence and scope for my purposes. This PR fixes all these bugs: ```python # # -- ✅ CASE 1: There are no leading or trailing single-cell rows. # -> this subtable functions never get called, subtable is emitted as the only element # # a b -> Table(a, b, c, d) # c d # -- ✅ CASE 2: There is exactly one leading single-cell row. # -> Leading single-cell row emitted as `Title` element, core-table properly identified. # # a -> [ Title(a), # b c Table(b, c, d, e) ] # d e # -- ❌ CASE 3: There are two-or-more leading single-cell rows. # -> leading single-cell rows are included in subtable # # a -> [ Table(a, b, c, d, e, f) ] # b # c d # e f # -- ❌ CASE 4: There is exactly one trailing single-cell row. # -> core table is dropped. trailing single-cell row is emitted as Title # (this is the behavior in the reported bug) # # a b -> [ Title(e) ] # c d # e # -- ❌ CASE 5: There are two-or-more trailing single-cell rows. # -> core table is dropped. trailing single-cell rows are each emitted as a Title # # a b -> [ Title(e), # c d Title(f) ] # e # f # -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b c Table(b, c, d, e), # d e Title(f) ] # f # -- ✅ CASE 7: There are two leading and one trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g) ] # g # -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g), # g Title(h) ] # h # -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below. # -> First cell is mistakenly emitted as title, remaining cells are dropped. # # a b c -> [ Title(a) ] # -- ❌ CASE 10: Single-row subtable with one leading single-cell row. # -> Leading single-row cell is correctly identified as title, core-table is mis-identified # as a `Title` and truncated. # # a -> [ Title(a), # b c d Title(b) ] ```	2024-02-13 20:29:17 -08:00
Ahmet Melek	f9f2cacb58	build(release): release commit for 0.12.4 (#2525 ) Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2024-02-08 21:18:29 +00:00
David Potter	0c834517d8	fix: change opensearch port (#2517 ) change opensearch port to see if fixes CI. We think there may be a conflict with the elasticsearch docker port. Also adding simple retry to vector query. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-07 21:25:04 +00:00
Yao You	97fb10db4a	fix: default hi_res model rely on inference setting (#2441 ) - there are multiple places setting the default `hi_res_model_name` in both `unstructured` and `unstructured-inference` - they lead to inconsistency and unexpected behaviors - this fix removes a helper in `unstructured` that tries to set the default hi_res layout detection model; instead we rely on the `unstructured-inference` to provide that default when no explicit model name is passed in ## test ```bash UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython ``` ```python from unstructured.partition.auto import partition # find a pdf file elements = partition("foo.pdf", strategy="hi_res") assert elements[0].metadata.detection_origin == "yolox" ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2024-01-29 16:44:41 +00:00
David Potter	bc791d53f4	feat: add opensearch source and destination connector (#2349 ) Adds OpenSearch as a source and destination. Since OpenSearch is a fork of Elasticsearch, these connectors rely heavily on inheriting the Elasticsearch connectors whenever possible. - Adds OpenSearch source connector to be able to ingest documents from OpenSearch. - Adds OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured OpenSearch indexes easily. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-17 04:31:49 +00:00
David Potter	76e0d10e61	feat: add MongoDB source connector (#2393 ) Adds MongoDB as a source (we already had it as a destination connector) --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-16 20:56:29 +00:00
jakub-sandomierz-deepsense-ai	ee0441efea	enhancement: normalize Salesforce artifact extensions (#2402 ) Connectors use predictable result file naming convention so consumers of library can write code in abstraction of particular connector. This change introduces compatibility with said naming convention. `_output_filename` returns now filename with format.	2024-01-16 10:36:00 +00:00
Christine Straub	ee06260987	feat: keep all image elements when using `hi_res` strategy. (#2382 ) ### Summary The goal of this PR is to keep all image elements when using "hi_res" strategy. Previously, `Image` elements with small chunks of text were ignored unless the image block extraction parameters (`extract_images_in_pdf` or `extract_image_block_types`) were specified. Now, all image elements are kept regardless of whether the image block extraction parameters are specified. ### Testing - on `main` branch, ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print("number of image elements: ", len(image_elements)) ``` The above code will display `number of image elements: 0`. - on this `feature` branch, The same code will display `number of image elements: 3` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-01-15 23:19:17 +00:00
Steve Canny	2f2c48acd5	feat(ingest): add basic chunking to ingest (#2380 ) The new "basic" chunking strategy and overlap options need to be available from the ingest CLI. An ingest test of those features is also welcome, both to verify the ingest feature and to defend against regressions in the chunking code. Add a local ingest test exercising both the "basic" chunking strategy and intra-chunk overlap. Since there is no new source connector involved, use the local ingest source and destination. Update documentation to suit, filling in some details that hadn't made it into the docs yet.	2024-01-12 20:27:34 +00:00
Roman Isecke	b37b4689bc	drop python3.8 (#2372 ) ### Description Remove all uses of python3.8 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-01-09 23:37:30 +00:00
Steve Canny	22cbdce7ca	fix(html): unequal row lengths in HTMLTable.text_as_html (#2345 ) Fixes #2339 Fixes to HTML partitioning introduced with v0.11.0 removed the use of `tabulate` for forming the HTML placed in `HTMLTable.text_as_html`. This had several benefits, but part of `tabulate`'s behavior was to make row-length (cell-count) uniform across the rows of the table. Lacking this prior uniformity produced a downstream problem reported in On closer inspection, the method used to "harvest" cell-text was producing more text-nodes than there were cells and was sensitive to where whitespace was used to format the HTML. It also "moved" text to different columns in certain rows. Refine the cell-text gathering mechanism to get exactly one text string for each row cell, eliminating whitespace formatting nodes and producing strict correspondence between the number of cells in the original HTML table row and that placed in HTML.text_as_html. HTML tables that are uniform (every row has the same number of cells) will produce a uniform table in `.text_as_html`. Merged cells may still produce a non-uniform table in `.text_as_html` (because the source table is non-uniform).	2024-01-04 21:53:19 +00:00
Yao You	5f5ff6319f	fix: consider text in cid code as invalid in hi_res (#2259 ) This PR addresses [CORE-2969](https://unstructured-ai.atlassian.net/browse/CORE-2969) - pdfminer sometimes fail to decode text in an pdf file and returns cid codes as text - now those text will be considered invalid and be replaced with ocr results in `hi_res` mode ## test This PR adds unit test for the utility functions. In addition the file below would return elements with text in cid code on main but proper ascii text with this PR: [005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf](https://github.com/Unstructured-IO/unstructured/files/13662984/005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf) This change improves both cct accuracy and %missing scores: before: ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.681 0.267 0.266 105 cct-%missing 0.086 0.159 0.159 105 ``` after: ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.697 0.251 0.250 105 cct-%missing 0.071 0.123 0.122 105 ``` [CORE-2969]: https://unstructured-ai.atlassian.net/browse/CORE-2969?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-12-14 06:49:23 +00:00
Roman Isecke	cc05e948ff	chore: sensitive info connector audit (#2227 ) ### Description All other connectors that were not included in https://github.com/Unstructured-IO/unstructured/pull/2194 are now updated to follow the new pattern and mark any variables as sensitive where it makes sense. Core changes: * All connectors now support an `AccessConfig` to mark data that's needed for auth (i.e. username, password) and those that are sensitive are designated appropriately using the new enhanced field. * All cli configs on the cli definition now inherit from the base config in the connector file to reuse the variables set on that dataclass * The base writer class was updated to better generalize the new approach given better use of dataclasses * The base cli classes were refactored to also take into account the need for a connector and write config when creating the respective runner/writer classes. * Any mismatch between the cli field name and the dataclass field name were updated on the dataclass side to not impact the user but maintain consistency * Add custom redaction logic for mongodb URIs since the password is expected to be a part of it. Now this: `"mongodb+srv://ingest-test-user:r4hK3BD07b@ingest-test.hgaig.mongodb.net/"` -> `"mongodb+srv://ingest-test-user:*REDACTED@ingest-test.hgaig.mongodb.net/"` in the logs Bundle all fsspec based files into their own packages. * Refactor custom `_decode_dataclass` used for enhanced json mixin by using a monkey-patch approach. The original approach was breaking on optional nested dataclasses when serializing since the other methods in `dataclasses_json_core` weren't using the new method. By monkey-patching the original method with a new one, all other methods in that library would use the new one. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-11 17:37:49 +00:00
Christine Straub	4ad01efe23	feat: improve reading order (#2219 ) Closes GH Issue #2208.	2023-12-07 23:21:10 -08:00
David Potter	cde11d1eb0	feat: Add sftp source connector (#2163 ) Adds source connector for SFTP which uses fsspec and paramiko via fsspec. Paramiko is the standard sftp package for python used in pysftp etc... ``` --username foo \ --password bar \ --remote-url sftp://localhost:47474/upload/ ``` Will only download a specifically requested file if it has an extension. (i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will treat any other remote_url as a folder path. This is intentional. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-07 19:33:19 +00:00
Roman Isecke	f193d3d43b	feat: improve sensitive data handling by fsspec connectors (#2194 ) ### Description Building off of PR https://github.com/Unstructured-IO/unstructured/pull/2179, updating fsspec based connectors to use better authentication field handling. This PR adds in the following changes: * Update the base classes to inherit from the enhanced json mixin * Add in a new access config dataclass that should be used as a nest dataclass in the connector configs * Update the code extracting configs out of the cli options dictionary to support the nested access config if it exists on the parent config * Update all fsspec connectors with explicit access configs given what each one's SDKs support * Update the json mixin and enhanced field to support a name override when serializing/deserializing from json/dicts. This allows a different name to be used for the CLI option than what the name of the field is on the dataclass. * Update all the writes to use class-based approach and share the same structure of the runner classes * Above update allowed for better code to be used in the base source and destination CLI commands * Add in utility code around paring a flat dictionary (coming from the click based options) into dataclass-based configs with potentially nested dataclasses. Slightly unrelated changes: * session handle removed from pinecone connector as this was breaking the serialization of the write config and didn't have any benefit as a connection was never being shared, the index used simply makes a new http call each time it's invoked. * Dedicated write configs were created for all destination connectors to better support serialization * Refactor of Elasticsearch connector included, with update to ingest test to use auth TODOs * Left a `#TODO` in the code but the way session handler is implemented right now, it breaks serialization since it adds a generic variable based on the library being used for a connector (i.e. `googleapiclient.discovery.Resource`) which is not serializable. This will need to be updated to omit that from serialization but still support the current workflow. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-05 20:55:19 +00:00
rvztz	50b1431c9e	rvztz/hubspot ingest connector (#1760 ) Closes #1843 Ingest connector for HubSpot. Supports: - Calls: Logs from calls related to contacts, companies and tickets - Communications: Logs from SMS/Whatsapp related to contacts, companies and tickets - Notes: Notes related to CRM notes - Products: CRM products - Emails: Logs from emails sent to CRM objects. - Tasks: CRM tasks From each record, `body/`description`information is grabbed. When a title property is available, this is registered at the beggining of the output file. The CLI receives three params: - `api-token`: [Private app](https://developers.hubspot.com/docs/api/private-apps) token. - `object-types: One of the noted supported objects in the form of a comma separated list: `calls,products,tasks` - `custom-properties`: Custom properties to grab information from. Must be in the form `<object_type>:<custom_property_id>,<object_type>:<custom_property_id>` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rvztz <rvztz@users.noreply.github.com>	2023-11-28 23:07:57 +00:00
Roman Isecke	30cbc420a0	bug: fsspec output filepath including base directory (#2146 ) ### Description When passing in a remote path for fsspec-based source connectors, the base directory was always being included in the output path itself. This was updated to exclude the base directory any only include any child directories relative to the base one. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-11-28 14:19:42 +00:00
Roman Isecke	2bb463d006	feat: support both single and batch ingest docs (#2105 ) ### Description There are some source ingest connectors that would be more efficient to read the content in batches rather than use an entire process per document. For example, reading from ElasticSearch. Given an index with possible hundreds of documents, reading each one individually is not as optimal as reading in batches. To try and maintain as much of the ingest doc paradigm already being supported, a new class `BaseIngestDocBatch` was added to handle reading in batches. It produces a list of `BaseSingleIngestDoc` which is what all current implementations were renamed to. This list is generated after it runs its `get_files` method. Past the source node, all other steps in the pipeline should not be affected, this is just an optimization for the read step. Additional Changes: * Removed use of jq and instead converted this into a fields filter on the content to let the database handle the filtering and limit the amount of data being pulled in. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-11-27 19:25:30 +00:00
Steve Canny	ee9be2a3b2	fix: assorted partition_html() bugs (#2113 ) Addresses a cluster of HTML-related bugs: - empty table is identified as bulleted-table - `partition_html()` emits empty (no text) tables (#1928) - `.text_as_html` contains inappropriate `<br>` elements in invalid locations. - cells enclosed in `<thead>` and `<tfoot>` elements are dropped (#1928) - `.text_as_html` contains whitespace padding Each of these is addressed in a separate commit below. Fixes #1928. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com> Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>	2023-11-20 16:29:32 +00:00
ryannikolaidis	13a23deba6	fix: local connector with input path to single file (#2116 ) When passed an absolute file path for the input document path, the local connector incorrectly writes the output file to the wrong directory. Also, in the single file input path cases we are currently including parent path as part of the destination writing, instead when a single file is specified as input the output file should be located directly in the specified outputs directory. Note: this change meant that we needed to bump the file path of some expected results. This fixes such that the output in this case is written to `output-dir/input-filename.json`. ## Changes - Fix for incorrect output path of files partitioned via the local connector when the input path is a file path (rather than directory) - Updated single-local-file test to validate the flow where we specify an absolute file path (since this was particularly broken) ## Testing Note: running the updated `local-single-file` test without the changes to the local connector will result in a final output copy of: ``` Copying /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/workdir/local-single-file/partitioned/a48c2abec07a9a31860429f94e5a6ade.json -> /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/../example-docs/language-docs/UDHR_first_article_all.txt.json ``` where the output path is the input path and not the expected `output-dir/input-filename.json` Running with this change we can now expect the file at that directory. --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2023-11-19 18:21:31 +00:00
Steve Canny	b8a8de33f4	fix(ingest): canonicalize ingest JSON (#2080 ) Canonicalize JSON produced for ingest tests such that incidental changes is _form_ of the JSON objects (keys moving around) that does not change the _content_ of that JSON object does not trigger an ingest-test failure.	2023-11-15 00:52:58 -08:00
Christine Straub	475066ba7c	Fix: fast strategy fallback to ocr only (#2055 ) Closes #2038. ### Summary The `fast` strategy should not fall back to a more expensive strategy. ### Testing For [9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf), the following code should return an empty list. ``` elements = partition(filename=filename, strategy="fast") ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-14 18:46:41 +00:00
John	1ead5a27df	Jj/2011 missing languages metadata (#2037 ) ### Summary Closes #2011 `languages` was missing from the metadata when partitioning pdfs via `hi_res` and `fast` strategies and missing from image partitions via `hi_res`. This PR adds `languages` to the relevant function calls so it is included in the resulting elements. ### Testing On the main branch, `partition_image` will include `languages` when `strategy='ocr_only'`, but not when `strategy='hi_res'`: ``` filename = "example-docs/english-and-korean.png" from unstructured.partition.image import partition_image elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor']) elements[0].metadata.languages elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor']) elements[0].metadata.languages ``` For `partition_pdf`, `'ocr_only'` will include `languages` in the metadata, but `'fast'` and `'hi_res'` will not. ``` filename = "example-docs/korean-text-with-tables.pdf" from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename, strategy="ocr_only", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="fast", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="hi_res", languages=['kor']) elements[0].metadata.languages ``` On this branch, `languages` is included in the metadata regardless of strategy --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>	2023-11-13 16:47:05 +00:00
Steve Canny	d06bcc41bb	fix(docx): improve page-break detection (#2036 ) Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to a `w:lastRenderedPageBreak` element so cause over-counting if used. Use rendered page-breaks only.	2023-11-09 20:34:30 +00:00
Christine Straub	3fe480799a	Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961 ) Closes #1875. ### Summary - add functionality to do a second OCR on cropped table images - use `IMAGE_CROP_PAD` env for `individual_blocks` mode ### Testing The test function [`test_partition_pdf_hi_res_ocr_mode_with_table_extraction()`](https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/partition/pdf_image/test_pdf.py#L425) in `test_pdf.py` should pass. ### NOTE: I've tried to experiment with values for scaling ENVs on the following PRs but found that changes to the values for scaling ENVs affect the entire page OCR output(OCR regression) so switched to doing a second OCR for tables. - https://github.com/Unstructured-IO/unstructured/pull/1998/files - https://github.com/Unstructured-IO/unstructured/pull/2004/files - https://github.com/Unstructured-IO/unstructured/pull/2016/files - https://github.com/Unstructured-IO/unstructured/pull/2029/files --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-09 18:29:55 +00:00
Ahmet Melek	ca78dc737a	feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918 ) Closes #1782 This PR: - Extends ingest pipeline so that it is possible to select an embedding provider from a range of providers - Modifies the ingest embedding test to be a diff test, since the embedding vectors are reproducible after supporting multiple providers Additional info on the chosen provider for the test: - Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic even when there's no seed set - Took 6.84s to pass a unit test with the provider (without cache, including model download) - `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it zero cost For all these reasons, testing embedding modules with the Huggingface model seems to be making sense --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-11-06 12:26:12 +00:00
Roman Isecke	ba4477ac20	feat: support table conversion for tabular destination connectors (#1917 ) ### Description * A full schema was introduced to map the type of all output content from the json partition output and mapped to a flattened table structure to leverage table-based destination connectors. The delta table destination connector was updated at the moment to take advantage of this. * Existing method to convert to a dataframe was updated because it had a bug in it. Object content in the metadata would have the key name changed when flattened but then this would be omitted since it didn't exist in the `_get_metadata_table_fieldnames` response. * Unit test was added to make sure we handle all values possible in an Element when converting to a table * Delta table ingest test was split into a source and destination test (looking ahead to split these up in CI) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-11-03 16:47:21 +00:00
shreyanid	c24e6e056c	chore: add doctype to ingest evaluation functions (#1977 ) ### Summary To combine ingest and holistic metrics efforts, add the `doctype` field to the results from the functions in evaluate.py for use in subsequent aggregation functions. ### Test Run `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` and there will be a new doctype column with the file's doctype extension. <img width="508" alt="Screenshot 2023-11-01 at 2 23 11 PM" src="https://github.com/Unstructured-IO/unstructured/assets/42684285/44583da9-e7ef-4142-be72-c2247b954bcf"> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>	2023-11-02 19:15:53 +00:00
Ahmet Melek	a9a3efd85c	bugfix: SharePoint permissions fetching should be opt-in (#1894 ) Closes: #1891 (check the issue for more info) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> Co-authored-by: Yao You <yao@unstructured.io>	2023-10-31 15:55:07 +00:00
Christine Straub	1f0c563e0c	refactor: `partition_pdf()` for `ocr_only` strategy (#1811 ) ### Summary Update `ocr_only` strategy in `partition_pdf()`. This PR adds the functionality to get accurate coordinate data when partitioning PDFs and Images with the `ocr_only` strategy. - Add functionality to perform OCR region grouping based on the OCR text taken from `pytesseract.image_to_string()` - Add functionality to get layout elements from OCR regions (ocr_layout) for both `tesseract` and `paddle` - Add functionality to determine the `source` of merged text regions when merging text regions in `merge_text_regions()` - Merge multiple test functions related to "ocr_only" strategy into `test_partition_pdf_with_ocr_only_strategy()` - This PR also fixes [issue #1792](https://github.com/Unstructured-IO/unstructured/issues/1792) ### Evaluation ``` # Image PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image # PDF PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf ``` ### Test - Before update All elements have the same coordinate data ![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/aae0195a-2943-4fa8-bdd8-807f2f09c768) - After update All elements have accurate coordinate data ![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/0f6c6202-9e65-4acf-bcd4-ac9dd01ab64a) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-10-30 20:13:29 +00:00
Roman Isecke	680cfbabd4	expand fsspec downstream connectors (#1777 ) ### Description Replacing PR [1383](https://github.com/Unstructured-IO/unstructured/pull/1383) --------- Co-authored-by: Trevor Bossert <alanboss@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-30 20:09:49 +00:00

1 2 3 4

178 Commits