unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-24 17:41:15 +00:00

Author	SHA1	Message	Date
Roman Isecke	3eaf65a8c1	feat: refactor ingest (#3009 ) ### Description This refactors the current ingest CLI process to support better granularity in how the steps are ran * Both multiprocessing and async now supported. Given that a lot of the steps are IO-bound, such as downloading and uploading content, we can achieve better parallelization by using async here * Destination step broken up into a stager step and an upload step. This will allow for steps that require manipulation of the data between formats, such as converting the elements json into a csv format to upload for tabular destinations, to be pulled out of the step that does the actual upload. * The process of writing the content to a local destination was now pulled out as it's own dedicated destination connector, meaning you no longer need to persist the content locally once the process is done if the content was uploaded elsewhere. * Quick update to the chunker/partition step to use the python client. * Move the uncompress suppport as a pipeline step since this can arbitrarily apply to any concrete files that have been downloaded, regardless of where they came from. * Leverage last modified date to mark files to be reprocessed, even if the file already exists locally. ### Callouts Retry configs haven't been moved over yet. This is an open question because the intent was for it to wrap potential connection errors but now any of the other steps that leverage an API might run into network connection issues. Should those be isolated in each of the steps and wrapped with the same retry configs? Or do we need to expose a unique retry config for each step? This would bloat the input params even more. ### Testing * If you want to run the new code as an SDK, there's an example file that was added to highlight how to do that: [example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py) * If you want to run the new code as an isolated CLI: ```shell PYTHONPATH=. python unstructured/ingest/v2/main.py --help ``` * If you want to see which commands have been migrated to the new version, there's now a `v2` short help text next to those commands when running the current cli: ```shell PYTHONPATH=. python unstructured/ingest/main.py --help Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help Options: --help Show this message and exit. Commands: airtable azure biomed box confluence delta-table discord dropbox elasticsearch fsspec gcs github gitlab google-drive hubspot jira local v2 mongodb notion onedrive opensearch outlook reddit s3 v2 salesforce sftp sharepoint slack wikipedia ``` You can run any of the local or s3 specific ingest tests and these should now work. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-05-21 17:01:49 +00:00
Matt Robinson	d7608014c0	improve: add Python 3.12 support (#3033 ) (#3047 ) ### Summary Closes #2959. Updates the dependency and CI to add support for Python 3.12. The MongoDB ingest tests were disabled due to jobs like [this one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333) failing due to issues with the `bson` package. `bson` is a dependency for the AstraDB connector, but `pymongo` does not work when `bson` is installed from `pip`. This issue is documented by MongoDB [here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun off #3049 to resolve this. Issue seems unrelated to Python 3.12, though unsure why this didn't surface previously. Disables the `argilla` tests because `argilla` does not yet support Python 3.12. We can add the `argilla` tests back in once the PR references below is merged. You can still use the `stage_for_argilla` function if you're on `python<3.12` and you install `argilla` yourself. - https://github.com/argilla-io/argilla/pull/4837 --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>	2024-05-19 23:03:15 +00:00
Matt Robinson	ec987dcbb2	BREAKING CHANGE: revert table extraction off by default for PDFs and images (#3035 ) ### Summary Closes #3021 . Turns table extraction for PDFs and images off by default. The default behavior originally changed in #2588 . The reason for reversion is that some users did not realize turning off table extraction was an option and experience long processing times for PDFs and images with the new default behavior. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-05-17 15:28:11 +00:00
David Potter	df8d39a4d4	fix: allow AstraDB to prevent indexing on metadata columns with long text (#3003 ) Thanks to @erichare from AstraDB Adds support for specifying the indexing options for various columns in Astra DB, allowing users to avoid a situation where long text columns are by-default indexed. Changes to: test_unstructured_ingest/python/test-ingest-astra-output.py are forward looking from AstraDB	2024-05-17 04:12:37 +00:00
Yuming Long	542d442699	chore CORE-4775: remove html page number metadata field (#2942 ) ### Summary Rip off page_number metadata fields until we have page counting for all kinds of html files (not just limited to news articles with multiple `<article>` tag) ### Test Unit tests `test_add_chunking_strategy_on_partition_html_respects_multipage` and `test_add_chunking_strategy_title_on_partition_auto_respects_multipage` removed since they relay on the `page_number` fields from the SEC html file - now test moved to mock test for chunk_by_title -> revisit those tests when we find test file for this Also changed the element ids from partition outputs for html files - element id change due to page number change (in element id hashing) -> todo ticket: update other deterministic element id tests per crag's comment --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2024-04-30 15:20:26 +00:00
Pluto	df1f7bcd0e	Save table prediction in cells format (#2892 ) This pull request allows to return predictions in raw cell representation from table transformer. It will be later used to save prediction in a cells format for simpler metrics calculation. This PR has to be merged, after https://github.com/Unstructured-IO/unstructured-inference/pull/335	2024-04-25 11:14:48 +00:00
John	3843af666e	feat: Enable remote chunking via unstructured-ingest (#2905 ) Update: The cli shell script works when sending documents to the free api, but the paid api is down, so waiting to test against it. - The first commit adds docstrings and fixes type hints. - The second commit reorganizes `test_unstructured_ingest` so it matches the structure of `unstructured/ingest`. - The third commit contains the primary changes for this PR. - The `.chunk()` method responsible for sending elements to the correct method is moved from `ChunkingConfig` to `Chunker` so that `ChunkingConfig` acts as a config object instead of containing implementation logic. `Chunker.chunk()` also now takes a json file instead of a list of elements. This is done to avoid redundant serialization if the file is to be sent to the api for chunking. --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>	2024-04-25 00:24:58 +00:00
Michał Martyniak	2d1923ac7e	Better element IDs - deterministic and document-unique hashes (#2673 ) Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842 Main changes compared to part one: * hash computation includes element's sequence number on page, page number, document filename and its text * there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents) This PR addresses the following issue: https://github.com/Unstructured-IO/unstructured/issues/2461	2024-04-24 00:05:20 -07:00
Michał Martyniak	001fa17c86	Preparing the foundation for better element IDs (#2842 ) Part one of the issue described here: https://github.com/Unstructured-IO/unstructured/issues/2461 It does not change how hashing algorithm works, just reworks how ids are assigned: > Element ID Design Principles > > 1. A partitioning function can assign only one of two available ID types to a returned element: a hash or UUID. > 2. All elements that are returned come with an ID, which is never None. > 3. No matter which type of ID is used, it will always be in string format. > 4. Partitioning a document returns elements with hashes as their default IDs. Big thanks to @scanny for explaining the current design and suggesting ways to do it right, especially with chunking. Here's the next PR in line: https://github.com/Unstructured-IO/unstructured/pull/2673 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>	2024-04-16 21:14:53 +00:00
Christine Straub	ba3f374268	Fix: ingest test fixtures update pr (#2881 ) This PR aims to update "Ingest Test Fixtures Update PR" CI to update the ingest test fixtures only if the OVERWRITE_FIXTURES variable is not `false` and the OUTPUT_DIR directory is not empty.	2024-04-15 17:47:22 +00:00
MiXiBo	0506aff788	add support for `start_index` in `html` links extraction (#2600 ) add support for start_index in html links extraction (closes #2625) Testing ``` from unstructured.partition.html import partition_html from unstructured.staging.base import elements_to_json html_text = """<html> <p>Hello there I am a <a href="/link">very important link!</a></p> <p>Here is a list of my favorite things</p> <ul> <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li> <li>Dogs</li> </ul> <a href="/loner">A lone link!</a> </html>""" elements = partition_html(text=html_text) print(elements_to_json(elements)) ``` --------- Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-04-12 06:14:20 +00:00
Ahmet Melek	6fd29ea77c	fix: collection deletion for AstraDB test (#2869 ) This PR: - Fixes occasional collection deletion failures for AstraDB via putting collection deletion statements inside a trap statement. It uses click commands to do this. Testing: - Run ingest astradb destination test	2024-04-10 23:08:24 +00:00
Ahmet Melek	d46792214a	feat: add vertexai embeddings (#2693 ) This PR: - Adds VertexAI embeddings as an embedding provider Testing - Tested with pinecone destination connector on [this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693) job run. --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-03-28 21:15:36 +00:00
David Potter	c8cf8f31ac	bug CORE-4225: mongodb url bug (#2662 ) The mongodb redact method was created because we wanted part of the url to be exposed to the user during logging. Thus it did not use the dataclass `enhanced_field(sensitive=True)` solution. This changes it to use our standard redacted solution. This also minimizes the amount of work to be done in platform.	2024-03-28 18:38:50 +00:00
Steve Canny	9ae838e50a	feat: add --include-orig-elements option to Ingest CLI (#2687 ) Summary Add an `--include-orig-elements` option to the Ingest CLI to allow users to specify that corresponding new chunking parameter. Reviewer A lot of this is cleanup, the second commit is where the actual adding of this option are. The first commit fixes a number of inaccuracies in the documentation and does some other clean-up. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-27 06:35:01 +00:00
Christine Straub	08fafc564f	Fix: embedded text not getting merged with inferred elements (#2679 ) This PR is the second part of fixing "embedded text not getting merged with inferred elements", the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/331. ### Summary - replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()` when removing pdfminer (embedded) elements that were merged with inferred elements - use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD` introduced in the [first part](https://github.com/Unstructured-IO/unstructured-inference/pull/331) when removing pdfminer (embedded) elements that were merged with inferred elements - bump `unstructured-inference` to 0.7.25 ### Testing PDF: [pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf) ``` $ pip uninstall unstructured-inference -y $ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference $ pip install -e . ``` ``` elements = partition_pdf( filename="pwc-financial-statements-p114.pdf", strategy="hi_res", infer_table_structure=True, extract_image_block_types=["Image"], ) table_elements = [el for el in elements if el.category == "Table"] print(table_elements[0].text) ``` --------- Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-03-23 03:59:23 +00:00
Steve Canny	56fbaaed10	feat(chunking): add metadata.orig_elements serde (#2680 ) Summary This final PR in the "orig_elements" series adds the needful such that `.metadata.orig_elements`, when present on a chunk (element), is serialized to JSON when the chunk is serialized, for instance, to be used in an HTTP response payload. It also provides for deserializing such a JSON payload into chunks that contain the `.orig_elements` metadata. Additional Context Note that `.metadata.orig_elements` is always `Optional[list[Element]]` when in memory. However, those original elements are serialized as Base64-encoded gzipped JSON and are in that form (str) when present as JSON or as "element-dicts" which is an intermediate serialization/deserialization format. That is, serialization is `Element -> dict -> JSON` and deserialization is `JSON -> dict -> Element` and `.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-22 21:53:26 +00:00
Filip Knefel	bdfd975115	chore: change table extraction defaults (#2588 ) Change default values for table extraction - works in pair with [this](https://github.com/Unstructured-IO/unstructured-api/pull/370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-22 10:08:49 +00:00
Roman Isecke	4ff6a5b78e	Roman/bugfix support bedrock embeddings (#2650 ) ### Description This PR resolved the following open issue: [bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319). To do so, the following changes were made: * All aws configs were added as input parameters to the CLI * These were mapped to the bedrock embedder when an embedder is generated via `get_embedder` * An ingest test was added to call the aws bedrock service * Requirements for boto were bumped because the first version to introduce the bedrock runtime, which is required to hit the bedrock service, was introduced in version `1.34.63`, which was ahead of the version of boto pinned. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-03-21 18:21:04 +00:00
David Potter	9177aa20a8	feature CORE-3985: add Clarifai destination connector (#2633 ) Thanks to @mogith-pn from Clarifai we have a new destination connector! This PR intends to add Clarifai as a ingest destination connector. Access via CLI and programmatic. Documentation and Examples. Integration test script.	2024-03-21 16:36:21 +00:00
Steve Canny	31bef433ad	rfctr: prepare to add orig_elements serde (#2668 ) Summary The serialization and deserialization (serde) of `metadata.orig_elements` will be located in `unstructured.staging.base` alongside `elements_to_json()` and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for `metadata.orig_elements`. Reviewers: The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.	2024-03-20 21:27:59 +00:00
David Potter	5b92e0bb6b	bug CORE-4089: Onedrive partitioning fails - datetime formatting error (#2638 ) Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both are microsoft products) https://github.com/Unstructured-IO/unstructured/pull/2591 https://github.com/Unstructured-IO/unstructured/pull/2592/files We are seeing occurrences of inconsistency in the timestamps returned by Onedrive when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. Changes This adds logic to guarantee Onedrive dates will be properly formatted as ISO, regardless of the format provided by the onedrive library. Bumps timestamp format output to include timezone offset (as we do with others) Adds unit tests for isofomat. json_to_dict already unit tested here: https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py Adds small change for AstraDB to allow them to see what source called their api	2024-03-15 14:01:05 +00:00
Steve Canny	8ea203adf7	feat(chunking): composite text gets is_continuation (#2639 ) Summary Add `metadata.is_continuation = True` to metadata of second-and-later text-split chunks formed from an oversized non-table element. Previously this metadata was only present on text-split `TableChunk` elements. This enables downstream filtering of intentionally redundant metadata on chunk elements that may not be desired for all purposes. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-12 19:44:41 +00:00
David Potter	1ca90d209a	bug: update sharepoint-with-permissions test to fix CI (#2589 ) Adding `metadata.data_source.permissions_data` to sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint deprecation warning from ruining test. Updating expected-structured-output As per Ahmet's comment. We do want to check sharepoint permissions metadata at some point. But that will take a separate type of test. A file diff test is too unstable. Permissions checking will be later down the road.	2024-03-06 17:15:36 +00:00
David Potter	43250d5576	bug CORE-3971: fix deserialization in google-drive source connector key path (#2586 ) Google Drive Service account key can be a dict or a file path(str) We have successfully been using the path. But the dict can also end up being stored as a string that needs to be deserialized. The deserialization can have issues with single and double quotes.	2024-03-03 15:30:35 +00:00
ryannikolaidis	71d5d513ef	fix: handling of varied SharePoint date formats (#2591 ) We are seeing occurrences of inconsistency in the timestamps returned by office365.sharepoint when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. ## Changes - This adds logic to guarantee SharePoint dates will be properly formatted as ISO, regardless of the format provided by the sharepoint library. - Bumps timestamp format output to include timezone offset (as we do with others) ## Testing Unit test added to validate this datetime handling across various formats. --------- Co-authored-by: David Potter <potterdavidm@gmail.com>	2024-02-28 16:11:53 +00:00
David Potter	e8ec09c8b9	feat: astra dest connector (#2571 ) Thanks to Eric Hare @erichare at DataStax we have a new destination connector. This Pull Request implements an integration with [Astra DB](https://datastax.com) which allows for the Astra DB Vector Database to be compatible with Unstructured's set of integrations. To create your Astra account and authenticate with your `ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these steps: 1. Create an account at https://astra.datastax.com 2. Login and create a new database 3. From the database page, in the right hand panel, you will find your API Endpoint 4. Beneath that, you can create a Token to be used Some notes about Astra DB: - Astra DB is a Vector Database which allows for high-performance database transactions, and enables modern GenAI apps [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html) - It supports similarity search via a number of methods [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics) - It also supports non-vector tables / collections	2024-02-23 20:50:50 +00:00
Matt Robinson	b4d9ad8130	enhancement: detect headers in `partition_pdf` with fast strategy (#2455 ) ### Summary Detects headers and footers when using `partition_pdf` with the fast strategy. Identifies elements that are positioned in the top or bottom 5% of the page as headers or footers. If no coordinate information is available, an element won't be detected as a header or footer. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-02-23 16:56:09 +00:00
Pawel Kmiecik	ff9d46f9dc	feat(eval): table evaluation metrics (#2558 ) This PR adds new table evaluation metrics prepared by @leah1985 The metrics include: - `table count` (check) - `table_level_acc` - accuracy of table detection - `element_col_level_index_acc` - accuracy of cell detection in columns - `element_row_level_index_acc` - accuracy of cell detection in rows - `element_col_level_content_acc` - accuracy of content detected in columns - `element_row_level_content_acc` - accuracy of content detected in rows TODO in next steps: - create a minimal dataset and upload to s3 for ingest tests - generate and add metrics on the above dataset to `test_unstructured_ingest/metrics`	2024-02-22 16:35:46 +00:00
Klaijan	d06936d35a	feat: modify test-ingest-src and evaluation-metrics to allow EXPORT_DIR (#2551 ) The current `test-ingest-src.sh` and `evaluation-metrics` do not allow passing the `EXPORT_DIR` (`OUTPUT_ROOT` in `evaluation-metrics`). It is currently saving at the current working directory (`unstructured/test_unstructured_ingest`). When running the eval from `core-product`, all outputs is now saved at `core-product/upstream-unstructured/test_unstructured_ingest` which is undesirable. This PR modifies two scripts to accommodate such behavior: 1. `test-ingest-src.sh` - assign `EVAL_OUTPUT_ROOT` to the value set within the environment if exist, or the current working directory if not. Then calls to run `evaluation-metrics.sh`. 2. `evaluation-metrics.sh` - accepting param from `test-ingest-src.sh` if exist, or to the value set within the environment if exist, or the current directory if not. (Note: I also add param to `evaluation-metrics.sh` because it makes sense to allow a separate run to be able to specify an export directory) This PR should work in sync with another PR under `core-product`, which I will add the link here later. To test: Run the script below, change `$SCRIPT_DIR` as needed to see the result. ``` export OVERWRITE_FIXTURES=true ./upstream-unstructured/test_unstructured_ingest/src/s3.sh SCRIPT_DIR=$(dirname "$(realpath "$0")") bash -x ./upstream-unstructured/test_unstructured_ingest/evaluation-metrics.sh text-extraction "$SCRIPT_DIR" ``` ---- This PR also updates the requirements by `make pip-compile` since the `click` module was not found.	2024-02-17 05:21:15 +00:00
Steve Canny	d9f8467187	fix(xlsx): xlsx subtable algorithm (#2534 ) Reviewers: It may be easier to review each of the two commits separately. The first adds the new `_SubtableParser` object with its unit-tests and the second one uses that object to replace the flawed existing subtable-parsing algorithm. Summary There are a cluster of bugs in `partition_xlsx()` that all derive from flaws in the algorithm we use to detect "subtables". These are encountered when the user wants to get multiple document-elements from each worksheet, which is the default (argument `find_subtable = True`). This PR replaces the flawed existing algorithm with a `_SubtableParser` object that encapsulates all that logic and has thorough unit-tests. Additional Context This is a summary of the failure cases. There are a few other cases but they're closely related and this was enough evidence and scope for my purposes. This PR fixes all these bugs: ```python # # -- ✅ CASE 1: There are no leading or trailing single-cell rows. # -> this subtable functions never get called, subtable is emitted as the only element # # a b -> Table(a, b, c, d) # c d # -- ✅ CASE 2: There is exactly one leading single-cell row. # -> Leading single-cell row emitted as `Title` element, core-table properly identified. # # a -> [ Title(a), # b c Table(b, c, d, e) ] # d e # -- ❌ CASE 3: There are two-or-more leading single-cell rows. # -> leading single-cell rows are included in subtable # # a -> [ Table(a, b, c, d, e, f) ] # b # c d # e f # -- ❌ CASE 4: There is exactly one trailing single-cell row. # -> core table is dropped. trailing single-cell row is emitted as Title # (this is the behavior in the reported bug) # # a b -> [ Title(e) ] # c d # e # -- ❌ CASE 5: There are two-or-more trailing single-cell rows. # -> core table is dropped. trailing single-cell rows are each emitted as a Title # # a b -> [ Title(e), # c d Title(f) ] # e # f # -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b c Table(b, c, d, e), # d e Title(f) ] # f # -- ✅ CASE 7: There are two leading and one trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g) ] # g # -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g), # g Title(h) ] # h # -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below. # -> First cell is mistakenly emitted as title, remaining cells are dropped. # # a b c -> [ Title(a) ] # -- ❌ CASE 10: Single-row subtable with one leading single-cell row. # -> Leading single-row cell is correctly identified as title, core-table is mis-identified # as a `Title` and truncated. # # a -> [ Title(a), # b c d Title(b) ] ```	2024-02-13 20:29:17 -08:00
Ahmet Melek	f9f2cacb58	build(release): release commit for 0.12.4 (#2525 ) Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2024-02-08 21:18:29 +00:00
David Potter	0c834517d8	fix: change opensearch port (#2517 ) change opensearch port to see if fixes CI. We think there may be a conflict with the elasticsearch docker port. Also adding simple retry to vector query. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-07 21:25:04 +00:00
Ahmet Melek	be71633415	refactor: isolate ingest dependencies into local scopes (#2509 ) This PR: - Moves ingest dependencies into local scopes to be able to import ingest connector classes without the need of installing imported external dependencies. This allows lightweight use of the classes (not the instances. to use the instances as intended you'll still need the dependencies). - Upgrades the embed module dependencies from `langchain` to `langchain-community` module (to pass CI [rather than introducing a pin]) - Does pip-compile - Does minor refactors in other files to pass `ruff 2.0` checks which were introduced by pip-compile	2024-02-06 21:28:55 +00:00
David Potter	138625438f	fix: add title to Vectara upload (#2511 ) Small improvement to Vectara requested by Ofer at Vectara In the "Document" construct, every document can have a title. If it's there, in the UI it will show up above the document (otherwise you get "Untitled") --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-06 19:49:53 +00:00
David Potter	c100ce28a7	feat: add Vectara destination connector (#2357 ) Thanks to Ofer at Vectara, we now have a Vectara destination connector. - There are no dependencies since it is all REST calls to API - --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-01 14:38:34 +00:00
Yao You	97fb10db4a	fix: default hi_res model rely on inference setting (#2441 ) - there are multiple places setting the default `hi_res_model_name` in both `unstructured` and `unstructured-inference` - they lead to inconsistency and unexpected behaviors - this fix removes a helper in `unstructured` that tries to set the default hi_res layout detection model; instead we rely on the `unstructured-inference` to provide that default when no explicit model name is passed in ## test ```bash UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython ``` ```python from unstructured.partition.auto import partition # find a pdf file elements = partition("foo.pdf", strategy="hi_res") assert elements[0].metadata.detection_origin == "yolox" ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2024-01-29 16:44:41 +00:00
David Potter	74dcca44ca	fix: link_texts was breaking postgres destination connector (#2460 ) Formatting of link_texts was breaking metadata storage. Turns out it didn't need any conforming and came in correctly from json. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-27 04:29:38 +00:00
ryannikolaidis	2e97494613	fix: fsspec connectors returning data source version as integer (#2427 ) Connector data source versions should always be string values, however we were using the integer checksum value for the version for fsspec connectors. This casts that value to a string. ## Changes * Cast the checksum value to a string when assigning the version value for fsspec connectors. * Adds test to validate that these connectors will assign a string value when an integer checksum is fetched. ## Testing Unit test added.	2024-01-19 15:58:01 +00:00
David Potter	bc791d53f4	feat: add opensearch source and destination connector (#2349 ) Adds OpenSearch as a source and destination. Since OpenSearch is a fork of Elasticsearch, these connectors rely heavily on inheriting the Elasticsearch connectors whenever possible. - Adds OpenSearch source connector to be able to ingest documents from OpenSearch. - Adds OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured OpenSearch indexes easily. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-17 04:31:49 +00:00
David Potter	d7f4c24e21	fix documentation for chroma (#2403 ) To test: cd docs && make HTML changelogs: point main readme to the correct connector html page point chroma docs to correct sample code --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-17 01:53:52 +00:00
Steve Canny	fcc919b9f5	rfctr(chunking): add chunking arg constants (#2408 ) There are several public interface points for chunking and they all provide a default for arguments like `max_charactes`. These defaults are provided by literal values. Keeping these synchronized has become a problem. Declare constant values for chunking argument default values and use those wherever a non-trivial default is used in an end-user facing API function.	2024-01-16 21:48:36 +00:00
David Potter	76e0d10e61	feat: add MongoDB source connector (#2393 ) Adds MongoDB as a source (we already had it as a destination connector) --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-16 20:56:29 +00:00
jakub-sandomierz-deepsense-ai	ee0441efea	enhancement: normalize Salesforce artifact extensions (#2402 ) Connectors use predictable result file naming convention so consumers of library can write code in abstraction of particular connector. This change introduces compatibility with said naming convention. `_output_filename` returns now filename with format.	2024-01-16 10:36:00 +00:00
Christine Straub	ee06260987	feat: keep all image elements when using `hi_res` strategy. (#2382 ) ### Summary The goal of this PR is to keep all image elements when using "hi_res" strategy. Previously, `Image` elements with small chunks of text were ignored unless the image block extraction parameters (`extract_images_in_pdf` or `extract_image_block_types`) were specified. Now, all image elements are kept regardless of whether the image block extraction parameters are specified. ### Testing - on `main` branch, ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print("number of image elements: ", len(image_elements)) ``` The above code will display `number of image elements: 0`. - on this `feature` branch, The same code will display `number of image elements: 3` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-01-15 23:19:17 +00:00
ryannikolaidis	2ce829ddd0	test: update test Elasticsearch mappings to validate embedding search (#2397 ) Currently in the Elasticsearch Destination ingest test we are writing the embeddings to a "float" type field. In order to leverage this field for similarity search it should be mapped as "dense_vector" with the respective dimensions assigned. This PR updates that mapping and adds a test query to validate that this works as expected.	2024-01-14 19:27:56 +00:00
Steve Canny	2f2c48acd5	feat(ingest): add basic chunking to ingest (#2380 ) The new "basic" chunking strategy and overlap options need to be available from the ingest CLI. An ingest test of those features is also welcome, both to verify the ingest feature and to defend against regressions in the chunking code. Add a local ingest test exercising both the "basic" chunking strategy and intra-chunk overlap. Since there is no new source connector involved, use the local ingest source and destination. Update documentation to suit, filling in some details that hadn't made it into the docs yet.	2024-01-12 20:27:34 +00:00
Ahmet Melek	50f142d4e0	chore(ingest): update pinecone index creation specifications (#2389 ) This PR updates Pinecone index creation in the ingest test due to a recent update in Pinecone API. Due to a change in Pinecone API, it is not allowed anymore to specify both number of replicas and number of pods: `Cannot specify both replicas and pods` We solve it by removing the replica specification while sending the index creation request. ``` Creating index ingest-test-28418 Index creation success: 201 ```	2024-01-12 02:49:09 +00:00
jakub-sandomierz-deepsense-ai	411aa98bbf	feat: Salesforce connector accepts key path or value (#2321 ) (#2327 ) Solution to issue https://github.com/Unstructured-IO/unstructured/issues/2321. simple_salesforce API allows for passing private key path or value. This PR introduces this support for Ingest connector. Salesforce parameter "private-key-file" has been renamed to "private-key". It can contain one of following: - path to PEM encoded key file (as string) - key contents (PEM encoded string) If the provided value cannot be parsed as PEM encoded private key, then the file existence is checked. This way private key contents are not exposed to unnecessary underlying function calls.	2024-01-11 11:15:24 +00:00
jakub-sandomierz-deepsense-ai	5581e6a4c4	fix: Ingest GCS accepts JSON auth token (#2322 ) (#2371 ) FSSpec serialization caused conversion of JSON token to string with single quotes. GCS requires JSON token in form of dict so this format is now assured. Other forms of auth are not modified but there is improved validation for all of the options.	2024-01-11 09:03:47 +00:00

1 2 3 4 5 ...

313 Commits