unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-18 19:44:31 +00:00

Author	SHA1	Message	Date
David Potter	1ca90d209a	bug: update sharepoint-with-permissions test to fix CI (#2589 ) Adding `metadata.data_source.permissions_data` to sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint deprecation warning from ruining test. Updating expected-structured-output As per Ahmet's comment. We do want to check sharepoint permissions metadata at some point. But that will take a separate type of test. A file diff test is too unstable. Permissions checking will be later down the road.	2024-03-06 17:15:36 +00:00
David Potter	0c834517d8	fix: change opensearch port (#2517 ) change opensearch port to see if fixes CI. We think there may be a conflict with the elasticsearch docker port. Also adding simple retry to vector query. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-07 21:25:04 +00:00
David Potter	c100ce28a7	feat: add Vectara destination connector (#2357 ) Thanks to Ofer at Vectara, we now have a Vectara destination connector. - There are no dependencies since it is all REST calls to API - --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-01 14:38:34 +00:00
David Potter	bc791d53f4	feat: add opensearch source and destination connector (#2349 ) Adds OpenSearch as a source and destination. Since OpenSearch is a fork of Elasticsearch, these connectors rely heavily on inheriting the Elasticsearch connectors whenever possible. - Adds OpenSearch source connector to be able to ingest documents from OpenSearch. - Adds OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured OpenSearch indexes easily. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-17 04:31:49 +00:00
David Potter	76e0d10e61	feat: add MongoDB source connector (#2393 ) Adds MongoDB as a source (we already had it as a destination connector) --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-16 20:56:29 +00:00
Steve Canny	2f2c48acd5	feat(ingest): add basic chunking to ingest (#2380 ) The new "basic" chunking strategy and overlap options need to be available from the ingest CLI. An ingest test of those features is also welcome, both to verify the ingest feature and to defend against regressions in the chunking code. Add a local ingest test exercising both the "basic" chunking strategy and intra-chunk overlap. Since there is no new source connector involved, use the local ingest source and destination. Update documentation to suit, filling in some details that hadn't made it into the docs yet.	2024-01-12 20:27:34 +00:00
jakub-sandomierz-deepsense-ai	411aa98bbf	feat: Salesforce connector accepts key path or value (#2321 ) (#2327 ) Solution to issue https://github.com/Unstructured-IO/unstructured/issues/2321. simple_salesforce API allows for passing private key path or value. This PR introduces this support for Ingest connector. Salesforce parameter "private-key-file" has been renamed to "private-key". It can contain one of following: - path to PEM encoded key file (as string) - key contents (PEM encoded string) If the provided value cannot be parsed as PEM encoded private key, then the file existence is checked. This way private key contents are not exposed to unnecessary underlying function calls.	2024-01-11 11:15:24 +00:00
Ahmet Melek	fd293b3e78	feat: add elasticsearch destination connector (#2152 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1842 Closes https://github.com/Unstructured-IO/unstructured/issues/2202 Closes https://github.com/Unstructured-IO/unstructured/issues/2203 This PR: - Adds Elasticsearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Elasticsearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured elasticsearch indexes easily. - Includes parallelized upload and lazy processing for elasticsearch destination connector. - Rearranges elasticsearch test helpers to source, destination, and common folders. - Adds util functions to be able to batch iterables in a lazy way for uploads - Fixes a bug where removing the optional parameter `--fields` broke the connector due to an integer processing error. - Fixes a bug where using an [elasticsearch config](`8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35)`) for a destination connector resulted in a serialization issue when optional parameter `--fields` was not provided.	2023-12-20 01:26:58 +00:00
cragwolfe	bd8a74d686	chore: shell scripts default indent of 2 instead of 4 (#2287 ) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces.	2023-12-19 07:48:21 +00:00
Roman Isecke	76efcf4dd7	chore: add shfmt (#2246 ) ### Description Given all the shell files that now exist in the repo, would be nice to have linting/formatting around them (in addition to the existing shellcheck which doesn't do anything to format the shell code). This PR introduces `shfmt` to both check for changes and apply formatting when the associated make targets are called.	2023-12-12 01:04:15 +00:00
Roman Isecke	cc05e948ff	chore: sensitive info connector audit (#2227 ) ### Description All other connectors that were not included in https://github.com/Unstructured-IO/unstructured/pull/2194 are now updated to follow the new pattern and mark any variables as sensitive where it makes sense. Core changes: * All connectors now support an `AccessConfig` to mark data that's needed for auth (i.e. username, password) and those that are sensitive are designated appropriately using the new enhanced field. * All cli configs on the cli definition now inherit from the base config in the connector file to reuse the variables set on that dataclass * The base writer class was updated to better generalize the new approach given better use of dataclasses * The base cli classes were refactored to also take into account the need for a connector and write config when creating the respective runner/writer classes. * Any mismatch between the cli field name and the dataclass field name were updated on the dataclass side to not impact the user but maintain consistency * Add custom redaction logic for mongodb URIs since the password is expected to be a part of it. Now this: `"mongodb+srv://ingest-test-user:r4hK3BD07b@ingest-test.hgaig.mongodb.net/"` -> `"mongodb+srv://ingest-test-user:*REDACTED@ingest-test.hgaig.mongodb.net/"` in the logs Bundle all fsspec based files into their own packages. * Refactor custom `_decode_dataclass` used for enhanced json mixin by using a monkey-patch approach. The original approach was breaking on optional nested dataclasses when serializing since the other methods in `dataclasses_json_core` weren't using the new method. By monkey-patching the original method with a new one, all other methods in that library would use the new one. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-11 17:37:49 +00:00
David Potter	cde11d1eb0	feat: Add sftp source connector (#2163 ) Adds source connector for SFTP which uses fsspec and paramiko via fsspec. Paramiko is the standard sftp package for python used in pysftp etc... ``` --username foo \ --password bar \ --remote-url sftp://localhost:47474/upload/ ``` Will only download a specifically requested file if it has an extension. (i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will treat any other remote_url as a folder path. This is intentional. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-07 19:33:19 +00:00
Roman Isecke	f193d3d43b	feat: improve sensitive data handling by fsspec connectors (#2194 ) ### Description Building off of PR https://github.com/Unstructured-IO/unstructured/pull/2179, updating fsspec based connectors to use better authentication field handling. This PR adds in the following changes: * Update the base classes to inherit from the enhanced json mixin * Add in a new access config dataclass that should be used as a nest dataclass in the connector configs * Update the code extracting configs out of the cli options dictionary to support the nested access config if it exists on the parent config * Update all fsspec connectors with explicit access configs given what each one's SDKs support * Update the json mixin and enhanced field to support a name override when serializing/deserializing from json/dicts. This allows a different name to be used for the CLI option than what the name of the field is on the dataclass. * Update all the writes to use class-based approach and share the same structure of the runner classes * Above update allowed for better code to be used in the base source and destination CLI commands * Add in utility code around paring a flat dictionary (coming from the click based options) into dataclass-based configs with potentially nested dataclasses. Slightly unrelated changes: * session handle removed from pinecone connector as this was breaking the serialization of the write config and didn't have any benefit as a connection was never being shared, the index used simply makes a new http call each time it's invoked. * Dedicated write configs were created for all destination connectors to better support serialization * Refactor of Elasticsearch connector included, with update to ingest test to use auth TODOs * Left a `#TODO` in the code but the way session handler is implemented right now, it breaks serialization since it adds a generic variable based on the library being used for a connector (i.e. `googleapiclient.discovery.Resource`) which is not serializable. This will need to be updated to omit that from serialization but still support the current workflow. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-05 20:55:19 +00:00
Roman Isecke	b951d73a9b	feat: add logging to ingest CLI for tests being skipped at the end (#2174 ) ### Description Often times there are tests being skipped either due to missing env vars or explicitly defined in the base script but these get lost in the logs. This PR updates the scripts to leverage a custom error code if being skipped due to missing env vars and this custom error code is being caught by the base script and logs all files being skipped to a file. At the end of the script, this file gets logged in the CI output.	2023-11-29 13:41:19 +00:00
rvztz	50b1431c9e	rvztz/hubspot ingest connector (#1760 ) Closes #1843 Ingest connector for HubSpot. Supports: - Calls: Logs from calls related to contacts, companies and tickets - Communications: Logs from SMS/Whatsapp related to contacts, companies and tickets - Notes: Notes related to CRM notes - Products: CRM products - Emails: Logs from emails sent to CRM objects. - Tasks: CRM tasks From each record, `body/`description`information is grabbed. When a title property is available, this is registered at the beggining of the output file. The CLI receives three params: - `api-token`: [Private app](https://developers.hubspot.com/docs/api/private-apps) token. - `object-types: One of the noted supported objects in the form of a comma separated list: `calls,products,tasks` - `custom-properties`: Custom properties to grab information from. Must be in the form `<object_type>:<custom_property_id>,<object_type>:<custom_property_id>` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rvztz <rvztz@users.noreply.github.com>	2023-11-28 23:07:57 +00:00
Roman Isecke	2bb463d006	feat: support both single and batch ingest docs (#2105 ) ### Description There are some source ingest connectors that would be more efficient to read the content in batches rather than use an entire process per document. For example, reading from ElasticSearch. Given an index with possible hundreds of documents, reading each one individually is not as optimal as reading in batches. To try and maintain as much of the ingest doc paradigm already being supported, a new class `BaseIngestDocBatch` was added to handle reading in batches. It produces a list of `BaseSingleIngestDoc` which is what all current implementations were renamed to. This list is generated after it runs its `get_files` method. Past the source node, all other steps in the pipeline should not be affected, this is just an optimization for the read step. Additional Changes: * Removed use of jq and instead converted this into a fields filter on the content to let the database handle the filtering and limit the amount of data being pulled in. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-11-27 19:25:30 +00:00
Roman Isecke	6e67c48fd8	feat: update all ingest tests to use huggingface for embeddings (#2071 ) ### Description Update any use of OpenAI for generating embeddings in the ingest tests to use Huggingface Bonus Changes: * Remove duplicate delta table test * Delete delta table destination directory at the beginning of the test to make sure it doesn't exist and prevent the test from breaking.	2023-11-21 18:43:19 +00:00
ryannikolaidis	13a23deba6	fix: local connector with input path to single file (#2116 ) When passed an absolute file path for the input document path, the local connector incorrectly writes the output file to the wrong directory. Also, in the single file input path cases we are currently including parent path as part of the destination writing, instead when a single file is specified as input the output file should be located directly in the specified outputs directory. Note: this change meant that we needed to bump the file path of some expected results. This fixes such that the output in this case is written to `output-dir/input-filename.json`. ## Changes - Fix for incorrect output path of files partitioned via the local connector when the input path is a file path (rather than directory) - Updated single-local-file test to validate the flow where we specify an absolute file path (since this was particularly broken) ## Testing Note: running the updated `local-single-file` test without the changes to the local connector will result in a final output copy of: ``` Copying /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/workdir/local-single-file/partitioned/a48c2abec07a9a31860429f94e5a6ade.json -> /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/../example-docs/language-docs/UDHR_first_article_all.txt.json ``` where the output path is the input path and not the expected `output-dir/input-filename.json` Running with this change we can now expect the file at that directory. --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2023-11-19 18:21:31 +00:00
Klaijan	5ba3b9c2c6	chore: get eval metrics from ingest in (#2097 ) Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-11-17 18:22:36 +00:00
Klaijan	777a428071	chore: for ingest-test metrics, also check subdirs (#2079 ) - Copy script only went through one layer of subdirectory so it did not found the match between manifest file and structured output. Now edited to search all subdirectories. - `set -e` causes the script to exit at any exit rather than `exit 0`, fix all scripts that needs to run the copy script to be `set +e` right before the check diff, then back to `set -e` after - Edit the default evaluation metrics output from `metrics` to `metrics-tmp` to account for diff check - Add a script that checks the differences between old eval metric output (metrics) and new eval metrics output (metrics-tmp)	2023-11-15 21:02:43 -08:00
ryannikolaidis	d5fd21f0fd	fix: pass partition arguments to api when partitioning with unstructured-ingest and --partition-by-api (#2023 ) Closes #1064 When using the `--partition-by-api` flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. ## Changes * parse and pass relevant partition arguments to the api in unstructured-ingest * bonus: leverage an existing `partition.api` function to call out to the api rather than including duplicative request logic in unstructured ingest * bonus: --pdf-infer-table-structure is now a flag not an arg (it defaults false anyways, this is more succinct and consistent with similar parameters) * bonus: adds `hi_res_model_name` so a user can specify the model to leverage when using a hi_res strategy. ## Testing * update against_api.sh source test script to specify a partition argument and validates that the response from the api respected the argument * manually ran a request and validated that it was processed with chipper as specified (not sure if we want to bake a chipper request into the ci tests) (validated that the response leveraged the chipper model): ``` PYTHONPATH=. ./unstructured/ingest/main.py \ local \ --output-dir /tmp/ingest-requests/chipper \ --verbose \ --reprocess \ --strategy hi_res \ --partition-by-api \ --hi-res-model-name chipper \ --api-key "$API_KEY" \ --input-path 'example-docs/layout-parser-paper-with-table.pdf' ```	2023-11-08 04:47:02 +00:00
ryannikolaidis	0e94dd5d65	fix: ingest destination test failure with missing output (#2031 ) Intermittently the various destination test will fail with: ``` {noformat}--- Cleanup done --- gs://utic-test-ingest-fixtures-output/1699377964/example-docs/ deleting gs://utic-test-ingest-fixtures-output/1699377964 Removing objects: ERROR: (gcloud.storage.rm) The following URLs matched no objects or files: -gs://utic-test-ingest-fixtures-output/1699377964 Last ran script: gcs.sh Error: Process completed with exit code 1.{noformat} ``` Reference trace [here](https://github.com/Unstructured-IO/unstructured/actions/runs/6787927424/job/18452240764?pr=2020) After some investigation it looks like this error is due to collisions that occur because we’re assuming 1s date accuracy is sufficient when generating (and deleting) "unique" test destination location names. The likelihood is actually pretty high given that we run these tests against a test matrix. Instead we should just use a uuid for these unique destinations. ## Changes - Use uuidgen instead of `date +%s` for unique destinations	2023-11-07 23:14:01 +00:00
Ahmet Melek	ca78dc737a	feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918 ) Closes #1782 This PR: - Extends ingest pipeline so that it is possible to select an embedding provider from a range of providers - Modifies the ingest embedding test to be a diff test, since the embedding vectors are reproducible after supporting multiple providers Additional info on the chosen provider for the test: - Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic even when there's no seed set - Took 6.84s to pass a unit test with the provider (without cache, including model download) - `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it zero cost For all these reasons, testing embedding modules with the Huggingface model seems to be making sense --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-11-06 12:26:12 +00:00
Klaijan	c471ea3cc7	chore: remove copy line from non-matrix connectors (#1976 )	2023-11-04 10:58:56 -07:00
Roman Isecke	d09c8c0cab	test: update ingest dest tests to follow set pattern (#1991 ) ### Description Update all destination tests to match pattern: * Don't omit any metadata to check full schema * Move azure cognitive dest test from src to dest * Split delta table test into seperate src and dest tests * Fix azure cognitive search and add to dest tests being run (wasn't being run originally)	2023-11-03 12:46:56 +00:00
Yao You	db766402a4	test: parametrize ingest test scripts (#1979 ) This PR resolves [CORE-2453](https://unstructured-ai.atlassian.net/browse/CORE-2453): - parametrizes the output folder so that ingest output files can be saved other than the same place where the scripts are; this is set by env `OUTPUT_ROOT` - parametrize the python path `PYTHONPATH` to first check existing definition before default to `.`, the current folder - parametrize the run script that carries out ingest using `RUN_SCRIPT`, default is still `./unstructured/ingest/main.py` These changes allows us to run ingest test with more control. To test: - run `OUTPUT_ROOT=/tmp ./test_unstructured_ingest/src/local-single-file.sh`: the output now should be in `/tmp` instead of in the ingest test folder - run `RUN_SCRIPT=/hope/you/do/not/have/this/folder ./test_unstructured_ingest/src/local-single-file.sh` would raise an error because system can't find `/hope/you/do/not/have/this/folder` - run `RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh` should run as normal - do the following ```bash cp ./unstructured/ingest/main.py /tmp/main.py OUTPUT_ROOT=/tmp PYTHONPATH=$(pwd) RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh ``` This will run and generate output at `/tmp` [CORE-2453]: https://unstructured-ai.atlassian.net/browse/CORE-2453?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ	2023-11-02 21:41:56 +00:00
Roman Isecke	6700a7d8c4	feat: support generic inputs for partition kwargs from ingest CLI (#1923 ) ### Description To always support the latest changed to the partition method and the possible kwargs it supports, the ingest CLI has been refactored to take in a valid json string to represent those values to allow a user more flexibility with controlling the partition method.	2023-11-02 21:19:29 +00:00
Roman Isecke	24a419ece0	separate ingest tests (#1951 ) ### Description This splits the source ingest tests from the destination ingest tests since they share a different pattern: * src tests pull data from a source and compare the partitioned content to the expected results * destingation tests leverage the local connector to produce results to push to a destination and leverages overhead to create temporary locations at those destinations to write to and delete when done. Only the src tests create partitioned content that needs to be checked so the update ingest test CI job only needs to run these.	2023-11-01 19:23:44 +00:00

28 Commits