28 Commits

Author SHA1 Message Date
David Potter
1ca90d209a
bug: update sharepoint-with-permissions test to fix CI (#2589)
Adding `metadata.data_source.permissions_data` to
sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint
deprecation warning from ruining test.

Updating expected-structured-output

As per Ahmet's comment. We do want to check sharepoint permissions
metadata at some point. But that will take a separate type of test. A
file diff test is too unstable. Permissions checking will be later down
the road.
2024-03-06 17:15:36 +00:00
David Potter
0c834517d8
fix: change opensearch port (#2517)
change opensearch port to see if fixes CI. We think there may be a
conflict with the elasticsearch docker port.

Also adding simple retry to vector query.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-02-07 21:25:04 +00:00
David Potter
c100ce28a7
feat: add Vectara destination connector (#2357)
Thanks to Ofer at Vectara, we now have a Vectara destination connector.

- There are no dependencies since it is all REST calls to API
-

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-02-01 14:38:34 +00:00
David Potter
bc791d53f4
feat: add opensearch source and destination connector (#2349)
Adds OpenSearch as a source and destination.

Since OpenSearch is a fork of Elasticsearch, these connectors rely
heavily on inheriting the Elasticsearch connectors whenever possible.

- Adds OpenSearch source connector to be able to ingest documents from
OpenSearch.
- Adds OpenSearch destination connector to be able to ingest documents
from any supported source, embed them and write the embeddings /
documents into OpenSearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured OpenSearch indexes easily.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 04:31:49 +00:00
David Potter
76e0d10e61
feat: add MongoDB source connector (#2393)
Adds MongoDB as a source (we already had it as a destination connector)

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-16 20:56:29 +00:00
Steve Canny
2f2c48acd5
feat(ingest): add basic chunking to ingest (#2380)
The new "basic" chunking strategy and overlap options need to be
available from the ingest CLI. An ingest test of those features is also
welcome, both to verify the ingest feature and to defend against
regressions in the chunking code.

Add a local ingest test exercising both the "basic" chunking strategy
and intra-chunk overlap. Since there is no new source connector
involved, use the local ingest source and destination. Update
documentation to suit, filling in some details that hadn't made it into
the docs yet.
2024-01-12 20:27:34 +00:00
jakub-sandomierz-deepsense-ai
411aa98bbf
feat: Salesforce connector accepts key path or value (#2321) (#2327)
Solution to issue
https://github.com/Unstructured-IO/unstructured/issues/2321.

simple_salesforce API allows for passing private key path or value. This
PR introduces this support for Ingest connector.

Salesforce parameter "private-key-file" has been renamed to
"private-key".
It can contain one of following:
- path to PEM encoded key file (as string)
- key contents (PEM encoded string)

If the provided value cannot be parsed as PEM encoded private key, then
the file existence is checked. This way private key contents are not
exposed to unnecessary underlying function calls.
2024-01-11 11:15:24 +00:00
Ahmet Melek
fd293b3e78
feat: add elasticsearch destination connector (#2152)
Closes https://github.com/Unstructured-IO/unstructured/issues/1842
Closes https://github.com/Unstructured-IO/unstructured/issues/2202
Closes https://github.com/Unstructured-IO/unstructured/issues/2203

This PR:
- Adds Elasticsearch destination connector to be able to ingest
documents from any supported source, embed them and write the embeddings
/ documents into Elasticsearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured elasticsearch indexes easily.
- Includes parallelized upload and lazy processing for elasticsearch
destination connector.
- Rearranges elasticsearch test helpers to source, destination, and
common folders.
- Adds util functions to be able to batch iterables in a lazy way for
uploads
- Fixes a bug where removing the optional parameter `--fields` broke the
connector due to an integer processing error.
- Fixes a bug where using an [elasticsearch
config](8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35))
for a destination connector resulted in a serialization issue when
optional parameter `--fields` was not provided.
2023-12-20 01:26:58 +00:00
cragwolfe
bd8a74d686
chore: shell scripts default indent of 2 instead of 4 (#2287)
Given the tendency for shell scripts to easily enter into a few levels
of indentation and long line lengths, update the default to 2 spaces.
2023-12-19 07:48:21 +00:00
Roman Isecke
76efcf4dd7
chore: add shfmt (#2246)
### Description
Given all the shell files that now exist in the repo, would be nice to
have linting/formatting around them (in addition to the existing
shellcheck which doesn't do anything to format the shell code). This PR
introduces `shfmt` to both check for changes and apply formatting when
the associated make targets are called.
2023-12-12 01:04:15 +00:00
Roman Isecke
cc05e948ff
chore: sensitive info connector audit (#2227)
### Description
All other connectors that were not included in
https://github.com/Unstructured-IO/unstructured/pull/2194 are now
updated to follow the new pattern and mark any variables as sensitive
where it makes sense.
Core changes:
* All connectors now support an `AccessConfig` to mark data that's
needed for auth (i.e. username, password) and those that are sensitive
are designated appropriately using the new enhanced field.
* All cli configs on the cli definition now inherit from the base config
in the connector file to reuse the variables set on that dataclass
* The base writer class was updated to better generalize the new
approach given better use of dataclasses
* The base cli classes were refactored to also take into account the
need for a connector and write config when creating the respective
runner/writer classes.
* Any mismatch between the cli field name and the dataclass field name
were updated on the dataclass side to not impact the user but maintain
consistency
* Add custom redaction logic for mongodb URIs since the password is
expected to be a part of it. Now this:
`"mongodb+srv://ingest-test-user:r4hK3BD07b@ingest-test.hgaig.mongodb.net/"`
->
`"mongodb+srv://ingest-test-user:***REDACTED***@ingest-test.hgaig.mongodb.net/"`
in the logs
* Bundle all fsspec based files into their own packages. 
* Refactor custom `_decode_dataclass` used for enhanced json mixin by
using a monkey-patch approach. The original approach was breaking on
optional nested dataclasses when serializing since the other methods in
`dataclasses_json_core` weren't using the new method. By monkey-patching
the original method with a new one, all other methods in that library
would use the new one.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-12-11 17:37:49 +00:00
David Potter
cde11d1eb0
feat: Add sftp source connector (#2163)
Adds source connector for SFTP which uses fsspec and paramiko via
fsspec. Paramiko is the standard sftp package for python used in pysftp
etc...

```
--username foo \
--password bar \
--remote-url sftp://localhost:47474/upload/
```

Will only download a specifically requested file if it has an extension.
(i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will
treat any other remote_url as a folder path. This is intentional.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-07 19:33:19 +00:00
Roman Isecke
f193d3d43b
feat: improve sensitive data handling by fsspec connectors (#2194)
### Description
Building off of PR
https://github.com/Unstructured-IO/unstructured/pull/2179, updating
fsspec based connectors to use better authentication field handling.
This PR adds in the following changes:

* Update the base classes to inherit from the enhanced json mixin
* Add in a new access config dataclass that should be used as a nest
dataclass in the connector configs
* Update the code extracting configs out of the cli options dictionary
to support the nested access config if it exists on the parent config
* Update all fsspec connectors with explicit access configs given what
each one's SDKs support
* Update the json mixin and enhanced field to support a name override
when serializing/deserializing from json/dicts. This allows a different
name to be used for the CLI option than what the name of the field is on
the dataclass.
* Update all the writes to use class-based approach and share the same
structure of the runner classes
* Above update allowed for better code to be used in the base source and
destination CLI commands
* Add in utility code around paring a flat dictionary (coming from the
click based options) into dataclass-based configs with potentially
nested dataclasses.

**Slightly unrelated changes:**
* session handle removed from pinecone connector as this was breaking
the serialization of the write config and didn't have any benefit as a
connection was never being shared, the index used simply makes a new
http call each time it's invoked.
* Dedicated write configs were created for all destination connectors to
better support serialization
* Refactor of Elasticsearch connector included, with update to ingest
test to use auth

**TODOs**
* Left a `#TODO` in the code but the way session handler is implemented
right now, it breaks serialization since it adds a generic variable
based on the library being used for a connector (i.e.
`googleapiclient.discovery.Resource`) which is not serializable. This
will need to be updated to omit that from serialization but still
support the current workflow.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-12-05 20:55:19 +00:00
Roman Isecke
b951d73a9b
feat: add logging to ingest CLI for tests being skipped at the end (#2174)
### Description
Often times there are tests being skipped either due to missing env vars
or explicitly defined in the base script but these get lost in the logs.
This PR updates the scripts to leverage a custom error code if being
skipped due to missing env vars and this custom error code is being
caught by the base script and logs all files being skipped to a file. At
the end of the script, this file gets logged in the CI output.
2023-11-29 13:41:19 +00:00
rvztz
50b1431c9e
rvztz/hubspot ingest connector (#1760)
Closes #1843 

Ingest connector for HubSpot. Supports:
- Calls: Logs from calls related to contacts, companies and tickets
- Communications: Logs from SMS/Whatsapp related to contacts, companies
and tickets
- Notes: Notes related to CRM notes
- Products: CRM products
- Emails: Logs from emails sent to CRM objects.
- Tasks: CRM tasks

From each record, `body/`description`information is grabbed. When a
title property is available, this is registered at the beggining of the
output file. The CLI receives three params:
- `api-token`: [Private
app](https://developers.hubspot.com/docs/api/private-apps) token.
- `object-types: One of the noted supported objects in the form of a
comma separated list: `calls,products,tasks`
- `custom-properties`: Custom properties to grab information from. Must
be in the form
`<object_type>:<custom_property_id>,<object_type>:<custom_property_id>`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rvztz <rvztz@users.noreply.github.com>
2023-11-28 23:07:57 +00:00
Roman Isecke
2bb463d006
feat: support both single and batch ingest docs (#2105)
### Description
There are some source ingest connectors that would be more efficient to
read the content in batches rather than use an entire process per
document. For example, reading from ElasticSearch. Given an index with
possible hundreds of documents, reading each one individually is not as
optimal as reading in batches. To try and maintain as much of the ingest
doc paradigm already being supported, a new class `BaseIngestDocBatch`
was added to handle reading in batches. It produces a list of
`BaseSingleIngestDoc` which is what all current implementations were
renamed to. This list is generated after it runs its `get_files` method.
Past the source node, all other steps in the pipeline should not be
affected, this is just an optimization for the read step.

**Additional Changes:**
* Removed use of jq and instead converted this into a fields filter on
the content to let the database handle the filtering and limit the
amount of data being pulled in.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-11-27 19:25:30 +00:00
Roman Isecke
6e67c48fd8
feat: update all ingest tests to use huggingface for embeddings (#2071)
### Description
Update any use of OpenAI for generating embeddings in the ingest tests
to use Huggingface

**Bonus Changes:**
* Remove duplicate delta table test
* Delete delta table destination directory at the beginning of the test
to make sure it doesn't exist and prevent the test from breaking.
2023-11-21 18:43:19 +00:00
ryannikolaidis
13a23deba6
fix: local connector with input path to single file (#2116)
When passed an absolute file path for the input document path, the local
connector incorrectly writes the output file to the wrong directory.
Also, in the single file input path cases we are currently including
parent path as part of the destination writing, instead when a single
file is specified as input the output file should be located directly in
the specified outputs directory. Note: this change meant that we needed
to bump the file path of some expected results. This fixes such that the
output in this case is written to `output-dir/input-filename.json`.

## Changes
- Fix for incorrect output path of files partitioned via the local
connector when the input path is a file path (rather than directory)
- Updated single-local-file test to validate the flow where we specify
an absolute file path (since this was particularly broken)

## Testing
Note: running the updated `local-single-file` test without the changes
to the local connector will result in a final output copy of:

```
Copying /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/workdir/local-single-file/partitioned/a48c2abec07a9a31860429f94e5a6ade.json -> /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/../example-docs/language-docs/UDHR_first_article_all.txt.json
```

where the output path is the input path and not the expected
`output-dir/input-filename.json`

Running with this change we can now expect the file at that directory.

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2023-11-19 18:21:31 +00:00
Klaijan
5ba3b9c2c6
chore: get eval metrics from ingest in (#2097)
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-17 18:22:36 +00:00
Klaijan
777a428071
chore: for ingest-test metrics, also check subdirs (#2079)
- Copy script only went through one layer of subdirectory so it did not
found the match between manifest file and structured output. Now edited
to search all subdirectories.
- `set -e` causes the script to exit at any exit rather than `exit 0`,
fix all scripts that needs to run the copy script to be `set +e` right
before the check diff, then back to `set -e` after
- Edit the default evaluation metrics output from `metrics` to
`metrics-tmp` to account for diff check
- Add a script that checks the differences between old eval metric
output (metrics) and new eval metrics output (metrics-tmp)
2023-11-15 21:02:43 -08:00
ryannikolaidis
d5fd21f0fd
fix: pass partition arguments to api when partitioning with unstructured-ingest and --partition-by-api (#2023)
Closes #1064 

When using the `--partition-by-api` flag via unstructured-ingest, none
of the partition arguments are forwarded, meaning that these options are
disregarded. With this change, we now pass through all of the relevant
partition arguments to the api.

## Changes

* parse and pass relevant partition arguments to the api in
unstructured-ingest
* bonus: leverage an existing `partition.api` function to call out to
the api rather than including duplicative request logic in unstructured
ingest
* bonus: --pdf-infer-table-structure is now a flag not an arg (it
defaults false anyways, this is more succinct and consistent with
similar parameters)
* bonus: adds `hi_res_model_name` so a user can specify the model to
leverage when using a hi_res strategy.

## Testing

* update against_api.sh source test script to specify a partition
argument and validates that the response from the api respected the
argument
* manually ran a request and validated that it was processed with
chipper as specified (not sure if we want to bake a chipper request into
the ci tests) (validated that the response leveraged the chipper model):

```
PYTHONPATH=. ./unstructured/ingest/main.py \
    local \
    --output-dir /tmp/ingest-requests/chipper \
    --verbose \
    --reprocess \
    --strategy hi_res \
    --partition-by-api \
    --hi-res-model-name chipper \
    --api-key "$API_KEY" \
    --input-path 'example-docs/layout-parser-paper-with-table.pdf'
```
2023-11-08 04:47:02 +00:00
ryannikolaidis
0e94dd5d65
fix: ingest destination test failure with missing output (#2031)
Intermittently the various destination test will fail with:

```
{noformat}--- Cleanup done ---
gs://utic-test-ingest-fixtures-output/1699377964/example-docs/
deleting gs://utic-test-ingest-fixtures-output/1699377964
Removing objects:
  

ERROR: (gcloud.storage.rm) The following URLs matched no objects or files:
-gs://utic-test-ingest-fixtures-output/1699377964
Last ran script: gcs.sh
Error: Process completed with exit code 1.{noformat}
```

Reference trace
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/6787927424/job/18452240764?pr=2020)

After some investigation it looks like this error is due to collisions
that occur because we’re assuming 1s date accuracy is sufficient when
generating (and deleting) "unique" test destination location names. The
likelihood is actually pretty high given that we run these tests against
a test matrix.

Instead we should just use a uuid for these unique destinations.

## Changes

- Use uuidgen instead of `date +%s` for unique destinations
2023-11-07 23:14:01 +00:00
Ahmet Melek
ca78dc737a
feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918)
Closes #1782 

This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers

Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost

For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-11-06 12:26:12 +00:00
Klaijan
c471ea3cc7
chore: remove copy line from non-matrix connectors (#1976) 2023-11-04 10:58:56 -07:00
Roman Isecke
d09c8c0cab
test: update ingest dest tests to follow set pattern (#1991)
### Description
Update all destination tests to match pattern:
* Don't omit any metadata to check full schema
* Move azure cognitive dest test from src to dest
* Split delta table test into seperate src and dest tests
* Fix azure cognitive search and add to dest tests being run (wasn't
being run originally)
2023-11-03 12:46:56 +00:00
Yao You
db766402a4
test: parametrize ingest test scripts (#1979)
This PR resolves
[CORE-2453](https://unstructured-ai.atlassian.net/browse/CORE-2453):

- parametrizes the output folder so that ingest output files can be
saved other than the same place where the scripts are; this is set by
env `OUTPUT_ROOT`
- parametrize the python path `PYTHONPATH` to first check existing
definition before default to `.`, the current folder
- parametrize the run script that carries out ingest using `RUN_SCRIPT`,
default is still `./unstructured/ingest/main.py`

These changes allows us to run ingest test with more control. To test:
- run `OUTPUT_ROOT=/tmp
./test_unstructured_ingest/src/local-single-file.sh`: the output now
should be in `/tmp` instead of in the ingest test folder
- run `RUN_SCRIPT=/hope/you/do/not/have/this/folder
./test_unstructured_ingest/src/local-single-file.sh` would raise an
error because system can't find `/hope/you/do/not/have/this/folder`
- run `RUN_SCRIPT=./unstructured/ingest/main.py
./test_unstructured_ingest/src/local-single-file.sh` should run as
normal
- do the following

```bash
cp ./unstructured/ingest/main.py /tmp/main.py
OUTPUT_ROOT=/tmp PYTHONPATH=$(pwd) RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh
```
This will run and generate output at `/tmp`

[CORE-2453]:
https://unstructured-ai.atlassian.net/browse/CORE-2453?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
2023-11-02 21:41:56 +00:00
Roman Isecke
6700a7d8c4
feat: support generic inputs for partition kwargs from ingest CLI (#1923)
### Description
To always support the latest changed to the partition method and the
possible kwargs it supports, the ingest CLI has been refactored to take
in a valid json string to represent those values to allow a user more
flexibility with controlling the partition method.
2023-11-02 21:19:29 +00:00
Roman Isecke
24a419ece0
separate ingest tests (#1951)
### Description
This splits the source ingest tests from the destination ingest tests
since they share a different pattern:
* src tests pull data from a source and compare the partitioned content
to the expected results
* destingation tests leverage the local connector to produce results to
push to a destination and leverages overhead to create temporary
locations at those destinations to write to and delete when done.

Only the src tests create partitioned content that needs to be checked
so the update ingest test CI job only needs to run these.
2023-11-01 19:23:44 +00:00