10 Commits

Author SHA1 Message Date
ryannikolaidis
1f8768750c
chore: add auth to s3 destination test (#3122)
We should be validating the S3 Destination with authenticated requests,
with credentials from a limited test user.

## Changes

- Updates s3 destination test to point to a bucket that requires
authentication.
- Adds authentication to the s3 destination test request
- Bonus: fix deserialization of S3ConnectionConfig for s3 V2 destination
- Bonus: fix S3ConnectionConfig never registered for s3 V2 destination
- Bonus: repair version and changelog version for consistency with -dev
convention

## Testing
Validated by changes to S3 destination ingest test
2024-05-31 07:05:09 +00:00
Roman Isecke
3eaf65a8c1
feat: refactor ingest (#3009)
### Description
This refactors the current ingest CLI process to support better
granularity in how the steps are ran
* Both multiprocessing and async now supported. Given that a lot of the
steps are IO-bound, such as downloading and uploading content, we can
achieve better parallelization by using async here
* Destination step broken up into a stager step and an upload step. This
will allow for steps that require manipulation of the data between
formats, such as converting the elements json into a csv format to
upload for tabular destinations, to be pulled out of the step that does
the actual upload.
* The process of writing the content to a local destination was now
pulled out as it's own dedicated destination connector, meaning you no
longer need to persist the content locally once the process is done if
the content was uploaded elsewhere.
* Quick update to the chunker/partition step to use the python client.
* Move the uncompress suppport as a pipeline step since this can
arbitrarily apply to any concrete files that have been downloaded,
regardless of where they came from.
* Leverage last modified date to mark files to be reprocessed, even if
the file already exists locally.

### Callouts
Retry configs haven't been moved over yet. This is an open question
because the intent was for it to wrap potential connection errors but
now any of the other steps that leverage an API might run into network
connection issues. Should those be isolated in each of the steps and
wrapped with the same retry configs? Or do we need to expose a unique
retry config for each step? This would bloat the input params even more.

### Testing
* If you want to run the new code as an SDK, there's an example file
that was added to highlight how to do that:
[example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py)
* If you want to run the new code as an isolated CLI:
```shell
PYTHONPATH=. python unstructured/ingest/v2/main.py --help
```
* If you want to see which commands have been migrated to the new
version, there's now a `v2` short help text next to those commands when
running the current cli:
```shell
PYTHONPATH=. python unstructured/ingest/main.py --help
Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help   

Options:
  --help  Show this message and exit.

Commands:
  airtable
  azure
  biomed
  box
  confluence
  delta-table
  discord
  dropbox
  elasticsearch
  fsspec
  gcs
  github
  gitlab
  google-drive
  hubspot
  jira
  local          v2
  mongodb
  notion
  onedrive
  opensearch
  outlook
  reddit
  s3             v2
  salesforce
  sftp
  sharepoint
  slack
  wikipedia
```

You can run any of the local or s3 specific ingest tests and these
should now work.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-05-21 17:01:49 +00:00
cragwolfe
bd8a74d686
chore: shell scripts default indent of 2 instead of 4 (#2287)
Given the tendency for shell scripts to easily enter into a few levels
of indentation and long line lengths, update the default to 2 spaces.
2023-12-19 07:48:21 +00:00
Roman Isecke
76efcf4dd7
chore: add shfmt (#2246)
### Description
Given all the shell files that now exist in the repo, would be nice to
have linting/formatting around them (in addition to the existing
shellcheck which doesn't do anything to format the shell code). This PR
introduces `shfmt` to both check for changes and apply formatting when
the associated make targets are called.
2023-12-12 01:04:15 +00:00
ryannikolaidis
13a23deba6
fix: local connector with input path to single file (#2116)
When passed an absolute file path for the input document path, the local
connector incorrectly writes the output file to the wrong directory.
Also, in the single file input path cases we are currently including
parent path as part of the destination writing, instead when a single
file is specified as input the output file should be located directly in
the specified outputs directory. Note: this change meant that we needed
to bump the file path of some expected results. This fixes such that the
output in this case is written to `output-dir/input-filename.json`.

## Changes
- Fix for incorrect output path of files partitioned via the local
connector when the input path is a file path (rather than directory)
- Updated single-local-file test to validate the flow where we specify
an absolute file path (since this was particularly broken)

## Testing
Note: running the updated `local-single-file` test without the changes
to the local connector will result in a final output copy of:

```
Copying /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/workdir/local-single-file/partitioned/a48c2abec07a9a31860429f94e5a6ade.json -> /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/../example-docs/language-docs/UDHR_first_article_all.txt.json
```

where the output path is the input path and not the expected
`output-dir/input-filename.json`

Running with this change we can now expect the file at that directory.

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2023-11-19 18:21:31 +00:00
Roman Isecke
b8af2f18bb
add mongo db destination connector (#2068)
### Description
This adds the basic implementation of pushing the generated json output
of partition to mongodb. None of this code provisions the mondo db
instance so things like adding a search index around the embedding
content must be done by the user. Any sort of schema validation would
also have to take place via user-specific configuration on the database.
This update makes no assumptions about the configuration of the database
itself.
2023-11-16 22:40:22 +00:00
ryannikolaidis
0e94dd5d65
fix: ingest destination test failure with missing output (#2031)
Intermittently the various destination test will fail with:

```
{noformat}--- Cleanup done ---
gs://utic-test-ingest-fixtures-output/1699377964/example-docs/
deleting gs://utic-test-ingest-fixtures-output/1699377964
Removing objects:
  

ERROR: (gcloud.storage.rm) The following URLs matched no objects or files:
-gs://utic-test-ingest-fixtures-output/1699377964
Last ran script: gcs.sh
Error: Process completed with exit code 1.{noformat}
```

Reference trace
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/6787927424/job/18452240764?pr=2020)

After some investigation it looks like this error is due to collisions
that occur because we’re assuming 1s date accuracy is sufficient when
generating (and deleting) "unique" test destination location names. The
likelihood is actually pretty high given that we run these tests against
a test matrix.

Instead we should just use a uuid for these unique destinations.

## Changes

- Use uuidgen instead of `date +%s` for unique destinations
2023-11-07 23:14:01 +00:00
Roman Isecke
d09c8c0cab
test: update ingest dest tests to follow set pattern (#1991)
### Description
Update all destination tests to match pattern:
* Don't omit any metadata to check full schema
* Move azure cognitive dest test from src to dest
* Split delta table test into seperate src and dest tests
* Fix azure cognitive search and add to dest tests being run (wasn't
being run originally)
2023-11-03 12:46:56 +00:00
Yao You
db766402a4
test: parametrize ingest test scripts (#1979)
This PR resolves
[CORE-2453](https://unstructured-ai.atlassian.net/browse/CORE-2453):

- parametrizes the output folder so that ingest output files can be
saved other than the same place where the scripts are; this is set by
env `OUTPUT_ROOT`
- parametrize the python path `PYTHONPATH` to first check existing
definition before default to `.`, the current folder
- parametrize the run script that carries out ingest using `RUN_SCRIPT`,
default is still `./unstructured/ingest/main.py`

These changes allows us to run ingest test with more control. To test:
- run `OUTPUT_ROOT=/tmp
./test_unstructured_ingest/src/local-single-file.sh`: the output now
should be in `/tmp` instead of in the ingest test folder
- run `RUN_SCRIPT=/hope/you/do/not/have/this/folder
./test_unstructured_ingest/src/local-single-file.sh` would raise an
error because system can't find `/hope/you/do/not/have/this/folder`
- run `RUN_SCRIPT=./unstructured/ingest/main.py
./test_unstructured_ingest/src/local-single-file.sh` should run as
normal
- do the following

```bash
cp ./unstructured/ingest/main.py /tmp/main.py
OUTPUT_ROOT=/tmp PYTHONPATH=$(pwd) RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh
```
This will run and generate output at `/tmp`

[CORE-2453]:
https://unstructured-ai.atlassian.net/browse/CORE-2453?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
2023-11-02 21:41:56 +00:00
Roman Isecke
24a419ece0
separate ingest tests (#1951)
### Description
This splits the source ingest tests from the destination ingest tests
since they share a different pattern:
* src tests pull data from a source and compare the partitioned content
to the expected results
* destingation tests leverage the local connector to produce results to
push to a destination and leverages overhead to create temporary
locations at those destinations to write to and delete when done.

Only the src tests create partitioned content that needs to be checked
so the update ingest test CI job only needs to run these.
2023-11-01 19:23:44 +00:00