6 Commits

Author SHA1 Message Date
Matt Robinson
3158169585
fix: uninstall bson for mongo connector (#3104)
### Summary

Closes #3049. Reenables the MongoDB connector test, which was disabled
previously in #3047 due to incompatibility between the `pymongo` and the
`bson` package from `pip`, which is a dependency for the Astra
connector. Per the `pymongo` docs below, `pymongo` ships with its own
version of `bson` and installing `bson` from `pip` breaks `pymongo`.

- https://pymongo.readthedocs.io/en/stable/installation.html

### Testing

Ingest tests ran successfully for the [source
connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512636315)
and the [destination
connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512635546).
2024-05-28 17:45:18 +00:00
cragwolfe
bd8a74d686
chore: shell scripts default indent of 2 instead of 4 (#2287)
Given the tendency for shell scripts to easily enter into a few levels
of indentation and long line lengths, update the default to 2 spaces.
2023-12-19 07:48:21 +00:00
Roman Isecke
76efcf4dd7
chore: add shfmt (#2246)
### Description
Given all the shell files that now exist in the repo, would be nice to
have linting/formatting around them (in addition to the existing
shellcheck which doesn't do anything to format the shell code). This PR
introduces `shfmt` to both check for changes and apply formatting when
the associated make targets are called.
2023-12-12 01:04:15 +00:00
Roman Isecke
b951d73a9b
feat: add logging to ingest CLI for tests being skipped at the end (#2174)
### Description
Often times there are tests being skipped either due to missing env vars
or explicitly defined in the base script but these get lost in the logs.
This PR updates the scripts to leverage a custom error code if being
skipped due to missing env vars and this custom error code is being
caught by the base script and logs all files being skipped to a file. At
the end of the script, this file gets logged in the CI output.
2023-11-29 13:41:19 +00:00
ryannikolaidis
13a23deba6
fix: local connector with input path to single file (#2116)
When passed an absolute file path for the input document path, the local
connector incorrectly writes the output file to the wrong directory.
Also, in the single file input path cases we are currently including
parent path as part of the destination writing, instead when a single
file is specified as input the output file should be located directly in
the specified outputs directory. Note: this change meant that we needed
to bump the file path of some expected results. This fixes such that the
output in this case is written to `output-dir/input-filename.json`.

## Changes
- Fix for incorrect output path of files partitioned via the local
connector when the input path is a file path (rather than directory)
- Updated single-local-file test to validate the flow where we specify
an absolute file path (since this was particularly broken)

## Testing
Note: running the updated `local-single-file` test without the changes
to the local connector will result in a final output copy of:

```
Copying /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/workdir/local-single-file/partitioned/a48c2abec07a9a31860429f94e5a6ade.json -> /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/../example-docs/language-docs/UDHR_first_article_all.txt.json
```

where the output path is the input path and not the expected
`output-dir/input-filename.json`

Running with this change we can now expect the file at that directory.

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2023-11-19 18:21:31 +00:00
Roman Isecke
b8af2f18bb
add mongo db destination connector (#2068)
### Description
This adds the basic implementation of pushing the generated json output
of partition to mongodb. None of this code provisions the mondo db
instance so things like adding a search index around the embedding
content must be done by the user. Any sort of schema validation would
also have to take place via user-specific configuration on the database.
This update makes no assumptions about the configuration of the database
itself.
2023-11-16 22:40:22 +00:00