49 Commits

Author SHA1 Message Date
fzowl
0fbdd4ea36
Refactoring VoyageAI integration (#3878)
Using VoyageAI's python package directly, allowing more features than
through langchain
2025-01-28 21:45:40 +00:00
Roman Isecke
9049e4e2be
feat/remove ingest code, use new dep for tests (#3595)
### Description
Alternative to https://github.com/Unstructured-IO/unstructured/pull/3572
but maintaining all ingest tests, running them by pulling in the latest
version of unstructured-ingest.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-10-15 10:01:34 -05:00
Steve Canny
cd074bb32b
chore(file): remove dead code (#3645)
**Summary**
Remove dead code in `unstructured.file_utils`.

**Additional Context**
These modules were added in 12/2022 and 1/2023 and are not referenced by
any code. Removing to reduce unnecessary complexity. These can of course
be recovered from Git history if we decide we want them again in future.
2024-09-19 06:45:33 +00:00
David Potter
ddba928344
Potter/mixedbread embedder (#3513)
Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai
embedder!
2024-08-27 14:52:13 +00:00
David Potter
59ec64235b
chore: rename astra to astradb (#3458)
DataStax wanted all references to be astradb instead of astra. As per
@erichare

We'll also have to do the same in unstructured-ingest :)
2024-08-05 20:41:02 +00:00
Christine Straub
0eb461acc2
refactor: restructure PDF/Image example document organization (#3410)
This PR aims to improve the organization and readability of our example
documents used in unit tests, specifically focusing on PDF and image
files.

### Summary
- Created two new subdirectories in the `example-docs` folder:
  - `pdf/`: for all PDF example files
  - `img/`: for all image example files
- Moved relevant PDF files from `example-docs/` to `example-docs/pdf/`
- Moved relevant image files from `example-docs/` to `example-docs/img/`
- Updated file paths in affected unit & ingest tests to reflect the new
directory structure

### Testing
All unit & ingest tests should be updated and verified to work with the
new file structure.

## Notes
Other file types (e.g., office documents, HTML files) remain in the root
of `example-docs/` for now.

## Next Steps
Consider similar reorganization for other file types if this structure
proves to be beneficial.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-07-18 22:21:32 +00:00
David Potter
6c78677ebb
feat: add Astra source connector (#3304)
Thanks to @erichare we now have an AstraDB source connector.

updating constant names to be more aligned with AstraDB
2024-07-10 20:29:22 +00:00
David Potter
db1e6993a8
rfctr [P6M-398]: salesforce connector v2 (#3344)
Updates salesforce source connector to v2.
2024-07-09 16:46:58 +00:00
Roman Isecke
b556d6d575
rfctr: Implement Sharepoint V2 Source Connector (#3314)
### Description
Migrate over the sharepoint connector to v2 and in the process refactor
the majority of the connector. It now pulls in much more content from
the SDK on index time, including permissions data is the parameters are
passed in. HTML content generated from the SitePage is isolated to the
html content in the `CanvasContent1` and `LayoutWebpartsContent`
returned by the SDK.

Some TODOs were left in there for future iterations. Currently only
document and site page content is being pulled in from sharepoint, but
sharepoint has more types of content than just that, such as lists. Note
left in there to support other sharepoint types.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: vangheem <vangheem@gmail.com>
Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>
Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
2024-07-09 09:52:59 +00:00
David Potter
9eb4c96b94
fix: update slack test to point to new channel (#3328)
When we switched community Slack from Paid to Free we lost the CI test
bot. Also if messages delete after 90 days then our expected test data
will disappear.

- created a new bot in our paid company slack
(test_unstructured_ingest_bot)
- added a new private channel (test-ingest)
- invited the bot to the channel
- adjusted the end datetime of the test to cover the first few messages
in the channel

Still to do:
- update the CI secrets with the new bot token
- update the LastPass with the new bot token (I don't have write
access.. :(.
2024-07-01 18:11:21 +00:00
Roman Isecke
3f581e6b7d
feat/migrate gdrive source connector (#3239)
### Description
Migrate the google drive source connector over to the new v2 ingest
framework and include a variety of improvements as part of the refactor:
* The ID is no longer limited to a drive id but can also be the id of a
subfolder within a drive or a file directly and each case is handled
appropriately
* More metadata is pulled in from google drive to enrich the partitioned
elements downstream and now the modified date is being set to not
reprocess if the ingest pipeline already has the file cached
* timing information is set on the file created when downloaded based on
the last modified data retrieved from google drive

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-25 12:55:28 +00:00
Roman Isecke
e0f4374386
Roman/bugfix conflicting event loop ingest (#3264)
### Description
In use cases where an external system (such as code being run in a
jupyter notebook) already has a running event loop, run the async code
in a dedicated thread pool to not conflict with the existing event loop.

This also has a variety of fixes that were found when putting together a
demo leveraging the elasticsearch destination connector
2024-06-24 18:47:37 +00:00
David Potter
8610bd3ab9
feat: Kafka source and destination connector (#3176)
Thanks to @tullytim we have a new Kafka source and destination
connector. It also works with hosted Kafka via Confluent.

Documentation will be added to the Docs repo.
2024-06-22 23:26:23 +00:00
Matt Robinson
3158169585
fix: uninstall bson for mongo connector (#3104)
### Summary

Closes #3049. Reenables the MongoDB connector test, which was disabled
previously in #3047 due to incompatibility between the `pymongo` and the
`bson` package from `pip`, which is a dependency for the Astra
connector. Per the `pymongo` docs below, `pymongo` ships with its own
version of `bson` and installing `bson` from `pip` breaks `pymongo`.

- https://pymongo.readthedocs.io/en/stable/installation.html

### Testing

Ingest tests ran successfully for the [source
connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512636315)
and the [destination
connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512635546).
2024-05-28 17:45:18 +00:00
Matt Robinson
6b400b46fe
feat: add VoyageAI embeddings (#3069) (#3099)
Original PR was #3069. Merged in to a feature branch to fix dependency
and linting issues. Application code changes from the original PR were
already reviewed and approved.

------------
Original PR description:
Adding VoyageAI embeddings 
Voyage AI’s embedding models and rerankers are state-of-the-art in
retrieval accuracy.

---------

Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>
2024-05-24 21:48:35 +00:00
Roman Isecke
3eaf65a8c1
feat: refactor ingest (#3009)
### Description
This refactors the current ingest CLI process to support better
granularity in how the steps are ran
* Both multiprocessing and async now supported. Given that a lot of the
steps are IO-bound, such as downloading and uploading content, we can
achieve better parallelization by using async here
* Destination step broken up into a stager step and an upload step. This
will allow for steps that require manipulation of the data between
formats, such as converting the elements json into a csv format to
upload for tabular destinations, to be pulled out of the step that does
the actual upload.
* The process of writing the content to a local destination was now
pulled out as it's own dedicated destination connector, meaning you no
longer need to persist the content locally once the process is done if
the content was uploaded elsewhere.
* Quick update to the chunker/partition step to use the python client.
* Move the uncompress suppport as a pipeline step since this can
arbitrarily apply to any concrete files that have been downloaded,
regardless of where they came from.
* Leverage last modified date to mark files to be reprocessed, even if
the file already exists locally.

### Callouts
Retry configs haven't been moved over yet. This is an open question
because the intent was for it to wrap potential connection errors but
now any of the other steps that leverage an API might run into network
connection issues. Should those be isolated in each of the steps and
wrapped with the same retry configs? Or do we need to expose a unique
retry config for each step? This would bloat the input params even more.

### Testing
* If you want to run the new code as an SDK, there's an example file
that was added to highlight how to do that:
[example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py)
* If you want to run the new code as an isolated CLI:
```shell
PYTHONPATH=. python unstructured/ingest/v2/main.py --help
```
* If you want to see which commands have been migrated to the new
version, there's now a `v2` short help text next to those commands when
running the current cli:
```shell
PYTHONPATH=. python unstructured/ingest/main.py --help
Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help   

Options:
  --help  Show this message and exit.

Commands:
  airtable
  azure
  biomed
  box
  confluence
  delta-table
  discord
  dropbox
  elasticsearch
  fsspec
  gcs
  github
  gitlab
  google-drive
  hubspot
  jira
  local          v2
  mongodb
  notion
  onedrive
  opensearch
  outlook
  reddit
  s3             v2
  salesforce
  sftp
  sharepoint
  slack
  wikipedia
```

You can run any of the local or s3 specific ingest tests and these
should now work.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-05-21 17:01:49 +00:00
John
3843af666e
feat: Enable remote chunking via unstructured-ingest (#2905)
Update: The cli shell script works when sending documents to the free
api, but the paid api is down, so waiting to test against it.

- The first commit adds docstrings and fixes type hints.
- The second commit reorganizes `test_unstructured_ingest` so it matches
the structure of `unstructured/ingest`.
- The third commit contains the primary changes for this PR.
- The `.chunk()` method responsible for sending elements to the correct
method is moved from `ChunkingConfig` to `Chunker` so that
`ChunkingConfig` acts as a config object instead of containing
implementation logic. `Chunker.chunk()` also now takes a json file
instead of a list of elements. This is done to avoid redundant
serialization if the file is to be sent to the api for chunking.

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
2024-04-25 00:24:58 +00:00
Michał Martyniak
2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00
Ahmet Melek
d46792214a
feat: add vertexai embeddings (#2693)
This PR:
- Adds VertexAI embeddings as an embedding provider

Testing
- Tested with pinecone destination connector on
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693)
job run.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2024-03-28 21:15:36 +00:00
Steve Canny
9ae838e50a
feat: add --include-orig-elements option to Ingest CLI (#2687)
**Summary**
Add an `--include-orig-elements` option to the Ingest CLI to allow users
to specify that corresponding new chunking parameter.

**Reviewer** A lot of this is cleanup, the second commit is where the
actual adding of this option are. The first commit fixes a number of
inaccuracies in the documentation and does some other clean-up.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-27 06:35:01 +00:00
Roman Isecke
4ff6a5b78e
Roman/bugfix support bedrock embeddings (#2650)
### Description
This PR resolved the following open issue:
[bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319).
To do so, the following changes were made:
* All aws configs were added as input parameters to the CLI
* These were mapped to the bedrock embedder when an embedder is
generated via `get_embedder`
* An ingest test was added to call the aws bedrock service
* Requirements for boto were bumped because the first version to
introduce the bedrock runtime, which is required to hit the bedrock
service, was introduced in version `1.34.63`, which was ahead of the
version of boto pinned.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-03-21 18:21:04 +00:00
David Potter
1ca90d209a
bug: update sharepoint-with-permissions test to fix CI (#2589)
Adding `metadata.data_source.permissions_data` to
sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint
deprecation warning from ruining test.

Updating expected-structured-output

As per Ahmet's comment. We do want to check sharepoint permissions
metadata at some point. But that will take a separate type of test. A
file diff test is too unstable. Permissions checking will be later down
the road.
2024-03-06 17:15:36 +00:00
David Potter
0c834517d8
fix: change opensearch port (#2517)
change opensearch port to see if fixes CI. We think there may be a
conflict with the elasticsearch docker port.

Also adding simple retry to vector query.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-02-07 21:25:04 +00:00
David Potter
c100ce28a7
feat: add Vectara destination connector (#2357)
Thanks to Ofer at Vectara, we now have a Vectara destination connector.

- There are no dependencies since it is all REST calls to API
-

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-02-01 14:38:34 +00:00
David Potter
bc791d53f4
feat: add opensearch source and destination connector (#2349)
Adds OpenSearch as a source and destination.

Since OpenSearch is a fork of Elasticsearch, these connectors rely
heavily on inheriting the Elasticsearch connectors whenever possible.

- Adds OpenSearch source connector to be able to ingest documents from
OpenSearch.
- Adds OpenSearch destination connector to be able to ingest documents
from any supported source, embed them and write the embeddings /
documents into OpenSearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured OpenSearch indexes easily.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 04:31:49 +00:00
David Potter
76e0d10e61
feat: add MongoDB source connector (#2393)
Adds MongoDB as a source (we already had it as a destination connector)

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-16 20:56:29 +00:00
Steve Canny
2f2c48acd5
feat(ingest): add basic chunking to ingest (#2380)
The new "basic" chunking strategy and overlap options need to be
available from the ingest CLI. An ingest test of those features is also
welcome, both to verify the ingest feature and to defend against
regressions in the chunking code.

Add a local ingest test exercising both the "basic" chunking strategy
and intra-chunk overlap. Since there is no new source connector
involved, use the local ingest source and destination. Update
documentation to suit, filling in some details that hadn't made it into
the docs yet.
2024-01-12 20:27:34 +00:00
jakub-sandomierz-deepsense-ai
411aa98bbf
feat: Salesforce connector accepts key path or value (#2321) (#2327)
Solution to issue
https://github.com/Unstructured-IO/unstructured/issues/2321.

simple_salesforce API allows for passing private key path or value. This
PR introduces this support for Ingest connector.

Salesforce parameter "private-key-file" has been renamed to
"private-key".
It can contain one of following:
- path to PEM encoded key file (as string)
- key contents (PEM encoded string)

If the provided value cannot be parsed as PEM encoded private key, then
the file existence is checked. This way private key contents are not
exposed to unnecessary underlying function calls.
2024-01-11 11:15:24 +00:00
Ahmet Melek
fd293b3e78
feat: add elasticsearch destination connector (#2152)
Closes https://github.com/Unstructured-IO/unstructured/issues/1842
Closes https://github.com/Unstructured-IO/unstructured/issues/2202
Closes https://github.com/Unstructured-IO/unstructured/issues/2203

This PR:
- Adds Elasticsearch destination connector to be able to ingest
documents from any supported source, embed them and write the embeddings
/ documents into Elasticsearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured elasticsearch indexes easily.
- Includes parallelized upload and lazy processing for elasticsearch
destination connector.
- Rearranges elasticsearch test helpers to source, destination, and
common folders.
- Adds util functions to be able to batch iterables in a lazy way for
uploads
- Fixes a bug where removing the optional parameter `--fields` broke the
connector due to an integer processing error.
- Fixes a bug where using an [elasticsearch
config](8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35))
for a destination connector resulted in a serialization issue when
optional parameter `--fields` was not provided.
2023-12-20 01:26:58 +00:00
cragwolfe
bd8a74d686
chore: shell scripts default indent of 2 instead of 4 (#2287)
Given the tendency for shell scripts to easily enter into a few levels
of indentation and long line lengths, update the default to 2 spaces.
2023-12-19 07:48:21 +00:00
Roman Isecke
76efcf4dd7
chore: add shfmt (#2246)
### Description
Given all the shell files that now exist in the repo, would be nice to
have linting/formatting around them (in addition to the existing
shellcheck which doesn't do anything to format the shell code). This PR
introduces `shfmt` to both check for changes and apply formatting when
the associated make targets are called.
2023-12-12 01:04:15 +00:00
Roman Isecke
cc05e948ff
chore: sensitive info connector audit (#2227)
### Description
All other connectors that were not included in
https://github.com/Unstructured-IO/unstructured/pull/2194 are now
updated to follow the new pattern and mark any variables as sensitive
where it makes sense.
Core changes:
* All connectors now support an `AccessConfig` to mark data that's
needed for auth (i.e. username, password) and those that are sensitive
are designated appropriately using the new enhanced field.
* All cli configs on the cli definition now inherit from the base config
in the connector file to reuse the variables set on that dataclass
* The base writer class was updated to better generalize the new
approach given better use of dataclasses
* The base cli classes were refactored to also take into account the
need for a connector and write config when creating the respective
runner/writer classes.
* Any mismatch between the cli field name and the dataclass field name
were updated on the dataclass side to not impact the user but maintain
consistency
* Add custom redaction logic for mongodb URIs since the password is
expected to be a part of it. Now this:
`"mongodb+srv://ingest-test-user:r4hK3BD07b@ingest-test.hgaig.mongodb.net/"`
->
`"mongodb+srv://ingest-test-user:***REDACTED***@ingest-test.hgaig.mongodb.net/"`
in the logs
* Bundle all fsspec based files into their own packages. 
* Refactor custom `_decode_dataclass` used for enhanced json mixin by
using a monkey-patch approach. The original approach was breaking on
optional nested dataclasses when serializing since the other methods in
`dataclasses_json_core` weren't using the new method. By monkey-patching
the original method with a new one, all other methods in that library
would use the new one.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-12-11 17:37:49 +00:00
David Potter
cde11d1eb0
feat: Add sftp source connector (#2163)
Adds source connector for SFTP which uses fsspec and paramiko via
fsspec. Paramiko is the standard sftp package for python used in pysftp
etc...

```
--username foo \
--password bar \
--remote-url sftp://localhost:47474/upload/
```

Will only download a specifically requested file if it has an extension.
(i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will
treat any other remote_url as a folder path. This is intentional.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-07 19:33:19 +00:00
Roman Isecke
f193d3d43b
feat: improve sensitive data handling by fsspec connectors (#2194)
### Description
Building off of PR
https://github.com/Unstructured-IO/unstructured/pull/2179, updating
fsspec based connectors to use better authentication field handling.
This PR adds in the following changes:

* Update the base classes to inherit from the enhanced json mixin
* Add in a new access config dataclass that should be used as a nest
dataclass in the connector configs
* Update the code extracting configs out of the cli options dictionary
to support the nested access config if it exists on the parent config
* Update all fsspec connectors with explicit access configs given what
each one's SDKs support
* Update the json mixin and enhanced field to support a name override
when serializing/deserializing from json/dicts. This allows a different
name to be used for the CLI option than what the name of the field is on
the dataclass.
* Update all the writes to use class-based approach and share the same
structure of the runner classes
* Above update allowed for better code to be used in the base source and
destination CLI commands
* Add in utility code around paring a flat dictionary (coming from the
click based options) into dataclass-based configs with potentially
nested dataclasses.

**Slightly unrelated changes:**
* session handle removed from pinecone connector as this was breaking
the serialization of the write config and didn't have any benefit as a
connection was never being shared, the index used simply makes a new
http call each time it's invoked.
* Dedicated write configs were created for all destination connectors to
better support serialization
* Refactor of Elasticsearch connector included, with update to ingest
test to use auth

**TODOs**
* Left a `#TODO` in the code but the way session handler is implemented
right now, it breaks serialization since it adds a generic variable
based on the library being used for a connector (i.e.
`googleapiclient.discovery.Resource`) which is not serializable. This
will need to be updated to omit that from serialization but still
support the current workflow.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-12-05 20:55:19 +00:00
Roman Isecke
b951d73a9b
feat: add logging to ingest CLI for tests being skipped at the end (#2174)
### Description
Often times there are tests being skipped either due to missing env vars
or explicitly defined in the base script but these get lost in the logs.
This PR updates the scripts to leverage a custom error code if being
skipped due to missing env vars and this custom error code is being
caught by the base script and logs all files being skipped to a file. At
the end of the script, this file gets logged in the CI output.
2023-11-29 13:41:19 +00:00
rvztz
50b1431c9e
rvztz/hubspot ingest connector (#1760)
Closes #1843 

Ingest connector for HubSpot. Supports:
- Calls: Logs from calls related to contacts, companies and tickets
- Communications: Logs from SMS/Whatsapp related to contacts, companies
and tickets
- Notes: Notes related to CRM notes
- Products: CRM products
- Emails: Logs from emails sent to CRM objects.
- Tasks: CRM tasks

From each record, `body/`description`information is grabbed. When a
title property is available, this is registered at the beggining of the
output file. The CLI receives three params:
- `api-token`: [Private
app](https://developers.hubspot.com/docs/api/private-apps) token.
- `object-types: One of the noted supported objects in the form of a
comma separated list: `calls,products,tasks`
- `custom-properties`: Custom properties to grab information from. Must
be in the form
`<object_type>:<custom_property_id>,<object_type>:<custom_property_id>`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rvztz <rvztz@users.noreply.github.com>
2023-11-28 23:07:57 +00:00
Roman Isecke
2bb463d006
feat: support both single and batch ingest docs (#2105)
### Description
There are some source ingest connectors that would be more efficient to
read the content in batches rather than use an entire process per
document. For example, reading from ElasticSearch. Given an index with
possible hundreds of documents, reading each one individually is not as
optimal as reading in batches. To try and maintain as much of the ingest
doc paradigm already being supported, a new class `BaseIngestDocBatch`
was added to handle reading in batches. It produces a list of
`BaseSingleIngestDoc` which is what all current implementations were
renamed to. This list is generated after it runs its `get_files` method.
Past the source node, all other steps in the pipeline should not be
affected, this is just an optimization for the read step.

**Additional Changes:**
* Removed use of jq and instead converted this into a fields filter on
the content to let the database handle the filtering and limit the
amount of data being pulled in.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-11-27 19:25:30 +00:00
Roman Isecke
6e67c48fd8
feat: update all ingest tests to use huggingface for embeddings (#2071)
### Description
Update any use of OpenAI for generating embeddings in the ingest tests
to use Huggingface

**Bonus Changes:**
* Remove duplicate delta table test
* Delete delta table destination directory at the beginning of the test
to make sure it doesn't exist and prevent the test from breaking.
2023-11-21 18:43:19 +00:00
ryannikolaidis
13a23deba6
fix: local connector with input path to single file (#2116)
When passed an absolute file path for the input document path, the local
connector incorrectly writes the output file to the wrong directory.
Also, in the single file input path cases we are currently including
parent path as part of the destination writing, instead when a single
file is specified as input the output file should be located directly in
the specified outputs directory. Note: this change meant that we needed
to bump the file path of some expected results. This fixes such that the
output in this case is written to `output-dir/input-filename.json`.

## Changes
- Fix for incorrect output path of files partitioned via the local
connector when the input path is a file path (rather than directory)
- Updated single-local-file test to validate the flow where we specify
an absolute file path (since this was particularly broken)

## Testing
Note: running the updated `local-single-file` test without the changes
to the local connector will result in a final output copy of:

```
Copying /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/workdir/local-single-file/partitioned/a48c2abec07a9a31860429f94e5a6ade.json -> /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/../example-docs/language-docs/UDHR_first_article_all.txt.json
```

where the output path is the input path and not the expected
`output-dir/input-filename.json`

Running with this change we can now expect the file at that directory.

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2023-11-19 18:21:31 +00:00
Klaijan
5ba3b9c2c6
chore: get eval metrics from ingest in (#2097)
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-17 18:22:36 +00:00
Klaijan
777a428071
chore: for ingest-test metrics, also check subdirs (#2079)
- Copy script only went through one layer of subdirectory so it did not
found the match between manifest file and structured output. Now edited
to search all subdirectories.
- `set -e` causes the script to exit at any exit rather than `exit 0`,
fix all scripts that needs to run the copy script to be `set +e` right
before the check diff, then back to `set -e` after
- Edit the default evaluation metrics output from `metrics` to
`metrics-tmp` to account for diff check
- Add a script that checks the differences between old eval metric
output (metrics) and new eval metrics output (metrics-tmp)
2023-11-15 21:02:43 -08:00
ryannikolaidis
d5fd21f0fd
fix: pass partition arguments to api when partitioning with unstructured-ingest and --partition-by-api (#2023)
Closes #1064 

When using the `--partition-by-api` flag via unstructured-ingest, none
of the partition arguments are forwarded, meaning that these options are
disregarded. With this change, we now pass through all of the relevant
partition arguments to the api.

## Changes

* parse and pass relevant partition arguments to the api in
unstructured-ingest
* bonus: leverage an existing `partition.api` function to call out to
the api rather than including duplicative request logic in unstructured
ingest
* bonus: --pdf-infer-table-structure is now a flag not an arg (it
defaults false anyways, this is more succinct and consistent with
similar parameters)
* bonus: adds `hi_res_model_name` so a user can specify the model to
leverage when using a hi_res strategy.

## Testing

* update against_api.sh source test script to specify a partition
argument and validates that the response from the api respected the
argument
* manually ran a request and validated that it was processed with
chipper as specified (not sure if we want to bake a chipper request into
the ci tests) (validated that the response leveraged the chipper model):

```
PYTHONPATH=. ./unstructured/ingest/main.py \
    local \
    --output-dir /tmp/ingest-requests/chipper \
    --verbose \
    --reprocess \
    --strategy hi_res \
    --partition-by-api \
    --hi-res-model-name chipper \
    --api-key "$API_KEY" \
    --input-path 'example-docs/layout-parser-paper-with-table.pdf'
```
2023-11-08 04:47:02 +00:00
ryannikolaidis
0e94dd5d65
fix: ingest destination test failure with missing output (#2031)
Intermittently the various destination test will fail with:

```
{noformat}--- Cleanup done ---
gs://utic-test-ingest-fixtures-output/1699377964/example-docs/
deleting gs://utic-test-ingest-fixtures-output/1699377964
Removing objects:
  

ERROR: (gcloud.storage.rm) The following URLs matched no objects or files:
-gs://utic-test-ingest-fixtures-output/1699377964
Last ran script: gcs.sh
Error: Process completed with exit code 1.{noformat}
```

Reference trace
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/6787927424/job/18452240764?pr=2020)

After some investigation it looks like this error is due to collisions
that occur because we’re assuming 1s date accuracy is sufficient when
generating (and deleting) "unique" test destination location names. The
likelihood is actually pretty high given that we run these tests against
a test matrix.

Instead we should just use a uuid for these unique destinations.

## Changes

- Use uuidgen instead of `date +%s` for unique destinations
2023-11-07 23:14:01 +00:00
Ahmet Melek
ca78dc737a
feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918)
Closes #1782 

This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers

Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost

For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-11-06 12:26:12 +00:00
Klaijan
c471ea3cc7
chore: remove copy line from non-matrix connectors (#1976) 2023-11-04 10:58:56 -07:00
Roman Isecke
d09c8c0cab
test: update ingest dest tests to follow set pattern (#1991)
### Description
Update all destination tests to match pattern:
* Don't omit any metadata to check full schema
* Move azure cognitive dest test from src to dest
* Split delta table test into seperate src and dest tests
* Fix azure cognitive search and add to dest tests being run (wasn't
being run originally)
2023-11-03 12:46:56 +00:00
Yao You
db766402a4
test: parametrize ingest test scripts (#1979)
This PR resolves
[CORE-2453](https://unstructured-ai.atlassian.net/browse/CORE-2453):

- parametrizes the output folder so that ingest output files can be
saved other than the same place where the scripts are; this is set by
env `OUTPUT_ROOT`
- parametrize the python path `PYTHONPATH` to first check existing
definition before default to `.`, the current folder
- parametrize the run script that carries out ingest using `RUN_SCRIPT`,
default is still `./unstructured/ingest/main.py`

These changes allows us to run ingest test with more control. To test:
- run `OUTPUT_ROOT=/tmp
./test_unstructured_ingest/src/local-single-file.sh`: the output now
should be in `/tmp` instead of in the ingest test folder
- run `RUN_SCRIPT=/hope/you/do/not/have/this/folder
./test_unstructured_ingest/src/local-single-file.sh` would raise an
error because system can't find `/hope/you/do/not/have/this/folder`
- run `RUN_SCRIPT=./unstructured/ingest/main.py
./test_unstructured_ingest/src/local-single-file.sh` should run as
normal
- do the following

```bash
cp ./unstructured/ingest/main.py /tmp/main.py
OUTPUT_ROOT=/tmp PYTHONPATH=$(pwd) RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh
```
This will run and generate output at `/tmp`

[CORE-2453]:
https://unstructured-ai.atlassian.net/browse/CORE-2453?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
2023-11-02 21:41:56 +00:00
Roman Isecke
6700a7d8c4
feat: support generic inputs for partition kwargs from ingest CLI (#1923)
### Description
To always support the latest changed to the partition method and the
possible kwargs it supports, the ingest CLI has been refactored to take
in a valid json string to represent those values to allow a user more
flexibility with controlling the partition method.
2023-11-02 21:19:29 +00:00
Roman Isecke
24a419ece0
separate ingest tests (#1951)
### Description
This splits the source ingest tests from the destination ingest tests
since they share a different pattern:
* src tests pull data from a source and compare the partitioned content
to the expected results
* destingation tests leverage the local connector to produce results to
push to a destination and leverages overhead to create temporary
locations at those destinations to write to and delete when done.

Only the src tests create partitioned content that needs to be checked
so the update ingest test CI job only needs to run these.
2023-11-01 19:23:44 +00:00