22 Commits

Author SHA1 Message Date
David Potter
bc791d53f4
feat: add opensearch source and destination connector (#2349)
Adds OpenSearch as a source and destination.

Since OpenSearch is a fork of Elasticsearch, these connectors rely
heavily on inheriting the Elasticsearch connectors whenever possible.

- Adds OpenSearch source connector to be able to ingest documents from
OpenSearch.
- Adds OpenSearch destination connector to be able to ingest documents
from any supported source, embed them and write the embeddings /
documents into OpenSearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured OpenSearch indexes easily.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 04:31:49 +00:00
David Potter
d7f4c24e21
fix documentation for chroma (#2403)
To test:

cd docs && make HTML

changelogs:

point main readme to the correct connector html page
point chroma docs to correct sample code

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 01:53:52 +00:00
David Potter
76e0d10e61
feat: add MongoDB source connector (#2393)
Adds MongoDB as a source (we already had it as a destination connector)

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-16 20:56:29 +00:00
ryannikolaidis
2ce829ddd0
test: update test Elasticsearch mappings to validate embedding search (#2397)
Currently in the Elasticsearch Destination ingest test we are writing
the embeddings to a "float" type field. In order to leverage this field
for similarity search it should be mapped as "dense_vector" with the
respective dimensions assigned.

This PR updates that mapping and adds a test query to validate that this
works as expected.
2024-01-14 19:27:56 +00:00
Steve Canny
2f2c48acd5
feat(ingest): add basic chunking to ingest (#2380)
The new "basic" chunking strategy and overlap options need to be
available from the ingest CLI. An ingest test of those features is also
welcome, both to verify the ingest feature and to defend against
regressions in the chunking code.

Add a local ingest test exercising both the "basic" chunking strategy
and intra-chunk overlap. Since there is no new source connector
involved, use the local ingest source and destination. Update
documentation to suit, filling in some details that hadn't made it into
the docs yet.
2024-01-12 20:27:34 +00:00
jakub-sandomierz-deepsense-ai
411aa98bbf
feat: Salesforce connector accepts key path or value (#2321) (#2327)
Solution to issue
https://github.com/Unstructured-IO/unstructured/issues/2321.

simple_salesforce API allows for passing private key path or value. This
PR introduces this support for Ingest connector.

Salesforce parameter "private-key-file" has been renamed to
"private-key".
It can contain one of following:
- path to PEM encoded key file (as string)
- key contents (PEM encoded string)

If the provided value cannot be parsed as PEM encoded private key, then
the file existence is checked. This way private key contents are not
exposed to unnecessary underlying function calls.
2024-01-11 11:15:24 +00:00
Ronny H
98a0de30b4
Fix sphinx error (#2384)
To test:
> cd docs && make HTML

changelogs:
- remove unindented line in destination connector's sql.rst file 
- add elasticsearch page into destination_connector.rst file
2024-01-10 22:25:18 +00:00
rvztz
950e5d68f9
feat: adds postgresql/sqlite destination connector (#2005)
- Adds a destination connector to upload processed output into a
PostgreSQL/Sqlite database instance.
- Users are responsible to provide their instances. This PR includes a
couple of configuration examples.
- Defines the scripts required to setup a PostgreSQL instance with the
unstructured elements schema.
- Validates postgres/pgvector embedding storage and retrieval

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-04 19:33:16 +00:00
ryannikolaidis
dd1443ab6f
feat: add Qdrant ingest destination connector (#2338)
This PR intends to add [Qdrant](https://qdrant.tech/) as a supported
ingestion destination.

- Implements CLI and programmatic usage.
- Documentation update
- Integration test script

---
Clone of #2315 to run with CI secrets

---------

Co-authored-by: Anush008 <anushshetty90@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2024-01-02 22:08:20 +00:00
Ahmet Melek
fd293b3e78
feat: add elasticsearch destination connector (#2152)
Closes https://github.com/Unstructured-IO/unstructured/issues/1842
Closes https://github.com/Unstructured-IO/unstructured/issues/2202
Closes https://github.com/Unstructured-IO/unstructured/issues/2203

This PR:
- Adds Elasticsearch destination connector to be able to ingest
documents from any supported source, embed them and write the embeddings
/ documents into Elasticsearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured elasticsearch indexes easily.
- Includes parallelized upload and lazy processing for elasticsearch
destination connector.
- Rearranges elasticsearch test helpers to source, destination, and
common folders.
- Adds util functions to be able to batch iterables in a lazy way for
uploads
- Fixes a bug where removing the optional parameter `--fields` broke the
connector due to an integer processing error.
- Fixes a bug where using an [elasticsearch
config](8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35))
for a destination connector resulted in a serialization issue when
optional parameter `--fields` was not provided.
2023-12-20 01:26:58 +00:00
David Potter
4b8352e0f5
feat: add chroma destination connector (#2240)
Adds Chroma (also known as ChromaDB) as a vector destination.

Currently Chroma is an in-memory single-process oriented library with
plans of a hosted and/or more production ready solution
-https://docs.trychroma.com/deployment

Though they now claim to support multiple Clients hitting the database
at once, I found that it was inconsistent. Sometimes multiprocessing
worked (maybe 1 out of 3 times) But the other times I would get
different errors. So I kept it single process.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-19 16:58:23 +00:00
cragwolfe
bd8a74d686
chore: shell scripts default indent of 2 instead of 4 (#2287)
Given the tendency for shell scripts to easily enter into a few levels
of indentation and long line lengths, update the default to 2 spaces.
2023-12-19 07:48:21 +00:00
Roman Isecke
ac302689a0
chore: update sphinx ingest docs with new connectors (#2245)
Replacing https://github.com/Unstructured-IO/unstructured/pull/2243
2023-12-11 21:29:41 +00:00
David Potter
cde11d1eb0
feat: Add sftp source connector (#2163)
Adds source connector for SFTP which uses fsspec and paramiko via
fsspec. Paramiko is the standard sftp package for python used in pysftp
etc...

```
--username foo \
--password bar \
--remote-url sftp://localhost:47474/upload/
```

Will only download a specifically requested file if it has an extension.
(i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will
treat any other remote_url as a folder path. This is intentional.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-07 19:33:19 +00:00
rvztz
ce905dd098
feat: Weaviate destination connector (#1963)
Closes #1781.
- Adds a Weaviate destination connector
- The connector receives a host for the weaviate instance and a weaviate
class name.
- Defines a weaviate schema for json elements.
- Defines the pre-processing to conform unstructured's schema to the
proposed weaviate schema.
2023-12-01 22:27:41 +00:00
Ahmet Melek
ed08773de7
feat: add pinecone destination connector (#1774)
Closes https://github.com/Unstructured-IO/unstructured/issues/1414
Closes #2039 

This PR:
- Uses Pinecone python cli to implement a destination connector for
Pinecone and provides the ingest readme requirements
[(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist)
for the connector
- Updates documentation for the s3 destination connector
- Alphabetically sorts setup.py contents
- Updates logs for the chunking node  in ingest pipeline
- Adds a baseline session handle implementation for destination
connectors, to be able to parallelize their operations
- For the
[bug](https://github.com/Unstructured-IO/unstructured/issues/1892)
related to persisting element data to ingest embedding nodes; this PR
tests the
[solution](https://github.com/Unstructured-IO/unstructured/pull/1893)
with its ingest test
- Solves a bug on ingest chunking params with [bugfix on chunking params
and implementing related
test](69e1949a6f)

---------

Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-11-29 22:37:32 +00:00
ryannikolaidis
b718db22b8
docs: add mongodb destination connector to ingest docs (#2136)
The mongodb documentation page exists but is missing from the list of
destination connectors, meaning it's not actually visible in the
documentation.
2023-11-21 19:11:25 +00:00
Roman Isecke
b8af2f18bb
add mongo db destination connector (#2068)
### Description
This adds the basic implementation of pushing the generated json output
of partition to mongodb. None of this code provisions the mondo db
instance so things like adding a search index around the embedding
content must be done by the user. Any sort of schema validation would
also have to take place via user-specific configuration on the database.
This update makes no assumptions about the configuration of the database
itself.
2023-11-16 22:40:22 +00:00
qued
ad09a869b5
fix: update slack link to link shortener (#2010)
Per @tabossert we're now using a link shortener behind which we can
rotate the link to keep it current. That way we (🤞 ) never have to
update this here again.

#### Testing:

Links should work. No more links should exist in the documentation
except this one.
2023-11-06 15:47:18 +00:00
Matt Robinson
e5bcd36475
docs: update slack links (#1990)
### Summary

A user in the [Community
Slack](https://unstructuredw-kbe4326.slack.com/archives/C043YA29U0J/p1698933003702919)
reported having difficulty signing up for Slack using the links from the
documentation. Updated the links to the use the invite link that worked
from him, which came from [this blog
post](https://medium.com/unstructured-io/setting-up-a-private-retrieval-augmented-generation-rag-system-with-local-vector-database-d42f34692ca7).
2023-11-05 11:26:34 -08:00
Roman Isecke
d09c8c0cab
test: update ingest dest tests to follow set pattern (#1991)
### Description
Update all destination tests to match pattern:
* Don't omit any metadata to check full schema
* Move azure cognitive dest test from src to dest
* Split delta table test into seperate src and dest tests
* Fix azure cognitive search and add to dest tests being run (wasn't
being run originally)
2023-11-03 12:46:56 +00:00
Roman Isecke
901704b6c0
update sphinx docs with ingest content (#1969)
### Description
Create a new structure for ingest content in the docs, update with all
configs
2023-11-02 20:40:35 +00:00