unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-13 20:15:54 +00:00

Author	SHA1	Message	Date
David Potter	8610bd3ab9	feat: Kafka source and destination connector (#3176 ) Thanks to @tullytim we have a new Kafka source and destination connector. It also works with hosted Kafka via Confluent. Documentation will be added to the Docs repo.	2024-06-22 23:26:23 +00:00
Matt Robinson	3158169585	fix: uninstall bson for mongo connector (#3104 ) ### Summary Closes #3049. Reenables the MongoDB connector test, which was disabled previously in #3047 due to incompatibility between the `pymongo` and the `bson` package from `pip`, which is a dependency for the Astra connector. Per the `pymongo` docs below, `pymongo` ships with its own version of `bson` and installing `bson` from `pip` breaks `pymongo`. - https://pymongo.readthedocs.io/en/stable/installation.html ### Testing Ingest tests ran successfully for the [source connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512636315) and the [destination connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512635546).	2024-05-28 17:45:18 +00:00
Matt Robinson	d7608014c0	improve: add Python 3.12 support (#3033 ) (#3047 ) ### Summary Closes #2959. Updates the dependency and CI to add support for Python 3.12. The MongoDB ingest tests were disabled due to jobs like [this one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333) failing due to issues with the `bson` package. `bson` is a dependency for the AstraDB connector, but `pymongo` does not work when `bson` is installed from `pip`. This issue is documented by MongoDB [here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun off #3049 to resolve this. Issue seems unrelated to Python 3.12, though unsure why this didn't surface previously. Disables the `argilla` tests because `argilla` does not yet support Python 3.12. We can add the `argilla` tests back in once the PR references below is merged. You can still use the `stage_for_argilla` function if you're on `python<3.12` and you install `argilla` yourself. - https://github.com/argilla-io/argilla/pull/4837 --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>	2024-05-19 23:03:15 +00:00
David Potter	9177aa20a8	feature CORE-3985: add Clarifai destination connector (#2633 ) Thanks to @mogith-pn from Clarifai we have a new destination connector! This PR intends to add Clarifai as a ingest destination connector. Access via CLI and programmatic. Documentation and Examples. Integration test script.	2024-03-21 16:36:21 +00:00
David Potter	e8ec09c8b9	feat: astra dest connector (#2571 ) Thanks to Eric Hare @erichare at DataStax we have a new destination connector. This Pull Request implements an integration with [Astra DB](https://datastax.com) which allows for the Astra DB Vector Database to be compatible with Unstructured's set of integrations. To create your Astra account and authenticate with your `ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these steps: 1. Create an account at https://astra.datastax.com 2. Login and create a new database 3. From the database page, in the right hand panel, you will find your API Endpoint 4. Beneath that, you can create a Token to be used Some notes about Astra DB: - Astra DB is a Vector Database which allows for high-performance database transactions, and enables modern GenAI apps [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html) - It supports similarity search via a number of methods [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics) - It also supports non-vector tables / collections	2024-02-23 20:50:50 +00:00
Ahmet Melek	be71633415	refactor: isolate ingest dependencies into local scopes (#2509 ) This PR: - Moves ingest dependencies into local scopes to be able to import ingest connector classes without the need of installing imported external dependencies. This allows lightweight use of the classes (not the instances. to use the instances as intended you'll still need the dependencies). - Upgrades the embed module dependencies from `langchain` to `langchain-community` module (to pass CI [rather than introducing a pin]) - Does pip-compile - Does minor refactors in other files to pass `ruff 2.0` checks which were introduced by pip-compile	2024-02-06 21:28:55 +00:00
David Potter	c100ce28a7	feat: add Vectara destination connector (#2357 ) Thanks to Ofer at Vectara, we now have a Vectara destination connector. - There are no dependencies since it is all REST calls to API - --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-01 14:38:34 +00:00
David Potter	bc791d53f4	feat: add opensearch source and destination connector (#2349 ) Adds OpenSearch as a source and destination. Since OpenSearch is a fork of Elasticsearch, these connectors rely heavily on inheriting the Elasticsearch connectors whenever possible. - Adds OpenSearch source connector to be able to ingest documents from OpenSearch. - Adds OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured OpenSearch indexes easily. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-17 04:31:49 +00:00
rvztz	950e5d68f9	feat: adds postgresql/sqlite destination connector (#2005 ) - Adds a destination connector to upload processed output into a PostgreSQL/Sqlite database instance. - Users are responsible to provide their instances. This PR includes a couple of configuration examples. - Defines the scripts required to setup a PostgreSQL instance with the unstructured elements schema. - Validates postgres/pgvector embedding storage and retrieval --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-04 19:33:16 +00:00
ryannikolaidis	dd1443ab6f	feat: add Qdrant ingest destination connector (#2338 ) This PR intends to add [Qdrant](https://qdrant.tech/) as a supported ingestion destination. - Implements CLI and programmatic usage. - Documentation update - Integration test script --- Clone of #2315 to run with CI secrets --------- Co-authored-by: Anush008 <anushshetty90@gmail.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2024-01-02 22:08:20 +00:00
Ahmet Melek	fd293b3e78	feat: add elasticsearch destination connector (#2152 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1842 Closes https://github.com/Unstructured-IO/unstructured/issues/2202 Closes https://github.com/Unstructured-IO/unstructured/issues/2203 This PR: - Adds Elasticsearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Elasticsearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured elasticsearch indexes easily. - Includes parallelized upload and lazy processing for elasticsearch destination connector. - Rearranges elasticsearch test helpers to source, destination, and common folders. - Adds util functions to be able to batch iterables in a lazy way for uploads - Fixes a bug where removing the optional parameter `--fields` broke the connector due to an integer processing error. - Fixes a bug where using an [elasticsearch config](`8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35)`) for a destination connector resulted in a serialization issue when optional parameter `--fields` was not provided.	2023-12-20 01:26:58 +00:00
David Potter	4b8352e0f5	feat: add chroma destination connector (#2240 ) Adds Chroma (also known as ChromaDB) as a vector destination. Currently Chroma is an in-memory single-process oriented library with plans of a hosted and/or more production ready solution -https://docs.trychroma.com/deployment Though they now claim to support multiple Clients hitting the database at once, I found that it was inconsistent. Sometimes multiprocessing worked (maybe 1 out of 3 times) But the other times I would get different errors. So I kept it single process. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-19 16:58:23 +00:00
cragwolfe	bd8a74d686	chore: shell scripts default indent of 2 instead of 4 (#2287 ) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces.	2023-12-19 07:48:21 +00:00
Roman Isecke	76efcf4dd7	chore: add shfmt (#2246 ) ### Description Given all the shell files that now exist in the repo, would be nice to have linting/formatting around them (in addition to the existing shellcheck which doesn't do anything to format the shell code). This PR introduces `shfmt` to both check for changes and apply formatting when the associated make targets are called.	2023-12-12 01:04:15 +00:00
Roman Isecke	f193d3d43b	feat: improve sensitive data handling by fsspec connectors (#2194 ) ### Description Building off of PR https://github.com/Unstructured-IO/unstructured/pull/2179, updating fsspec based connectors to use better authentication field handling. This PR adds in the following changes: * Update the base classes to inherit from the enhanced json mixin * Add in a new access config dataclass that should be used as a nest dataclass in the connector configs * Update the code extracting configs out of the cli options dictionary to support the nested access config if it exists on the parent config * Update all fsspec connectors with explicit access configs given what each one's SDKs support * Update the json mixin and enhanced field to support a name override when serializing/deserializing from json/dicts. This allows a different name to be used for the CLI option than what the name of the field is on the dataclass. * Update all the writes to use class-based approach and share the same structure of the runner classes * Above update allowed for better code to be used in the base source and destination CLI commands * Add in utility code around paring a flat dictionary (coming from the click based options) into dataclass-based configs with potentially nested dataclasses. Slightly unrelated changes: * session handle removed from pinecone connector as this was breaking the serialization of the write config and didn't have any benefit as a connection was never being shared, the index used simply makes a new http call each time it's invoked. * Dedicated write configs were created for all destination connectors to better support serialization * Refactor of Elasticsearch connector included, with update to ingest test to use auth TODOs * Left a `#TODO` in the code but the way session handler is implemented right now, it breaks serialization since it adds a generic variable based on the library being used for a connector (i.e. `googleapiclient.discovery.Resource`) which is not serializable. This will need to be updated to omit that from serialization but still support the current workflow. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-05 20:55:19 +00:00
rvztz	ce905dd098	feat: Weaviate destination connector (#1963 ) Closes #1781. - Adds a Weaviate destination connector - The connector receives a host for the weaviate instance and a weaviate class name. - Defines a weaviate schema for json elements. - Defines the pre-processing to conform unstructured's schema to the proposed weaviate schema.	2023-12-01 22:27:41 +00:00
Ahmet Melek	ed08773de7	feat: add pinecone destination connector (#1774 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1414 Closes #2039 This PR: - Uses Pinecone python cli to implement a destination connector for Pinecone and provides the ingest readme requirements [(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist) for the connector - Updates documentation for the s3 destination connector - Alphabetically sorts setup.py contents - Updates logs for the chunking node in ingest pipeline - Adds a baseline session handle implementation for destination connectors, to be able to parallelize their operations - For the [bug](https://github.com/Unstructured-IO/unstructured/issues/1892) related to persisting element data to ingest embedding nodes; this PR tests the [solution](https://github.com/Unstructured-IO/unstructured/pull/1893) with its ingest test - Solves a bug on ingest chunking params with [bugfix on chunking params and implementing related test](`69e1949a6f`) --------- Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2023-11-29 22:37:32 +00:00
Roman Isecke	c028a14ebf	chore: enable azure destination CI tests (#2172 ) Add AZURE_DEST_CONNECTION_STR to list of env vars in ci ingest test	2023-11-29 20:44:18 +00:00
Roman Isecke	b951d73a9b	feat: add logging to ingest CLI for tests being skipped at the end (#2174 ) ### Description Often times there are tests being skipped either due to missing env vars or explicitly defined in the base script but these get lost in the logs. This PR updates the scripts to leverage a custom error code if being skipped due to missing env vars and this custom error code is being caught by the base script and logs all files being skipped to a file. At the end of the script, this file gets logged in the CI output.	2023-11-29 13:41:19 +00:00
Roman Isecke	6e67c48fd8	feat: update all ingest tests to use huggingface for embeddings (#2071 ) ### Description Update any use of OpenAI for generating embeddings in the ingest tests to use Huggingface Bonus Changes: * Remove duplicate delta table test * Delete delta table destination directory at the beginning of the test to make sure it doesn't exist and prevent the test from breaking.	2023-11-21 18:43:19 +00:00
Roman Isecke	b8af2f18bb	add mongo db destination connector (#2068 ) ### Description This adds the basic implementation of pushing the generated json output of partition to mongodb. None of this code provisions the mondo db instance so things like adding a search index around the embedding content must be done by the user. Any sort of schema validation would also have to take place via user-specific configuration on the database. This update makes no assumptions about the configuration of the database itself.	2023-11-16 22:40:22 +00:00
Roman Isecke	d09c8c0cab	test: update ingest dest tests to follow set pattern (#1991 ) ### Description Update all destination tests to match pattern: * Don't omit any metadata to check full schema * Move azure cognitive dest test from src to dest * Split delta table test into seperate src and dest tests * Fix azure cognitive search and add to dest tests being run (wasn't being run originally)	2023-11-03 12:46:56 +00:00
Roman Isecke	24a419ece0	separate ingest tests (#1951 ) ### Description This splits the source ingest tests from the destination ingest tests since they share a different pattern: * src tests pull data from a source and compare the partitioned content to the expected results * destingation tests leverage the local connector to produce results to push to a destination and leverages overhead to create temporary locations at those destinations to write to and delete when done. Only the src tests create partitioned content that needs to be checked so the update ingest test CI job only needs to run these.	2023-11-01 19:23:44 +00:00

23 Commits