unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-01 11:21:13 +00:00

Author	SHA1	Message	Date
ryannikolaidis	2ce829ddd0	test: update test Elasticsearch mappings to validate embedding search (#2397 ) Currently in the Elasticsearch Destination ingest test we are writing the embeddings to a "float" type field. In order to leverage this field for similarity search it should be mapped as "dense_vector" with the respective dimensions assigned. This PR updates that mapping and adds a test query to validate that this works as expected.	2024-01-14 19:27:56 +00:00
Ahmet Melek	fd293b3e78	feat: add elasticsearch destination connector (#2152 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1842 Closes https://github.com/Unstructured-IO/unstructured/issues/2202 Closes https://github.com/Unstructured-IO/unstructured/issues/2203 This PR: - Adds Elasticsearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Elasticsearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured elasticsearch indexes easily. - Includes parallelized upload and lazy processing for elasticsearch destination connector. - Rearranges elasticsearch test helpers to source, destination, and common folders. - Adds util functions to be able to batch iterables in a lazy way for uploads - Fixes a bug where removing the optional parameter `--fields` broke the connector due to an integer processing error. - Fixes a bug where using an [elasticsearch config](`8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35)`) for a destination connector resulted in a serialization issue when optional parameter `--fields` was not provided.	2023-12-20 01:26:58 +00:00
Roman Isecke	76efcf4dd7	chore: add shfmt (#2246 ) ### Description Given all the shell files that now exist in the repo, would be nice to have linting/formatting around them (in addition to the existing shellcheck which doesn't do anything to format the shell code). This PR introduces `shfmt` to both check for changes and apply formatting when the associated make targets are called.	2023-12-12 01:04:15 +00:00
Roman Isecke	f193d3d43b	feat: improve sensitive data handling by fsspec connectors (#2194 ) ### Description Building off of PR https://github.com/Unstructured-IO/unstructured/pull/2179, updating fsspec based connectors to use better authentication field handling. This PR adds in the following changes: * Update the base classes to inherit from the enhanced json mixin * Add in a new access config dataclass that should be used as a nest dataclass in the connector configs * Update the code extracting configs out of the cli options dictionary to support the nested access config if it exists on the parent config * Update all fsspec connectors with explicit access configs given what each one's SDKs support * Update the json mixin and enhanced field to support a name override when serializing/deserializing from json/dicts. This allows a different name to be used for the CLI option than what the name of the field is on the dataclass. * Update all the writes to use class-based approach and share the same structure of the runner classes * Above update allowed for better code to be used in the base source and destination CLI commands * Add in utility code around paring a flat dictionary (coming from the click based options) into dataclass-based configs with potentially nested dataclasses. Slightly unrelated changes: * session handle removed from pinecone connector as this was breaking the serialization of the write config and didn't have any benefit as a connection was never being shared, the index used simply makes a new http call each time it's invoked. * Dedicated write configs were created for all destination connectors to better support serialization * Refactor of Elasticsearch connector included, with update to ingest test to use auth TODOs * Left a `#TODO` in the code but the way session handler is implemented right now, it breaks serialization since it adds a generic variable based on the library being used for a connector (i.e. `googleapiclient.discovery.Resource`) which is not serializable. This will need to be updated to omit that from serialization but still support the current workflow. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-05 20:55:19 +00:00
Roman Isecke	b2e997635f	roman/es ingest test fixes (#1610 ) ### Description update elasticsearch docker setup to use docker-compose Would close out https://github.com/Unstructured-IO/unstructured/issues/1609	2023-10-03 10:39:33 -04:00
Yao You	ad59a879cc	chore: bump inference to 0.6.6 (#1563 ) - bump `unstructured-inference` to `0.6.6` - specify default model name for element detection to be `detectron2_onnx` to keep current behavior - NOTE: the updated inference package by default would use yolox as element detection model; this will be evaluated and enabled in a separated PR --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2023-09-29 19:09:57 +00:00
Yao You	af7639e23f	ci: add retry to elastic search ingest test (#1581 ) Occasionally the es test can fail because the index fail to be created on the first try. Experiments show adding timeout doesn't help but add retry mitigates the issue. See history of commits in branch: yao/bump-inference-to-0.6.6 https://github.com/Unstructured-IO/unstructured/pull/1563 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2023-09-29 13:42:21 -05:00
Ahmet Melek	5ea216cf07	feat: elasticsearch connector (#817 )	2023-07-01 17:45:28 +00:00

8 Commits