unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-05 05:11:52 +00:00

Author	SHA1	Message	Date
Roman Isecke	d876a386ed	Roman/fix ingest async connectors (#3210 ) ### Description Choosing to use async needs to be very careful because if a connector is set to use async, the pipeline will not fan out the inputs via multiprocessing but instead it will be limited to run in a single process under the assumption it has more benefit from async due to heavy network traffic. This means the exact same code that is not optimized for async and is blocking will force the pipeline to perform worse than simply never marking the connector to use async since the pipeline will fan that out using multiprocessing. All connectors and processes in the pipeline we revisited to make sure this criteria was met and updated accordingly: * Currently the unstructured client does not support making requests async, so this was moved over to use multiprocessing * fsspec connector was updated to use the async client from the fsspec library. This also required that the client be a `@property` fetched on demand, otherwise the client would break the multiprocessing pool since it maintains a thread lock and that can't be pickled when the fsspec connector doesn't support async. * elasticsearch was also updated to use the async client * weaviate only recently came out with async support in their SDK at a version that is higher than we can use in the open source repo, so a TODO was left but otherwise moved to use multiprocessing * all underlying embedders don't use async to embedder step must be multiprocessing for now. TODO left to update underlying embedder classes to optionally support async. * Chunking parameters were not accurately being passed through from cli to chunker params, this was fixed --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-17 16:55:19 +00:00
ryannikolaidis	6b5d8a9785	fix: revert dropping of filename extension for some connectors (#3109 ) V2 refactor of ingest code introduces the removal of original file extensions. Since the upgrade of connectors is incomplete this means that some connectors will remove the original file extension and some will not. Still TBD whether this is actually something we want at all. This PR reverts specifically that change in the V2 ingest code so that original file extension is preserved downstream. ## Testing CI is passing with filenames updated via `Ingest Test Fixtures Update` workflow. --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2024-05-29 19:14:22 +00:00
Roman Isecke	3eaf65a8c1	feat: refactor ingest (#3009 ) ### Description This refactors the current ingest CLI process to support better granularity in how the steps are ran * Both multiprocessing and async now supported. Given that a lot of the steps are IO-bound, such as downloading and uploading content, we can achieve better parallelization by using async here * Destination step broken up into a stager step and an upload step. This will allow for steps that require manipulation of the data between formats, such as converting the elements json into a csv format to upload for tabular destinations, to be pulled out of the step that does the actual upload. * The process of writing the content to a local destination was now pulled out as it's own dedicated destination connector, meaning you no longer need to persist the content locally once the process is done if the content was uploaded elsewhere. * Quick update to the chunker/partition step to use the python client. * Move the uncompress suppport as a pipeline step since this can arbitrarily apply to any concrete files that have been downloaded, regardless of where they came from. * Leverage last modified date to mark files to be reprocessed, even if the file already exists locally. ### Callouts Retry configs haven't been moved over yet. This is an open question because the intent was for it to wrap potential connection errors but now any of the other steps that leverage an API might run into network connection issues. Should those be isolated in each of the steps and wrapped with the same retry configs? Or do we need to expose a unique retry config for each step? This would bloat the input params even more. ### Testing * If you want to run the new code as an SDK, there's an example file that was added to highlight how to do that: [example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py) * If you want to run the new code as an isolated CLI: ```shell PYTHONPATH=. python unstructured/ingest/v2/main.py --help ``` * If you want to see which commands have been migrated to the new version, there's now a `v2` short help text next to those commands when running the current cli: ```shell PYTHONPATH=. python unstructured/ingest/main.py --help Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help Options: --help Show this message and exit. Commands: airtable azure biomed box confluence delta-table discord dropbox elasticsearch fsspec gcs github gitlab google-drive hubspot jira local v2 mongodb notion onedrive opensearch outlook reddit s3 v2 salesforce sftp sharepoint slack wikipedia ``` You can run any of the local or s3 specific ingest tests and these should now work. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-05-21 17:01:49 +00:00
Michał Martyniak	2d1923ac7e	Better element IDs - deterministic and document-unique hashes (#2673 ) Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842 Main changes compared to part one: * hash computation includes element's sequence number on page, page number, document filename and its text * there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents) This PR addresses the following issue: https://github.com/Unstructured-IO/unstructured/issues/2461	2024-04-24 00:05:20 -07:00
Steve Canny	56fbaaed10	feat(chunking): add metadata.orig_elements serde (#2680 ) Summary This final PR in the "orig_elements" series adds the needful such that `.metadata.orig_elements`, when present on a chunk (element), is serialized to JSON when the chunk is serialized, for instance, to be used in an HTTP response payload. It also provides for deserializing such a JSON payload into chunks that contain the `.orig_elements` metadata. Additional Context Note that `.metadata.orig_elements` is always `Optional[list[Element]]` when in memory. However, those original elements are serialized as Base64-encoded gzipped JSON and are in that form (str) when present as JSON or as "element-dicts" which is an intermediate serialization/deserialization format. That is, serialization is `Element -> dict -> JSON` and deserialization is `JSON -> dict -> Element` and `.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-22 21:53:26 +00:00
Steve Canny	8ea203adf7	feat(chunking): composite text gets is_continuation (#2639 ) Summary Add `metadata.is_continuation = True` to metadata of second-and-later text-split chunks formed from an oversized non-table element. Previously this metadata was only present on text-split `TableChunk` elements. This enables downstream filtering of intentionally redundant metadata on chunk elements that may not be desired for all purposes. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-12 19:44:41 +00:00
Steve Canny	2f2c48acd5	feat(ingest): add basic chunking to ingest (#2380 ) The new "basic" chunking strategy and overlap options need to be available from the ingest CLI. An ingest test of those features is also welcome, both to verify the ingest feature and to defend against regressions in the chunking code. Add a local ingest test exercising both the "basic" chunking strategy and intra-chunk overlap. Since there is no new source connector involved, use the local ingest source and destination. Update documentation to suit, filling in some details that hadn't made it into the docs yet.	2024-01-12 20:27:34 +00:00

7 Commits