2024-01-17 13:01:01 -08:00
Multi-files API Processing
==========================
2023-11-29 17:24:43 -08:00
Introduction
2024-01-17 13:01:01 -08:00
***** ***** **
2023-11-29 17:24:43 -08:00
This guide demonstrates how to process multiple files using the Unstructured API and S3 Connector and implement context-aware chunking. The process involves installing dependencies, configuring settings, and utilizing Python scripts to manage and chunk data effectively.
Prerequisites
2024-01-17 13:01:01 -08:00
***** ***** ***
2023-11-29 17:24:43 -08:00
Ensure you have Unstructured API key and access to an S3 bucket containing the target files.
2024-01-17 13:01:01 -08:00
Step-by-Step Process
***** ***** ***** *****
2023-11-29 17:24:43 -08:00
Step 1: Install Unstructured and S3 Dependency
2024-01-17 13:01:01 -08:00
----------------------------------------------
2023-11-29 17:24:43 -08:00
Install the `unstructured` package with S3 support.
.. code-block :: python
2023-12-20 12:13:47 -08:00
pip install "unstructured[s3]"
2023-11-29 17:24:43 -08:00
Step 2: Import Libraries
2024-01-17 13:01:01 -08:00
------------------------
2023-11-29 17:24:43 -08:00
Import necessary libraries from the `unstructured` package for chunking and S3 processing.
.. code-block :: python
from unstructured.ingest.interfaces import (
FsspecConfig,
PartitionConfig,
ProcessorConfig,
ReadConfig,
)
from unstructured.ingest.runner import S3Runner
from unstructured.chunking.title import chunk_by_title
from unstructured.staging.base import dict_to_elements
2024-01-17 13:01:01 -08:00
2023-11-29 17:24:43 -08:00
Step 3: Configuration
2024-01-17 13:01:01 -08:00
---------------------
2023-11-29 17:24:43 -08:00
Set up the API key and S3 URL for accessing the data.
.. code-block :: python
UNSTRUCTURED_API_KEY = os.getenv('UNSTRUCTURED_API_KEY')
S3_URL = "s3://rh-financial-reports/world-development-bank-2023/"
2024-01-17 13:01:01 -08:00
2023-11-29 17:24:43 -08:00
Step 4: Python Runner
2024-01-17 13:01:01 -08:00
---------------------
2023-11-29 17:24:43 -08:00
Configure and run the S3Runner for processing the data.
.. code-block :: python
runner = S3Runner(
processor_config=ProcessorConfig(
verbose=True,
output_dir="Connector-Output",
num_processes=8,
),
read_config=ReadConfig(),
partition_config=PartitionConfig(
partition_endpoint="https://api.unstructured.io/general/v0/general",
partition_by_api=True,
api_key=UNSTRUCTURED_API_KEY,
strategy="hi_res",
hi_res_model_name="yolox",
),
fsspec_config=FsspecConfig(
remote_url=S3_URL,
),
)
runner.run(anonymous=True)
2024-01-17 13:01:01 -08:00
2023-11-29 17:24:43 -08:00
Step 5: Combine JSON Files from Multi-files Ingestion
2024-01-17 13:01:01 -08:00
-----------------------------------------------------
2023-11-29 17:24:43 -08:00
Combine JSON files into a single dataset for further processing.
.. code-block :: python
combined_json_data = read_and_combine_json("Connector-Output/world-development-bank-2023")
2024-01-17 13:01:01 -08:00
2023-11-29 17:24:43 -08:00
Step 6: Convert into Unstructured Elements for Chunking
2024-01-17 13:01:01 -08:00
-------------------------------------------------------
2023-11-29 17:24:43 -08:00
Convert the combined JSON data into Unstructured Elements and apply chunking by title.
.. code-block :: python
elements = dict_to_elements(combined_json_data)
chunks = chunk_by_title(elements)
2024-01-17 13:01:01 -08:00
2023-11-29 17:24:43 -08:00
Conclusion
2024-01-17 13:01:01 -08:00
***** *****
2023-11-29 17:24:43 -08:00
Following these steps allows for efficient processing of multiple files using the Unstructured S3 Connector. The context-aware chunking helps in organizing and analyzing the data effectively.