mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-11-19 11:57:32 +00:00
37 lines
1.6 KiB
Markdown
37 lines
1.6 KiB
Markdown
|
|
# Batch Processing Documents
|
||
|
|
|
||
|
|
Several classes are provided in the Unstructured library
|
||
|
|
to enable effecient batch processing of documents.
|
||
|
|
|
||
|
|
## The Abstractions
|
||
|
|
|
||
|
|
```mermaid
|
||
|
|
sequenceDiagram
|
||
|
|
participant MainProcess
|
||
|
|
participant DocReader (connector)
|
||
|
|
participant DocProcessor
|
||
|
|
participant StructuredDocWriter (conncector)
|
||
|
|
MainProcess->>DocReader (connector): Initialize / Authorize
|
||
|
|
DocReader (connector)->>MainProcess: All doc metadata (no file content)
|
||
|
|
loop Single doc at a time (allows for multiprocessing)
|
||
|
|
MainProcess->>DocProcessor: Raw document metadata (no file content)
|
||
|
|
DocProcessor->>DocReader (connector): Request document
|
||
|
|
DocReader (connector)->>DocProcessor: Single document payload
|
||
|
|
Note over DocProcessor: Process through Unstructured
|
||
|
|
DocProcessor->>StructuredDocWriter (conncector): Write Structured Data
|
||
|
|
Note over StructuredDocWriter (conncector): <br /> Optionally store version info, filename, etc
|
||
|
|
DocProcessor->>MainProcess: Structured Data (only JSON in V0)
|
||
|
|
end
|
||
|
|
Note over MainProcess: Optional - process structured data from all docs
|
||
|
|
```
|
||
|
|
|
||
|
|
## Sample Connector: S3
|
||
|
|
|
||
|
|
See the sample project [examples/ingest/s3-small-batch/main.py](examples/ingest/s3-small-batch/main.py), which processes all the documents under a given s3 URL with 2 parallel processes, writing the structured json output to `structured-outputs/`.
|
||
|
|
|
||
|
|
You can try it out with
|
||
|
|
|
||
|
|
PYTHONPATH=. python examples/ingest/s3-small-batch/main.py
|
||
|
|
|
||
|
|
The abstractions in the above diagram are honored in this project (though ABC's are not yet written), with the exception of the StructuredDocWriter which may be added more formally at a later time.
|