mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-11-02 11:03:38 +00:00
* Many command line options added. The sample ingest project is now an easy to use CLI (no code editing necessary), capable of processing large numbers of files from S3 in a re-entrant manner. See Ingest.md. * Fixes issue where text fixtures had been truncated * Adds a check to make sure this doesn't happen again * Moves fixture outputs for the existing connector one subdir lower, to make room for future connector outputs.
3.5 KiB
3.5 KiB
Batch Processing Documents
Sample Connector: S3
See the sample project examples/ingest/s3-small-batch/main.py, which processes all the documents under a given s3 URL with 2 parallel processes, writing the structured json output to structured-outputs/.
You can try it out with:
PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ --anonymous
# Note: the --anonymous flag indicates not to provide AWS credentials, needed
# for the boto3 lib. Remove this flag when local AWS credentials are required.
This utility is ready to use with any s3 prefix!
By default, it will not reprocess files from s3 if their outputs already exist in --structured-ouput-dir. Natrually, this may come in handy when processing a large number of files. However, you can force reprocessing all documents with the --reprocess flag.
$ PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --help
Usage: main.py [OPTIONS]
Options:
--s3-url TEXT Prefix of s3 objects (files) to download.
E.g. s3://bucket1/path/. This value may also
be a single file.
--re-download / --no-re-download
Re-download files from s3 even if they are
already present in --download-dir.
--download-dir TEXT Where s3 files are downloaded to, defaults
to tmp-ingest-<6 random chars>.
--preserve-downloads Preserve downloaded s3 files. Otherwise each
file is removed after being processed
successfully.
--structured-output-dir TEXT Where to place structured output .json
files.
--reprocess Reprocess a downloaded file from s3 even if
the relevant structured output .json file in
--structured-output-dir already exists.
--num-processes INTEGER Number of parallel processes to process docs
in. [default: 2]
--anonymous Connect to s3 without local AWS credentials.
-v, --verbose
--help Show this message and exit.
Developer notes
The Abstractions
sequenceDiagram
participant MainProcess
participant DocReader (connector)
participant DocProcessor
participant StructuredDocWriter (conncector)
MainProcess->>DocReader (connector): Initialize / Authorize
DocReader (connector)->>MainProcess: All doc metadata (no file content)
loop Single doc at a time (allows for multiprocessing)
MainProcess->>DocProcessor: Raw document metadata (no file content)
DocProcessor->>DocReader (connector): Request document
DocReader (connector)->>DocProcessor: Single document payload
Note over DocProcessor: Process through Unstructured
DocProcessor->>StructuredDocWriter (conncector): Write Structured Data
Note over StructuredDocWriter (conncector): <br /> Optionally store version info, filename, etc
DocProcessor->>MainProcess: Structured Data (only JSON in V0)
end
Note over MainProcess: Optional - process structured data from all docs
The abstractions in the above diagram are honored in the S3 Connector project (though ABC's are not yet written), with the exception of the StructuredDocWriter which may be added more formally at a later time.