* Add `requires_dependencies` decorator * Use `required_dependencies` on Reddit & S3 * Fix bug in `requires_dependencies` To used named args the decorator needs to be also wrapped * Add `requires_dependencies` integration tests * Add `requires_dependencies` in `Competition.md` * Update `CHANGELOG.md` * Bump version 0.4.16-dev5 * Ignore `F401` unused imports in `requires_dependencies` tests * Apply suggestions from code review * Add `functools.wrap` to keep docs, & annotations * Use `requires_dependencies` in `GitHubConnector`
5.9 KiB
Batch Processing Documents
The unstructured-ingest CLI
The unstructured library includes a CLI to batch ingest documents from (soon to be various) sources, storing structured outputs locally on the filesystem.
For example, the following command processes all the documents in S3 in the
utic-dev-tech-fixtures bucket with a prefix of small-pdf-set/.
unstructured-ingest \
--s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ \
--s3-anonymous \
--structured-output-dir s3-small-batch-output \
--num-processes 2
Naturally, --num-processes may be adjusted for better instance utilization with multiprocessing.
Installation note: make sure to install the following extras when installing unstructured, needed for the above command:
pip install "unstructured[s3,local-inference]"
See the Quick Start which documents how to pip install dectectron2 and other OS dependencies, necessary for the parsing of .PDF files.
Developers' Guide
Local testing
When testing from a local checkout rather than a pip-installed version of unstructured,
just execute unstructured/ingest/main.py, e.g.:
PYTHONPATH=. ./unstructured/ingest/main.py \
--s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ \
--s3-anonymous \
--structured-output-dir s3-small-batch-output \
--num-processes 2
Adding Data Connectors
To add a connector, refer to unstructured/ingest/connector/s3_connector.py as example that implements the three relelvant abstract base classes.
Then, update unstructured/ingest/main.py to instantiate the connector specific to your class if its command line options are invoked.
Create at least one folder examples/ingest with an easily reproducible script that shows the new connector in action.
Finally, to ensure the connector remains stable, add a new script test_unstructured_ingest/test-ingest-<the-new-data-source>.sh similar to test_unstructured_ingest/test-ingest-s3.sh, and append a line invoking the new script in test_unstructured_ingest/test-ingest.sh.
You'll notice that the unstructured outputs for the new documents are expected
to be checked into CI under test_unstructured_ingest/expected-structured-output/<folder-name-relevant-to-your-dataset>. So, you'll need to git add those json outputs so that test-ingest.sh passes in CI.
The main.py flags of --re-download/--no-re-download , --download-dir, --preserve-downloads, --structured-output-dir, and --reprocess are honored by the connector.
The checklist:
In checklist form, the above steps are summarized as:
- Create a new module under unstructured/ingest/connector/ implementing the 3 abstract base classes, similar to unstructured/ingest/connector/s3_connector.py.
- The subclass of
BaseIngestDocoverridesprocess_file()if extra processing logic is needed other than what is provided by auto.partition().
- The subclass of
- Update unstructured/ingest/main.py with support for the new connector.
- Create a folder under examples/ingest that includes at least one well documented script.
- Add a script test_unstructured_ingest/test-ingest-<the-new-data-source>.sh. It's json output files should have a total of no more than 100K.
- Git add the expected outputs under test_unstructured_ingest/expected-structured-output/<folder-name-relevant-to-your-dataset> so the above test passes in CI.
- Add a line to test_unstructured_ingest/test-ingest.sh invoking the new test script.
- If additional python dependencies are needed for the new connector:
- Add them as an extra to setup.py.
- Update the Makefile, adding a target for
install-ingest-<name>and adding anotherpip-compileline to thepip-compilemake target. See this commit for a reference. - The added dependencies should be imported at runtime when the new connector is invoked, rather than as top-level imports.
- Add the decorator
unstructured.utils.requires_dependencieson top of each class instance or function that uses those connector-specific dependencies e.g. forS3Connectorshould look like@requires_dependencies(dependencies=["boto3"], extras="s3")
- Honors the conventions of
BaseConnectorConfigdefined in unstructured/ingest/interfaces.py which is passed through the CLI:- If running with an
.output_dirwhere structured outputs already exists for a given file, the file content is not re-downloaded from the data source nor is it reprocessed. This is made possible by implementing the call toMyIngestDoc.has_output()which is invoked in MainProcess._filter_docs_with_outputs. - Unless
.reprocessisTrue, then documents are always reprocessed. - If
.preserve_downloadisTrue, documents downloaded to.download_dirare not removed after processing. - Else if
.preserve_downloadisFalse, documents downloaded to.download_dirare removed after they are successfully processed during the invocation ofMyIngestDoc.cleanup_file()in process_document - Does not re-download documents to
.download_dirif.re_downloadis False, enforced inMyIngestDoc.get_file() - Prints more details if
.verbosesimilar to unstructured/ingest/connector/s3_connector.py.
- If running with an