mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-24 21:55:33 +00:00
feat: add FsspecConnector to easily integrate new connectors with a fsspec implementation available (#318)
So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured. I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.
This commit is contained in:
parent
7c619f045b
commit
c51adb21e3
3
.gitignore
vendored
3
.gitignore
vendored
@ -184,3 +184,6 @@ tags
|
||||
[._]*.un~
|
||||
|
||||
.DS_Store
|
||||
|
||||
# Ruff cache
|
||||
.ruff_cache/
|
||||
|
||||
@ -1,7 +1,13 @@
|
||||
## 0.5.4-dev0
|
||||
## 0.5.4-dev1
|
||||
|
||||
### Enhancements
|
||||
|
||||
|
||||
* Add `FsspecConnector` to easily integrate any existing `fsspec` filesystem as a connector.
|
||||
* Rename `s3_connector.py` to `s3.py` for readability and consistency with the
|
||||
rest of the connectors.
|
||||
* Now `S3Connector` relies on `s3fs` instead of on `boto3`, and it inherits
|
||||
from `FsspecConnector`.
|
||||
* Adds an `UNSTRUCTURED_LANGUAGE_CHECKS` environment variable to control whether or not language
|
||||
specific checks like vocabulary and POS tagging are applied. Set to `"true"` for higher
|
||||
resolution partitioning and `"false"` for faster processing.
|
||||
|
||||
10
Ingest.md
10
Ingest.md
@ -37,7 +37,9 @@ just execute `unstructured/ingest/main.py`, e.g.:
|
||||
|
||||
## Adding Data Connectors
|
||||
|
||||
To add a connector, refer to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py) as example that implements the three relelvant abstract base classes.
|
||||
To add a connector, refer to [unstructured/ingest/connector/github.py](unstructured/ingest/connector/github.py) as example that implements the three relelvant abstract base classes.
|
||||
|
||||
If the connector has an available `fsspec` implementation, then refer to [unstructured/ingest/connector/s3.py](unstructured/ingest/connector/s3.py).
|
||||
|
||||
Then, update [unstructured/ingest/main.py](unstructured/ingest/main.py) to instantiate
|
||||
the connector specific to your class if its command line options are invoked.
|
||||
@ -56,7 +58,7 @@ The `main.py` flags of --re-download/--no-re-download , --download-dir, --preser
|
||||
|
||||
In checklist form, the above steps are summarized as:
|
||||
|
||||
- [ ] Create a new module under [unstructured/ingest/connector/](unstructured/ingest/connector/) implementing the 3 abstract base classes, similar to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py).
|
||||
- [ ] Create a new module under [unstructured/ingest/connector/](unstructured/ingest/connector/) implementing the 3 abstract base classes, similar to [unstructured/ingest/connector/github.py](unstructured/ingest/connector/github.py).
|
||||
- [ ] The subclass of `BaseIngestDoc` overrides `process_file()` if extra processing logic is needed other than what is provided by [auto.partition()](unstructured/partition/auto.py).
|
||||
- [ ] Update [unstructured/ingest/main.py](unstructured/ingest/main.py) with support for the new connector.
|
||||
- [ ] Create a folder under [examples/ingest](examples/ingest) that includes at least one well documented script.
|
||||
@ -67,7 +69,7 @@ In checklist form, the above steps are summarized as:
|
||||
- [ ] Add them as an extra to [setup.py](unstructured/setup.py).
|
||||
- [ ] Update the Makefile, adding a target for `install-ingest-<name>` and adding another `pip-compile` line to the `pip-compile` make target. See [this commit](https://github.com/Unstructured-IO/unstructured/commit/ab542ca3c6274f96b431142262d47d727f309e37) for a reference.
|
||||
- [ ] The added dependencies should be imported at runtime when the new connector is invoked, rather than as top-level imports.
|
||||
- [ ] Add the decorator `unstructured.utils.requires_dependencies` on top of each class instance or function that uses those connector-specific dependencies e.g. for `S3Connector` should look like `@requires_dependencies(dependencies=["boto3"], extras="s3")`
|
||||
- [ ] Add the decorator `unstructured.utils.requires_dependencies` on top of each class instance or function that uses those connector-specific dependencies e.g. for `GitHubConnector` should look like `@requires_dependencies(dependencies=["github"], extras="github")`
|
||||
- [ ] Run `make tidy` and `make check` to ensure linting checks pass.
|
||||
- [ ] Honors the conventions of `BaseConnectorConfig` defined in [unstructured/ingest/interfaces.py](unstructured/ingest/interfaces.py) which is passed through [the CLI](unstructured/ingest/main.py):
|
||||
- [ ] If running with an `.output_dir` where structured outputs already exists for a given file, the file content is not re-downloaded from the data source nor is it reprocessed. This is made possible by implementing the call to `MyIngestDoc.has_output()` which is invoked in [MainProcess._filter_docs_with_outputs](ingest-prep-for-many/unstructured/ingest/main.py).
|
||||
@ -75,4 +77,4 @@ In checklist form, the above steps are summarized as:
|
||||
- [ ] If `.preserve_download` is `True`, documents downloaded to `.download_dir` are not removed after processing.
|
||||
- [ ] Else if `.preserve_download` is `False`, documents downloaded to `.download_dir` are removed after they are **successfully** processed during the invocation of `MyIngestDoc.cleanup_file()` in [process_document](unstructured/ingest/doc_processor/generalized.py)
|
||||
- [ ] Does not re-download documents to `.download_dir` if `.re_download` is False, enforced in `MyIngestDoc.get_file()`
|
||||
- [ ] Prints more details if `.verbose` similar to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py).
|
||||
- [ ] Prints more details if `--verbose` in ingest CLI, similar to [unstructured/ingest/connector/github.py](unstructured/ingest/connector/github.py) logging messages.
|
||||
|
||||
@ -14,7 +14,7 @@ certifi==2022.12.7
|
||||
# via
|
||||
# -r requirements/build.in
|
||||
# requests
|
||||
charset-normalizer==3.0.1
|
||||
charset-normalizer==3.1.0
|
||||
# via requests
|
||||
docutils==0.18.1
|
||||
# via
|
||||
|
||||
@ -1230,12 +1230,12 @@ files to an S3 bucket.
|
||||
|
||||
# Upload staged data files to S3 from local output directory.
|
||||
def upload_staged_files():
|
||||
import boto3
|
||||
s3 = boto3.client("s3")
|
||||
from s3fs import S3FileSystem
|
||||
fs = S3FileSystem()
|
||||
for filename in os.listdir(LOCAL_OUTPUT_DIRECTORY):
|
||||
filepath = os.path.join(LOCAL_OUTPUT_DIRECTORY, filename)
|
||||
upload_key = os.path.join(S3_BUCKET_KEY_PREFIX, filename)
|
||||
s3.upload_file(filepath, Bucket=S3_BUCKET_NAME, Key=upload_key)
|
||||
fs.put_file(lpath=filepath, rpath=os.path.join(S3_BUCKET_NAME, upload_key))
|
||||
|
||||
upload_staged_files()
|
||||
|
||||
|
||||
@ -6,7 +6,7 @@
|
||||
#
|
||||
anyio==3.6.2
|
||||
# via httpcore
|
||||
argilla==1.3.1
|
||||
argilla==1.4.0
|
||||
# via unstructured (setup.py)
|
||||
backoff==2.2.1
|
||||
# via argilla
|
||||
@ -16,10 +16,12 @@ certifi==2022.12.7
|
||||
# httpx
|
||||
# requests
|
||||
# unstructured (setup.py)
|
||||
charset-normalizer==3.0.1
|
||||
charset-normalizer==3.1.0
|
||||
# via requests
|
||||
click==8.1.3
|
||||
# via nltk
|
||||
commonmark==0.9.1
|
||||
# via rich
|
||||
deprecated==1.2.13
|
||||
# via argilla
|
||||
et-xmlfile==1.1.0
|
||||
@ -66,8 +68,10 @@ pillow==9.4.0
|
||||
# via
|
||||
# python-pptx
|
||||
# unstructured (setup.py)
|
||||
pydantic==1.10.5
|
||||
pydantic==1.10.6
|
||||
# via argilla
|
||||
pygments==2.14.0
|
||||
# via rich
|
||||
python-dateutil==2.8.2
|
||||
# via pandas
|
||||
python-docx==0.8.11
|
||||
@ -84,6 +88,8 @@ requests==2.28.2
|
||||
# via unstructured (setup.py)
|
||||
rfc3986[idna2008]==1.5.0
|
||||
# via httpx
|
||||
rich==13.0.1
|
||||
# via argilla
|
||||
six==1.16.0
|
||||
# via python-dateutil
|
||||
sniffio==1.3.0
|
||||
@ -91,12 +97,14 @@ sniffio==1.3.0
|
||||
# anyio
|
||||
# httpcore
|
||||
# httpx
|
||||
tqdm==4.64.1
|
||||
tqdm==4.65.0
|
||||
# via
|
||||
# argilla
|
||||
# nltk
|
||||
typing-extensions==4.5.0
|
||||
# via pydantic
|
||||
# via
|
||||
# pydantic
|
||||
# rich
|
||||
urllib3==1.26.14
|
||||
# via requests
|
||||
wrapt==1.14.1
|
||||
|
||||
@ -14,7 +14,7 @@ certifi==2022.12.7
|
||||
# via
|
||||
# -r requirements/build.in
|
||||
# requests
|
||||
charset-normalizer==3.0.1
|
||||
charset-normalizer==3.1.0
|
||||
# via requests
|
||||
docutils==0.18.1
|
||||
# via
|
||||
|
||||
@ -55,7 +55,7 @@ filelock==3.9.0
|
||||
# via virtualenv
|
||||
fqdn==1.5.1
|
||||
# via jsonschema
|
||||
identify==2.5.18
|
||||
identify==2.5.19
|
||||
# via pre-commit
|
||||
idna==3.4
|
||||
# via
|
||||
@ -67,7 +67,7 @@ importlib-metadata==6.0.0
|
||||
# nbconvert
|
||||
importlib-resources==5.12.0
|
||||
# via jsonschema
|
||||
ipykernel==6.21.2
|
||||
ipykernel==6.21.3
|
||||
# via
|
||||
# ipywidgets
|
||||
# jupyter
|
||||
@ -115,7 +115,7 @@ jupyter-client==8.0.3
|
||||
# nbclient
|
||||
# notebook
|
||||
# qtconsole
|
||||
jupyter-console==6.6.2
|
||||
jupyter-console==6.6.3
|
||||
# via jupyter
|
||||
jupyter-core==5.2.0
|
||||
# via
|
||||
@ -132,7 +132,7 @@ jupyter-core==5.2.0
|
||||
# qtconsole
|
||||
jupyter-events==0.6.3
|
||||
# via jupyter-server
|
||||
jupyter-server==2.3.0
|
||||
jupyter-server==2.4.0
|
||||
# via
|
||||
# nbclassic
|
||||
# notebook-shim
|
||||
@ -152,7 +152,7 @@ matplotlib-inline==0.1.6
|
||||
# ipython
|
||||
mistune==2.0.5
|
||||
# via nbconvert
|
||||
nbclassic==0.5.2
|
||||
nbclassic==0.5.3
|
||||
# via notebook
|
||||
nbclient==0.7.2
|
||||
# via nbconvert
|
||||
@ -176,7 +176,7 @@ nest-asyncio==1.5.6
|
||||
# notebook
|
||||
nodeenv==1.7.0
|
||||
# via pre-commit
|
||||
notebook==6.5.2
|
||||
notebook==6.5.3
|
||||
# via jupyter
|
||||
notebook-shim==0.2.2
|
||||
# via nbclassic
|
||||
@ -199,7 +199,7 @@ pip-tools==6.12.3
|
||||
# via -r requirements/dev.in
|
||||
pkgutil-resolve-name==1.3.10
|
||||
# via jsonschema
|
||||
platformdirs==3.0.0
|
||||
platformdirs==3.1.0
|
||||
# via
|
||||
# jupyter-core
|
||||
# virtualenv
|
||||
|
||||
@ -6,7 +6,7 @@
|
||||
#
|
||||
anyio==3.6.2
|
||||
# via httpcore
|
||||
argilla==1.3.1
|
||||
argilla==1.4.0
|
||||
# via unstructured (setup.py)
|
||||
backoff==2.2.1
|
||||
# via argilla
|
||||
@ -16,12 +16,14 @@ certifi==2022.12.7
|
||||
# httpx
|
||||
# requests
|
||||
# unstructured (setup.py)
|
||||
charset-normalizer==3.0.1
|
||||
charset-normalizer==3.1.0
|
||||
# via requests
|
||||
click==8.1.3
|
||||
# via
|
||||
# nltk
|
||||
# sacremoses
|
||||
commonmark==0.9.1
|
||||
# via rich
|
||||
deprecated==1.2.13
|
||||
# via argilla
|
||||
et-xmlfile==1.1.0
|
||||
@ -36,7 +38,7 @@ httpcore==0.16.3
|
||||
# via httpx
|
||||
httpx==0.23.3
|
||||
# via argilla
|
||||
huggingface-hub==0.12.1
|
||||
huggingface-hub==0.13.1
|
||||
# via transformers
|
||||
idna==3.4
|
||||
# via
|
||||
@ -82,8 +84,10 @@ pillow==9.4.0
|
||||
# via
|
||||
# python-pptx
|
||||
# unstructured (setup.py)
|
||||
pydantic==1.10.5
|
||||
pydantic==1.10.6
|
||||
# via argilla
|
||||
pygments==2.14.0
|
||||
# via rich
|
||||
python-dateutil==2.8.2
|
||||
# via pandas
|
||||
python-docx==0.8.11
|
||||
@ -110,6 +114,8 @@ requests==2.28.2
|
||||
# unstructured (setup.py)
|
||||
rfc3986[idna2008]==1.5.0
|
||||
# via httpx
|
||||
rich==13.0.1
|
||||
# via argilla
|
||||
sacremoses==0.0.53
|
||||
# via unstructured (setup.py)
|
||||
sentencepiece==0.1.97
|
||||
@ -128,7 +134,7 @@ tokenizers==0.13.2
|
||||
# via transformers
|
||||
torch==1.13.1
|
||||
# via unstructured (setup.py)
|
||||
tqdm==4.64.1
|
||||
tqdm==4.65.0
|
||||
# via
|
||||
# argilla
|
||||
# huggingface-hub
|
||||
@ -141,6 +147,7 @@ typing-extensions==4.5.0
|
||||
# via
|
||||
# huggingface-hub
|
||||
# pydantic
|
||||
# rich
|
||||
# torch
|
||||
urllib3==1.26.14
|
||||
# via requests
|
||||
|
||||
@ -8,7 +8,7 @@ anyio==3.6.2
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# httpcore
|
||||
argilla==1.3.1
|
||||
argilla==1.4.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# unstructured (setup.py)
|
||||
@ -25,7 +25,7 @@ certifi==2022.12.7
|
||||
# unstructured (setup.py)
|
||||
cffi==1.15.1
|
||||
# via pynacl
|
||||
charset-normalizer==3.0.1
|
||||
charset-normalizer==3.1.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# requests
|
||||
@ -33,6 +33,10 @@ click==8.1.3
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# nltk
|
||||
commonmark==0.9.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# rich
|
||||
deprecated==1.2.13
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -111,12 +115,16 @@ pillow==9.4.0
|
||||
# unstructured (setup.py)
|
||||
pycparser==2.21
|
||||
# via cffi
|
||||
pydantic==1.10.5
|
||||
pydantic==1.10.6
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
pygithub==1.57.0
|
||||
# via unstructured (setup.py)
|
||||
pygments==2.14.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# rich
|
||||
pyjwt==2.6.0
|
||||
# via pygithub
|
||||
pynacl==1.5.0
|
||||
@ -154,6 +162,10 @@ rfc3986[idna2008]==1.5.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# httpx
|
||||
rich==13.0.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
six==1.16.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -164,7 +176,7 @@ sniffio==1.3.0
|
||||
# anyio
|
||||
# httpcore
|
||||
# httpx
|
||||
tqdm==4.64.1
|
||||
tqdm==4.65.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
@ -173,6 +185,7 @@ typing-extensions==4.5.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# pydantic
|
||||
# rich
|
||||
urllib3==1.26.14
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
|
||||
@ -8,7 +8,7 @@ anyio==3.6.2
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# httpcore
|
||||
argilla==1.3.1
|
||||
argilla==1.4.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# unstructured (setup.py)
|
||||
@ -23,7 +23,7 @@ certifi==2022.12.7
|
||||
# httpx
|
||||
# requests
|
||||
# unstructured (setup.py)
|
||||
charset-normalizer==3.0.1
|
||||
charset-normalizer==3.1.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# requests
|
||||
@ -31,10 +31,10 @@ click==8.1.3
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# nltk
|
||||
colorama==0.4.6
|
||||
commonmark==0.9.1
|
||||
# via
|
||||
# click
|
||||
# tqdm
|
||||
# -r requirements/base.txt
|
||||
# rich
|
||||
deprecated==1.2.13
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -110,10 +110,14 @@ pillow==9.4.0
|
||||
# -r requirements/base.txt
|
||||
# python-pptx
|
||||
# unstructured (setup.py)
|
||||
pydantic==1.10.5
|
||||
pydantic==1.10.6
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
pygments==2.14.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# rich
|
||||
python-dateutil==2.8.2
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -152,6 +156,10 @@ rfc3986[idna2008]==1.5.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# httpx
|
||||
rich==13.0.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
six==1.16.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -162,7 +170,7 @@ sniffio==1.3.0
|
||||
# anyio
|
||||
# httpcore
|
||||
# httpx
|
||||
tqdm==4.64.1
|
||||
tqdm==4.65.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
@ -171,6 +179,7 @@ typing-extensions==4.5.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# pydantic
|
||||
# rich
|
||||
urllib3==1.26.14
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
|
||||
@ -1,5 +1,5 @@
|
||||
#
|
||||
# This file is autogenerated by pip-compile with Python 3.9
|
||||
# This file is autogenerated by pip-compile with Python 3.8
|
||||
# by the following command:
|
||||
#
|
||||
# pip-compile --extra=google-drive --output-file=requirements/ingest-google-drive.txt requirements/base.txt setup.py
|
||||
@ -8,7 +8,7 @@ anyio==3.6.2
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# httpcore
|
||||
argilla==1.3.1
|
||||
argilla==1.4.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# unstructured (setup.py)
|
||||
@ -25,7 +25,7 @@ certifi==2022.12.7
|
||||
# httpx
|
||||
# requests
|
||||
# unstructured (setup.py)
|
||||
charset-normalizer==3.0.1
|
||||
charset-normalizer==3.1.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# requests
|
||||
@ -33,6 +33,10 @@ click==8.1.3
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# nltk
|
||||
commonmark==0.9.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# rich
|
||||
deprecated==1.2.13
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -125,7 +129,7 @@ pillow==9.4.0
|
||||
# -r requirements/base.txt
|
||||
# python-pptx
|
||||
# unstructured (setup.py)
|
||||
protobuf==4.22.0
|
||||
protobuf==4.22.1
|
||||
# via
|
||||
# google-api-core
|
||||
# googleapis-common-protos
|
||||
@ -135,10 +139,14 @@ pyasn1==0.4.8
|
||||
# rsa
|
||||
pyasn1-modules==0.2.8
|
||||
# via google-auth
|
||||
pydantic==1.10.5
|
||||
pydantic==1.10.6
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
pygments==2.14.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# rich
|
||||
pyparsing==3.0.9
|
||||
# via httplib2
|
||||
python-dateutil==2.8.2
|
||||
@ -174,6 +182,10 @@ rfc3986[idna2008]==1.5.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# httpx
|
||||
rich==13.0.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
rsa==4.9
|
||||
# via google-auth
|
||||
six==1.16.0
|
||||
@ -188,7 +200,7 @@ sniffio==1.3.0
|
||||
# anyio
|
||||
# httpcore
|
||||
# httpx
|
||||
tqdm==4.64.1
|
||||
tqdm==4.65.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
@ -197,6 +209,7 @@ typing-extensions==4.5.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# pydantic
|
||||
# rich
|
||||
uritemplate==4.1.1
|
||||
# via google-api-python-client
|
||||
urllib3==1.26.14
|
||||
|
||||
@ -8,7 +8,7 @@ anyio==3.6.2
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# httpcore
|
||||
argilla==1.3.1
|
||||
argilla==1.4.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# unstructured (setup.py)
|
||||
@ -23,7 +23,7 @@ certifi==2022.12.7
|
||||
# httpx
|
||||
# requests
|
||||
# unstructured (setup.py)
|
||||
charset-normalizer==3.0.1
|
||||
charset-normalizer==3.1.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# requests
|
||||
@ -31,6 +31,10 @@ click==8.1.3
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# nltk
|
||||
commonmark==0.9.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# rich
|
||||
deprecated==1.2.13
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -110,10 +114,14 @@ praw==7.7.0
|
||||
# via unstructured (setup.py)
|
||||
prawcore==2.3.0
|
||||
# via praw
|
||||
pydantic==1.10.5
|
||||
pydantic==1.10.6
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
pygments==2.14.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# rich
|
||||
python-dateutil==2.8.2
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -148,6 +156,10 @@ rfc3986[idna2008]==1.5.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# httpx
|
||||
rich==13.0.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
six==1.16.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -158,7 +170,7 @@ sniffio==1.3.0
|
||||
# anyio
|
||||
# httpcore
|
||||
# httpx
|
||||
tqdm==4.64.1
|
||||
tqdm==4.65.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
@ -167,6 +179,7 @@ typing-extensions==4.5.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# pydantic
|
||||
# rich
|
||||
update-checker==0.18.0
|
||||
# via praw
|
||||
urllib3==1.26.14
|
||||
|
||||
@ -4,24 +4,34 @@
|
||||
#
|
||||
# pip-compile --extra=s3 --output-file=requirements/ingest-s3.txt requirements/base.txt setup.py
|
||||
#
|
||||
aiobotocore==2.4.2
|
||||
# via s3fs
|
||||
aiohttp==3.8.4
|
||||
# via
|
||||
# aiobotocore
|
||||
# s3fs
|
||||
aioitertools==0.11.0
|
||||
# via aiobotocore
|
||||
aiosignal==1.3.1
|
||||
# via aiohttp
|
||||
anyio==3.6.2
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# httpcore
|
||||
argilla==1.3.1
|
||||
argilla==1.4.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# unstructured (setup.py)
|
||||
async-timeout==4.0.2
|
||||
# via aiohttp
|
||||
attrs==22.2.0
|
||||
# via aiohttp
|
||||
backoff==2.2.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
boto3==1.26.82
|
||||
# via unstructured (setup.py)
|
||||
botocore==1.29.82
|
||||
# via
|
||||
# boto3
|
||||
# s3transfer
|
||||
botocore==1.27.59
|
||||
# via aiobotocore
|
||||
certifi==2022.12.7
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -29,14 +39,19 @@ certifi==2022.12.7
|
||||
# httpx
|
||||
# requests
|
||||
# unstructured (setup.py)
|
||||
charset-normalizer==3.0.1
|
||||
charset-normalizer==3.1.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# aiohttp
|
||||
# requests
|
||||
click==8.1.3
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# nltk
|
||||
commonmark==0.9.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# rich
|
||||
deprecated==1.2.13
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -45,6 +60,14 @@ et-xmlfile==1.1.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# openpyxl
|
||||
frozenlist==1.3.3
|
||||
# via
|
||||
# aiohttp
|
||||
# aiosignal
|
||||
fsspec==2023.3.0
|
||||
# via
|
||||
# s3fs
|
||||
# unstructured (setup.py)
|
||||
h11==0.14.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -63,14 +86,13 @@ idna==3.4
|
||||
# anyio
|
||||
# requests
|
||||
# rfc3986
|
||||
# yarl
|
||||
importlib-metadata==6.0.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# markdown
|
||||
jmespath==1.0.1
|
||||
# via
|
||||
# boto3
|
||||
# botocore
|
||||
# via botocore
|
||||
joblib==1.2.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -89,6 +111,10 @@ monotonic==1.6
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
multidict==6.0.4
|
||||
# via
|
||||
# aiohttp
|
||||
# yarl
|
||||
nltk==3.8.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -116,10 +142,14 @@ pillow==9.4.0
|
||||
# -r requirements/base.txt
|
||||
# python-pptx
|
||||
# unstructured (setup.py)
|
||||
pydantic==1.10.5
|
||||
pydantic==1.10.6
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
pygments==2.14.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# rich
|
||||
python-dateutil==2.8.2
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -153,8 +183,12 @@ rfc3986[idna2008]==1.5.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# httpx
|
||||
s3transfer==0.6.0
|
||||
# via boto3
|
||||
rich==13.0.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
s3fs==2023.3.0
|
||||
# via unstructured (setup.py)
|
||||
six==1.16.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -165,7 +199,7 @@ sniffio==1.3.0
|
||||
# anyio
|
||||
# httpcore
|
||||
# httpx
|
||||
tqdm==4.64.1
|
||||
tqdm==4.65.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
@ -173,7 +207,9 @@ tqdm==4.64.1
|
||||
typing-extensions==4.5.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# aioitertools
|
||||
# pydantic
|
||||
# rich
|
||||
urllib3==1.26.14
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -182,12 +218,15 @@ urllib3==1.26.14
|
||||
wrapt==1.14.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# aiobotocore
|
||||
# argilla
|
||||
# deprecated
|
||||
xlsxwriter==3.0.8
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# python-pptx
|
||||
yarl==1.8.2
|
||||
# via aiohttp
|
||||
zipp==3.15.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
|
||||
@ -8,7 +8,7 @@ anyio==3.6.2
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# httpcore
|
||||
argilla==1.3.1
|
||||
argilla==1.4.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# unstructured (setup.py)
|
||||
@ -25,7 +25,7 @@ certifi==2022.12.7
|
||||
# httpx
|
||||
# requests
|
||||
# unstructured (setup.py)
|
||||
charset-normalizer==3.0.1
|
||||
charset-normalizer==3.1.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# requests
|
||||
@ -33,6 +33,10 @@ click==8.1.3
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# nltk
|
||||
commonmark==0.9.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# rich
|
||||
deprecated==1.2.13
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -108,10 +112,14 @@ pillow==9.4.0
|
||||
# -r requirements/base.txt
|
||||
# python-pptx
|
||||
# unstructured (setup.py)
|
||||
pydantic==1.10.5
|
||||
pydantic==1.10.6
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
pygments==2.14.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# rich
|
||||
python-dateutil==2.8.2
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -145,6 +153,10 @@ rfc3986[idna2008]==1.5.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# httpx
|
||||
rich==13.0.1
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
six==1.16.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
@ -157,7 +169,7 @@ sniffio==1.3.0
|
||||
# httpx
|
||||
soupsieve==2.4
|
||||
# via beautifulsoup4
|
||||
tqdm==4.64.1
|
||||
tqdm==4.65.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# argilla
|
||||
@ -166,6 +178,7 @@ typing-extensions==4.5.0
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
# pydantic
|
||||
# rich
|
||||
urllib3==1.26.14
|
||||
# via
|
||||
# -r requirements/base.txt
|
||||
|
||||
@ -10,7 +10,7 @@ anyio==3.6.2
|
||||
# via
|
||||
# httpcore
|
||||
# starlette
|
||||
argilla==1.3.1
|
||||
argilla==1.4.0
|
||||
# via unstructured (setup.py)
|
||||
backoff==2.2.1
|
||||
# via argilla
|
||||
@ -22,7 +22,7 @@ certifi==2022.12.7
|
||||
# unstructured (setup.py)
|
||||
cffi==1.15.1
|
||||
# via cryptography
|
||||
charset-normalizer==3.0.1
|
||||
charset-normalizer==3.1.0
|
||||
# via
|
||||
# pdfminer-six
|
||||
# requests
|
||||
@ -32,9 +32,11 @@ click==8.1.3
|
||||
# uvicorn
|
||||
coloredlogs==15.0.1
|
||||
# via onnxruntime
|
||||
commonmark==0.9.1
|
||||
# via rich
|
||||
contourpy==1.0.7
|
||||
# via matplotlib
|
||||
cryptography==39.0.1
|
||||
cryptography==39.0.2
|
||||
# via pdfminer-six
|
||||
cycler==0.11.0
|
||||
# via matplotlib
|
||||
@ -44,15 +46,15 @@ effdet==0.3.0
|
||||
# via layoutparser
|
||||
et-xmlfile==1.1.0
|
||||
# via openpyxl
|
||||
fastapi==0.92.0
|
||||
fastapi==0.93.0
|
||||
# via unstructured-inference
|
||||
filelock==3.9.0
|
||||
# via
|
||||
# huggingface-hub
|
||||
# transformers
|
||||
flatbuffers==23.1.21
|
||||
flatbuffers==23.3.3
|
||||
# via onnxruntime
|
||||
fonttools==4.38.0
|
||||
fonttools==4.39.0
|
||||
# via matplotlib
|
||||
h11==0.14.0
|
||||
# via
|
||||
@ -62,7 +64,7 @@ httpcore==0.16.3
|
||||
# via httpx
|
||||
httpx==0.23.3
|
||||
# via argilla
|
||||
huggingface-hub==0.12.1
|
||||
huggingface-hub==0.13.1
|
||||
# via
|
||||
# timm
|
||||
# transformers
|
||||
@ -95,11 +97,11 @@ lxml==4.9.2
|
||||
# unstructured (setup.py)
|
||||
markdown==3.4.1
|
||||
# via unstructured (setup.py)
|
||||
matplotlib==3.7.0
|
||||
matplotlib==3.7.1
|
||||
# via pycocotools
|
||||
monotonic==1.6
|
||||
# via argilla
|
||||
mpmath==1.2.1
|
||||
mpmath==1.3.0
|
||||
# via sympy
|
||||
nltk==3.8.1
|
||||
# via unstructured (setup.py)
|
||||
@ -157,16 +159,18 @@ pillow==9.4.0
|
||||
# unstructured (setup.py)
|
||||
portalocker==2.7.0
|
||||
# via iopath
|
||||
protobuf==4.22.0
|
||||
protobuf==4.22.1
|
||||
# via onnxruntime
|
||||
pycocotools==2.0.6
|
||||
# via effdet
|
||||
pycparser==2.21
|
||||
# via cffi
|
||||
pydantic==1.10.5
|
||||
pydantic==1.10.6
|
||||
# via
|
||||
# argilla
|
||||
# fastapi
|
||||
pygments==2.14.0
|
||||
# via rich
|
||||
pyparsing==3.0.9
|
||||
# via matplotlib
|
||||
pytesseract==0.3.10
|
||||
@ -204,6 +208,8 @@ requests==2.28.2
|
||||
# unstructured (setup.py)
|
||||
rfc3986[idna2008]==1.5.0
|
||||
# via httpx
|
||||
rich==13.0.1
|
||||
# via argilla
|
||||
scipy==1.10.1
|
||||
# via layoutparser
|
||||
six==1.16.0
|
||||
@ -232,7 +238,7 @@ torchvision==0.14.1
|
||||
# effdet
|
||||
# layoutparser
|
||||
# timm
|
||||
tqdm==4.64.1
|
||||
tqdm==4.65.0
|
||||
# via
|
||||
# argilla
|
||||
# huggingface-hub
|
||||
@ -246,6 +252,7 @@ typing-extensions==4.5.0
|
||||
# huggingface-hub
|
||||
# iopath
|
||||
# pydantic
|
||||
# rich
|
||||
# starlette
|
||||
# torch
|
||||
# torchvision
|
||||
|
||||
@ -12,7 +12,7 @@ certifi==2022.12.7
|
||||
# via
|
||||
# -r requirements/test.in
|
||||
# requests
|
||||
charset-normalizer==3.0.1
|
||||
charset-normalizer==3.1.0
|
||||
# via requests
|
||||
click==8.1.3
|
||||
# via
|
||||
@ -40,7 +40,7 @@ mccabe==0.7.0
|
||||
# via flake8
|
||||
multidict==6.0.4
|
||||
# via yarl
|
||||
mypy==1.0.1
|
||||
mypy==1.1.1
|
||||
# via -r requirements/test.in
|
||||
mypy-extensions==1.0.0
|
||||
# via
|
||||
@ -52,17 +52,17 @@ packaging==23.0
|
||||
# pytest
|
||||
pathspec==0.11.0
|
||||
# via black
|
||||
platformdirs==3.0.0
|
||||
platformdirs==3.1.0
|
||||
# via black
|
||||
pluggy==1.0.0
|
||||
# via pytest
|
||||
pycodestyle==2.10.0
|
||||
# via flake8
|
||||
pydantic==1.10.5
|
||||
pydantic==1.10.6
|
||||
# via label-studio-sdk
|
||||
pyflakes==3.0.1
|
||||
# via flake8
|
||||
pytest==7.2.1
|
||||
pytest==7.2.2
|
||||
# via pytest-cov
|
||||
pytest-cov==4.0.0
|
||||
# via -r requirements/test.in
|
||||
@ -70,7 +70,7 @@ pyyaml==6.0
|
||||
# via vcrpy
|
||||
requests==2.28.2
|
||||
# via label-studio-sdk
|
||||
ruff==0.0.253
|
||||
ruff==0.0.254
|
||||
# via -r requirements/test.in
|
||||
six==1.16.0
|
||||
# via vcrpy
|
||||
|
||||
2
setup.py
2
setup.py
@ -77,7 +77,7 @@ setup(
|
||||
# NOTE(robinson) - Upper bound is temporary due to a multithreading issue
|
||||
"unstructured-inference>=0.2.4,<0.2.8",
|
||||
],
|
||||
"s3": ["boto3"],
|
||||
"s3": ["s3fs", "fsspec"],
|
||||
"github": [
|
||||
# NOTE - pygithub==1.58.0 fails due to https://github.com/PyGithub/PyGithub/issues/2436
|
||||
# In the future, we can update this to pygithub>1.58.0
|
||||
|
||||
@ -1,9 +1,9 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
|
||||
SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
|
||||
cd "$SCRIPT_DIR"/.. || exit 1
|
||||
|
||||
if [[ "$(find test_unstructured_ingest/expected-structured-output/s3-small-batch/ -type f -size +20k | wc -l)" != 3 ]]; then
|
||||
if [[ "$(find test_unstructured_ingest/expected-structured-output/s3-small-batch/ -type f -size +20k | wc -l)" -ne 3 ]]; then
|
||||
echo "The test fixtures in test_unstructured_ingest/expected-structured-output/ look suspicious. At least one of the files is too small."
|
||||
echo "Did you overwrite test fixtures with bad outputs?"
|
||||
exit 1
|
||||
@ -12,13 +12,13 @@ fi
|
||||
PYTHONPATH=. ./unstructured/ingest/main.py --s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ --s3-anonymous --structured-output-dir s3-small-batch-output
|
||||
|
||||
if ! diff -ru s3-small-batch-output test_unstructured_ingest/expected-structured-output/s3-small-batch ; then
|
||||
echo
|
||||
echo "There are differences from the previously checked-in structured outputs."
|
||||
echo
|
||||
echo "If these differences are acceptable, copy the outputs from"
|
||||
echo "s3-small-batch-output/ to test_unstructured_ingest/expected-structured-output/s3-small-batch/ after running"
|
||||
echo
|
||||
echo " PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --structured-output-dir s3-small-batch-output"
|
||||
echo
|
||||
exit 1
|
||||
echo
|
||||
echo "There are differences from the previously checked-in structured outputs."
|
||||
echo
|
||||
echo "If these differences are acceptable, copy the outputs from"
|
||||
echo "s3-small-batch-output/ to test_unstructured_ingest/expected-structured-output/s3-small-batch/ after running"
|
||||
echo
|
||||
echo " PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --structured-output-dir s3-small-batch-output"
|
||||
echo
|
||||
exit 1
|
||||
fi
|
||||
|
||||
@ -1 +1 @@
|
||||
__version__ = "0.5.4-dev0" # pragma: no cover
|
||||
__version__ = "0.5.4-dev1" # pragma: no cover
|
||||
|
||||
186
unstructured/ingest/connector/fsspec.py
Normal file
186
unstructured/ingest/connector/fsspec.py
Normal file
@ -0,0 +1,186 @@
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
|
||||
from unstructured.ingest.interfaces import (
|
||||
BaseConnector,
|
||||
BaseConnectorConfig,
|
||||
BaseIngestDoc,
|
||||
)
|
||||
from unstructured.ingest.logger import logger
|
||||
|
||||
SUPPORTED_REMOTE_FSSPEC_PROTOCOLS = [
|
||||
"s3",
|
||||
"gcs",
|
||||
"gcfs",
|
||||
"abfs",
|
||||
]
|
||||
|
||||
|
||||
@dataclass
|
||||
class SimpleFsspecConfig(BaseConnectorConfig):
|
||||
# fsspec specific options
|
||||
path: str
|
||||
|
||||
# base connector options
|
||||
download_dir: str
|
||||
output_dir: str
|
||||
preserve_downloads: bool = False
|
||||
re_download: bool = False
|
||||
|
||||
# fsspec specific options
|
||||
access_kwargs: dict = field(default_factory=dict)
|
||||
|
||||
protocol: str = field(init=False)
|
||||
path_without_protocol: str = field(init=False)
|
||||
dir_path: str = field(init=False)
|
||||
file_path: str = field(init=False)
|
||||
|
||||
def __post_init__(self):
|
||||
self.protocol, self.path_without_protocol = self.path.split("://")
|
||||
if self.protocol not in SUPPORTED_REMOTE_FSSPEC_PROTOCOLS:
|
||||
raise ValueError(
|
||||
f"Protocol {self.protocol} not supported yet, only "
|
||||
f"{SUPPORTED_REMOTE_FSSPEC_PROTOCOLS} are supported.",
|
||||
)
|
||||
|
||||
# just a path with no trailing prefix
|
||||
match = re.match(rf"{self.protocol}://([^/\s]+?)(/*)$", self.path)
|
||||
if match:
|
||||
self.dir_path = match.group(1)
|
||||
self.file_path = ""
|
||||
return
|
||||
|
||||
# valid path with a dir and/or file
|
||||
match = re.match(rf"{self.protocol}://([^/\s]+?)/([^\s]*)", self.path)
|
||||
if not match:
|
||||
raise ValueError(
|
||||
f"Invalid path {self.path}. Expected <protocol>://<dir-path>/<file-or-dir-path>.",
|
||||
)
|
||||
self.dir_path = match.group(1)
|
||||
self.file_path = match.group(2) or ""
|
||||
|
||||
|
||||
@dataclass
|
||||
class FsspecIngestDoc(BaseIngestDoc):
|
||||
"""Class encapsulating fetching a doc and writing processed results (but not
|
||||
doing the processing!).
|
||||
|
||||
Also includes a cleanup method. When things go wrong and the cleanup
|
||||
method is not called, the file is left behind on the filesystem to assist debugging.
|
||||
"""
|
||||
|
||||
config: SimpleFsspecConfig
|
||||
remote_file_path: str
|
||||
|
||||
def _tmp_download_file(self):
|
||||
return Path(self.config.download_dir) / self.remote_file_path.replace(
|
||||
f"{self.config.dir_path}/",
|
||||
"",
|
||||
)
|
||||
|
||||
def _output_filename(self):
|
||||
return (
|
||||
Path(self.config.output_dir)
|
||||
/ f"{self.remote_file_path.replace(f'{self.config.dir_path}/', '')}.json"
|
||||
)
|
||||
|
||||
def has_output(self):
|
||||
"""Determine if structured output for this doc already exists."""
|
||||
return self._output_filename().is_file() and os.path.getsize(self._output_filename())
|
||||
|
||||
def _create_full_tmp_dir_path(self):
|
||||
"""Includes "directories" in the object path"""
|
||||
self._tmp_download_file().parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def get_file(self):
|
||||
"""Fetches the file from the current filesystem and stores it locally."""
|
||||
from fsspec import AbstractFileSystem, get_filesystem_class
|
||||
|
||||
self._create_full_tmp_dir_path()
|
||||
if (
|
||||
not self.config.re_download
|
||||
and self._tmp_download_file().is_file()
|
||||
and os.path.getsize(self._tmp_download_file())
|
||||
):
|
||||
logger.debug(f"File exists: {self._tmp_download_file()}, skipping download")
|
||||
return
|
||||
|
||||
fs: AbstractFileSystem = get_filesystem_class(self.config.protocol)(
|
||||
**self.config.access_kwargs,
|
||||
)
|
||||
|
||||
logger.debug(f"Fetching {self} - PID: {os.getpid()}")
|
||||
fs.get(rpath=self.remote_file_path, lpath=self._tmp_download_file().as_posix())
|
||||
|
||||
def write_result(self):
|
||||
"""Write the structured json result for this doc. result must be json serializable."""
|
||||
output_filename = self._output_filename()
|
||||
output_filename.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_filename, "w") as output_f:
|
||||
output_f.write(json.dumps(self.isd_elems_no_filename, ensure_ascii=False, indent=2))
|
||||
logger.info(f"Wrote {output_filename}")
|
||||
|
||||
@property
|
||||
def filename(self):
|
||||
"""The filename of the file after downloading from s3"""
|
||||
return self._tmp_download_file()
|
||||
|
||||
def cleanup_file(self):
|
||||
"""Removes the local copy the file after successful processing."""
|
||||
if not self.config.preserve_downloads:
|
||||
logger.debug(f"Cleaning up {self}")
|
||||
os.unlink(self._tmp_download_file())
|
||||
|
||||
|
||||
class FsspecConnector(BaseConnector):
|
||||
"""Objects of this class support fetching document(s) from"""
|
||||
|
||||
def __init__(self, config: SimpleFsspecConfig):
|
||||
from fsspec import AbstractFileSystem, get_filesystem_class
|
||||
|
||||
self.config = config
|
||||
self.fs: AbstractFileSystem = get_filesystem_class(self.config.protocol)(
|
||||
**self.config.access_kwargs,
|
||||
)
|
||||
self.cleanup_files = not config.preserve_downloads
|
||||
|
||||
def cleanup(self, cur_dir=None):
|
||||
"""cleanup linginering empty sub-dirs from s3 paths, but leave remaining files
|
||||
(and their paths) in tact as that indicates they were not processed"""
|
||||
if not self.cleanup_files:
|
||||
return
|
||||
|
||||
if cur_dir is None:
|
||||
cur_dir = self.config.download_dir
|
||||
sub_dirs = os.listdir(cur_dir)
|
||||
os.chdir(cur_dir)
|
||||
for sub_dir in sub_dirs:
|
||||
# don't traverse symlinks, not that there every should be any
|
||||
if os.path.isdir(sub_dir) and not os.path.islink(sub_dir):
|
||||
self.cleanup(sub_dir)
|
||||
os.chdir("..")
|
||||
if len(os.listdir(cur_dir)) == 0:
|
||||
os.rmdir(cur_dir)
|
||||
|
||||
def initialize(self):
|
||||
"""Verify that can get metadata for an object, validates connections info."""
|
||||
ls_output = self.fs.ls(self.config.path_without_protocol)
|
||||
if len(ls_output) < 1:
|
||||
raise ValueError(
|
||||
f"No objects found in {self.config.path}.",
|
||||
)
|
||||
|
||||
def _list_files(self):
|
||||
return self.fs.ls(self.config.path_without_protocol)
|
||||
|
||||
def get_ingest_docs(self):
|
||||
return [
|
||||
FsspecIngestDoc(
|
||||
self.config,
|
||||
file,
|
||||
)
|
||||
for file in self._list_files()
|
||||
]
|
||||
24
unstructured/ingest/connector/s3.py
Normal file
24
unstructured/ingest/connector/s3.py
Normal file
@ -0,0 +1,24 @@
|
||||
from dataclasses import dataclass
|
||||
|
||||
from unstructured.ingest.connector.fsspec import (
|
||||
FsspecConnector,
|
||||
FsspecIngestDoc,
|
||||
SimpleFsspecConfig,
|
||||
)
|
||||
from unstructured.utils import requires_dependencies
|
||||
|
||||
|
||||
@dataclass
|
||||
class SimpleS3Config(SimpleFsspecConfig):
|
||||
pass
|
||||
|
||||
|
||||
class S3IngestDoc(FsspecIngestDoc):
|
||||
@requires_dependencies(["s3fs", "fsspec"], extras="s3")
|
||||
def get_file(self):
|
||||
super().get_file()
|
||||
|
||||
|
||||
@requires_dependencies(["s3fs", "fsspec"], extras="s3")
|
||||
class S3Connector(FsspecConnector):
|
||||
pass
|
||||
@ -1,198 +0,0 @@
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
|
||||
from unstructured.ingest.interfaces import (
|
||||
BaseConnector,
|
||||
BaseConnectorConfig,
|
||||
BaseIngestDoc,
|
||||
)
|
||||
from unstructured.ingest.logger import logger
|
||||
from unstructured.utils import requires_dependencies
|
||||
|
||||
|
||||
@dataclass
|
||||
class SimpleS3Config(BaseConnectorConfig):
|
||||
"""Connector config where s3_url is an s3 prefix to process all documents from."""
|
||||
|
||||
# S3 Specific Options
|
||||
s3_url: str
|
||||
|
||||
# Standard Connector options
|
||||
download_dir: str
|
||||
# where to write structured data, with the directory structure matching s3 path
|
||||
output_dir: str
|
||||
re_download: bool = False
|
||||
preserve_downloads: bool = False
|
||||
|
||||
# S3 Specific (optional)
|
||||
anonymous: bool = False
|
||||
|
||||
s3_bucket: str = field(init=False)
|
||||
# could be single object or prefix
|
||||
s3_path: str = field(init=False)
|
||||
|
||||
def __post_init__(self):
|
||||
if not self.s3_url.startswith("s3://"):
|
||||
raise ValueError("s3_url must begin with 's3://'")
|
||||
|
||||
# just a bucket with no trailing prefix
|
||||
match = re.match(r"s3://([^/\s]+?)$", self.s3_url)
|
||||
if match:
|
||||
self.s3_bucket = match.group(1)
|
||||
self.s3_path = ""
|
||||
return
|
||||
|
||||
# bucket with a path
|
||||
match = re.match(r"s3://([^/\s]+?)/([^\s]*)", self.s3_url)
|
||||
if not match:
|
||||
raise ValueError(
|
||||
f"s3_url {self.s3_url} does not look like a valid path. "
|
||||
"Expected s3://<bucket-name or s3://<bucket-name/path",
|
||||
)
|
||||
self.s3_bucket = match.group(1)
|
||||
self.s3_path = match.group(2) or ""
|
||||
|
||||
|
||||
@dataclass
|
||||
class S3IngestDoc(BaseIngestDoc):
|
||||
"""Class encapsulating fetching a doc and writing processed results (but not
|
||||
doing the processing!).
|
||||
|
||||
Also includes a cleanup method. When things go wrong and the cleanup
|
||||
method is not called, the file is left behind on the filesystem to assist debugging.
|
||||
"""
|
||||
|
||||
config: SimpleS3Config
|
||||
s3_key: str
|
||||
|
||||
# NOTE(crag): probably doesn't matter, but intentionally not defining tmp_download_file
|
||||
# __post_init__ for multiprocessing simplicity (no Path objects in initially
|
||||
# instantiated object)
|
||||
def _tmp_download_file(self):
|
||||
return Path(self.config.download_dir) / self.s3_key
|
||||
|
||||
def _output_filename(self):
|
||||
return Path(self.config.output_dir) / f"{self.s3_key}.json"
|
||||
|
||||
def has_output(self):
|
||||
"""Determine if structured output for this doc already exists."""
|
||||
return self._output_filename().is_file() and os.path.getsize(self._output_filename())
|
||||
|
||||
def _create_full_tmp_dir_path(self):
|
||||
"""includes "directories" in s3 object path"""
|
||||
self._tmp_download_file().parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
@requires_dependencies(["boto3"], extras="s3")
|
||||
def get_file(self):
|
||||
"""Actually fetches the file from s3 and stores it locally."""
|
||||
import boto3
|
||||
|
||||
self._create_full_tmp_dir_path()
|
||||
if (
|
||||
not self.config.re_download
|
||||
and self._tmp_download_file().is_file()
|
||||
and os.path.getsize(self._tmp_download_file())
|
||||
):
|
||||
logger.debug(f"File exists: {self.filename}, skipping download")
|
||||
return
|
||||
|
||||
if self.config.anonymous:
|
||||
from botocore import UNSIGNED
|
||||
from botocore.client import Config
|
||||
|
||||
s3_cli = boto3.client("s3", config=Config(signature_version=UNSIGNED))
|
||||
else:
|
||||
s3_cli = boto3.client("s3")
|
||||
logger.debug(f"Fetching {self} - PID: {os.getpid()}")
|
||||
s3_cli.download_file(self.config.s3_bucket, self.s3_key, self._tmp_download_file())
|
||||
|
||||
def write_result(self):
|
||||
"""Write the structured json result for this doc. result must be json serializable."""
|
||||
output_filename = self._output_filename()
|
||||
output_filename.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_filename, "w") as output_f:
|
||||
json.dump(self.isd_elems_no_filename, output_f, ensure_ascii=False, indent=2)
|
||||
logger.info(f"Wrote {output_filename}")
|
||||
|
||||
@property
|
||||
def filename(self):
|
||||
"""The filename of the file after downloading from s3"""
|
||||
return self._tmp_download_file()
|
||||
|
||||
def cleanup_file(self):
|
||||
"""Removes the local copy the file after successful processing."""
|
||||
if not self.config.preserve_downloads:
|
||||
logger.debug(f"Cleaning up {self}")
|
||||
os.unlink(self._tmp_download_file())
|
||||
|
||||
|
||||
@requires_dependencies(["boto3"], extras="s3")
|
||||
class S3Connector(BaseConnector):
|
||||
"""Objects of this class support fetching document(s) from"""
|
||||
|
||||
def __init__(self, config: SimpleS3Config):
|
||||
import boto3
|
||||
|
||||
self.config = config
|
||||
self._list_objects_kwargs = {"Bucket": config.s3_bucket, "Prefix": config.s3_path}
|
||||
if config.anonymous:
|
||||
from botocore import UNSIGNED
|
||||
from botocore.client import Config
|
||||
|
||||
self.s3_cli = boto3.client("s3", config=Config(signature_version=UNSIGNED))
|
||||
else:
|
||||
self.s3_cli = boto3.client("s3")
|
||||
self.cleanup_files = not config.preserve_downloads
|
||||
|
||||
def cleanup(self, cur_dir=None):
|
||||
"""cleanup linginering empty sub-dirs from s3 paths, but leave remaining files
|
||||
(and their paths) in tact as that indicates they were not processed"""
|
||||
if not self.cleanup_files:
|
||||
return
|
||||
|
||||
if cur_dir is None:
|
||||
cur_dir = self.config.download_dir
|
||||
sub_dirs = os.listdir(cur_dir)
|
||||
os.chdir(cur_dir)
|
||||
for sub_dir in sub_dirs:
|
||||
# don't traverse symlinks, not that there every should be any
|
||||
if os.path.isdir(sub_dir) and not os.path.islink(sub_dir):
|
||||
self.cleanup(sub_dir)
|
||||
os.chdir("..")
|
||||
if len(os.listdir(cur_dir)) == 0:
|
||||
os.rmdir(cur_dir)
|
||||
|
||||
def initialize(self):
|
||||
"""Verify that can get metadata for an object, validates connections info."""
|
||||
response = self.s3_cli.list_objects_v2(**self._list_objects_kwargs, MaxKeys=1)
|
||||
if response["KeyCount"] < 1:
|
||||
raise ValueError(
|
||||
f"No objects found in {self.config.s3_url} -- response list object is {response}",
|
||||
)
|
||||
|
||||
def _list_objects(self):
|
||||
response = self.s3_cli.list_objects_v2(**self._list_objects_kwargs)
|
||||
s3_keys = []
|
||||
while True:
|
||||
s3_keys.extend([s3_item["Key"] for s3_item in response["Contents"]])
|
||||
if not response.get("IsTruncated"):
|
||||
break
|
||||
next_token = response.get("NextContinuationToken")
|
||||
response = self.s3_cli.list_objects_v2(
|
||||
**self._list_objects_kwargs,
|
||||
ContinuationToken=next_token,
|
||||
)
|
||||
return s3_keys
|
||||
|
||||
def get_ingest_docs(self):
|
||||
s3_keys = self._list_objects()
|
||||
return [
|
||||
S3IngestDoc(
|
||||
self.config,
|
||||
s3_key,
|
||||
)
|
||||
for s3_key in s3_keys
|
||||
]
|
||||
@ -3,6 +3,7 @@ import hashlib
|
||||
import logging
|
||||
import multiprocessing as mp
|
||||
import sys
|
||||
from contextlib import suppress
|
||||
from pathlib import Path
|
||||
|
||||
import click
|
||||
@ -14,7 +15,7 @@ from unstructured.ingest.connector.google_drive import (
|
||||
SimpleGoogleDriveConfig,
|
||||
)
|
||||
from unstructured.ingest.connector.reddit import RedditConnector, SimpleRedditConfig
|
||||
from unstructured.ingest.connector.s3_connector import S3Connector, SimpleS3Config
|
||||
from unstructured.ingest.connector.s3 import S3Connector, SimpleS3Config
|
||||
from unstructured.ingest.connector.wikipedia import (
|
||||
SimpleWikipediaConfig,
|
||||
WikipediaConnector,
|
||||
@ -22,6 +23,9 @@ from unstructured.ingest.connector.wikipedia import (
|
||||
from unstructured.ingest.doc_processor.generalized import initialize, process_document
|
||||
from unstructured.ingest.logger import ingest_log_streaming_init, logger
|
||||
|
||||
with suppress(RuntimeError):
|
||||
mp.set_start_method("spawn")
|
||||
|
||||
|
||||
class MainProcess:
|
||||
def __init__(
|
||||
@ -80,7 +84,6 @@ class MainProcess:
|
||||
# Debugging tip: use the below line and comment out the mp.Pool loop
|
||||
# block to remain in single process
|
||||
# self.doc_processor_fn(docs[0])
|
||||
|
||||
with mp.Pool(
|
||||
processes=self.num_processes,
|
||||
initializer=ingest_log_streaming_init,
|
||||
@ -303,11 +306,10 @@ def main(
|
||||
if s3_url:
|
||||
doc_connector = S3Connector(
|
||||
config=SimpleS3Config(
|
||||
path=s3_url,
|
||||
access_kwargs={"anon": s3_anonymous},
|
||||
download_dir=download_dir,
|
||||
s3_url=s3_url,
|
||||
output_dir=structured_output_dir,
|
||||
# set to False to use your AWS creds (not needed for this public s3 url)
|
||||
anonymous=s3_anonymous,
|
||||
re_download=re_download,
|
||||
preserve_downloads=preserve_downloads,
|
||||
),
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user