feat: add FsspecConnector to easily integrate new connectors with a fsspec implementation available (#318)

So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured.

I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.
This commit is contained in:
Alvaro Bartolome 2023-03-10 07:15:19 +01:00 committed by GitHub
parent 7c619f045b
commit c51adb21e3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
24 changed files with 448 additions and 301 deletions

3
.gitignore vendored
View File

@ -184,3 +184,6 @@ tags
[._]*.un~
.DS_Store
# Ruff cache
.ruff_cache/

View File

@ -1,7 +1,13 @@
## 0.5.4-dev0
## 0.5.4-dev1
### Enhancements
* Add `FsspecConnector` to easily integrate any existing `fsspec` filesystem as a connector.
* Rename `s3_connector.py` to `s3.py` for readability and consistency with the
rest of the connectors.
* Now `S3Connector` relies on `s3fs` instead of on `boto3`, and it inherits
from `FsspecConnector`.
* Adds an `UNSTRUCTURED_LANGUAGE_CHECKS` environment variable to control whether or not language
specific checks like vocabulary and POS tagging are applied. Set to `"true"` for higher
resolution partitioning and `"false"` for faster processing.

View File

@ -37,7 +37,9 @@ just execute `unstructured/ingest/main.py`, e.g.:
## Adding Data Connectors
To add a connector, refer to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py) as example that implements the three relelvant abstract base classes.
To add a connector, refer to [unstructured/ingest/connector/github.py](unstructured/ingest/connector/github.py) as example that implements the three relelvant abstract base classes.
If the connector has an available `fsspec` implementation, then refer to [unstructured/ingest/connector/s3.py](unstructured/ingest/connector/s3.py).
Then, update [unstructured/ingest/main.py](unstructured/ingest/main.py) to instantiate
the connector specific to your class if its command line options are invoked.
@ -56,7 +58,7 @@ The `main.py` flags of --re-download/--no-re-download , --download-dir, --preser
In checklist form, the above steps are summarized as:
- [ ] Create a new module under [unstructured/ingest/connector/](unstructured/ingest/connector/) implementing the 3 abstract base classes, similar to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py).
- [ ] Create a new module under [unstructured/ingest/connector/](unstructured/ingest/connector/) implementing the 3 abstract base classes, similar to [unstructured/ingest/connector/github.py](unstructured/ingest/connector/github.py).
- [ ] The subclass of `BaseIngestDoc` overrides `process_file()` if extra processing logic is needed other than what is provided by [auto.partition()](unstructured/partition/auto.py).
- [ ] Update [unstructured/ingest/main.py](unstructured/ingest/main.py) with support for the new connector.
- [ ] Create a folder under [examples/ingest](examples/ingest) that includes at least one well documented script.
@ -67,7 +69,7 @@ In checklist form, the above steps are summarized as:
- [ ] Add them as an extra to [setup.py](unstructured/setup.py).
- [ ] Update the Makefile, adding a target for `install-ingest-<name>` and adding another `pip-compile` line to the `pip-compile` make target. See [this commit](https://github.com/Unstructured-IO/unstructured/commit/ab542ca3c6274f96b431142262d47d727f309e37) for a reference.
- [ ] The added dependencies should be imported at runtime when the new connector is invoked, rather than as top-level imports.
- [ ] Add the decorator `unstructured.utils.requires_dependencies` on top of each class instance or function that uses those connector-specific dependencies e.g. for `S3Connector` should look like `@requires_dependencies(dependencies=["boto3"], extras="s3")`
- [ ] Add the decorator `unstructured.utils.requires_dependencies` on top of each class instance or function that uses those connector-specific dependencies e.g. for `GitHubConnector` should look like `@requires_dependencies(dependencies=["github"], extras="github")`
- [ ] Run `make tidy` and `make check` to ensure linting checks pass.
- [ ] Honors the conventions of `BaseConnectorConfig` defined in [unstructured/ingest/interfaces.py](unstructured/ingest/interfaces.py) which is passed through [the CLI](unstructured/ingest/main.py):
- [ ] If running with an `.output_dir` where structured outputs already exists for a given file, the file content is not re-downloaded from the data source nor is it reprocessed. This is made possible by implementing the call to `MyIngestDoc.has_output()` which is invoked in [MainProcess._filter_docs_with_outputs](ingest-prep-for-many/unstructured/ingest/main.py).
@ -75,4 +77,4 @@ In checklist form, the above steps are summarized as:
- [ ] If `.preserve_download` is `True`, documents downloaded to `.download_dir` are not removed after processing.
- [ ] Else if `.preserve_download` is `False`, documents downloaded to `.download_dir` are removed after they are **successfully** processed during the invocation of `MyIngestDoc.cleanup_file()` in [process_document](unstructured/ingest/doc_processor/generalized.py)
- [ ] Does not re-download documents to `.download_dir` if `.re_download` is False, enforced in `MyIngestDoc.get_file()`
- [ ] Prints more details if `.verbose` similar to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py).
- [ ] Prints more details if `--verbose` in ingest CLI, similar to [unstructured/ingest/connector/github.py](unstructured/ingest/connector/github.py) logging messages.

View File

@ -14,7 +14,7 @@ certifi==2022.12.7
# via
# -r requirements/build.in
# requests
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via requests
docutils==0.18.1
# via

View File

@ -1230,12 +1230,12 @@ files to an S3 bucket.
# Upload staged data files to S3 from local output directory.
def upload_staged_files():
import boto3
s3 = boto3.client("s3")
from s3fs import S3FileSystem
fs = S3FileSystem()
for filename in os.listdir(LOCAL_OUTPUT_DIRECTORY):
filepath = os.path.join(LOCAL_OUTPUT_DIRECTORY, filename)
upload_key = os.path.join(S3_BUCKET_KEY_PREFIX, filename)
s3.upload_file(filepath, Bucket=S3_BUCKET_NAME, Key=upload_key)
fs.put_file(lpath=filepath, rpath=os.path.join(S3_BUCKET_NAME, upload_key))
upload_staged_files()

View File

@ -6,7 +6,7 @@
#
anyio==3.6.2
# via httpcore
argilla==1.3.1
argilla==1.4.0
# via unstructured (setup.py)
backoff==2.2.1
# via argilla
@ -16,10 +16,12 @@ certifi==2022.12.7
# httpx
# requests
# unstructured (setup.py)
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via requests
click==8.1.3
# via nltk
commonmark==0.9.1
# via rich
deprecated==1.2.13
# via argilla
et-xmlfile==1.1.0
@ -66,8 +68,10 @@ pillow==9.4.0
# via
# python-pptx
# unstructured (setup.py)
pydantic==1.10.5
pydantic==1.10.6
# via argilla
pygments==2.14.0
# via rich
python-dateutil==2.8.2
# via pandas
python-docx==0.8.11
@ -84,6 +88,8 @@ requests==2.28.2
# via unstructured (setup.py)
rfc3986[idna2008]==1.5.0
# via httpx
rich==13.0.1
# via argilla
six==1.16.0
# via python-dateutil
sniffio==1.3.0
@ -91,12 +97,14 @@ sniffio==1.3.0
# anyio
# httpcore
# httpx
tqdm==4.64.1
tqdm==4.65.0
# via
# argilla
# nltk
typing-extensions==4.5.0
# via pydantic
# via
# pydantic
# rich
urllib3==1.26.14
# via requests
wrapt==1.14.1

View File

@ -14,7 +14,7 @@ certifi==2022.12.7
# via
# -r requirements/build.in
# requests
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via requests
docutils==0.18.1
# via

View File

@ -55,7 +55,7 @@ filelock==3.9.0
# via virtualenv
fqdn==1.5.1
# via jsonschema
identify==2.5.18
identify==2.5.19
# via pre-commit
idna==3.4
# via
@ -67,7 +67,7 @@ importlib-metadata==6.0.0
# nbconvert
importlib-resources==5.12.0
# via jsonschema
ipykernel==6.21.2
ipykernel==6.21.3
# via
# ipywidgets
# jupyter
@ -115,7 +115,7 @@ jupyter-client==8.0.3
# nbclient
# notebook
# qtconsole
jupyter-console==6.6.2
jupyter-console==6.6.3
# via jupyter
jupyter-core==5.2.0
# via
@ -132,7 +132,7 @@ jupyter-core==5.2.0
# qtconsole
jupyter-events==0.6.3
# via jupyter-server
jupyter-server==2.3.0
jupyter-server==2.4.0
# via
# nbclassic
# notebook-shim
@ -152,7 +152,7 @@ matplotlib-inline==0.1.6
# ipython
mistune==2.0.5
# via nbconvert
nbclassic==0.5.2
nbclassic==0.5.3
# via notebook
nbclient==0.7.2
# via nbconvert
@ -176,7 +176,7 @@ nest-asyncio==1.5.6
# notebook
nodeenv==1.7.0
# via pre-commit
notebook==6.5.2
notebook==6.5.3
# via jupyter
notebook-shim==0.2.2
# via nbclassic
@ -199,7 +199,7 @@ pip-tools==6.12.3
# via -r requirements/dev.in
pkgutil-resolve-name==1.3.10
# via jsonschema
platformdirs==3.0.0
platformdirs==3.1.0
# via
# jupyter-core
# virtualenv

View File

@ -6,7 +6,7 @@
#
anyio==3.6.2
# via httpcore
argilla==1.3.1
argilla==1.4.0
# via unstructured (setup.py)
backoff==2.2.1
# via argilla
@ -16,12 +16,14 @@ certifi==2022.12.7
# httpx
# requests
# unstructured (setup.py)
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via requests
click==8.1.3
# via
# nltk
# sacremoses
commonmark==0.9.1
# via rich
deprecated==1.2.13
# via argilla
et-xmlfile==1.1.0
@ -36,7 +38,7 @@ httpcore==0.16.3
# via httpx
httpx==0.23.3
# via argilla
huggingface-hub==0.12.1
huggingface-hub==0.13.1
# via transformers
idna==3.4
# via
@ -82,8 +84,10 @@ pillow==9.4.0
# via
# python-pptx
# unstructured (setup.py)
pydantic==1.10.5
pydantic==1.10.6
# via argilla
pygments==2.14.0
# via rich
python-dateutil==2.8.2
# via pandas
python-docx==0.8.11
@ -110,6 +114,8 @@ requests==2.28.2
# unstructured (setup.py)
rfc3986[idna2008]==1.5.0
# via httpx
rich==13.0.1
# via argilla
sacremoses==0.0.53
# via unstructured (setup.py)
sentencepiece==0.1.97
@ -128,7 +134,7 @@ tokenizers==0.13.2
# via transformers
torch==1.13.1
# via unstructured (setup.py)
tqdm==4.64.1
tqdm==4.65.0
# via
# argilla
# huggingface-hub
@ -141,6 +147,7 @@ typing-extensions==4.5.0
# via
# huggingface-hub
# pydantic
# rich
# torch
urllib3==1.26.14
# via requests

View File

@ -8,7 +8,7 @@ anyio==3.6.2
# via
# -r requirements/base.txt
# httpcore
argilla==1.3.1
argilla==1.4.0
# via
# -r requirements/base.txt
# unstructured (setup.py)
@ -25,7 +25,7 @@ certifi==2022.12.7
# unstructured (setup.py)
cffi==1.15.1
# via pynacl
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via
# -r requirements/base.txt
# requests
@ -33,6 +33,10 @@ click==8.1.3
# via
# -r requirements/base.txt
# nltk
commonmark==0.9.1
# via
# -r requirements/base.txt
# rich
deprecated==1.2.13
# via
# -r requirements/base.txt
@ -111,12 +115,16 @@ pillow==9.4.0
# unstructured (setup.py)
pycparser==2.21
# via cffi
pydantic==1.10.5
pydantic==1.10.6
# via
# -r requirements/base.txt
# argilla
pygithub==1.57.0
# via unstructured (setup.py)
pygments==2.14.0
# via
# -r requirements/base.txt
# rich
pyjwt==2.6.0
# via pygithub
pynacl==1.5.0
@ -154,6 +162,10 @@ rfc3986[idna2008]==1.5.0
# via
# -r requirements/base.txt
# httpx
rich==13.0.1
# via
# -r requirements/base.txt
# argilla
six==1.16.0
# via
# -r requirements/base.txt
@ -164,7 +176,7 @@ sniffio==1.3.0
# anyio
# httpcore
# httpx
tqdm==4.64.1
tqdm==4.65.0
# via
# -r requirements/base.txt
# argilla
@ -173,6 +185,7 @@ typing-extensions==4.5.0
# via
# -r requirements/base.txt
# pydantic
# rich
urllib3==1.26.14
# via
# -r requirements/base.txt

View File

@ -8,7 +8,7 @@ anyio==3.6.2
# via
# -r requirements/base.txt
# httpcore
argilla==1.3.1
argilla==1.4.0
# via
# -r requirements/base.txt
# unstructured (setup.py)
@ -23,7 +23,7 @@ certifi==2022.12.7
# httpx
# requests
# unstructured (setup.py)
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via
# -r requirements/base.txt
# requests
@ -31,10 +31,10 @@ click==8.1.3
# via
# -r requirements/base.txt
# nltk
colorama==0.4.6
commonmark==0.9.1
# via
# click
# tqdm
# -r requirements/base.txt
# rich
deprecated==1.2.13
# via
# -r requirements/base.txt
@ -110,10 +110,14 @@ pillow==9.4.0
# -r requirements/base.txt
# python-pptx
# unstructured (setup.py)
pydantic==1.10.5
pydantic==1.10.6
# via
# -r requirements/base.txt
# argilla
pygments==2.14.0
# via
# -r requirements/base.txt
# rich
python-dateutil==2.8.2
# via
# -r requirements/base.txt
@ -152,6 +156,10 @@ rfc3986[idna2008]==1.5.0
# via
# -r requirements/base.txt
# httpx
rich==13.0.1
# via
# -r requirements/base.txt
# argilla
six==1.16.0
# via
# -r requirements/base.txt
@ -162,7 +170,7 @@ sniffio==1.3.0
# anyio
# httpcore
# httpx
tqdm==4.64.1
tqdm==4.65.0
# via
# -r requirements/base.txt
# argilla
@ -171,6 +179,7 @@ typing-extensions==4.5.0
# via
# -r requirements/base.txt
# pydantic
# rich
urllib3==1.26.14
# via
# -r requirements/base.txt

View File

@ -1,5 +1,5 @@
#
# This file is autogenerated by pip-compile with Python 3.9
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile --extra=google-drive --output-file=requirements/ingest-google-drive.txt requirements/base.txt setup.py
@ -8,7 +8,7 @@ anyio==3.6.2
# via
# -r requirements/base.txt
# httpcore
argilla==1.3.1
argilla==1.4.0
# via
# -r requirements/base.txt
# unstructured (setup.py)
@ -25,7 +25,7 @@ certifi==2022.12.7
# httpx
# requests
# unstructured (setup.py)
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via
# -r requirements/base.txt
# requests
@ -33,6 +33,10 @@ click==8.1.3
# via
# -r requirements/base.txt
# nltk
commonmark==0.9.1
# via
# -r requirements/base.txt
# rich
deprecated==1.2.13
# via
# -r requirements/base.txt
@ -125,7 +129,7 @@ pillow==9.4.0
# -r requirements/base.txt
# python-pptx
# unstructured (setup.py)
protobuf==4.22.0
protobuf==4.22.1
# via
# google-api-core
# googleapis-common-protos
@ -135,10 +139,14 @@ pyasn1==0.4.8
# rsa
pyasn1-modules==0.2.8
# via google-auth
pydantic==1.10.5
pydantic==1.10.6
# via
# -r requirements/base.txt
# argilla
pygments==2.14.0
# via
# -r requirements/base.txt
# rich
pyparsing==3.0.9
# via httplib2
python-dateutil==2.8.2
@ -174,6 +182,10 @@ rfc3986[idna2008]==1.5.0
# via
# -r requirements/base.txt
# httpx
rich==13.0.1
# via
# -r requirements/base.txt
# argilla
rsa==4.9
# via google-auth
six==1.16.0
@ -188,7 +200,7 @@ sniffio==1.3.0
# anyio
# httpcore
# httpx
tqdm==4.64.1
tqdm==4.65.0
# via
# -r requirements/base.txt
# argilla
@ -197,6 +209,7 @@ typing-extensions==4.5.0
# via
# -r requirements/base.txt
# pydantic
# rich
uritemplate==4.1.1
# via google-api-python-client
urllib3==1.26.14

View File

@ -8,7 +8,7 @@ anyio==3.6.2
# via
# -r requirements/base.txt
# httpcore
argilla==1.3.1
argilla==1.4.0
# via
# -r requirements/base.txt
# unstructured (setup.py)
@ -23,7 +23,7 @@ certifi==2022.12.7
# httpx
# requests
# unstructured (setup.py)
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via
# -r requirements/base.txt
# requests
@ -31,6 +31,10 @@ click==8.1.3
# via
# -r requirements/base.txt
# nltk
commonmark==0.9.1
# via
# -r requirements/base.txt
# rich
deprecated==1.2.13
# via
# -r requirements/base.txt
@ -110,10 +114,14 @@ praw==7.7.0
# via unstructured (setup.py)
prawcore==2.3.0
# via praw
pydantic==1.10.5
pydantic==1.10.6
# via
# -r requirements/base.txt
# argilla
pygments==2.14.0
# via
# -r requirements/base.txt
# rich
python-dateutil==2.8.2
# via
# -r requirements/base.txt
@ -148,6 +156,10 @@ rfc3986[idna2008]==1.5.0
# via
# -r requirements/base.txt
# httpx
rich==13.0.1
# via
# -r requirements/base.txt
# argilla
six==1.16.0
# via
# -r requirements/base.txt
@ -158,7 +170,7 @@ sniffio==1.3.0
# anyio
# httpcore
# httpx
tqdm==4.64.1
tqdm==4.65.0
# via
# -r requirements/base.txt
# argilla
@ -167,6 +179,7 @@ typing-extensions==4.5.0
# via
# -r requirements/base.txt
# pydantic
# rich
update-checker==0.18.0
# via praw
urllib3==1.26.14

View File

@ -4,24 +4,34 @@
#
# pip-compile --extra=s3 --output-file=requirements/ingest-s3.txt requirements/base.txt setup.py
#
aiobotocore==2.4.2
# via s3fs
aiohttp==3.8.4
# via
# aiobotocore
# s3fs
aioitertools==0.11.0
# via aiobotocore
aiosignal==1.3.1
# via aiohttp
anyio==3.6.2
# via
# -r requirements/base.txt
# httpcore
argilla==1.3.1
argilla==1.4.0
# via
# -r requirements/base.txt
# unstructured (setup.py)
async-timeout==4.0.2
# via aiohttp
attrs==22.2.0
# via aiohttp
backoff==2.2.1
# via
# -r requirements/base.txt
# argilla
boto3==1.26.82
# via unstructured (setup.py)
botocore==1.29.82
# via
# boto3
# s3transfer
botocore==1.27.59
# via aiobotocore
certifi==2022.12.7
# via
# -r requirements/base.txt
@ -29,14 +39,19 @@ certifi==2022.12.7
# httpx
# requests
# unstructured (setup.py)
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via
# -r requirements/base.txt
# aiohttp
# requests
click==8.1.3
# via
# -r requirements/base.txt
# nltk
commonmark==0.9.1
# via
# -r requirements/base.txt
# rich
deprecated==1.2.13
# via
# -r requirements/base.txt
@ -45,6 +60,14 @@ et-xmlfile==1.1.0
# via
# -r requirements/base.txt
# openpyxl
frozenlist==1.3.3
# via
# aiohttp
# aiosignal
fsspec==2023.3.0
# via
# s3fs
# unstructured (setup.py)
h11==0.14.0
# via
# -r requirements/base.txt
@ -63,14 +86,13 @@ idna==3.4
# anyio
# requests
# rfc3986
# yarl
importlib-metadata==6.0.0
# via
# -r requirements/base.txt
# markdown
jmespath==1.0.1
# via
# boto3
# botocore
# via botocore
joblib==1.2.0
# via
# -r requirements/base.txt
@ -89,6 +111,10 @@ monotonic==1.6
# via
# -r requirements/base.txt
# argilla
multidict==6.0.4
# via
# aiohttp
# yarl
nltk==3.8.1
# via
# -r requirements/base.txt
@ -116,10 +142,14 @@ pillow==9.4.0
# -r requirements/base.txt
# python-pptx
# unstructured (setup.py)
pydantic==1.10.5
pydantic==1.10.6
# via
# -r requirements/base.txt
# argilla
pygments==2.14.0
# via
# -r requirements/base.txt
# rich
python-dateutil==2.8.2
# via
# -r requirements/base.txt
@ -153,8 +183,12 @@ rfc3986[idna2008]==1.5.0
# via
# -r requirements/base.txt
# httpx
s3transfer==0.6.0
# via boto3
rich==13.0.1
# via
# -r requirements/base.txt
# argilla
s3fs==2023.3.0
# via unstructured (setup.py)
six==1.16.0
# via
# -r requirements/base.txt
@ -165,7 +199,7 @@ sniffio==1.3.0
# anyio
# httpcore
# httpx
tqdm==4.64.1
tqdm==4.65.0
# via
# -r requirements/base.txt
# argilla
@ -173,7 +207,9 @@ tqdm==4.64.1
typing-extensions==4.5.0
# via
# -r requirements/base.txt
# aioitertools
# pydantic
# rich
urllib3==1.26.14
# via
# -r requirements/base.txt
@ -182,12 +218,15 @@ urllib3==1.26.14
wrapt==1.14.1
# via
# -r requirements/base.txt
# aiobotocore
# argilla
# deprecated
xlsxwriter==3.0.8
# via
# -r requirements/base.txt
# python-pptx
yarl==1.8.2
# via aiohttp
zipp==3.15.0
# via
# -r requirements/base.txt

View File

@ -8,7 +8,7 @@ anyio==3.6.2
# via
# -r requirements/base.txt
# httpcore
argilla==1.3.1
argilla==1.4.0
# via
# -r requirements/base.txt
# unstructured (setup.py)
@ -25,7 +25,7 @@ certifi==2022.12.7
# httpx
# requests
# unstructured (setup.py)
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via
# -r requirements/base.txt
# requests
@ -33,6 +33,10 @@ click==8.1.3
# via
# -r requirements/base.txt
# nltk
commonmark==0.9.1
# via
# -r requirements/base.txt
# rich
deprecated==1.2.13
# via
# -r requirements/base.txt
@ -108,10 +112,14 @@ pillow==9.4.0
# -r requirements/base.txt
# python-pptx
# unstructured (setup.py)
pydantic==1.10.5
pydantic==1.10.6
# via
# -r requirements/base.txt
# argilla
pygments==2.14.0
# via
# -r requirements/base.txt
# rich
python-dateutil==2.8.2
# via
# -r requirements/base.txt
@ -145,6 +153,10 @@ rfc3986[idna2008]==1.5.0
# via
# -r requirements/base.txt
# httpx
rich==13.0.1
# via
# -r requirements/base.txt
# argilla
six==1.16.0
# via
# -r requirements/base.txt
@ -157,7 +169,7 @@ sniffio==1.3.0
# httpx
soupsieve==2.4
# via beautifulsoup4
tqdm==4.64.1
tqdm==4.65.0
# via
# -r requirements/base.txt
# argilla
@ -166,6 +178,7 @@ typing-extensions==4.5.0
# via
# -r requirements/base.txt
# pydantic
# rich
urllib3==1.26.14
# via
# -r requirements/base.txt

View File

@ -10,7 +10,7 @@ anyio==3.6.2
# via
# httpcore
# starlette
argilla==1.3.1
argilla==1.4.0
# via unstructured (setup.py)
backoff==2.2.1
# via argilla
@ -22,7 +22,7 @@ certifi==2022.12.7
# unstructured (setup.py)
cffi==1.15.1
# via cryptography
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via
# pdfminer-six
# requests
@ -32,9 +32,11 @@ click==8.1.3
# uvicorn
coloredlogs==15.0.1
# via onnxruntime
commonmark==0.9.1
# via rich
contourpy==1.0.7
# via matplotlib
cryptography==39.0.1
cryptography==39.0.2
# via pdfminer-six
cycler==0.11.0
# via matplotlib
@ -44,15 +46,15 @@ effdet==0.3.0
# via layoutparser
et-xmlfile==1.1.0
# via openpyxl
fastapi==0.92.0
fastapi==0.93.0
# via unstructured-inference
filelock==3.9.0
# via
# huggingface-hub
# transformers
flatbuffers==23.1.21
flatbuffers==23.3.3
# via onnxruntime
fonttools==4.38.0
fonttools==4.39.0
# via matplotlib
h11==0.14.0
# via
@ -62,7 +64,7 @@ httpcore==0.16.3
# via httpx
httpx==0.23.3
# via argilla
huggingface-hub==0.12.1
huggingface-hub==0.13.1
# via
# timm
# transformers
@ -95,11 +97,11 @@ lxml==4.9.2
# unstructured (setup.py)
markdown==3.4.1
# via unstructured (setup.py)
matplotlib==3.7.0
matplotlib==3.7.1
# via pycocotools
monotonic==1.6
# via argilla
mpmath==1.2.1
mpmath==1.3.0
# via sympy
nltk==3.8.1
# via unstructured (setup.py)
@ -157,16 +159,18 @@ pillow==9.4.0
# unstructured (setup.py)
portalocker==2.7.0
# via iopath
protobuf==4.22.0
protobuf==4.22.1
# via onnxruntime
pycocotools==2.0.6
# via effdet
pycparser==2.21
# via cffi
pydantic==1.10.5
pydantic==1.10.6
# via
# argilla
# fastapi
pygments==2.14.0
# via rich
pyparsing==3.0.9
# via matplotlib
pytesseract==0.3.10
@ -204,6 +208,8 @@ requests==2.28.2
# unstructured (setup.py)
rfc3986[idna2008]==1.5.0
# via httpx
rich==13.0.1
# via argilla
scipy==1.10.1
# via layoutparser
six==1.16.0
@ -232,7 +238,7 @@ torchvision==0.14.1
# effdet
# layoutparser
# timm
tqdm==4.64.1
tqdm==4.65.0
# via
# argilla
# huggingface-hub
@ -246,6 +252,7 @@ typing-extensions==4.5.0
# huggingface-hub
# iopath
# pydantic
# rich
# starlette
# torch
# torchvision

View File

@ -12,7 +12,7 @@ certifi==2022.12.7
# via
# -r requirements/test.in
# requests
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via requests
click==8.1.3
# via
@ -40,7 +40,7 @@ mccabe==0.7.0
# via flake8
multidict==6.0.4
# via yarl
mypy==1.0.1
mypy==1.1.1
# via -r requirements/test.in
mypy-extensions==1.0.0
# via
@ -52,17 +52,17 @@ packaging==23.0
# pytest
pathspec==0.11.0
# via black
platformdirs==3.0.0
platformdirs==3.1.0
# via black
pluggy==1.0.0
# via pytest
pycodestyle==2.10.0
# via flake8
pydantic==1.10.5
pydantic==1.10.6
# via label-studio-sdk
pyflakes==3.0.1
# via flake8
pytest==7.2.1
pytest==7.2.2
# via pytest-cov
pytest-cov==4.0.0
# via -r requirements/test.in
@ -70,7 +70,7 @@ pyyaml==6.0
# via vcrpy
requests==2.28.2
# via label-studio-sdk
ruff==0.0.253
ruff==0.0.254
# via -r requirements/test.in
six==1.16.0
# via vcrpy

View File

@ -77,7 +77,7 @@ setup(
# NOTE(robinson) - Upper bound is temporary due to a multithreading issue
"unstructured-inference>=0.2.4,<0.2.8",
],
"s3": ["boto3"],
"s3": ["s3fs", "fsspec"],
"github": [
# NOTE - pygithub==1.58.0 fails due to https://github.com/PyGithub/PyGithub/issues/2436
# In the future, we can update this to pygithub>1.58.0

View File

@ -1,9 +1,9 @@
#!/usr/bin/env bash
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
cd "$SCRIPT_DIR"/.. || exit 1
if [[ "$(find test_unstructured_ingest/expected-structured-output/s3-small-batch/ -type f -size +20k | wc -l)" != 3 ]]; then
if [[ "$(find test_unstructured_ingest/expected-structured-output/s3-small-batch/ -type f -size +20k | wc -l)" -ne 3 ]]; then
echo "The test fixtures in test_unstructured_ingest/expected-structured-output/ look suspicious. At least one of the files is too small."
echo "Did you overwrite test fixtures with bad outputs?"
exit 1
@ -12,13 +12,13 @@ fi
PYTHONPATH=. ./unstructured/ingest/main.py --s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ --s3-anonymous --structured-output-dir s3-small-batch-output
if ! diff -ru s3-small-batch-output test_unstructured_ingest/expected-structured-output/s3-small-batch ; then
echo
echo "There are differences from the previously checked-in structured outputs."
echo
echo "If these differences are acceptable, copy the outputs from"
echo "s3-small-batch-output/ to test_unstructured_ingest/expected-structured-output/s3-small-batch/ after running"
echo
echo " PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --structured-output-dir s3-small-batch-output"
echo
exit 1
echo
echo "There are differences from the previously checked-in structured outputs."
echo
echo "If these differences are acceptable, copy the outputs from"
echo "s3-small-batch-output/ to test_unstructured_ingest/expected-structured-output/s3-small-batch/ after running"
echo
echo " PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --structured-output-dir s3-small-batch-output"
echo
exit 1
fi

View File

@ -1 +1 @@
__version__ = "0.5.4-dev0" # pragma: no cover
__version__ = "0.5.4-dev1" # pragma: no cover

View File

@ -0,0 +1,186 @@
import json
import os
import re
from dataclasses import dataclass, field
from pathlib import Path
from unstructured.ingest.interfaces import (
BaseConnector,
BaseConnectorConfig,
BaseIngestDoc,
)
from unstructured.ingest.logger import logger
SUPPORTED_REMOTE_FSSPEC_PROTOCOLS = [
"s3",
"gcs",
"gcfs",
"abfs",
]
@dataclass
class SimpleFsspecConfig(BaseConnectorConfig):
# fsspec specific options
path: str
# base connector options
download_dir: str
output_dir: str
preserve_downloads: bool = False
re_download: bool = False
# fsspec specific options
access_kwargs: dict = field(default_factory=dict)
protocol: str = field(init=False)
path_without_protocol: str = field(init=False)
dir_path: str = field(init=False)
file_path: str = field(init=False)
def __post_init__(self):
self.protocol, self.path_without_protocol = self.path.split("://")
if self.protocol not in SUPPORTED_REMOTE_FSSPEC_PROTOCOLS:
raise ValueError(
f"Protocol {self.protocol} not supported yet, only "
f"{SUPPORTED_REMOTE_FSSPEC_PROTOCOLS} are supported.",
)
# just a path with no trailing prefix
match = re.match(rf"{self.protocol}://([^/\s]+?)(/*)$", self.path)
if match:
self.dir_path = match.group(1)
self.file_path = ""
return
# valid path with a dir and/or file
match = re.match(rf"{self.protocol}://([^/\s]+?)/([^\s]*)", self.path)
if not match:
raise ValueError(
f"Invalid path {self.path}. Expected <protocol>://<dir-path>/<file-or-dir-path>.",
)
self.dir_path = match.group(1)
self.file_path = match.group(2) or ""
@dataclass
class FsspecIngestDoc(BaseIngestDoc):
"""Class encapsulating fetching a doc and writing processed results (but not
doing the processing!).
Also includes a cleanup method. When things go wrong and the cleanup
method is not called, the file is left behind on the filesystem to assist debugging.
"""
config: SimpleFsspecConfig
remote_file_path: str
def _tmp_download_file(self):
return Path(self.config.download_dir) / self.remote_file_path.replace(
f"{self.config.dir_path}/",
"",
)
def _output_filename(self):
return (
Path(self.config.output_dir)
/ f"{self.remote_file_path.replace(f'{self.config.dir_path}/', '')}.json"
)
def has_output(self):
"""Determine if structured output for this doc already exists."""
return self._output_filename().is_file() and os.path.getsize(self._output_filename())
def _create_full_tmp_dir_path(self):
"""Includes "directories" in the object path"""
self._tmp_download_file().parent.mkdir(parents=True, exist_ok=True)
def get_file(self):
"""Fetches the file from the current filesystem and stores it locally."""
from fsspec import AbstractFileSystem, get_filesystem_class
self._create_full_tmp_dir_path()
if (
not self.config.re_download
and self._tmp_download_file().is_file()
and os.path.getsize(self._tmp_download_file())
):
logger.debug(f"File exists: {self._tmp_download_file()}, skipping download")
return
fs: AbstractFileSystem = get_filesystem_class(self.config.protocol)(
**self.config.access_kwargs,
)
logger.debug(f"Fetching {self} - PID: {os.getpid()}")
fs.get(rpath=self.remote_file_path, lpath=self._tmp_download_file().as_posix())
def write_result(self):
"""Write the structured json result for this doc. result must be json serializable."""
output_filename = self._output_filename()
output_filename.parent.mkdir(parents=True, exist_ok=True)
with open(output_filename, "w") as output_f:
output_f.write(json.dumps(self.isd_elems_no_filename, ensure_ascii=False, indent=2))
logger.info(f"Wrote {output_filename}")
@property
def filename(self):
"""The filename of the file after downloading from s3"""
return self._tmp_download_file()
def cleanup_file(self):
"""Removes the local copy the file after successful processing."""
if not self.config.preserve_downloads:
logger.debug(f"Cleaning up {self}")
os.unlink(self._tmp_download_file())
class FsspecConnector(BaseConnector):
"""Objects of this class support fetching document(s) from"""
def __init__(self, config: SimpleFsspecConfig):
from fsspec import AbstractFileSystem, get_filesystem_class
self.config = config
self.fs: AbstractFileSystem = get_filesystem_class(self.config.protocol)(
**self.config.access_kwargs,
)
self.cleanup_files = not config.preserve_downloads
def cleanup(self, cur_dir=None):
"""cleanup linginering empty sub-dirs from s3 paths, but leave remaining files
(and their paths) in tact as that indicates they were not processed"""
if not self.cleanup_files:
return
if cur_dir is None:
cur_dir = self.config.download_dir
sub_dirs = os.listdir(cur_dir)
os.chdir(cur_dir)
for sub_dir in sub_dirs:
# don't traverse symlinks, not that there every should be any
if os.path.isdir(sub_dir) and not os.path.islink(sub_dir):
self.cleanup(sub_dir)
os.chdir("..")
if len(os.listdir(cur_dir)) == 0:
os.rmdir(cur_dir)
def initialize(self):
"""Verify that can get metadata for an object, validates connections info."""
ls_output = self.fs.ls(self.config.path_without_protocol)
if len(ls_output) < 1:
raise ValueError(
f"No objects found in {self.config.path}.",
)
def _list_files(self):
return self.fs.ls(self.config.path_without_protocol)
def get_ingest_docs(self):
return [
FsspecIngestDoc(
self.config,
file,
)
for file in self._list_files()
]

View File

@ -0,0 +1,24 @@
from dataclasses import dataclass
from unstructured.ingest.connector.fsspec import (
FsspecConnector,
FsspecIngestDoc,
SimpleFsspecConfig,
)
from unstructured.utils import requires_dependencies
@dataclass
class SimpleS3Config(SimpleFsspecConfig):
pass
class S3IngestDoc(FsspecIngestDoc):
@requires_dependencies(["s3fs", "fsspec"], extras="s3")
def get_file(self):
super().get_file()
@requires_dependencies(["s3fs", "fsspec"], extras="s3")
class S3Connector(FsspecConnector):
pass

View File

@ -1,198 +0,0 @@
import json
import os
import re
from dataclasses import dataclass, field
from pathlib import Path
from unstructured.ingest.interfaces import (
BaseConnector,
BaseConnectorConfig,
BaseIngestDoc,
)
from unstructured.ingest.logger import logger
from unstructured.utils import requires_dependencies
@dataclass
class SimpleS3Config(BaseConnectorConfig):
"""Connector config where s3_url is an s3 prefix to process all documents from."""
# S3 Specific Options
s3_url: str
# Standard Connector options
download_dir: str
# where to write structured data, with the directory structure matching s3 path
output_dir: str
re_download: bool = False
preserve_downloads: bool = False
# S3 Specific (optional)
anonymous: bool = False
s3_bucket: str = field(init=False)
# could be single object or prefix
s3_path: str = field(init=False)
def __post_init__(self):
if not self.s3_url.startswith("s3://"):
raise ValueError("s3_url must begin with 's3://'")
# just a bucket with no trailing prefix
match = re.match(r"s3://([^/\s]+?)$", self.s3_url)
if match:
self.s3_bucket = match.group(1)
self.s3_path = ""
return
# bucket with a path
match = re.match(r"s3://([^/\s]+?)/([^\s]*)", self.s3_url)
if not match:
raise ValueError(
f"s3_url {self.s3_url} does not look like a valid path. "
"Expected s3://<bucket-name or s3://<bucket-name/path",
)
self.s3_bucket = match.group(1)
self.s3_path = match.group(2) or ""
@dataclass
class S3IngestDoc(BaseIngestDoc):
"""Class encapsulating fetching a doc and writing processed results (but not
doing the processing!).
Also includes a cleanup method. When things go wrong and the cleanup
method is not called, the file is left behind on the filesystem to assist debugging.
"""
config: SimpleS3Config
s3_key: str
# NOTE(crag): probably doesn't matter, but intentionally not defining tmp_download_file
# __post_init__ for multiprocessing simplicity (no Path objects in initially
# instantiated object)
def _tmp_download_file(self):
return Path(self.config.download_dir) / self.s3_key
def _output_filename(self):
return Path(self.config.output_dir) / f"{self.s3_key}.json"
def has_output(self):
"""Determine if structured output for this doc already exists."""
return self._output_filename().is_file() and os.path.getsize(self._output_filename())
def _create_full_tmp_dir_path(self):
"""includes "directories" in s3 object path"""
self._tmp_download_file().parent.mkdir(parents=True, exist_ok=True)
@requires_dependencies(["boto3"], extras="s3")
def get_file(self):
"""Actually fetches the file from s3 and stores it locally."""
import boto3
self._create_full_tmp_dir_path()
if (
not self.config.re_download
and self._tmp_download_file().is_file()
and os.path.getsize(self._tmp_download_file())
):
logger.debug(f"File exists: {self.filename}, skipping download")
return
if self.config.anonymous:
from botocore import UNSIGNED
from botocore.client import Config
s3_cli = boto3.client("s3", config=Config(signature_version=UNSIGNED))
else:
s3_cli = boto3.client("s3")
logger.debug(f"Fetching {self} - PID: {os.getpid()}")
s3_cli.download_file(self.config.s3_bucket, self.s3_key, self._tmp_download_file())
def write_result(self):
"""Write the structured json result for this doc. result must be json serializable."""
output_filename = self._output_filename()
output_filename.parent.mkdir(parents=True, exist_ok=True)
with open(output_filename, "w") as output_f:
json.dump(self.isd_elems_no_filename, output_f, ensure_ascii=False, indent=2)
logger.info(f"Wrote {output_filename}")
@property
def filename(self):
"""The filename of the file after downloading from s3"""
return self._tmp_download_file()
def cleanup_file(self):
"""Removes the local copy the file after successful processing."""
if not self.config.preserve_downloads:
logger.debug(f"Cleaning up {self}")
os.unlink(self._tmp_download_file())
@requires_dependencies(["boto3"], extras="s3")
class S3Connector(BaseConnector):
"""Objects of this class support fetching document(s) from"""
def __init__(self, config: SimpleS3Config):
import boto3
self.config = config
self._list_objects_kwargs = {"Bucket": config.s3_bucket, "Prefix": config.s3_path}
if config.anonymous:
from botocore import UNSIGNED
from botocore.client import Config
self.s3_cli = boto3.client("s3", config=Config(signature_version=UNSIGNED))
else:
self.s3_cli = boto3.client("s3")
self.cleanup_files = not config.preserve_downloads
def cleanup(self, cur_dir=None):
"""cleanup linginering empty sub-dirs from s3 paths, but leave remaining files
(and their paths) in tact as that indicates they were not processed"""
if not self.cleanup_files:
return
if cur_dir is None:
cur_dir = self.config.download_dir
sub_dirs = os.listdir(cur_dir)
os.chdir(cur_dir)
for sub_dir in sub_dirs:
# don't traverse symlinks, not that there every should be any
if os.path.isdir(sub_dir) and not os.path.islink(sub_dir):
self.cleanup(sub_dir)
os.chdir("..")
if len(os.listdir(cur_dir)) == 0:
os.rmdir(cur_dir)
def initialize(self):
"""Verify that can get metadata for an object, validates connections info."""
response = self.s3_cli.list_objects_v2(**self._list_objects_kwargs, MaxKeys=1)
if response["KeyCount"] < 1:
raise ValueError(
f"No objects found in {self.config.s3_url} -- response list object is {response}",
)
def _list_objects(self):
response = self.s3_cli.list_objects_v2(**self._list_objects_kwargs)
s3_keys = []
while True:
s3_keys.extend([s3_item["Key"] for s3_item in response["Contents"]])
if not response.get("IsTruncated"):
break
next_token = response.get("NextContinuationToken")
response = self.s3_cli.list_objects_v2(
**self._list_objects_kwargs,
ContinuationToken=next_token,
)
return s3_keys
def get_ingest_docs(self):
s3_keys = self._list_objects()
return [
S3IngestDoc(
self.config,
s3_key,
)
for s3_key in s3_keys
]

View File

@ -3,6 +3,7 @@ import hashlib
import logging
import multiprocessing as mp
import sys
from contextlib import suppress
from pathlib import Path
import click
@ -14,7 +15,7 @@ from unstructured.ingest.connector.google_drive import (
SimpleGoogleDriveConfig,
)
from unstructured.ingest.connector.reddit import RedditConnector, SimpleRedditConfig
from unstructured.ingest.connector.s3_connector import S3Connector, SimpleS3Config
from unstructured.ingest.connector.s3 import S3Connector, SimpleS3Config
from unstructured.ingest.connector.wikipedia import (
SimpleWikipediaConfig,
WikipediaConnector,
@ -22,6 +23,9 @@ from unstructured.ingest.connector.wikipedia import (
from unstructured.ingest.doc_processor.generalized import initialize, process_document
from unstructured.ingest.logger import ingest_log_streaming_init, logger
with suppress(RuntimeError):
mp.set_start_method("spawn")
class MainProcess:
def __init__(
@ -80,7 +84,6 @@ class MainProcess:
# Debugging tip: use the below line and comment out the mp.Pool loop
# block to remain in single process
# self.doc_processor_fn(docs[0])
with mp.Pool(
processes=self.num_processes,
initializer=ingest_log_streaming_init,
@ -303,11 +306,10 @@ def main(
if s3_url:
doc_connector = S3Connector(
config=SimpleS3Config(
path=s3_url,
access_kwargs={"anon": s3_anonymous},
download_dir=download_dir,
s3_url=s3_url,
output_dir=structured_output_dir,
# set to False to use your AWS creds (not needed for this public s3 url)
anonymous=s3_anonymous,
re_download=re_download,
preserve_downloads=preserve_downloads,
),