feat: add FsspecConnector to easily integrate new connectors with a fsspec implementation available (#318)

So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured. I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.
2025-12-24 21:55:33 +00:00 · 2023-03-10 07:15:19 +01:00 · 2023-03-10 07:15:19 +01:00 · c51adb21e3
commit c51adb21e3
parent 7c619f045b
24 changed files with 448 additions and 301 deletions
--- a/.gitignore
+++ b/.gitignore
@ -184,3 +184,6 @@ tags
 [._]*.un~

 .DS_Store
+
+# Ruff cache
+.ruff_cache/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,7 +1,13 @@
-## 0.5.4-dev0
+## 0.5.4-dev1

 ### Enhancements

+
+* Add `FsspecConnector` to easily integrate any existing `fsspec` filesystem as a connector.
+* Rename `s3_connector.py` to `s3.py` for readability and consistency with the
+  rest of the connectors.
+* Now `S3Connector` relies on `s3fs` instead of on `boto3`, and it inherits
+  from `FsspecConnector`.
 * Adds an `UNSTRUCTURED_LANGUAGE_CHECKS` environment variable to control whether or not language
  specific checks like vocabulary and POS tagging are applied. Set to `"true"` for higher
  resolution partitioning and `"false"` for faster processing.
--- a/Ingest.md
+++ b/Ingest.md
@ -37,7 +37,9 @@ just execute `unstructured/ingest/main.py`, e.g.:

 ## Adding Data Connectors

-To add a connector, refer to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py) as example that implements the three relelvant abstract base classes.
+To add a connector, refer to [unstructured/ingest/connector/github.py](unstructured/ingest/connector/github.py) as example that implements the three relelvant abstract base classes.
+
+If the connector has an available `fsspec` implementation, then refer to [unstructured/ingest/connector/s3.py](unstructured/ingest/connector/s3.py).

 Then, update [unstructured/ingest/main.py](unstructured/ingest/main.py) to instantiate
 the connector specific to your class if its command line options are invoked.
@ -56,7 +58,7 @@ The `main.py` flags of --re-download/--no-re-download , --download-dir, --preser

 In checklist form, the above steps are summarized as:

- [ ] Create a new module under [unstructured/ingest/connector/](unstructured/ingest/connector/) implementing the 3 abstract base classes, similar to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py).
+- [ ] Create a new module under [unstructured/ingest/connector/](unstructured/ingest/connector/) implementing the 3 abstract base classes, similar to [unstructured/ingest/connector/github.py](unstructured/ingest/connector/github.py).
  - [ ] The subclass of `BaseIngestDoc` overrides `process_file()` if extra processing logic is needed other than what is provided by [auto.partition()](unstructured/partition/auto.py).
 - [ ] Update [unstructured/ingest/main.py](unstructured/ingest/main.py) with support for the new connector.
 - [ ] Create a folder under [examples/ingest](examples/ingest) that includes at least one well documented script.
@ -67,7 +69,7 @@ In checklist form, the above steps are summarized as:
  - [ ] Add them as an extra to [setup.py](unstructured/setup.py).
  - [ ] Update the Makefile, adding a target for `install-ingest-<name>` and adding another `pip-compile` line to the `pip-compile` make target. See [this commit](https://github.com/Unstructured-IO/unstructured/commit/ab542ca3c6274f96b431142262d47d727f309e37) for a reference.
  - [ ] The added dependencies should be imported at runtime when the new connector is invoked, rather than as top-level imports.
-  - [ ] Add the decorator `unstructured.utils.requires_dependencies` on top of each class instance or function that uses those connector-specific dependencies e.g. for `S3Connector` should look like `@requires_dependencies(dependencies=["boto3"], extras="s3")`
+  - [ ] Add the decorator `unstructured.utils.requires_dependencies` on top of each class instance or function that uses those connector-specific dependencies e.g. for `GitHubConnector` should look like `@requires_dependencies(dependencies=["github"], extras="github")`
  - [ ] Run `make tidy` and `make check` to ensure linting checks pass.
 - [ ] Honors the conventions of `BaseConnectorConfig` defined in [unstructured/ingest/interfaces.py](unstructured/ingest/interfaces.py) which is passed through [the CLI](unstructured/ingest/main.py):
  - [ ] If running with an `.output_dir` where structured outputs already exists for a given file, the file content is not re-downloaded from the data source nor is it reprocessed. This is made possible by implementing the call to `MyIngestDoc.has_output()` which is invoked in [MainProcess._filter_docs_with_outputs](ingest-prep-for-many/unstructured/ingest/main.py).
@ -75,4 +77,4 @@ In checklist form, the above steps are summarized as:
  - [ ] If `.preserve_download` is `True`, documents downloaded to `.download_dir` are not removed after processing.
  - [ ] Else if `.preserve_download` is `False`, documents downloaded to `.download_dir` are removed after they are **successfully** processed during the invocation of `MyIngestDoc.cleanup_file()` in [process_document](unstructured/ingest/doc_processor/generalized.py)
  - [ ] Does not re-download documents to `.download_dir` if `.re_download` is False, enforced in `MyIngestDoc.get_file()`
-  - [ ] Prints more details if `.verbose` similar to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py).
+  - [ ] Prints more details if `--verbose` in ingest CLI, similar to [unstructured/ingest/connector/github.py](unstructured/ingest/connector/github.py) logging messages.
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -14,7 +14,7 @@ certifi==2022.12.7
    # via
    #   -r requirements/build.in
    #   requests
-charset-normalizer==3.0.1
+charset-normalizer==3.1.0
    # via requests
 docutils==0.18.1
    # via
--- a/docs/source/bricks.rst
+++ b/docs/source/bricks.rst
@ -1230,12 +1230,12 @@ files to an S3 bucket.

  # Upload staged data files to S3 from local output directory.
  def upload_staged_files():
-      import boto3
-      s3 = boto3.client("s3")
+      from s3fs import S3FileSystem
+      fs = S3FileSystem()
      for filename in os.listdir(LOCAL_OUTPUT_DIRECTORY):
          filepath = os.path.join(LOCAL_OUTPUT_DIRECTORY, filename)
          upload_key = os.path.join(S3_BUCKET_KEY_PREFIX, filename)
-          s3.upload_file(filepath, Bucket=S3_BUCKET_NAME, Key=upload_key)
+          fs.put_file(lpath=filepath, rpath=os.path.join(S3_BUCKET_NAME, upload_key))

  upload_staged_files()

--- a/requirements/base.txt
+++ b/requirements/base.txt
@ -6,7 +6,7 @@
 #
 anyio==3.6.2
    # via httpcore
-argilla==1.3.1
+argilla==1.4.0
    # via unstructured (setup.py)
 backoff==2.2.1
    # via argilla
@ -16,10 +16,12 @@ certifi==2022.12.7
    #   httpx
    #   requests
    #   unstructured (setup.py)
-charset-normalizer==3.0.1
+charset-normalizer==3.1.0
    # via requests
 click==8.1.3
    # via nltk
+commonmark==0.9.1
+    # via rich
 deprecated==1.2.13
    # via argilla
 et-xmlfile==1.1.0
@ -66,8 +68,10 @@ pillow==9.4.0
    # via
    #   python-pptx
    #   unstructured (setup.py)
-pydantic==1.10.5
+pydantic==1.10.6
    # via argilla
+pygments==2.14.0
+    # via rich
 python-dateutil==2.8.2
    # via pandas
 python-docx==0.8.11
@ -84,6 +88,8 @@ requests==2.28.2
    # via unstructured (setup.py)
 rfc3986[idna2008]==1.5.0
    # via httpx
+rich==13.0.1
+    # via argilla
 six==1.16.0
    # via python-dateutil
 sniffio==1.3.0
@ -91,12 +97,14 @@ sniffio==1.3.0
    #   anyio
    #   httpcore
    #   httpx
-tqdm==4.64.1
+tqdm==4.65.0
    # via
    #   argilla
    #   nltk
 typing-extensions==4.5.0
-    # via pydantic
+    # via
+    #   pydantic
+    #   rich
 urllib3==1.26.14
    # via requests
 wrapt==1.14.1
--- a/requirements/build.txt
+++ b/requirements/build.txt
@ -14,7 +14,7 @@ certifi==2022.12.7
    # via
    #   -r requirements/build.in
    #   requests
-charset-normalizer==3.0.1
+charset-normalizer==3.1.0
    # via requests
 docutils==0.18.1
    # via
--- a/requirements/dev.txt
+++ b/requirements/dev.txt
@ -55,7 +55,7 @@ filelock==3.9.0
    # via virtualenv
 fqdn==1.5.1
    # via jsonschema
-identify==2.5.18
+identify==2.5.19
    # via pre-commit
 idna==3.4
    # via
@ -67,7 +67,7 @@ importlib-metadata==6.0.0
    #   nbconvert
 importlib-resources==5.12.0
    # via jsonschema
-ipykernel==6.21.2
+ipykernel==6.21.3
    # via
    #   ipywidgets
    #   jupyter
@ -115,7 +115,7 @@ jupyter-client==8.0.3
    #   nbclient
    #   notebook
    #   qtconsole
-jupyter-console==6.6.2
+jupyter-console==6.6.3
    # via jupyter
 jupyter-core==5.2.0
    # via
@ -132,7 +132,7 @@ jupyter-core==5.2.0
    #   qtconsole
 jupyter-events==0.6.3
    # via jupyter-server
-jupyter-server==2.3.0
+jupyter-server==2.4.0
    # via
    #   nbclassic
    #   notebook-shim
@ -152,7 +152,7 @@ matplotlib-inline==0.1.6
    #   ipython
 mistune==2.0.5
    # via nbconvert
-nbclassic==0.5.2
+nbclassic==0.5.3
    # via notebook
 nbclient==0.7.2
    # via nbconvert
@ -176,7 +176,7 @@ nest-asyncio==1.5.6
    #   notebook
 nodeenv==1.7.0
    # via pre-commit
-notebook==6.5.2
+notebook==6.5.3
    # via jupyter
 notebook-shim==0.2.2
    # via nbclassic
@ -199,7 +199,7 @@ pip-tools==6.12.3
    # via -r requirements/dev.in
 pkgutil-resolve-name==1.3.10
    # via jsonschema
-platformdirs==3.0.0
+platformdirs==3.1.0
    # via
    #   jupyter-core
    #   virtualenv
--- a/requirements/huggingface.txt
+++ b/requirements/huggingface.txt
@ -6,7 +6,7 @@
 #
 anyio==3.6.2
    # via httpcore
-argilla==1.3.1
+argilla==1.4.0
    # via unstructured (setup.py)
 backoff==2.2.1
    # via argilla
@ -16,12 +16,14 @@ certifi==2022.12.7
    #   httpx
    #   requests
    #   unstructured (setup.py)
-charset-normalizer==3.0.1
+charset-normalizer==3.1.0
    # via requests
 click==8.1.3
    # via
    #   nltk
    #   sacremoses
+commonmark==0.9.1
+    # via rich
 deprecated==1.2.13
    # via argilla
 et-xmlfile==1.1.0
@ -36,7 +38,7 @@ httpcore==0.16.3
    # via httpx
 httpx==0.23.3
    # via argilla
-huggingface-hub==0.12.1
+huggingface-hub==0.13.1
    # via transformers
 idna==3.4
    # via
@ -82,8 +84,10 @@ pillow==9.4.0
    # via
    #   python-pptx
    #   unstructured (setup.py)
-pydantic==1.10.5
+pydantic==1.10.6
    # via argilla
+pygments==2.14.0
+    # via rich
 python-dateutil==2.8.2
    # via pandas
 python-docx==0.8.11
@ -110,6 +114,8 @@ requests==2.28.2
    #   unstructured (setup.py)
 rfc3986[idna2008]==1.5.0
    # via httpx
+rich==13.0.1
+    # via argilla
 sacremoses==0.0.53
    # via unstructured (setup.py)
 sentencepiece==0.1.97
@ -128,7 +134,7 @@ tokenizers==0.13.2
    # via transformers
 torch==1.13.1
    # via unstructured (setup.py)
-tqdm==4.64.1
+tqdm==4.65.0
    # via
    #   argilla
    #   huggingface-hub
@ -141,6 +147,7 @@ typing-extensions==4.5.0
    # via
    #   huggingface-hub
    #   pydantic
+    #   rich
    #   torch
 urllib3==1.26.14
    # via requests
--- a/requirements/ingest-github.txt
+++ b/requirements/ingest-github.txt
@ -8,7 +8,7 @@ anyio==3.6.2
    # via
    #   -r requirements/base.txt
    #   httpcore
-argilla==1.3.1
+argilla==1.4.0
    # via
    #   -r requirements/base.txt
    #   unstructured (setup.py)
@ -25,7 +25,7 @@ certifi==2022.12.7
    #   unstructured (setup.py)
 cffi==1.15.1
    # via pynacl
-charset-normalizer==3.0.1
+charset-normalizer==3.1.0
    # via
    #   -r requirements/base.txt
    #   requests
@ -33,6 +33,10 @@ click==8.1.3
    # via
    #   -r requirements/base.txt
    #   nltk
+commonmark==0.9.1
+    # via
+    #   -r requirements/base.txt
+    #   rich
 deprecated==1.2.13
    # via
    #   -r requirements/base.txt
@ -111,12 +115,16 @@ pillow==9.4.0
    #   unstructured (setup.py)
 pycparser==2.21
    # via cffi
-pydantic==1.10.5
+pydantic==1.10.6
    # via
    #   -r requirements/base.txt
    #   argilla
 pygithub==1.57.0
    # via unstructured (setup.py)
+pygments==2.14.0
+    # via
+    #   -r requirements/base.txt
+    #   rich
 pyjwt==2.6.0
    # via pygithub
 pynacl==1.5.0
@ -154,6 +162,10 @@ rfc3986[idna2008]==1.5.0
    # via
    #   -r requirements/base.txt
    #   httpx
+rich==13.0.1
+    # via
+    #   -r requirements/base.txt
+    #   argilla
 six==1.16.0
    # via
    #   -r requirements/base.txt
@ -164,7 +176,7 @@ sniffio==1.3.0
    #   anyio
    #   httpcore
    #   httpx
-tqdm==4.64.1
+tqdm==4.65.0
    # via
    #   -r requirements/base.txt
    #   argilla
@ -173,6 +185,7 @@ typing-extensions==4.5.0
    # via
    #   -r requirements/base.txt
    #   pydantic
+    #   rich
 urllib3==1.26.14
    # via
    #   -r requirements/base.txt
--- a/requirements/ingest-gitlab.txt
+++ b/requirements/ingest-gitlab.txt
@ -8,7 +8,7 @@ anyio==3.6.2
    # via
    #   -r requirements/base.txt
    #   httpcore
-argilla==1.3.1
+argilla==1.4.0
    # via
    #   -r requirements/base.txt
    #   unstructured (setup.py)
@ -23,7 +23,7 @@ certifi==2022.12.7
    #   httpx
    #   requests
    #   unstructured (setup.py)
-charset-normalizer==3.0.1
+charset-normalizer==3.1.0
    # via
    #   -r requirements/base.txt
    #   requests
@ -31,10 +31,10 @@ click==8.1.3
    # via
    #   -r requirements/base.txt
    #   nltk
-colorama==0.4.6
+commonmark==0.9.1
    # via
-    #   click
-    #   tqdm
+    #   -r requirements/base.txt
+    #   rich
 deprecated==1.2.13
    # via
    #   -r requirements/base.txt
@ -110,10 +110,14 @@ pillow==9.4.0
    #   -r requirements/base.txt
    #   python-pptx
    #   unstructured (setup.py)
-pydantic==1.10.5
+pydantic==1.10.6
    # via
    #   -r requirements/base.txt
    #   argilla
+pygments==2.14.0
+    # via
+    #   -r requirements/base.txt
+    #   rich
 python-dateutil==2.8.2
    # via
    #   -r requirements/base.txt
@ -152,6 +156,10 @@ rfc3986[idna2008]==1.5.0
    # via
    #   -r requirements/base.txt
    #   httpx
+rich==13.0.1
+    # via
+    #   -r requirements/base.txt
+    #   argilla
 six==1.16.0
    # via
    #   -r requirements/base.txt
@ -162,7 +170,7 @@ sniffio==1.3.0
    #   anyio
    #   httpcore
    #   httpx
-tqdm==4.64.1
+tqdm==4.65.0
    # via
    #   -r requirements/base.txt
    #   argilla
@ -171,6 +179,7 @@ typing-extensions==4.5.0
    # via
    #   -r requirements/base.txt
    #   pydantic
+    #   rich
 urllib3==1.26.14
    # via
    #   -r requirements/base.txt
--- a/requirements/ingest-google-drive.txt
+++ b/requirements/ingest-google-drive.txt
@ -1,5 +1,5 @@
 #
-# This file is autogenerated by pip-compile with Python 3.9
+# This file is autogenerated by pip-compile with Python 3.8
 # by the following command:
 #
 #    pip-compile --extra=google-drive --output-file=requirements/ingest-google-drive.txt requirements/base.txt setup.py
@ -8,7 +8,7 @@ anyio==3.6.2
    # via
    #   -r requirements/base.txt
    #   httpcore
-argilla==1.3.1
+argilla==1.4.0
    # via
    #   -r requirements/base.txt
    #   unstructured (setup.py)
@ -25,7 +25,7 @@ certifi==2022.12.7
    #   httpx
    #   requests
    #   unstructured (setup.py)
-charset-normalizer==3.0.1
+charset-normalizer==3.1.0
    # via
    #   -r requirements/base.txt
    #   requests
@ -33,6 +33,10 @@ click==8.1.3
    # via
    #   -r requirements/base.txt
    #   nltk
+commonmark==0.9.1
+    # via
+    #   -r requirements/base.txt
+    #   rich
 deprecated==1.2.13
    # via
    #   -r requirements/base.txt
@ -125,7 +129,7 @@ pillow==9.4.0
    #   -r requirements/base.txt
    #   python-pptx
    #   unstructured (setup.py)
-protobuf==4.22.0
+protobuf==4.22.1
    # via
    #   google-api-core
    #   googleapis-common-protos
@ -135,10 +139,14 @@ pyasn1==0.4.8
    #   rsa
 pyasn1-modules==0.2.8
    # via google-auth
-pydantic==1.10.5
+pydantic==1.10.6
    # via
    #   -r requirements/base.txt
    #   argilla
+pygments==2.14.0
+    # via
+    #   -r requirements/base.txt
+    #   rich
 pyparsing==3.0.9
    # via httplib2
 python-dateutil==2.8.2
@ -174,6 +182,10 @@ rfc3986[idna2008]==1.5.0
    # via
    #   -r requirements/base.txt
    #   httpx
+rich==13.0.1
+    # via
+    #   -r requirements/base.txt
+    #   argilla
 rsa==4.9
    # via google-auth
 six==1.16.0
@ -188,7 +200,7 @@ sniffio==1.3.0
    #   anyio
    #   httpcore
    #   httpx
-tqdm==4.64.1
+tqdm==4.65.0
    # via
    #   -r requirements/base.txt
    #   argilla
@ -197,6 +209,7 @@ typing-extensions==4.5.0
    # via
    #   -r requirements/base.txt
    #   pydantic
+    #   rich
 uritemplate==4.1.1
    # via google-api-python-client
 urllib3==1.26.14
--- a/requirements/ingest-reddit.txt
+++ b/requirements/ingest-reddit.txt
@ -8,7 +8,7 @@ anyio==3.6.2
    # via
    #   -r requirements/base.txt
    #   httpcore
-argilla==1.3.1
+argilla==1.4.0
    # via
    #   -r requirements/base.txt
    #   unstructured (setup.py)
@ -23,7 +23,7 @@ certifi==2022.12.7
    #   httpx
    #   requests
    #   unstructured (setup.py)
-charset-normalizer==3.0.1
+charset-normalizer==3.1.0
    # via
    #   -r requirements/base.txt
    #   requests
@ -31,6 +31,10 @@ click==8.1.3
    # via
    #   -r requirements/base.txt
    #   nltk
+commonmark==0.9.1
+    # via
+    #   -r requirements/base.txt
+    #   rich
 deprecated==1.2.13
    # via
    #   -r requirements/base.txt
@ -110,10 +114,14 @@ praw==7.7.0
    # via unstructured (setup.py)
 prawcore==2.3.0
    # via praw
-pydantic==1.10.5
+pydantic==1.10.6
    # via
    #   -r requirements/base.txt
    #   argilla
+pygments==2.14.0
+    # via
+    #   -r requirements/base.txt
+    #   rich
 python-dateutil==2.8.2
    # via
    #   -r requirements/base.txt
@ -148,6 +156,10 @@ rfc3986[idna2008]==1.5.0
    # via
    #   -r requirements/base.txt
    #   httpx
+rich==13.0.1
+    # via
+    #   -r requirements/base.txt
+    #   argilla
 six==1.16.0
    # via
    #   -r requirements/base.txt
@ -158,7 +170,7 @@ sniffio==1.3.0
    #   anyio
    #   httpcore
    #   httpx
-tqdm==4.64.1
+tqdm==4.65.0
    # via
    #   -r requirements/base.txt
    #   argilla
@ -167,6 +179,7 @@ typing-extensions==4.5.0
    # via
    #   -r requirements/base.txt
    #   pydantic
+    #   rich
 update-checker==0.18.0
    # via praw
 urllib3==1.26.14
--- a/requirements/ingest-s3.txt
+++ b/requirements/ingest-s3.txt
@ -4,24 +4,34 @@
 #
 #    pip-compile --extra=s3 --output-file=requirements/ingest-s3.txt requirements/base.txt setup.py
 #
+aiobotocore==2.4.2
+    # via s3fs
+aiohttp==3.8.4
+    # via
+    #   aiobotocore
+    #   s3fs
+aioitertools==0.11.0
+    # via aiobotocore
+aiosignal==1.3.1
+    # via aiohttp
 anyio==3.6.2
    # via
    #   -r requirements/base.txt
    #   httpcore
-argilla==1.3.1
+argilla==1.4.0
    # via
    #   -r requirements/base.txt
    #   unstructured (setup.py)
+async-timeout==4.0.2
+    # via aiohttp
+attrs==22.2.0
+    # via aiohttp
 backoff==2.2.1
    # via
    #   -r requirements/base.txt
    #   argilla
-boto3==1.26.82
-    # via unstructured (setup.py)
-botocore==1.29.82
-    # via
-    #   boto3
-    #   s3transfer
+botocore==1.27.59
+    # via aiobotocore
 certifi==2022.12.7
    # via
    #   -r requirements/base.txt
@ -29,14 +39,19 @@ certifi==2022.12.7
    #   httpx
    #   requests
    #   unstructured (setup.py)
-charset-normalizer==3.0.1
+charset-normalizer==3.1.0
    # via
    #   -r requirements/base.txt
+    #   aiohttp
    #   requests
 click==8.1.3
    # via
    #   -r requirements/base.txt
    #   nltk
+commonmark==0.9.1
+    # via
+    #   -r requirements/base.txt
+    #   rich
 deprecated==1.2.13
    # via
    #   -r requirements/base.txt
@ -45,6 +60,14 @@ et-xmlfile==1.1.0
    # via
    #   -r requirements/base.txt
    #   openpyxl
+frozenlist==1.3.3
+    # via
+    #   aiohttp
+    #   aiosignal
+fsspec==2023.3.0
+    # via
+    #   s3fs
+    #   unstructured (setup.py)
 h11==0.14.0
    # via
    #   -r requirements/base.txt
@ -63,14 +86,13 @@ idna==3.4
    #   anyio
    #   requests
    #   rfc3986
+    #   yarl
 importlib-metadata==6.0.0
    # via
    #   -r requirements/base.txt
    #   markdown
 jmespath==1.0.1
-    # via
-    #   boto3
-    #   botocore
+    # via botocore
 joblib==1.2.0
    # via
    #   -r requirements/base.txt
@ -89,6 +111,10 @@ monotonic==1.6
    # via
    #   -r requirements/base.txt
    #   argilla
+multidict==6.0.4
+    # via
+    #   aiohttp
+    #   yarl
 nltk==3.8.1
    # via
    #   -r requirements/base.txt
@ -116,10 +142,14 @@ pillow==9.4.0
    #   -r requirements/base.txt
    #   python-pptx
    #   unstructured (setup.py)
-pydantic==1.10.5
+pydantic==1.10.6
    # via
    #   -r requirements/base.txt
    #   argilla
+pygments==2.14.0
+    # via
+    #   -r requirements/base.txt
+    #   rich
 python-dateutil==2.8.2
    # via
    #   -r requirements/base.txt
@ -153,8 +183,12 @@ rfc3986[idna2008]==1.5.0
    # via
    #   -r requirements/base.txt
    #   httpx
-s3transfer==0.6.0
-    # via boto3
+rich==13.0.1
+    # via
+    #   -r requirements/base.txt
+    #   argilla
+s3fs==2023.3.0
+    # via unstructured (setup.py)
 six==1.16.0
    # via
    #   -r requirements/base.txt
@ -165,7 +199,7 @@ sniffio==1.3.0
    #   anyio
    #   httpcore
    #   httpx
-tqdm==4.64.1
+tqdm==4.65.0
    # via
    #   -r requirements/base.txt
    #   argilla
@ -173,7 +207,9 @@ tqdm==4.64.1
 typing-extensions==4.5.0
    # via
    #   -r requirements/base.txt
+    #   aioitertools
    #   pydantic
+    #   rich
 urllib3==1.26.14
    # via
    #   -r requirements/base.txt
@ -182,12 +218,15 @@ urllib3==1.26.14
 wrapt==1.14.1
    # via
    #   -r requirements/base.txt
+    #   aiobotocore
    #   argilla
    #   deprecated
 xlsxwriter==3.0.8
    # via
    #   -r requirements/base.txt
    #   python-pptx
+yarl==1.8.2
+    # via aiohttp
 zipp==3.15.0
    # via
    #   -r requirements/base.txt
--- a/requirements/ingest-wikipedia.txt
+++ b/requirements/ingest-wikipedia.txt
@ -8,7 +8,7 @@ anyio==3.6.2
    # via
    #   -r requirements/base.txt
    #   httpcore
-argilla==1.3.1
+argilla==1.4.0
    # via
    #   -r requirements/base.txt
    #   unstructured (setup.py)
@ -25,7 +25,7 @@ certifi==2022.12.7
    #   httpx
    #   requests
    #   unstructured (setup.py)
-charset-normalizer==3.0.1
+charset-normalizer==3.1.0
    # via
    #   -r requirements/base.txt
    #   requests
@ -33,6 +33,10 @@ click==8.1.3
    # via
    #   -r requirements/base.txt
    #   nltk
+commonmark==0.9.1
+    # via
+    #   -r requirements/base.txt
+    #   rich
 deprecated==1.2.13
    # via
    #   -r requirements/base.txt
@ -108,10 +112,14 @@ pillow==9.4.0
    #   -r requirements/base.txt
    #   python-pptx
    #   unstructured (setup.py)
-pydantic==1.10.5
+pydantic==1.10.6
    # via
    #   -r requirements/base.txt
    #   argilla
+pygments==2.14.0
+    # via
+    #   -r requirements/base.txt
+    #   rich
 python-dateutil==2.8.2
    # via
    #   -r requirements/base.txt
@ -145,6 +153,10 @@ rfc3986[idna2008]==1.5.0
    # via
    #   -r requirements/base.txt
    #   httpx
+rich==13.0.1
+    # via
+    #   -r requirements/base.txt
+    #   argilla
 six==1.16.0
    # via
    #   -r requirements/base.txt
@ -157,7 +169,7 @@ sniffio==1.3.0
    #   httpx
 soupsieve==2.4
    # via beautifulsoup4
-tqdm==4.64.1
+tqdm==4.65.0
    # via
    #   -r requirements/base.txt
    #   argilla
@ -166,6 +178,7 @@ typing-extensions==4.5.0
    # via
    #   -r requirements/base.txt
    #   pydantic
+    #   rich
 urllib3==1.26.14
    # via
    #   -r requirements/base.txt
--- a/requirements/local-inference.txt
+++ b/requirements/local-inference.txt
@ -10,7 +10,7 @@ anyio==3.6.2
    # via
    #   httpcore
    #   starlette
-argilla==1.3.1
+argilla==1.4.0
    # via unstructured (setup.py)
 backoff==2.2.1
    # via argilla
@ -22,7 +22,7 @@ certifi==2022.12.7
    #   unstructured (setup.py)
 cffi==1.15.1
    # via cryptography
-charset-normalizer==3.0.1
+charset-normalizer==3.1.0
    # via
    #   pdfminer-six
    #   requests
@ -32,9 +32,11 @@ click==8.1.3
    #   uvicorn
 coloredlogs==15.0.1
    # via onnxruntime
+commonmark==0.9.1
+    # via rich
 contourpy==1.0.7
    # via matplotlib
-cryptography==39.0.1
+cryptography==39.0.2
    # via pdfminer-six
 cycler==0.11.0
    # via matplotlib
@ -44,15 +46,15 @@ effdet==0.3.0
    # via layoutparser
 et-xmlfile==1.1.0
    # via openpyxl
-fastapi==0.92.0
+fastapi==0.93.0
    # via unstructured-inference
 filelock==3.9.0
    # via
    #   huggingface-hub
    #   transformers
-flatbuffers==23.1.21
+flatbuffers==23.3.3
    # via onnxruntime
-fonttools==4.38.0
+fonttools==4.39.0
    # via matplotlib
 h11==0.14.0
    # via
@ -62,7 +64,7 @@ httpcore==0.16.3
    # via httpx
 httpx==0.23.3
    # via argilla
-huggingface-hub==0.12.1
+huggingface-hub==0.13.1
    # via
    #   timm
    #   transformers
@ -95,11 +97,11 @@ lxml==4.9.2
    #   unstructured (setup.py)
 markdown==3.4.1
    # via unstructured (setup.py)
-matplotlib==3.7.0
+matplotlib==3.7.1
    # via pycocotools
 monotonic==1.6
    # via argilla
-mpmath==1.2.1
+mpmath==1.3.0
    # via sympy
 nltk==3.8.1
    # via unstructured (setup.py)
@ -157,16 +159,18 @@ pillow==9.4.0
    #   unstructured (setup.py)
 portalocker==2.7.0
    # via iopath
-protobuf==4.22.0
+protobuf==4.22.1
    # via onnxruntime
 pycocotools==2.0.6
    # via effdet
 pycparser==2.21
    # via cffi
-pydantic==1.10.5
+pydantic==1.10.6
    # via
    #   argilla
    #   fastapi
+pygments==2.14.0
+    # via rich
 pyparsing==3.0.9
    # via matplotlib
 pytesseract==0.3.10
@ -204,6 +208,8 @@ requests==2.28.2
    #   unstructured (setup.py)
 rfc3986[idna2008]==1.5.0
    # via httpx
+rich==13.0.1
+    # via argilla
 scipy==1.10.1
    # via layoutparser
 six==1.16.0
@ -232,7 +238,7 @@ torchvision==0.14.1
    #   effdet
    #   layoutparser
    #   timm
-tqdm==4.64.1
+tqdm==4.65.0
    # via
    #   argilla
    #   huggingface-hub
@ -246,6 +252,7 @@ typing-extensions==4.5.0
    #   huggingface-hub
    #   iopath
    #   pydantic
+    #   rich
    #   starlette
    #   torch
    #   torchvision
--- a/requirements/test.txt
+++ b/requirements/test.txt
@ -12,7 +12,7 @@ certifi==2022.12.7
    # via
    #   -r requirements/test.in
    #   requests
-charset-normalizer==3.0.1
+charset-normalizer==3.1.0
    # via requests
 click==8.1.3
    # via
@ -40,7 +40,7 @@ mccabe==0.7.0
    # via flake8
 multidict==6.0.4
    # via yarl
-mypy==1.0.1
+mypy==1.1.1
    # via -r requirements/test.in
 mypy-extensions==1.0.0
    # via
@ -52,17 +52,17 @@ packaging==23.0
    #   pytest
 pathspec==0.11.0
    # via black
-platformdirs==3.0.0
+platformdirs==3.1.0
    # via black
 pluggy==1.0.0
    # via pytest
 pycodestyle==2.10.0
    # via flake8
-pydantic==1.10.5
+pydantic==1.10.6
    # via label-studio-sdk
 pyflakes==3.0.1
    # via flake8
-pytest==7.2.1
+pytest==7.2.2
    # via pytest-cov
 pytest-cov==4.0.0
    # via -r requirements/test.in
@ -70,7 +70,7 @@ pyyaml==6.0
    # via vcrpy
 requests==2.28.2
    # via label-studio-sdk
-ruff==0.0.253
+ruff==0.0.254
    # via -r requirements/test.in
 six==1.16.0
    # via vcrpy
--- a/setup.py
+++ b/setup.py
@ -77,7 +77,7 @@ setup(
            # NOTE(robinson) - Upper bound is temporary due to a multithreading issue
            "unstructured-inference>=0.2.4,<0.2.8",
        ],
-        "s3": ["boto3"],
+        "s3": ["s3fs", "fsspec"],
        "github": [
            # NOTE - pygithub==1.58.0 fails due to https://github.com/PyGithub/PyGithub/issues/2436
            # In the future, we can update this to pygithub>1.58.0
--- a/test_unstructured_ingest/test-ingest-s3.sh
+++ b/test_unstructured_ingest/test-ingest-s3.sh
@ -1,9 +1,9 @@
 #!/usr/bin/env bash

-SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
 cd "$SCRIPT_DIR"/.. || exit 1

-if [[ "$(find test_unstructured_ingest/expected-structured-output/s3-small-batch/ -type f -size +20k | wc -l)" != 3 ]]; then
+if [[ "$(find test_unstructured_ingest/expected-structured-output/s3-small-batch/ -type f -size +20k | wc -l)" -ne 3 ]]; then
    echo "The test fixtures in test_unstructured_ingest/expected-structured-output/ look suspicious. At least one of the files is too small."
    echo "Did you overwrite test fixtures with bad outputs?"
    exit 1
@ -12,13 +12,13 @@ fi
 PYTHONPATH=. ./unstructured/ingest/main.py --s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ --s3-anonymous --structured-output-dir s3-small-batch-output

 if ! diff -ru s3-small-batch-output test_unstructured_ingest/expected-structured-output/s3-small-batch ; then
-   echo
-   echo "There are differences from the previously checked-in structured outputs."
-   echo 
-   echo "If these differences are acceptable, copy the outputs from"
-   echo "s3-small-batch-output/ to test_unstructured_ingest/expected-structured-output/s3-small-batch/ after running"
-   echo 
-   echo "  PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --structured-output-dir s3-small-batch-output"
-   echo
-   exit 1
+    echo
+    echo "There are differences from the previously checked-in structured outputs."
+    echo
+    echo "If these differences are acceptable, copy the outputs from"
+    echo "s3-small-batch-output/ to test_unstructured_ingest/expected-structured-output/s3-small-batch/ after running"
+    echo
+    echo "  PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --structured-output-dir s3-small-batch-output"
+    echo
+    exit 1
 fi
--- a/unstructured/version.py
+++ b/unstructured/version.py
@ -1 +1 @@
-__version__ = "0.5.4-dev0"  # pragma: no cover
+__version__ = "0.5.4-dev1"  # pragma: no cover
--- a/unstructured/ingest/connector/fsspec.py
+++ b/unstructured/ingest/connector/fsspec.py
@ -0,0 +1,186 @@
+import json
+import os
+import re
+from dataclasses import dataclass, field
+from pathlib import Path
+
+from unstructured.ingest.interfaces import (
+    BaseConnector,
+    BaseConnectorConfig,
+    BaseIngestDoc,
+)
+from unstructured.ingest.logger import logger
+
+SUPPORTED_REMOTE_FSSPEC_PROTOCOLS = [
+    "s3",
+    "gcs",
+    "gcfs",
+    "abfs",
+]
+
+
+@dataclass
+class SimpleFsspecConfig(BaseConnectorConfig):
+    # fsspec specific options
+    path: str
+
+    # base connector options
+    download_dir: str
+    output_dir: str
+    preserve_downloads: bool = False
+    re_download: bool = False
+
+    # fsspec specific options
+    access_kwargs: dict = field(default_factory=dict)
+
+    protocol: str = field(init=False)
+    path_without_protocol: str = field(init=False)
+    dir_path: str = field(init=False)
+    file_path: str = field(init=False)
+
+    def __post_init__(self):
+        self.protocol, self.path_without_protocol = self.path.split("://")
+        if self.protocol not in SUPPORTED_REMOTE_FSSPEC_PROTOCOLS:
+            raise ValueError(
+                f"Protocol {self.protocol} not supported yet, only "
+                f"{SUPPORTED_REMOTE_FSSPEC_PROTOCOLS} are supported.",
+            )
+
+        # just a path with no trailing prefix
+        match = re.match(rf"{self.protocol}://([^/\s]+?)(/*)$", self.path)
+        if match:
+            self.dir_path = match.group(1)
+            self.file_path = ""
+            return
+
+        # valid path with a dir and/or file
+        match = re.match(rf"{self.protocol}://([^/\s]+?)/([^\s]*)", self.path)
+        if not match:
+            raise ValueError(
+                f"Invalid path {self.path}. Expected <protocol>://<dir-path>/<file-or-dir-path>.",
+            )
+        self.dir_path = match.group(1)
+        self.file_path = match.group(2) or ""
+
+
+@dataclass
+class FsspecIngestDoc(BaseIngestDoc):
+    """Class encapsulating fetching a doc and writing processed results (but not
+    doing the processing!).
+
+    Also includes a cleanup method. When things go wrong and the cleanup
+    method is not called, the file is left behind on the filesystem to assist debugging.
+    """
+
+    config: SimpleFsspecConfig
+    remote_file_path: str
+
+    def _tmp_download_file(self):
+        return Path(self.config.download_dir) / self.remote_file_path.replace(
+            f"{self.config.dir_path}/",
+            "",
+        )
+
+    def _output_filename(self):
+        return (
+            Path(self.config.output_dir)
+            / f"{self.remote_file_path.replace(f'{self.config.dir_path}/', '')}.json"
+        )
+
+    def has_output(self):
+        """Determine if structured output for this doc already exists."""
+        return self._output_filename().is_file() and os.path.getsize(self._output_filename())
+
+    def _create_full_tmp_dir_path(self):
+        """Includes "directories" in the object path"""
+        self._tmp_download_file().parent.mkdir(parents=True, exist_ok=True)
+
+    def get_file(self):
+        """Fetches the file from the current filesystem and stores it locally."""
+        from fsspec import AbstractFileSystem, get_filesystem_class
+
+        self._create_full_tmp_dir_path()
+        if (
+            not self.config.re_download
+            and self._tmp_download_file().is_file()
+            and os.path.getsize(self._tmp_download_file())
+        ):
+            logger.debug(f"File exists: {self._tmp_download_file()}, skipping download")
+            return
+
+        fs: AbstractFileSystem = get_filesystem_class(self.config.protocol)(
+            **self.config.access_kwargs,
+        )
+
+        logger.debug(f"Fetching {self} - PID: {os.getpid()}")
+        fs.get(rpath=self.remote_file_path, lpath=self._tmp_download_file().as_posix())
+
+    def write_result(self):
+        """Write the structured json result for this doc. result must be json serializable."""
+        output_filename = self._output_filename()
+        output_filename.parent.mkdir(parents=True, exist_ok=True)
+        with open(output_filename, "w") as output_f:
+            output_f.write(json.dumps(self.isd_elems_no_filename, ensure_ascii=False, indent=2))
+        logger.info(f"Wrote {output_filename}")
+
+    @property
+    def filename(self):
+        """The filename of the file after downloading from s3"""
+        return self._tmp_download_file()
+
+    def cleanup_file(self):
+        """Removes the local copy the file after successful processing."""
+        if not self.config.preserve_downloads:
+            logger.debug(f"Cleaning up {self}")
+            os.unlink(self._tmp_download_file())
+
+
+class FsspecConnector(BaseConnector):
+    """Objects of this class support fetching document(s) from"""
+
+    def __init__(self, config: SimpleFsspecConfig):
+        from fsspec import AbstractFileSystem, get_filesystem_class
+
+        self.config = config
+        self.fs: AbstractFileSystem = get_filesystem_class(self.config.protocol)(
+            **self.config.access_kwargs,
+        )
+        self.cleanup_files = not config.preserve_downloads
+
+    def cleanup(self, cur_dir=None):
+        """cleanup linginering empty sub-dirs from s3 paths, but leave remaining files
+        (and their paths) in tact as that indicates they were not processed"""
+        if not self.cleanup_files:
+            return
+
+        if cur_dir is None:
+            cur_dir = self.config.download_dir
+        sub_dirs = os.listdir(cur_dir)
+        os.chdir(cur_dir)
+        for sub_dir in sub_dirs:
+            # don't traverse symlinks, not that there every should be any
+            if os.path.isdir(sub_dir) and not os.path.islink(sub_dir):
+                self.cleanup(sub_dir)
+        os.chdir("..")
+        if len(os.listdir(cur_dir)) == 0:
+            os.rmdir(cur_dir)
+
+    def initialize(self):
+        """Verify that can get metadata for an object, validates connections info."""
+        ls_output = self.fs.ls(self.config.path_without_protocol)
+        if len(ls_output) < 1:
+            raise ValueError(
+                f"No objects found in {self.config.path}.",
+            )
+
+    def _list_files(self):
+        return self.fs.ls(self.config.path_without_protocol)
+
+    def get_ingest_docs(self):
+        return [
+            FsspecIngestDoc(
+                self.config,
+                file,
+            )
+            for file in self._list_files()
+        ]
--- a/unstructured/ingest/connector/s3.py
+++ b/unstructured/ingest/connector/s3.py
@ -0,0 +1,24 @@
+from dataclasses import dataclass
+
+from unstructured.ingest.connector.fsspec import (
+    FsspecConnector,
+    FsspecIngestDoc,
+    SimpleFsspecConfig,
+)
+from unstructured.utils import requires_dependencies
+
+
+@dataclass
+class SimpleS3Config(SimpleFsspecConfig):
+    pass
+
+
+class S3IngestDoc(FsspecIngestDoc):
+    @requires_dependencies(["s3fs", "fsspec"], extras="s3")
+    def get_file(self):
+        super().get_file()
+
+
+@requires_dependencies(["s3fs", "fsspec"], extras="s3")
+class S3Connector(FsspecConnector):
+    pass
--- a/unstructured/ingest/connector/s3_connector.py
+++ b/unstructured/ingest/connector/s3_connector.py
@ -1,198 +0,0 @@
-import json
-import os
-import re
-from dataclasses import dataclass, field
-from pathlib import Path
-
-from unstructured.ingest.interfaces import (
-    BaseConnector,
-    BaseConnectorConfig,
-    BaseIngestDoc,
-)
-from unstructured.ingest.logger import logger
-from unstructured.utils import requires_dependencies
-
-
-@dataclass
-class SimpleS3Config(BaseConnectorConfig):
-    """Connector config where s3_url is an s3 prefix to process all documents from."""
-
-    # S3 Specific Options
-    s3_url: str
-
-    # Standard Connector options
-    download_dir: str
-    # where to write structured data, with the directory structure matching s3 path
-    output_dir: str
-    re_download: bool = False
-    preserve_downloads: bool = False
-
-    # S3 Specific (optional)
-    anonymous: bool = False
-
-    s3_bucket: str = field(init=False)
-    # could be single object or prefix
-    s3_path: str = field(init=False)
-
-    def __post_init__(self):
-        if not self.s3_url.startswith("s3://"):
-            raise ValueError("s3_url must begin with 's3://'")
-
-        # just a bucket with no trailing prefix
-        match = re.match(r"s3://([^/\s]+?)$", self.s3_url)
-        if match:
-            self.s3_bucket = match.group(1)
-            self.s3_path = ""
-            return
-
-        # bucket with a path
-        match = re.match(r"s3://([^/\s]+?)/([^\s]*)", self.s3_url)
-        if not match:
-            raise ValueError(
-                f"s3_url {self.s3_url} does not look like a valid path. "
-                "Expected s3://<bucket-name or s3://<bucket-name/path",
-            )
-        self.s3_bucket = match.group(1)
-        self.s3_path = match.group(2) or ""
-
-
-@dataclass
-class S3IngestDoc(BaseIngestDoc):
-    """Class encapsulating fetching a doc and writing processed results (but not
-    doing the processing!).
-
-    Also includes a cleanup method. When things go wrong and the cleanup
-    method is not called, the file is left behind on the filesystem to assist debugging.
-    """
-
-    config: SimpleS3Config
-    s3_key: str
-
-    # NOTE(crag): probably doesn't matter,  but intentionally not defining tmp_download_file
-    # __post_init__ for multiprocessing simplicity (no Path objects in initially
-    # instantiated object)
-    def _tmp_download_file(self):
-        return Path(self.config.download_dir) / self.s3_key
-
-    def _output_filename(self):
-        return Path(self.config.output_dir) / f"{self.s3_key}.json"
-
-    def has_output(self):
-        """Determine if structured output for this doc already exists."""
-        return self._output_filename().is_file() and os.path.getsize(self._output_filename())
-
-    def _create_full_tmp_dir_path(self):
-        """includes "directories" in s3 object path"""
-        self._tmp_download_file().parent.mkdir(parents=True, exist_ok=True)
-
-    @requires_dependencies(["boto3"], extras="s3")
-    def get_file(self):
-        """Actually fetches the file from s3 and stores it locally."""
-        import boto3
-
-        self._create_full_tmp_dir_path()
-        if (
-            not self.config.re_download
-            and self._tmp_download_file().is_file()
-            and os.path.getsize(self._tmp_download_file())
-        ):
-            logger.debug(f"File exists: {self.filename}, skipping download")
-            return
-
-        if self.config.anonymous:
-            from botocore import UNSIGNED
-            from botocore.client import Config
-
-            s3_cli = boto3.client("s3", config=Config(signature_version=UNSIGNED))
-        else:
-            s3_cli = boto3.client("s3")
-        logger.debug(f"Fetching {self} - PID: {os.getpid()}")
-        s3_cli.download_file(self.config.s3_bucket, self.s3_key, self._tmp_download_file())
-
-    def write_result(self):
-        """Write the structured json result for this doc. result must be json serializable."""
-        output_filename = self._output_filename()
-        output_filename.parent.mkdir(parents=True, exist_ok=True)
-        with open(output_filename, "w") as output_f:
-            json.dump(self.isd_elems_no_filename, output_f, ensure_ascii=False, indent=2)
-        logger.info(f"Wrote {output_filename}")
-
-    @property
-    def filename(self):
-        """The filename of the file after downloading from s3"""
-        return self._tmp_download_file()
-
-    def cleanup_file(self):
-        """Removes the local copy the file after successful processing."""
-        if not self.config.preserve_downloads:
-            logger.debug(f"Cleaning up {self}")
-            os.unlink(self._tmp_download_file())
-
-
-@requires_dependencies(["boto3"], extras="s3")
-class S3Connector(BaseConnector):
-    """Objects of this class support fetching document(s) from"""
-
-    def __init__(self, config: SimpleS3Config):
-        import boto3
-
-        self.config = config
-        self._list_objects_kwargs = {"Bucket": config.s3_bucket, "Prefix": config.s3_path}
-        if config.anonymous:
-            from botocore import UNSIGNED
-            from botocore.client import Config
-
-            self.s3_cli = boto3.client("s3", config=Config(signature_version=UNSIGNED))
-        else:
-            self.s3_cli = boto3.client("s3")
-        self.cleanup_files = not config.preserve_downloads
-
-    def cleanup(self, cur_dir=None):
-        """cleanup linginering empty sub-dirs from s3 paths, but leave remaining files
-        (and their paths) in tact as that indicates they were not processed"""
-        if not self.cleanup_files:
-            return
-
-        if cur_dir is None:
-            cur_dir = self.config.download_dir
-        sub_dirs = os.listdir(cur_dir)
-        os.chdir(cur_dir)
-        for sub_dir in sub_dirs:
-            # don't traverse symlinks, not that there every should be any
-            if os.path.isdir(sub_dir) and not os.path.islink(sub_dir):
-                self.cleanup(sub_dir)
-        os.chdir("..")
-        if len(os.listdir(cur_dir)) == 0:
-            os.rmdir(cur_dir)
-
-    def initialize(self):
-        """Verify that can get metadata for an object, validates connections info."""
-        response = self.s3_cli.list_objects_v2(**self._list_objects_kwargs, MaxKeys=1)
-        if response["KeyCount"] < 1:
-            raise ValueError(
-                f"No objects found in {self.config.s3_url} -- response list object is {response}",
-            )
-
-    def _list_objects(self):
-        response = self.s3_cli.list_objects_v2(**self._list_objects_kwargs)
-        s3_keys = []
-        while True:
-            s3_keys.extend([s3_item["Key"] for s3_item in response["Contents"]])
-            if not response.get("IsTruncated"):
-                break
-            next_token = response.get("NextContinuationToken")
-            response = self.s3_cli.list_objects_v2(
-                **self._list_objects_kwargs,
-                ContinuationToken=next_token,
-            )
-        return s3_keys
-
-    def get_ingest_docs(self):
-        s3_keys = self._list_objects()
-        return [
-            S3IngestDoc(
-                self.config,
-                s3_key,
-            )
-            for s3_key in s3_keys
-        ]
--- a/unstructured/ingest/main.py
+++ b/unstructured/ingest/main.py
@ -3,6 +3,7 @@ import hashlib
 import logging
 import multiprocessing as mp
 import sys
+from contextlib import suppress
 from pathlib import Path

 import click
@ -14,7 +15,7 @@ from unstructured.ingest.connector.google_drive import (
    SimpleGoogleDriveConfig,
 )
 from unstructured.ingest.connector.reddit import RedditConnector, SimpleRedditConfig
-from unstructured.ingest.connector.s3_connector import S3Connector, SimpleS3Config
+from unstructured.ingest.connector.s3 import S3Connector, SimpleS3Config
 from unstructured.ingest.connector.wikipedia import (
    SimpleWikipediaConfig,
    WikipediaConnector,
@ -22,6 +23,9 @@ from unstructured.ingest.connector.wikipedia import (
 from unstructured.ingest.doc_processor.generalized import initialize, process_document
 from unstructured.ingest.logger import ingest_log_streaming_init, logger

+with suppress(RuntimeError):
+    mp.set_start_method("spawn")
+

 class MainProcess:
    def __init__(
@ -80,7 +84,6 @@ class MainProcess:
        # Debugging tip: use the below line and comment out the mp.Pool loop
        # block to remain in single process
        # self.doc_processor_fn(docs[0])
-
        with mp.Pool(
            processes=self.num_processes,
            initializer=ingest_log_streaming_init,
@ -303,11 +306,10 @@ def main(
    if s3_url:
        doc_connector = S3Connector(
            config=SimpleS3Config(
+                path=s3_url,
+                access_kwargs={"anon": s3_anonymous},
                download_dir=download_dir,
-                s3_url=s3_url,
                output_dir=structured_output_dir,
-                # set to False to use your AWS creds (not needed for this public s3 url)
-                anonymous=s3_anonymous,
                re_download=re_download,
                preserve_downloads=preserve_downloads,
            ),