* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
from `FsspecConnector`
* Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in
favor of `--remote-url`.
---------
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Add GitLab data connector for ingest.
Involves more general Git functionality that is shared between the GitHub and GitLab data connectors.
Prevent code duplication for functionality between GitHub and GitLab ingest connectors.
Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively.
These work for GitHub and GitLab.
The connector can process a Wikipedia page
and output the HTML,
the plain text contents,
and the summary.
No API key required
Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).
Add Reddit data connector for ingest.
* The connector can process a subreddit.
* Either via a search query,
* or via hot posts.
* The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos).
* The number of posts to fetch can be changed with the CLI.
- Creates ABC's for ingest connectors
- Updates the s3_connector classes to inherit from ABC's
- Moves s3 test script to it's own file to establish pattern for additional connectors
- Rewrites the Ingest.md doc, including instructions how how to add a connector
- Updates the example s3 ingest script to use the new location for main.py
Note that there were no logic changes, this is essentially a refactoring PR.
Test instructions:
Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.
* Many command line options added. The sample ingest project is now an easy to use CLI (no code editing
necessary), capable of processing large numbers of files from S3 in a re-entrant manner. See Ingest.md.
* Fixes issue where text fixtures had been truncated
* Adds a check to make sure this doesn't happen again
* Moves fixture outputs for the existing connector one subdir lower,
to make room for future connector outputs.