### Description
Given all the shell files that now exist in the repo, would be nice to
have linting/formatting around them (in addition to the existing
shellcheck which doesn't do anything to format the shell code). This PR
introduces `shfmt` to both check for changes and apply formatting when
the associated make targets are called.
### Description
Convert s3 cli code to also support writing to s3. Writers are added as
optional subcommands to the parent command with their own arguments.
Custom `click.Group` introduced to add some custom formatting and text
in help messages.
To limit the scope of this PR, most existing files were not touched but
instead new files were added for the new flow. This allowed _only_ the
s3 connector to be updated without breaking any other ones.
* Pull out s3 code as subcommand
* Pull out dropbox code as subcommand
* Pull out azure code as subcommand
* Pull out fsspec code as subcommand
* Pull out github code as subcommand
* Pull out gitlab code as subcommand
* Pull out reddit code as subcommand
* Pull out slack code as subcommand
* Pull out discord code as subcommand
* Pull out wikipedia code as subcommand
* Pull out gdrive code as subcommand
* Pull out biomed code as subcommand
* rename parameters
* Pull out onedrive code as subcommand
* Pull out outlook code as subcommand
* Pull out local code as subcommand
* Pull out elasticsearch code as subcommand
* Pull out confluence code as subcommand
* Drop previous main file
* update changelog
* Add back in mp.Pool
* Fix mypy issues with click
* Make sure all tests run with verbose flag
* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data
* Pull out some more shared options
* Support running code via python as well as cli
* update ingest readme and move it to the ingest folder
* update usage in connector docs
* move local command arg in test
* Seperate out cli code from logic running unstructured
* Make some cli fields required rather than optional
* rename process -> processor
* Improve logger to avoid duplicate handlers
---------
Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
- Creates ABC's for ingest connectors
- Updates the s3_connector classes to inherit from ABC's
- Moves s3 test script to it's own file to establish pattern for additional connectors
- Rewrites the Ingest.md doc, including instructions how how to add a connector
- Updates the example s3 ingest script to use the new location for main.py
Note that there were no logic changes, this is essentially a refactoring PR.
Test instructions:
Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.
* Many command line options added. The sample ingest project is now an easy to use CLI (no code editing
necessary), capable of processing large numbers of files from S3 in a re-entrant manner. See Ingest.md.
* Fixes issue where text fixtures had been truncated
* Adds a check to make sure this doesn't happen again
* Moves fixture outputs for the existing connector one subdir lower,
to make room for future connector outputs.