8 Commits

Author SHA1 Message Date
cragwolfe
bd8a74d686
chore: shell scripts default indent of 2 instead of 4 (#2287)
Given the tendency for shell scripts to easily enter into a few levels
of indentation and long line lengths, update the default to 2 spaces.
2023-12-19 07:48:21 +00:00
Roman Isecke
76efcf4dd7
chore: add shfmt (#2246)
### Description
Given all the shell files that now exist in the repo, would be nice to
have linting/formatting around them (in addition to the existing
shellcheck which doesn't do anything to format the shell code). This PR
introduces `shfmt` to both check for changes and apply formatting when
the associated make targets are called.
2023-12-12 01:04:15 +00:00
Roman Isecke
ed7f991ab9
Add s3 writer (#1223)
### Description
Convert s3 cli code to also support writing to s3. Writers are added as
optional subcommands to the parent command with their own arguments.
Custom `click.Group` introduced to add some custom formatting and text
in help messages.

To limit the scope of this PR, most existing files were not touched but
instead new files were added for the new flow. This allowed _only_ the
s3 connector to be updated without breaking any other ones.
2023-08-31 22:19:53 +00:00
Roman Isecke
28214a6cc3
Roman/ingest refactor (#978)
* Pull out s3 code as subcommand

* Pull out dropbox code as subcommand

* Pull out azure code as subcommand

* Pull out fsspec code as subcommand

* Pull out github code as subcommand

* Pull out gitlab code as subcommand

* Pull out reddit code as subcommand

* Pull out slack code as subcommand

* Pull out discord code as subcommand

* Pull out wikipedia code as subcommand

* Pull out gdrive code as subcommand

* Pull out biomed code as subcommand

* rename parameters

* Pull out onedrive code as subcommand

* Pull out outlook code as subcommand

* Pull out local code as subcommand

* Pull out elasticsearch code as subcommand

* Pull out confluence code as subcommand

* Drop previous main file

* update changelog

* Add back in mp.Pool

* Fix mypy issues with click

* Make sure all tests run with verbose flag

* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data

* Pull out some more shared options

* Support running code via python as well as cli

* update ingest readme and move it to the ingest folder

* update usage in connector docs

* move local command arg in test

* Seperate out cli code from logic running unstructured

* Make some cli fields required rather than optional

* rename process -> processor

* Improve logger to avoid duplicate handlers

---------

Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-07-31 13:20:10 -04:00
Yuming Long
ab5f92dd79
Fix(ingest): Deprecate --s3-url in favor of --remote-url (#616)
* deprecation s3-url

* changelopg and versioin

* download dir not now
2023-05-19 12:11:40 -04:00
cragwolfe
87fd0d01dc
feat: Ingest refactors, doc updates (#243)
- Creates ABC's for ingest connectors
- Updates the s3_connector classes to inherit from ABC's
- Moves s3 test script to it's own file to establish pattern for additional connectors
- Rewrites the Ingest.md doc, including instructions how how to add a connector
- Updates the example s3 ingest script to use the new location for main.py

Note that there were no logic changes, this is essentially a refactoring PR.

Test instructions:

Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.
2023-02-21 10:15:33 -08:00
cragwolfe
3c1b089071
feat: Ingest CLI flags and test fixture updates (#227)
* Many command line options added. The sample ingest project is now an easy to use CLI (no code editing
   necessary), capable of processing large numbers of files from S3 in a re-entrant manner. See Ingest.md.
* Fixes issue where text fixtures had been truncated
  * Adds a check to make sure this doesn't happen again
* Moves fixture outputs for the existing connector one subdir lower, 
  to make room for future connector outputs.
2023-02-16 16:45:50 +00:00
cragwolfe
ab542ca3c6
feat: Sample ingest project with S3 connector (#218) 2023-02-14 12:27:45 -08:00