* bump cryptography version
* re pip-compile for latest versions
* update argilla example requirements
* dependency updates
* bump versions
* pin unstructured-inference due to multithreading issue
* linting, linting, linting
* dependency on one line
* Apply import sorting
ruff . --select I --fix
* Remove unnecessary open mode parameter
ruff . --select UP015 --fix
* Use f-string formatting rather than .format
* Remove extraneous parentheses
Also use "" instead of str()
* Resolve missing trailing commas
ruff . --select COM --fix
* Rewrite list() and dict() calls using literals
ruff . --select C4 --fix
* Add () to pytest.fixture, use tuples for parametrize, etc.
ruff . --select PT --fix
* Simplify code: merge conditionals, context managers
ruff . --select SIM --fix
* Import without unnecessary alias
ruff . --select PLR0402 --fix
* Apply formatting via black
* Rewrite ValueError somewhat
Slightly unrelated to the rest of the PR
* Apply formatting to tests via black
* Update expected exception message to match
0d81564
* Satisfy E501 line too long in test
* Update changelog & version
* Add ruff to make tidy and test deps
* Run 'make tidy'
* Update changelog & version
* Update changelog & version
* Add ruff to 'check' target
Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
Add Reddit data connector for ingest.
* The connector can process a subreddit.
* Either via a search query,
* or via hot posts.
* The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos).
* The number of posts to fetch can be changed with the CLI.
- Creates ABC's for ingest connectors
- Updates the s3_connector classes to inherit from ABC's
- Moves s3 test script to it's own file to establish pattern for additional connectors
- Rewrites the Ingest.md doc, including instructions how how to add a connector
- Updates the example s3 ingest script to use the new location for main.py
Note that there were no logic changes, this is essentially a refactoring PR.
Test instructions:
Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.
* add a bigger list of english words
* update thresholds and add tests
* update docs; bump version
* fix version
* add additional english words back in
* linting, linting, linting
* add slashes
* work -> word
* added python-pptx to requirements
* added filetype detection for powerpoint
* add more filetypes to detect
* more tests
* added tests for filetype
* reorder document types
* tests for get_directory_file_info
* added docs for get_directory_file_info
* bump version
* Word -> Office
* added test for filetype
* add group by filetype example
* add python-magic
* first pass on filetype detection
* tests for filetype detection
* more tests for file detection
* added tests for error conditions
* install libmagic dev in github
* libmagic install instructions
* pattern for checking email files
* support reading .eml in rb mode
* add auto partition function
* auto tests for emal
* auto tests for docx
* added tests for html
* add pdf and html tests
* linting, linting, linting
* added docs for auto partitioning
* update readme with generic partition brick
* bumped version
* added test for bad type
* detect .docx files from application/octet-stream
* linting, linting, linting
* identify xlsx from octet stream
* install poppler in ci
* fix mocks; test for unknown type
* install poppler utils
* install in one line
* only poppler-utils
* file extension logic from application/octet-stream
* install local inference for ci
* install detectron2
* removing unused dockerfile
* initial implementation for translate brick
* more input validation
* tests for translate brick
* added docs
* bumped version
* chinese and arabic tests
* re-run pip-compile
* add torch to dependencies
* cleanup doc string
* fix long string
* fix typo in docs
* take out empty string check
* return string if string is empty
* added huggingface into make install
* Add argilla to dependencies and run pip-compile
* Implement Argilla staging brick and add unit tests
* Update version and changelog
* Update docs with description and usage for Argilla staging brick
* Remove unused fixtures and fix typo in Argilla tests
* add missing quote in docs
* changelog tweak
* doc tweaks
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* add huggingface dependencies and re pip-compile
* first pass on chunk by attention window
* test for chunking function
* completed tests for chunk_by_attention_window
* change default buffer size to 2
* wrapper function for staging
* added docs for transformers
* fix wording and typos
* updated change log and bumped the version
* added docs on huggingface dependencies
* fix typo
* re pip-compile