Update versions of dependencies, including unpinning the unstructured-inference dependency that's causing conflicts in repos like pipeline-oer that want the newer version.
Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
from `FsspecConnector`
* Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in
favor of `--remote-url`.
---------
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured.
I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.
* environment variable to set language checks
* change log and version
* checks for if language checks are false
* update docs
* changelog type
* add assert to tests
* performance note in docstrings
* docstring tweaks
Added Amazon Linux 2 setup script. Also updated Ubuntu setup script to keep the scripts as aligned as possible.
Co-authored-by: cragwolfe <crag@unstructured.io>
Add GitLab data connector for ingest.
Involves more general Git functionality that is shared between the GitHub and GitLab data connectors.
Prevent code duplication for functionality between GitHub and GitLab ingest connectors.
Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively.
These work for GitHub and GitLab.
Show a more meaningful error message (and potentially useful for debugging)
when file type is not supported by the auto partition().
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Per the README, provides an optional `pre-commit` configuration
file to ensure code matches the formatting and linting standards used in `unstructured`.
* fix: ensure all text is maintained in html pages
* add back in replace unicode quotes
* changelog and version bump
* apt-get update in ci
* white space differences in output
When I took the changes to the Ubuntu setup script and propagated them to other scripts that run in slightly different contexts, the script failed at line 45 as DEBIAN_FRONTEND=noninteractive was interpreted as a command rather than a variable assignment.
Added the env command so there's no misinterpretation. Tested in docker as both root and user.
* `unstructured-ingest` now uses a default `--download_dir` of `$HOME/.cache/unstructured/ingest`
rather than a "tmp-ingest-" dir in the working directory.
* `unstructured-ingest` no longer re-downloads files when --preserve-downloads
is used without --download-dir.
Fix several issues re. the requires_dependencies decorator:
* There was a missing space between the sentences.
* Crucial brackets were missing in making the error message.
* "pygithub" was used where "github" should have been used.
* fix for missing narrative text in partition_html
* fixes so existing tests pass
* tests for figure caption and narrative text
* bump version; changelog
* Add `requires_dependencies` decorator
* Use `required_dependencies` on Reddit & S3
* Fix bug in `requires_dependencies`
To used named args the decorator needs to be also wrapped
* Add `requires_dependencies` integration tests
* Add `requires_dependencies` in `Competition.md`
* Update `CHANGELOG.md`
* Bump version 0.4.16-dev5
* Ignore `F401` unused imports in `requires_dependencies` tests
* Apply suggestions from code review
* Add `functools.wrap` to keep docs, & annotations
* Use `requires_dependencies` in `GitHubConnector`
The connector can process a Wikipedia page
and output the HTML,
the plain text contents,
and the summary.
No API key required
Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).
* add print statement in readme
* elements before bricks
* new preamble to bricks section
* add preamble to bricks section
* add preamble to cleaning section
* descriptions of each documentation page
* non-brick helper functions to the bottom
* fix codeblock
* includes some optional kwargs
* code blocks
* typo fix
* bump cryptography version
* re pip-compile for latest versions
* update argilla example requirements
* dependency updates
* bump versions
* pin unstructured-inference due to multithreading issue
* linting, linting, linting
* dependency on one line
* Apply import sorting
ruff . --select I --fix
* Remove unnecessary open mode parameter
ruff . --select UP015 --fix
* Use f-string formatting rather than .format
* Remove extraneous parentheses
Also use "" instead of str()
* Resolve missing trailing commas
ruff . --select COM --fix
* Rewrite list() and dict() calls using literals
ruff . --select C4 --fix
* Add () to pytest.fixture, use tuples for parametrize, etc.
ruff . --select PT --fix
* Simplify code: merge conditionals, context managers
ruff . --select SIM --fix
* Import without unnecessary alias
ruff . --select PLR0402 --fix
* Apply formatting via black
* Rewrite ValueError somewhat
Slightly unrelated to the rest of the PR
* Apply formatting to tests via black
* Update expected exception message to match
0d81564
* Satisfy E501 line too long in test
* Update changelog & version
* Add ruff to make tidy and test deps
* Run 'make tidy'
* Update changelog & version
* Update changelog & version
* Add ruff to 'check' target
Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
Add Reddit data connector for ingest.
* The connector can process a subreddit.
* Either via a search query,
* or via hot posts.
* The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos).
* The number of posts to fetch can be changed with the CLI.