Closes#200. Fixes the failing test for label_studio_sdk>0.0.17 using the suggestion found in this comment. The vcr fixture on the test needed allow_playback_repeats=True. Unpinned label_studio_sdk and pip-compiled.
Created a github workflow to create a new issue in JIRA when a github issue is created, mirroring the summary and description.
Pretty simplistic for now with a hardcoded project, and no support for any ongoing sync events.
Update versions of dependencies, including unpinning the unstructured-inference dependency that's causing conflicts in repos like pipeline-oer that want the newer version.
Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
from `FsspecConnector`
* Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in
favor of `--remote-url`.
---------
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured.
I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.
* environment variable to set language checks
* change log and version
* checks for if language checks are false
* update docs
* changelog type
* add assert to tests
* performance note in docstrings
* docstring tweaks
Added Amazon Linux 2 setup script. Also updated Ubuntu setup script to keep the scripts as aligned as possible.
Co-authored-by: cragwolfe <crag@unstructured.io>
Add GitLab data connector for ingest.
Involves more general Git functionality that is shared between the GitHub and GitLab data connectors.
Prevent code duplication for functionality between GitHub and GitLab ingest connectors.
Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively.
These work for GitHub and GitLab.
Show a more meaningful error message (and potentially useful for debugging)
when file type is not supported by the auto partition().
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Per the README, provides an optional `pre-commit` configuration
file to ensure code matches the formatting and linting standards used in `unstructured`.
* fix: ensure all text is maintained in html pages
* add back in replace unicode quotes
* changelog and version bump
* apt-get update in ci
* white space differences in output
When I took the changes to the Ubuntu setup script and propagated them to other scripts that run in slightly different contexts, the script failed at line 45 as DEBIAN_FRONTEND=noninteractive was interpreted as a command rather than a variable assignment.
Added the env command so there's no misinterpretation. Tested in docker as both root and user.
* `unstructured-ingest` now uses a default `--download_dir` of `$HOME/.cache/unstructured/ingest`
rather than a "tmp-ingest-" dir in the working directory.
* `unstructured-ingest` no longer re-downloads files when --preserve-downloads
is used without --download-dir.
Fix several issues re. the requires_dependencies decorator:
* There was a missing space between the sentences.
* Crucial brackets were missing in making the error message.
* "pygithub" was used where "github" should have been used.
* fix for missing narrative text in partition_html
* fixes so existing tests pass
* tests for figure caption and narrative text
* bump version; changelog
* Add `requires_dependencies` decorator
* Use `required_dependencies` on Reddit & S3
* Fix bug in `requires_dependencies`
To used named args the decorator needs to be also wrapped
* Add `requires_dependencies` integration tests
* Add `requires_dependencies` in `Competition.md`
* Update `CHANGELOG.md`
* Bump version 0.4.16-dev5
* Ignore `F401` unused imports in `requires_dependencies` tests
* Apply suggestions from code review
* Add `functools.wrap` to keep docs, & annotations
* Use `requires_dependencies` in `GitHubConnector`
The connector can process a Wikipedia page
and output the HTML,
the plain text contents,
and the summary.
No API key required
Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).
* add print statement in readme
* elements before bricks
* new preamble to bricks section
* add preamble to bricks section
* add preamble to cleaning section
* descriptions of each documentation page
* non-brick helper functions to the bottom
* fix codeblock
* includes some optional kwargs
* code blocks
* typo fix