Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.
* fix: correct order of kwargs in pandoc
* only skip epub tests in Docker
* changelog
---------
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
Co-authored-by: cragwolfe <crag@unstructured.io>
Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word
Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.
* update stage_for_transformers to return a list of elements
* bump changelog and version
* flag breaking change
* fix last word bug in chunk_by_attention_window
* added msg-parser dependency
* pass through kwargs in convert_file_to_text
* added partition_msg for processing msft outlook files
* version bump and changelog
* added tests for partition_msg
* added test for msg with plain text
* add partition_msg docs; fix underlines in integration docs
* add .msg to file list
* finish tests for auto msg
* linting, linting, linting
Closes#200. Fixes the failing test for label_studio_sdk>0.0.17 using the suggestion found in this comment. The vcr fixture on the test needed allow_playback_repeats=True. Unpinned label_studio_sdk and pip-compiled.
Created a github workflow to create a new issue in JIRA when a github issue is created, mirroring the summary and description.
Pretty simplistic for now with a hardcoded project, and no support for any ongoing sync events.
Update versions of dependencies, including unpinning the unstructured-inference dependency that's causing conflicts in repos like pipeline-oer that want the newer version.
Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
from `FsspecConnector`
* Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in
favor of `--remote-url`.
---------
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured.
I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.