* spike for ocr-only strategy for images
* fix for file processing
* extra space
* add korean to ci
* added test for ocr_only strategy
* added docs for ocr_only
* changelog and version
* added test for bad strategy
* skip korean test if in docker
* bump version
* version bump
* document valid strategies
* bump version for release
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* added filetype detection for odt
* add function for partition odt documents
* add odt files to auto
* changelog and version
* docs and readme
* update installation docs
* skip tests if not supported or in docker
* import pytest
* fix docs typos
* switch to using PDF objects
* linting, linting, linting
* couple more tweaks
* added test for chevron-page
* version and changelog
* linting, linting, linting
* now processing 4 files
* added function for multiple files via api
* make multiple work with files
* updated docs strings
* changelog and version
* docs and contextlib for open files
* tests for partition multiple
* add tests for error conditions
* add output example
* check to see if text file is a json
* add json check into filetype detection
* added test for updated file detection logic
* bytes/strings handling
* changlog and version bump
* fix: fix text_type.py exceeds_cap_ratio() returns
There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected
* Update text_type.py exceeds_cap_ratio()
..
* Update text_type.py
..
* Update CHANGELOG.md
..
* linting, linting, linting ...
* update tests
* more test fixes
* Update text_type.py
..
* bump version and changelog
* add punctuation check
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* function to check if pdf is extractable
* add fallback logic for unextractable pdfs
* tests for docs with copy protection
* add test for unprocessable pdf
* update docs
* changelog and version
* update logic for images; reset file before proceeding
* 3 files for api tests
* docs update
* pip-compile new reqs
* bump inference version
* add language to pdf and image calls
* tests for passing in language
* version bump and changelog
* update docs
* pass ocr_languages in auto
* updated test fixtures
* typo in doc string
* group broken paragraphs with fast strategy
* changelog and version
* fix broken tests for text.py
* formatting for paragraph pattern re
* fix test
* fix whitespace substitution
* one more test tweak
* blurb to account for short lines
* fix for shorter paragraphs
* update changelog
* remove extra line break from auto
* retrigger ci
* trying skipping azure
* skip azure (test)
* updated github and azure fixtures
* update slack fixture
Fixes issue where .json files were recognized as "text/plain" rather than "application/json on
the Unstructured image (and other installs that may have an older libmagic).
Also adds missing json auto partition tests.
Including an xfail test for #492 .
Previously, if there was an error (non-zero exit code) in an ingest test script,
the script would still complete and echo a warning about mismatched outputs
and how to regenerate the fixtures. However, this statement is irrelevant and
misleading: if the ingest failed with a non-zero exit code in the first place,
that is the failure that should be debugged -- don't confuse the user with
a comment about outputs.
This connector takes a slack channel id, token and other options to
pull conversation history for a channel and store it as a text file that
is then processed by unstructured into expected output.
* Update test fixtures that should have been updated in prior commit
* Disable biomed ingest tests for now, the fail more often than not
* Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.
There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected.
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* add carriage return to html if missing
* test on markdown with embedded html
* changelog and version
* check for html parser
* linting, linting, linting
Attempting to fix formatting of github issues transferred to Jira.
The old format was attempting to use double-slashes (\\) to specify line breaks. This worked in the test repo but didn't look right when merged to this repo.
Now attempting to use formatted text in the yaml with |. This worked in the test repo, but I guess that's no guarantee.
* Add --partition-by-api and --partition-host args to ingest
* Fix error in make check
* Bump changelog
* Add a test ingest script
Also add a workaround for the test causing 400s from our api. Seems we need to make sure
unstructured-api can handle getting a file.content_type of None.
* Remove the content type workaround