Adds filetype to metadata. I've created a decorator that adds metadata to a list of elements. This replaces some existing boilerplate, but also adds a nice layered approach to determining the filetype. Since in some cases several partition_ functions handle a file in various formats, the partition function that first touches a file will be the last one to alter its metadata, resulting in the correct filetype metadata.
Tests are added to make sure:
* When partition is used, any content type or auto file type detection will override file-specific partition function metadata
* Both auto and file-specific partitioning gives the desired filetype metadata
Won't work with image files currently... the plumbing is there to use the image format inferred by PIL, but we need to pull in the fix from this PR to unstructured-inference .
* added functions for determining auto stratgy
* change default strategy to auto
* tests for auto strategy
* update docs
* changelog and version
* bump version
* remove ingest file in wrong location
* update jpg output
* typo fix
* spike for ocr-only strategy for images
* fix for file processing
* extra space
* add korean to ci
* added test for ocr_only strategy
* added docs for ocr_only
* changelog and version
* added test for bad strategy
* skip korean test if in docker
* bump version
* version bump
* document valid strategies
* bump version for release
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* pip-compile new reqs
* bump inference version
* add language to pdf and image calls
* tests for passing in language
* version bump and changelog
* update docs
* pass ocr_languages in auto
* updated test fixtures
* typo in doc string
* Apply import sorting
ruff . --select I --fix
* Remove unnecessary open mode parameter
ruff . --select UP015 --fix
* Use f-string formatting rather than .format
* Remove extraneous parentheses
Also use "" instead of str()
* Resolve missing trailing commas
ruff . --select COM --fix
* Rewrite list() and dict() calls using literals
ruff . --select C4 --fix
* Add () to pytest.fixture, use tuples for parametrize, etc.
ruff . --select PT --fix
* Simplify code: merge conditionals, context managers
ruff . --select SIM --fix
* Import without unnecessary alias
ruff . --select PLR0402 --fix
* Apply formatting via black
* Rewrite ValueError somewhat
Slightly unrelated to the rest of the PR
* Apply formatting to tests via black
* Update expected exception message to match
0d81564
* Satisfy E501 line too long in test
* Update changelog & version
* Add ruff to make tidy and test deps
* Run 'make tidy'
* Update changelog & version
* Update changelog & version
* Add ruff to 'check' target
Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
* page breaks for pptx
* added page breaks for image/pdf
* tests for images with page breaks
* page breaks for html documents
* linting, linting, linting
* changelog and bump version
* update docs
* fix typo
* refactor reusable code to common.py
* add type back in
Adds partition_image to partition image file types, which is integrated into the partition brick. This relies on the 0.2.2 version of unstructured-inference.