* change strategy arg defalut to auto in partition
* passing --partition-strategy down
* add strategy="hi_res" to test (default changed)
* made an error on param name, added note
Adds filetype to metadata. I've created a decorator that adds metadata to a list of elements. This replaces some existing boilerplate, but also adds a nice layered approach to determining the filetype. Since in some cases several partition_ functions handle a file in various formats, the partition function that first touches a file will be the last one to alter its metadata, resulting in the correct filetype metadata.
Tests are added to make sure:
* When partition is used, any content type or auto file type detection will override file-specific partition function metadata
* Both auto and file-specific partitioning gives the desired filetype metadata
Won't work with image files currently... the plumbing is there to use the image format inferred by PIL, but we need to pull in the fix from this PR to unstructured-inference .
* added functions for determining auto stratgy
* change default strategy to auto
* tests for auto strategy
* update docs
* changelog and version
* bump version
* remove ingest file in wrong location
* update jpg output
* typo fix
* added method for extracting datetime
* change filename metadata to the base filename
* fix filename metadata for msg
* changelog and bump version
* fix expected structured output
* newline back in file
* reset outpout file
* update filename output
* update test fixtures
* update fixture
* add tests for validating strategy
* refactor into determine_pdf_strategy function
* refactor pdf strategies into strategies
* remove commented out code
* remove unreachable code
* add in handling for image types
* a little more refactoring
* import ocr partioning for images
* catch warnings, partition type for valid strategies
* fallback to ocr_only from fast
* fallback logic for hi_res
* test for fallback to ocr only
* fallback logic ofr ocr_only
* more tests for fallback logic
* update doc strings
* version and changelog
* linting, linting, linting
* update docs to include notes about strategy
* fix typos
* change back patched filename
* spike for ocr-only strategy for images
* fix for file processing
* extra space
* add korean to ci
* added test for ocr_only strategy
* added docs for ocr_only
* changelog and version
* added test for bad strategy
* skip korean test if in docker
* bump version
* version bump
* document valid strategies
* bump version for release
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* added filetype detection for odt
* add function for partition odt documents
* add odt files to auto
* changelog and version
* docs and readme
* update installation docs
* skip tests if not supported or in docker
* import pytest
* fix docs typos
* switch to using PDF objects
* linting, linting, linting
* couple more tweaks
* added test for chevron-page
* version and changelog
* linting, linting, linting
* now processing 4 files
* added function for multiple files via api
* make multiple work with files
* updated docs strings
* changelog and version
* docs and contextlib for open files
* tests for partition multiple
* add tests for error conditions
* add output example
* check to see if text file is a json
* add json check into filetype detection
* added test for updated file detection logic
* bytes/strings handling
* changlog and version bump
* fix: fix text_type.py exceeds_cap_ratio() returns
There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected
* Update text_type.py exceeds_cap_ratio()
..
* Update text_type.py
..
* Update CHANGELOG.md
..
* linting, linting, linting ...
* update tests
* more test fixes
* Update text_type.py
..
* bump version and changelog
* add punctuation check
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* function to check if pdf is extractable
* add fallback logic for unextractable pdfs
* tests for docs with copy protection
* add test for unprocessable pdf
* update docs
* changelog and version
* update logic for images; reset file before proceeding
* 3 files for api tests
* docs update
* pip-compile new reqs
* bump inference version
* add language to pdf and image calls
* tests for passing in language
* version bump and changelog
* update docs
* pass ocr_languages in auto
* updated test fixtures
* typo in doc string
* group broken paragraphs with fast strategy
* changelog and version
* fix broken tests for text.py
* formatting for paragraph pattern re
* fix test
* fix whitespace substitution
* one more test tweak
* blurb to account for short lines
* fix for shorter paragraphs
* update changelog
* remove extra line break from auto
* retrigger ci
* trying skipping azure
* skip azure (test)
* updated github and azure fixtures
* update slack fixture
Fixes issue where .json files were recognized as "text/plain" rather than "application/json on
the Unstructured image (and other installs that may have an older libmagic).
Also adds missing json auto partition tests.
Including an xfail test for #492 .
There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected.
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* add carriage return to html if missing
* test on markdown with embedded html
* changelog and version
* check for html parser
* linting, linting, linting
* refactor epub; add rtf
* added test for rtf files
* filetype detection for rtf files
* add rtf to auto
* update docs for group_broken_paragraphs
* add rtf to docs
* update file list in readme
* update stage_for_transformers docs
* changelog and version bump
* skip rtf if in docker
* skip test if rtf not supported
* docs tweaks
* cleaning brick to group broken paragraphs
* docs for group_broken_paragraphs
* add docs for partition_text with grouper
* partition_text and auto with paragraph_grouper
* version and changelog
* typo in the docs
* linting, linting, linting
* switch to using regular expressions
Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.
* fix: correct order of kwargs in pandoc
* only skip epub tests in Docker
* changelog
---------
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
Co-authored-by: cragwolfe <crag@unstructured.io>
Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word
Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.
* update stage_for_transformers to return a list of elements
* bump changelog and version
* flag breaking change
* fix last word bug in chunk_by_attention_window
* added msg-parser dependency
* pass through kwargs in convert_file_to_text
* added partition_msg for processing msft outlook files
* version bump and changelog
* added tests for partition_msg
* added test for msg with plain text
* add partition_msg docs; fix underlines in integration docs
* add .msg to file list
* finish tests for auto msg
* linting, linting, linting