- Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.).
- Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 .
- Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details.
- Updates expected outputs with above script.
- Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.
* refactor epub; add rtf
* added test for rtf files
* filetype detection for rtf files
* add rtf to auto
* update docs for group_broken_paragraphs
* add rtf to docs
* update file list in readme
* update stage_for_transformers docs
* changelog and version bump
* skip rtf if in docker
* skip test if rtf not supported
* docs tweaks
* fix(ingest): import connector-specific modules on demand
* unstructured-ingest --flatten-metadata supported for local connector.
* unstructured-ingest fix runtime error when using --metadata-include.
The mailcap centos7 package provides the file /etc/mime.types, which is used by the mimetypes python package. That said, the unstructured code base does not make much use of this but the upstream unstructured-api does.
Bonus: docx mimetype added in lookup table.
* cleaning brick to group broken paragraphs
* docs for group_broken_paragraphs
* add docs for partition_text with grouper
* partition_text and auto with paragraph_grouper
* version and changelog
* typo in the docs
* linting, linting, linting
* switch to using regular expressions
Some document elements may have a null style element which triggers an exception
when trying to access the name of the style.
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Add --download-only parameter so that files may be downloaded if they are not already present (as usual, in either --download-dir or the default download ~/.cache/... location if --download-dir is not specified) and skip processing them through unstructured.
Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.
* fix: correct order of kwargs in pandoc
* only skip epub tests in Docker
* changelog
---------
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
Co-authored-by: cragwolfe <crag@unstructured.io>
Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word
Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.
* update stage_for_transformers to return a list of elements
* bump changelog and version
* flag breaking change
* fix last word bug in chunk_by_attention_window
* added msg-parser dependency
* pass through kwargs in convert_file_to_text
* added partition_msg for processing msft outlook files
* version bump and changelog
* added tests for partition_msg
* added test for msg with plain text
* add partition_msg docs; fix underlines in integration docs
* add .msg to file list
* finish tests for auto msg
* linting, linting, linting
Closes#200. Fixes the failing test for label_studio_sdk>0.0.17 using the suggestion found in this comment. The vcr fixture on the test needed allow_playback_repeats=True. Unpinned label_studio_sdk and pip-compiled.