* group broken paragraphs with fast strategy
* changelog and version
* fix broken tests for text.py
* formatting for paragraph pattern re
* fix test
* fix whitespace substitution
* one more test tweak
* blurb to account for short lines
* fix for shorter paragraphs
* update changelog
* remove extra line break from auto
* retrigger ci
* trying skipping azure
* skip azure (test)
* updated github and azure fixtures
* update slack fixture
Fixes issue where .json files were recognized as "text/plain" rather than "application/json on
the Unstructured image (and other installs that may have an older libmagic).
Also adds missing json auto partition tests.
Including an xfail test for #492 .
Previously, if there was an error (non-zero exit code) in an ingest test script,
the script would still complete and echo a warning about mismatched outputs
and how to regenerate the fixtures. However, this statement is irrelevant and
misleading: if the ingest failed with a non-zero exit code in the first place,
that is the failure that should be debugged -- don't confuse the user with
a comment about outputs.
This connector takes a slack channel id, token and other options to
pull conversation history for a channel and store it as a text file that
is then processed by unstructured into expected output.
* Update test fixtures that should have been updated in prior commit
* Disable biomed ingest tests for now, the fail more often than not
* Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.
There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected.
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* add carriage return to html if missing
* test on markdown with embedded html
* changelog and version
* check for html parser
* linting, linting, linting
Attempting to fix formatting of github issues transferred to Jira.
The old format was attempting to use double-slashes (\\) to specify line breaks. This worked in the test repo but didn't look right when merged to this repo.
Now attempting to use formatted text in the yaml with |. This worked in the test repo, but I guess that's no guarantee.
* Add --partition-by-api and --partition-host args to ingest
* Fix error in make check
* Bump changelog
* Add a test ingest script
Also add a workaround for the test causing 400s from our api. Seems we need to make sure
unstructured-api can handle getting a file.content_type of None.
* Remove the content type workaround
- Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.).
- Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 .
- Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details.
- Updates expected outputs with above script.
- Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.
* refactor epub; add rtf
* added test for rtf files
* filetype detection for rtf files
* add rtf to auto
* update docs for group_broken_paragraphs
* add rtf to docs
* update file list in readme
* update stage_for_transformers docs
* changelog and version bump
* skip rtf if in docker
* skip test if rtf not supported
* docs tweaks
* fix(ingest): import connector-specific modules on demand
* unstructured-ingest --flatten-metadata supported for local connector.
* unstructured-ingest fix runtime error when using --metadata-include.
The mailcap centos7 package provides the file /etc/mime.types, which is used by the mimetypes python package. That said, the unstructured code base does not make much use of this but the upstream unstructured-api does.
Bonus: docx mimetype added in lookup table.
* cleaning brick to group broken paragraphs
* docs for group_broken_paragraphs
* add docs for partition_text with grouper
* partition_text and auto with paragraph_grouper
* version and changelog
* typo in the docs
* linting, linting, linting
* switch to using regular expressions
Some document elements may have a null style element which triggers an exception
when trying to access the name of the style.
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Add --download-only parameter so that files may be downloaded if they are not already present (as usual, in either --download-dir or the default download ~/.cache/... location if --download-dir is not specified) and skip processing them through unstructured.
Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.
* fix: correct order of kwargs in pandoc
* only skip epub tests in Docker
* changelog
---------
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
Co-authored-by: cragwolfe <crag@unstructured.io>
Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word
Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.
* update stage_for_transformers to return a list of elements
* bump changelog and version
* flag breaking change
* fix last word bug in chunk_by_attention_window