Attempting to fix formatting of github issues transferred to Jira.
The old format was attempting to use double-slashes (\\) to specify line breaks. This worked in the test repo but didn't look right when merged to this repo.
Now attempting to use formatted text in the yaml with |. This worked in the test repo, but I guess that's no guarantee.
* Add --partition-by-api and --partition-host args to ingest
* Fix error in make check
* Bump changelog
* Add a test ingest script
Also add a workaround for the test causing 400s from our api. Seems we need to make sure
unstructured-api can handle getting a file.content_type of None.
* Remove the content type workaround
- Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.).
- Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 .
- Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details.
- Updates expected outputs with above script.
- Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.
* refactor epub; add rtf
* added test for rtf files
* filetype detection for rtf files
* add rtf to auto
* update docs for group_broken_paragraphs
* add rtf to docs
* update file list in readme
* update stage_for_transformers docs
* changelog and version bump
* skip rtf if in docker
* skip test if rtf not supported
* docs tweaks
* fix(ingest): import connector-specific modules on demand
* unstructured-ingest --flatten-metadata supported for local connector.
* unstructured-ingest fix runtime error when using --metadata-include.
The mailcap centos7 package provides the file /etc/mime.types, which is used by the mimetypes python package. That said, the unstructured code base does not make much use of this but the upstream unstructured-api does.
Bonus: docx mimetype added in lookup table.
* cleaning brick to group broken paragraphs
* docs for group_broken_paragraphs
* add docs for partition_text with grouper
* partition_text and auto with paragraph_grouper
* version and changelog
* typo in the docs
* linting, linting, linting
* switch to using regular expressions
Some document elements may have a null style element which triggers an exception
when trying to access the name of the style.
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Add --download-only parameter so that files may be downloaded if they are not already present (as usual, in either --download-dir or the default download ~/.cache/... location if --download-dir is not specified) and skip processing them through unstructured.
Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.
* fix: correct order of kwargs in pandoc
* only skip epub tests in Docker
* changelog
---------
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
Co-authored-by: cragwolfe <crag@unstructured.io>
Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word
Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.
* update stage_for_transformers to return a list of elements
* bump changelog and version
* flag breaking change
* fix last word bug in chunk_by_attention_window
* added msg-parser dependency
* pass through kwargs in convert_file_to_text
* added partition_msg for processing msft outlook files
* version bump and changelog
* added tests for partition_msg
* added test for msg with plain text
* add partition_msg docs; fix underlines in integration docs
* add .msg to file list
* finish tests for auto msg
* linting, linting, linting