auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.
* chore: add example doc
* fix: adjust encoding recognition threshold value in `detect_file_encoding`
* test: add test cases for German characters
* chore: update changelog & version
* add max partition size logic
* work splitting logic into split_by_paragraph
* pass through max_partition to other functions
* added test for splitting long document
* add type hint
* add documentation
* version and changelog
* ingest-test-fixtures-update
* Update ingest test fixtures (#819)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* retrigger ci
* ingest-test-fixtures-update
* ingest-test-fixtures-update
* Update ingest test fixtures (#821)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* update default for partition_xml
* update version for release
* update msg doc string
---------
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* add modified arabic and hebrew encodings
* added calls to format_encoding_str so encoding is checked before use
* added formatting to detect_filetype()
* explicitly provided default value for null encoding parameter
* fixed format of annotated encodings list
* adding hebrew base64 test file
* small lint fixes
* update changelog
* bump version to -dev2
* feature(html partition): parse pre tag
* chore: update CHANGELOG.md
* style: black format xml.py
* Added tests dor html with pre tag
* remove skip test, update parse pre tag
* fix style
* chore: spell check
* chore: update changelog & version
* chore: update ingest test fixtures
* chore: add exception handling if `element.text` is `None` in `_read_xml`
* test: add more sanity testing on the `.text` content of the element(s)
* refactor: move the conditional logic for <pre> outside of the `try/except` block
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
* initial pass on text in figures
* refactor text extraction
* update tests
* fix title test
* add test for docs that require recursive text grab
* version and changelog
* ingest-test-fixtures-update
* there are 8 pdf files now
* Adds functionality to extract charset info from eml files
* Adds missed file-like object handling in detect_file_encoding
* Adds functionality to replace the MIME encodings for eml files with one of the
common encodings if a unicode error occurs
* Organize the eml example files in the example-docs/eml directory
* fix malformed pptx issue
Added a new test to check for the ability to partition a malformed PowerPoint file. Modified the `partition_pptx` function to skip processing shapes that are not on the actual slide, but only if they have top and left positions. Also modified `_order_shapes` function to handle cases where shapes do not have top or left positions.
* update changelog
* fix lint issue SIM102 nested ifs
* fix black linting
* first pass at partition_tsv
* working tests
* create constants for tests and debug `make test` failure
* make check and tidy
* undo changes for testing locally
* update changelog and version
* fix bricks.rst
* refactor if statements
* make tidy
* fix README and change try/except to if/else
* update changelog and version
* fix\ docstring
* add support for page numbers in docx when present
* version and changelog
* add comment on page numbers
* add header and footer to doc elements list
* update integrations docs
* include_page_breaks kwarg for doc and docx
* merge element metadata for pagebreaks
* fix typo
* fix changelog typo
* change page number default to None
* add initial_page_number kwarg
* make page number tests in pdf more explicit
* revert test file
* update ingest tests
* update test fixture outputs
* updates to IRS forms fixtures
* ingest-test-fixtures-update
* Update ingest test fixtures (#759)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
---------
Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding.
Change auto.py to have a None default for encoding
Remove the unused parameter encoding from partition_pdf
Add functionality to the read_txt_file utility function to handle file-like object from URL
This PR adds functionality to try other common encodings for email (.eml) files if an error related to the
encoding is raised and the user has not specified an encoding.
Add support for older .XLS files from the partition function in unstructured.partition.auto.
Note, this should also work on the centos7 unstructured image (with the requirements/*txt updates in this PR).
* first pass on partition_xml
* add option to keep xml tags
* added tests for xml
* fix filename
* update filenames
* remove outdated readme
* add xml to auto
* version and changelog
* update readme and docs
* pass through include_metadata
* update include_metadata description
* add README back in
* linting, linting, linting
* more linting
* spooled to bytes doesnt need to be a tuple
* Add tests for newly supported filetypes
* Correct metadata filetype
* doc typo
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* keep_xml_tags -> xml_keep_tags
---------
Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* first pass on partition_xlsx
* add support for files
* add test for xlsx from filename
* added filetype metadata
* add xlsx to auto
* remove fake excel from unsupported
* version and changelog
* update docs
* update readme
* fix removed file reference
* fix some more tests
* pass in metadata filename
* add include_metadata flag
* add table parsing
* import paragraph
* update changelog
* add example docx
* revert changelog formatting
* update function name for consistency
* add both text and html metadata for table
* update with metadata in docx table note
---------
Co-authored-by: kevin pan <kevin.pan@strivr.com>
* spike for ocr-only strategy for images
* fix for file processing
* extra space
* add korean to ci
* added test for ocr_only strategy
* added docs for ocr_only
* changelog and version
* added test for bad strategy
* skip korean test if in docker
* bump version
* version bump
* document valid strategies
* bump version for release
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* added filetype detection for odt
* add function for partition odt documents
* add odt files to auto
* changelog and version
* docs and readme
* update installation docs
* skip tests if not supported or in docker
* import pytest
* fix docs typos
* switch to using PDF objects
* linting, linting, linting
* couple more tweaks
* added test for chevron-page
* version and changelog
* linting, linting, linting
* now processing 4 files
* check to see if text file is a json
* add json check into filetype detection
* added test for updated file detection logic
* bytes/strings handling
* changlog and version bump
* function to check if pdf is extractable
* add fallback logic for unextractable pdfs
* tests for docs with copy protection
* add test for unprocessable pdf
* update docs
* changelog and version
* update logic for images; reset file before proceeding
* 3 files for api tests
* docs update
* refactor epub; add rtf
* added test for rtf files
* filetype detection for rtf files
* add rtf to auto
* update docs for group_broken_paragraphs
* add rtf to docs
* update file list in readme
* update stage_for_transformers docs
* changelog and version bump
* skip rtf if in docker
* skip test if rtf not supported
* docs tweaks
* added msg-parser dependency
* pass through kwargs in convert_file_to_text
* added partition_msg for processing msft outlook files
* version bump and changelog
* added tests for partition_msg
* added test for msg with plain text
* add partition_msg docs; fix underlines in integration docs
* add .msg to file list
* finish tests for auto msg
* linting, linting, linting
* fix: ensure all text is maintained in html pages
* add back in replace unicode quotes
* changelog and version bump
* apt-get update in ci
* white space differences in output
* added partition_ppt function and tests
* add ppt support to auto
* version bump
* update docs
* doc fixes
* update changelog
* `.docx` -> `.pptx`
* its -> their
* remove whitespace
* first pass on doc partitioning
* add libreoffice to deps
* update docs and readme
* add .doc to auto
* changelog bump
* value error with missing doc
* doc updates
* add env var for cap threshold; raise default threshold
* update docs and tests
* added check for ending in a comma
* update docs
* no caps check for all upper text
* capture Text in html and text
* check category in Text equality check
* lower case all caps before checking for verbs
* added check for us city/state/zip
* added address type
* add address to html
* add address to text
* fix for text tests; escape for large text segments
* refactor regex for readability
* update comment
* additional test for text with linebreaks
* update docs
* update changelog
* update elements docs
* remove old comment
* case -> cast
* type fix
* added python-pptx to requirements
* added filetype detection for powerpoint
* add more filetypes to detect
* more tests
* added tests for filetype
* reorder document types
* tests for get_directory_file_info
* added docs for get_directory_file_info
* bump version
* Word -> Office
* added test for filetype
* add group by filetype example