Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay!
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* first pass on partition_xml
* add option to keep xml tags
* added tests for xml
* fix filename
* update filenames
* remove outdated readme
* add xml to auto
* version and changelog
* update readme and docs
* pass through include_metadata
* update include_metadata description
* add README back in
* linting, linting, linting
* more linting
* spooled to bytes doesnt need to be a tuple
* Add tests for newly supported filetypes
* Correct metadata filetype
* doc typo
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* keep_xml_tags -> xml_keep_tags
---------
Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* first pass on partition_xlsx
* add support for files
* add test for xlsx from filename
* added filetype metadata
* add xlsx to auto
* remove fake excel from unsupported
* version and changelog
* update docs
* update readme
* fix removed file reference
* fix some more tests
* pass in metadata filename
* add include_metadata flag
* added functions for determining auto stratgy
* change default strategy to auto
* tests for auto strategy
* update docs
* changelog and version
* bump version
* remove ingest file in wrong location
* update jpg output
* typo fix
* add tests for validating strategy
* refactor into determine_pdf_strategy function
* refactor pdf strategies into strategies
* remove commented out code
* remove unreachable code
* add in handling for image types
* a little more refactoring
* import ocr partioning for images
* catch warnings, partition type for valid strategies
* fallback to ocr_only from fast
* fallback logic for hi_res
* test for fallback to ocr only
* fallback logic ofr ocr_only
* more tests for fallback logic
* update doc strings
* version and changelog
* linting, linting, linting
* update docs to include notes about strategy
* fix typos
* change back patched filename
* spike for ocr-only strategy for images
* fix for file processing
* extra space
* add korean to ci
* added test for ocr_only strategy
* added docs for ocr_only
* changelog and version
* added test for bad strategy
* skip korean test if in docker
* bump version
* version bump
* document valid strategies
* bump version for release
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* added filetype detection for odt
* add function for partition odt documents
* add odt files to auto
* changelog and version
* docs and readme
* update installation docs
* skip tests if not supported or in docker
* import pytest
* fix docs typos
* added function for multiple files via api
* make multiple work with files
* updated docs strings
* changelog and version
* docs and contextlib for open files
* tests for partition multiple
* add tests for error conditions
* add output example
* function to check if pdf is extractable
* add fallback logic for unextractable pdfs
* tests for docs with copy protection
* add test for unprocessable pdf
* update docs
* changelog and version
* update logic for images; reset file before proceeding
* 3 files for api tests
* docs update
* pip-compile new reqs
* bump inference version
* add language to pdf and image calls
* tests for passing in language
* version bump and changelog
* update docs
* pass ocr_languages in auto
* updated test fixtures
* typo in doc string
* refactor epub; add rtf
* added test for rtf files
* filetype detection for rtf files
* add rtf to auto
* update docs for group_broken_paragraphs
* add rtf to docs
* update file list in readme
* update stage_for_transformers docs
* changelog and version bump
* skip rtf if in docker
* skip test if rtf not supported
* docs tweaks
* cleaning brick to group broken paragraphs
* docs for group_broken_paragraphs
* add docs for partition_text with grouper
* partition_text and auto with paragraph_grouper
* version and changelog
* typo in the docs
* linting, linting, linting
* switch to using regular expressions
* added msg-parser dependency
* pass through kwargs in convert_file_to_text
* added partition_msg for processing msft outlook files
* version bump and changelog
* added tests for partition_msg
* added test for msg with plain text
* add partition_msg docs; fix underlines in integration docs
* add .msg to file list
* finish tests for auto msg
* linting, linting, linting
Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.
So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured.
I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.
* environment variable to set language checks
* change log and version
* checks for if language checks are false
* update docs
* changelog type
* add assert to tests
* performance note in docstrings
* docstring tweaks
* add print statement in readme
* elements before bricks
* new preamble to bricks section
* add preamble to bricks section
* add preamble to cleaning section
* descriptions of each documentation page
* non-brick helper functions to the bottom
* fix codeblock
* includes some optional kwargs
* code blocks
* typo fix
* added type to text element map
* add element_id and coordinates
* added test for serialization
* added serialization for check boxes
* add dict_to_elements and covert_to_dict aliases
* helpers for serializing and deserializing elements
* bump version; changelog
* add Text to tests
* aliases for isd functions
* remove test elements json
* changelog updates
* make indent a kwarg
* update expected structured output
* docs update
* use new function in ingest code
* pop coordinates due to floating point differences
* pop coordinates
* code for downloading nltk packages
* don't run nltk make command in ci
* test for model downloads
* remove nltk install from docs
* update changelog and bump version
* added partition_ppt function and tests
* add ppt support to auto
* version bump
* update docs
* doc fixes
* update changelog
* `.docx` -> `.pptx`
* its -> their
* remove whitespace
* first pass on doc partitioning
* add libreoffice to deps
* update docs and readme
* add .doc to auto
* changelog bump
* value error with missing doc
* doc updates