* split dependencies by document type
* make pip-compile with new requirements
* add extra requirements to setup.py
* add in all docs; re pip-compile
* extra for all docs
* add pandas to xlsx
* dependency requires for tsv and csv
* handling for doc, docx and odt
* dependency check for pypandoc
* required dependencies for pandoc files
* xml and html
* markdown
* msg
* add in pdf
* add in pptx
* add in excel
* add lxml as base req
* extra all docs for local inference
* local inference installs all
* pin pillow version
* fixes for plain text tests
* fixes for doc
* update make commands
* changelog and version
* add xlrd
* update pip-compile
* pin numpy for python 3.8 support
* more constraints
* contraint on scipy
* update install docs
* constrain ipython
* add outlook to pip-compile
* more ipython constraints
* add extras to dockerfile
* pin office365 client
* few doc tweaks
* types as strings
* last pip-compile
* re pip-comple
* make tidy
* make tidy
* Pull out s3 code as subcommand
* Pull out dropbox code as subcommand
* Pull out azure code as subcommand
* Pull out fsspec code as subcommand
* Pull out github code as subcommand
* Pull out gitlab code as subcommand
* Pull out reddit code as subcommand
* Pull out slack code as subcommand
* Pull out discord code as subcommand
* Pull out wikipedia code as subcommand
* Pull out gdrive code as subcommand
* Pull out biomed code as subcommand
* rename parameters
* Pull out onedrive code as subcommand
* Pull out outlook code as subcommand
* Pull out local code as subcommand
* Pull out elasticsearch code as subcommand
* Pull out confluence code as subcommand
* Drop previous main file
* update changelog
* Add back in mp.Pool
* Fix mypy issues with click
* Make sure all tests run with verbose flag
* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data
* Pull out some more shared options
* Support running code via python as well as cli
* update ingest readme and move it to the ingest folder
* update usage in connector docs
* move local command arg in test
* Seperate out cli code from logic running unstructured
* Make some cli fields required rather than optional
* rename process -> processor
* Improve logger to avoid duplicate handlers
---------
Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* track tags in html
* pass through links as metadata
* add test for grabbing links
* one more link
* changelog and version
* update docs
* fix tests
* update empty link assertion
* ingest-test-fixtures-update
* Update ingest test fixtures (#961)
* fran-unstructured/Add unstructured API documentation
* fran-unstructured/add api docs to index.rst
* fran-unstructured/add api docs with changes requested
* process attachments for email
* add attachment processing to msg
* fix up metadata for attachments
* add test for processing email attachments
* added test for processing msg attachments
* update docs
* tests for error conditions
* version and changelog
* add max partition size logic
* work splitting logic into split_by_paragraph
* pass through max_partition to other functions
* added test for splitting long document
* add type hint
* add documentation
* version and changelog
* ingest-test-fixtures-update
* Update ingest test fixtures (#819)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* retrigger ci
* ingest-test-fixtures-update
* ingest-test-fixtures-update
* Update ingest test fixtures (#821)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* update default for partition_xml
* update version for release
* update msg doc string
---------
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* optionally dont assemble articles
* add test for content outside of articles
* pass kwargs in partition
* changelog and version
* update default to False
* bump version for release
* back to dev version to get another fix in the release
* first pass on regex metadata
* fix typing for regex metadata
* add dataclass back in
* add decorators
* fix tests
* update docs
* add tests for regex metadata
* add process metadata to tsv
* changelog and version
* docs typos
* consolidate to using a single kwarg
* fix test
* first pass at partition_tsv
* working tests
* create constants for tests and debug `make test` failure
* make check and tidy
* undo changes for testing locally
* update changelog and version
* fix bricks.rst
* refactor if statements
* make tidy
* fix README and change try/except to if/else
* update changelog and version
* fix\ docstring
* add support for page numbers in docx when present
* version and changelog
* add comment on page numbers
* add header and footer to doc elements list
* update integrations docs
* include_page_breaks kwarg for doc and docx
* merge element metadata for pagebreaks
* fix typo
* fix changelog typo
* change page number default to None
* add initial_page_number kwarg
* make page number tests in pdf more explicit
* revert test file
* update ingest tests
* update test fixture outputs
* updates to IRS forms fixtures
* ingest-test-fixtures-update
* Update ingest test fixtures (#759)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
---------
Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay!
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* first pass on partition_xml
* add option to keep xml tags
* added tests for xml
* fix filename
* update filenames
* remove outdated readme
* add xml to auto
* version and changelog
* update readme and docs
* pass through include_metadata
* update include_metadata description
* add README back in
* linting, linting, linting
* more linting
* spooled to bytes doesnt need to be a tuple
* Add tests for newly supported filetypes
* Correct metadata filetype
* doc typo
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* keep_xml_tags -> xml_keep_tags
---------
Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* first pass on partition_xlsx
* add support for files
* add test for xlsx from filename
* added filetype metadata
* add xlsx to auto
* remove fake excel from unsupported
* version and changelog
* update docs
* update readme
* fix removed file reference
* fix some more tests
* pass in metadata filename
* add include_metadata flag
* added functions for determining auto stratgy
* change default strategy to auto
* tests for auto strategy
* update docs
* changelog and version
* bump version
* remove ingest file in wrong location
* update jpg output
* typo fix
* add tests for validating strategy
* refactor into determine_pdf_strategy function
* refactor pdf strategies into strategies
* remove commented out code
* remove unreachable code
* add in handling for image types
* a little more refactoring
* import ocr partioning for images
* catch warnings, partition type for valid strategies
* fallback to ocr_only from fast
* fallback logic for hi_res
* test for fallback to ocr only
* fallback logic ofr ocr_only
* more tests for fallback logic
* update doc strings
* version and changelog
* linting, linting, linting
* update docs to include notes about strategy
* fix typos
* change back patched filename
* spike for ocr-only strategy for images
* fix for file processing
* extra space
* add korean to ci
* added test for ocr_only strategy
* added docs for ocr_only
* changelog and version
* added test for bad strategy
* skip korean test if in docker
* bump version
* version bump
* document valid strategies
* bump version for release
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* added filetype detection for odt
* add function for partition odt documents
* add odt files to auto
* changelog and version
* docs and readme
* update installation docs
* skip tests if not supported or in docker
* import pytest
* fix docs typos
* added function for multiple files via api
* make multiple work with files
* updated docs strings
* changelog and version
* docs and contextlib for open files
* tests for partition multiple
* add tests for error conditions
* add output example
* function to check if pdf is extractable
* add fallback logic for unextractable pdfs
* tests for docs with copy protection
* add test for unprocessable pdf
* update docs
* changelog and version
* update logic for images; reset file before proceeding
* 3 files for api tests
* docs update
* pip-compile new reqs
* bump inference version
* add language to pdf and image calls
* tests for passing in language
* version bump and changelog
* update docs
* pass ocr_languages in auto
* updated test fixtures
* typo in doc string