* fix: ensure all text is maintained in html pages
* add back in replace unicode quotes
* changelog and version bump
* apt-get update in ci
* white space differences in output
* added partition_ppt function and tests
* add ppt support to auto
* version bump
* update docs
* doc fixes
* update changelog
* `.docx` -> `.pptx`
* its -> their
* remove whitespace
* first pass on doc partitioning
* add libreoffice to deps
* update docs and readme
* add .doc to auto
* changelog bump
* value error with missing doc
* doc updates
* add env var for cap threshold; raise default threshold
* update docs and tests
* added check for ending in a comma
* update docs
* no caps check for all upper text
* capture Text in html and text
* check category in Text equality check
* lower case all caps before checking for verbs
* added check for us city/state/zip
* added address type
* add address to html
* add address to text
* fix for text tests; escape for large text segments
* refactor regex for readability
* update comment
* additional test for text with linebreaks
* update docs
* update changelog
* update elements docs
* remove old comment
* case -> cast
* type fix
* added python-pptx to requirements
* added filetype detection for powerpoint
* add more filetypes to detect
* more tests
* added tests for filetype
* reorder document types
* tests for get_directory_file_info
* added docs for get_directory_file_info
* bump version
* Word -> Office
* added test for filetype
* add group by filetype example
* add python-magic
* first pass on filetype detection
* tests for filetype detection
* more tests for file detection
* added tests for error conditions
* install libmagic dev in github
* libmagic install instructions
* pattern for checking email files
* support reading .eml in rb mode
* add auto partition function
* auto tests for emal
* auto tests for docx
* added tests for html
* add pdf and html tests
* linting, linting, linting
* added docs for auto partitioning
* update readme with generic partition brick
* bumped version
* added test for bad type
* detect .docx files from application/octet-stream
* linting, linting, linting
* identify xlsx from octet stream
* install poppler in ci
* fix mocks; test for unknown type
* install poppler utils
* install in one line
* only poppler-utils
* file extension logic from application/octet-stream
* install local inference for ci
* install detectron2
* removing unused dockerfile
* fix for processing deeply embedded list elements
* fix types in mime encodings cleaner
* first pass on partition_email
* tests for email
* test for mime encodings
* changelog bump
* added note about \n=
* linting, linting, linting
* added email docs
* add partition_email to the readme
* add one more test
* docs: add new feature to the CHANGELOG.md, bump the version, update __version__.py
* feat: new partition to call the document image analysis API
* fix: remove duplicated dependency on partition.py
* fix: linting error due to line-lenght > 100
* test: add test to call partition_pdf brick
* chore: new short example-doc pdf for speed up in test X8
* fix: add missing return statement to _read to pass check
* feat: new partitioning brick to call doc parse API
* docs: version update fix in CHANGELOG
* refactor: no nested ifs
* docs: documentation for new brick partition_pdf
* refactor: made tidy
* docs: minor doc refactor
Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>