Addressed the issue #494 .
Updated the `_detect_filetype_from_octet_stream()` function to use
libmagic to infer the content type of file when it is not a zip file.
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach
### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
* split dependencies by document type
* make pip-compile with new requirements
* add extra requirements to setup.py
* add in all docs; re pip-compile
* extra for all docs
* add pandas to xlsx
* dependency requires for tsv and csv
* handling for doc, docx and odt
* dependency check for pypandoc
* required dependencies for pandoc files
* xml and html
* markdown
* msg
* add in pdf
* add in pptx
* add in excel
* add lxml as base req
* extra all docs for local inference
* local inference installs all
* pin pillow version
* fixes for plain text tests
* fixes for doc
* update make commands
* changelog and version
* add xlrd
* update pip-compile
* pin numpy for python 3.8 support
* more constraints
* contraint on scipy
* update install docs
* constrain ipython
* add outlook to pip-compile
* more ipython constraints
* add extras to dockerfile
* pin office365 client
* few doc tweaks
* types as strings
* last pip-compile
* re pip-comple
* make tidy
* make tidy
* add param and test in image table extraction
* version and changelog
* need to publish this one for api repo
* add new param skip_infer_table_types
* use warning
* clean up with mapping
* add test for tsv
* fix test fail
* weird change from merge
* doc nit
* don't use mapping
* correct conflict
* Bump inference version
* Pass through the dpi param if available
* Update CHANGELOG
* Check dpi param passed in via unit test
* Bump inference version
* Fix unit test around file info to work on mac as well
* update regex pattern
* improve json regex pattern checks and add test file
* update file name
* update tests and formatting
* update changelog and version
This PR is to reflect changes in the unstructured-inference PR #152
* Update functionality to retrieve image metadata from a page for document_to_element_list
* remove argilla; bump reqs
* enable py 3.11
* add 3.11 to setup.py
* make pip-compile
* ignore cli mypy errors
* install argilla
* fix constraints
* install argilla
* changelog and version
* skip argilla in docker
* dont import argilla in docker
* skip all of argilla if in container
* only import argilla if outside docker
* more docker skips
* remove weird pypi settings
* Adds functionality to extract charset info from eml files
* Adds missed file-like object handling in detect_file_encoding
* Adds functionality to replace the MIME encodings for eml files with one of the
common encodings if a unicode error occurs
* Organize the eml example files in the example-docs/eml directory
* first pass at partition_tsv
* working tests
* create constants for tests and debug `make test` failure
* make check and tidy
* undo changes for testing locally
* update changelog and version
* fix bricks.rst
* refactor if statements
* make tidy
* fix README and change try/except to if/else
* update changelog and version
* fix\ docstring
* fix: Filetype detection if a CSV has a text/plain MIME type #621
* bug: fix csv detection and create _read_file_start_for_type_check func
* fix: Make call to _is_text_file_a_csv from detect_filetype
* docker works
* more epub tests
* changelog version
* support epub + odt + rtf
* update dockerfile
* revert..
* install pandoc on ci env
* pandoc docker grab bashed on arch
* move arch into image
* move back to base image
* first pass on partition_xml
* add option to keep xml tags
* added tests for xml
* fix filename
* update filenames
* remove outdated readme
* add xml to auto
* version and changelog
* update readme and docs
* pass through include_metadata
* update include_metadata description
* add README back in
* linting, linting, linting
* more linting
* spooled to bytes doesnt need to be a tuple
* Add tests for newly supported filetypes
* Correct metadata filetype
* doc typo
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* keep_xml_tags -> xml_keep_tags
---------
Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* Update detect_filetype() to use hashmap for mime type return
* fix: text mime type and linting
* fix: declare docx and xlsx mime types locally and also fix linting
* Update CHANGELOG.md
* tweaks for failing tests
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* first pass on partition_xlsx
* add support for files
* add test for xlsx from filename
* added filetype metadata
* add xlsx to auto
* remove fake excel from unsupported
* version and changelog
* update docs
* update readme
* fix removed file reference
* fix some more tests
* pass in metadata filename
* add include_metadata flag
* added filetype detection for odt
* add function for partition odt documents
* add odt files to auto
* changelog and version
* docs and readme
* update installation docs
* skip tests if not supported or in docker
* import pytest
* fix docs typos
* check to see if text file is a json
* add json check into filetype detection
* added test for updated file detection logic
* bytes/strings handling
* changlog and version bump
* refactor epub; add rtf
* added test for rtf files
* filetype detection for rtf files
* add rtf to auto
* update docs for group_broken_paragraphs
* add rtf to docs
* update file list in readme
* update stage_for_transformers docs
* changelog and version bump
* skip rtf if in docker
* skip test if rtf not supported
* docs tweaks
* Apply import sorting
ruff . --select I --fix
* Remove unnecessary open mode parameter
ruff . --select UP015 --fix
* Use f-string formatting rather than .format
* Remove extraneous parentheses
Also use "" instead of str()
* Resolve missing trailing commas
ruff . --select COM --fix
* Rewrite list() and dict() calls using literals
ruff . --select C4 --fix
* Add () to pytest.fixture, use tuples for parametrize, etc.
ruff . --select PT --fix
* Simplify code: merge conditionals, context managers
ruff . --select SIM --fix
* Import without unnecessary alias
ruff . --select PLR0402 --fix
* Apply formatting via black
* Rewrite ValueError somewhat
Slightly unrelated to the rest of the PR
* Apply formatting to tests via black
* Update expected exception message to match
0d81564
* Satisfy E501 line too long in test
* Update changelog & version
* Add ruff to make tidy and test deps
* Run 'make tidy'
* Update changelog & version
* Update changelog & version
* Add ruff to 'check' target
Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
* included the previous PR changes and verified black
* resolved the issues mentioned
* make tidy and add tests
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* added function for exploring a list of files
* file info from file contents
* added tests for file info from contents
* bump version and add tests
* add dev to version
* added python-pptx to requirements
* added filetype detection for powerpoint
* add more filetypes to detect
* more tests
* added tests for filetype
* reorder document types
* tests for get_directory_file_info
* added docs for get_directory_file_info
* bump version
* Word -> Office
* added test for filetype
* add group by filetype example
* add python-magic
* first pass on filetype detection
* tests for filetype detection
* more tests for file detection
* added tests for error conditions
* install libmagic dev in github
* libmagic install instructions
* pattern for checking email files
* support reading .eml in rb mode
* add auto partition function
* auto tests for emal
* auto tests for docx
* added tests for html
* add pdf and html tests
* linting, linting, linting
* added docs for auto partitioning
* update readme with generic partition brick
* bumped version
* added test for bad type
* detect .docx files from application/octet-stream
* linting, linting, linting
* identify xlsx from octet stream
* install poppler in ci
* fix mocks; test for unknown type
* install poppler utils
* install in one line
* only poppler-utils
* file extension logic from application/octet-stream
* install local inference for ci
* install detectron2
* removing unused dockerfile