mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-06-27 02:30:08 +00:00

**Summary** In preparation for fixing a cluster of bugs with automatic file-type detection and paving the way for some reliability improvements, refactor `unstructured.file_utils.filetype` module and improve thoroughness of tests. **Additional Context** Factor type-recognition process into three distinct strategies that are attempted in sequence. Attempted in order of preference, type-recognition falls to the next strategy when the one before it is not applicable or cannot determine the file-type. This provides a clear basis for organizing the code and tests at the top level. Consolidate the existing tests around these strategies, adding additional cases to achieve better coverage. Several bugs were uncovered in the process. Small ones were just fixed, bigger ones will be remedied in following PRs.
34 KiB
34 KiB