Steve Canny 32bb77aafb
fix(file): no default OLE subtype (#3516)
**Summary**
Do not assume MSG format when an OLE "container" file cannot be
differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based
identification in that case.

**Additional Context**
DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly,
a Microsoft-proprietary Zip format which "contains" a filesystem of
discrete files and directories.

An OLE "container" is easily identified by inspecting the first 8 bytes
of the file, so all we need to do is differentiate between the four
subtypes we can process. The `filetype` module does a good job of this
but it not perfect and does not identify MSG files.

Previously we assumed MSG format when none of DOC, PPT, or XLS was
detected, but we discovered that `filetype` is not completely reliable
at detecting these types.

Change the behavior to remove the assumption of MSG format.
`_OleFileDifferentiator` returns `None` in this case and filetype
detection falls back to use filename-extension.

Note a file with no filename and no metadata_filename or an incorrect
extension will not be correctly identified in this case, however we're
assuming for now that will be rare in practice.
2024-08-22 19:16:53 +00:00
..