mirror of
				https://github.com/Unstructured-IO/unstructured.git
				synced 2025-11-03 19:43:24 +00:00 
			
		
		
		
	**Summary** A DOC, PPT, or XLS file sent to partition() as a file-like object is misidentified as a MSG file and raises an exception in python-oxmsg (which is used to process MSG files). **Fix** DOC, PPT, XLS, and MSG are all Microsoft OLE-based files, aka. Compound File Binary Format (CFBF). These can be reliably distinguished by inspecting magic bytes in certain locations. `libmagic` is unreliable at this or doesn't try, reporting the generic `"application/x-ole-storage"` which corresponds to the "container" CFBF format (vaguely like a Microsoft Zip format) that all these document types are stored in. Unconditionally use `filetype.guess_mime()` provided by the `filetype` package that is part of the base unstructured install. Unlike `libmagic`, this package reliably detects the distinguished MIME-type (e.g. `"application/msword"`) for OLE file subtypes. Fixes #3364
		
			
				
	
	
		
			6 lines
		
	
	
		
			94 B
		
	
	
	
		
			JSON
		
	
	
	
	
	
			
		
		
	
	
			6 lines
		
	
	
		
			94 B
		
	
	
	
		
			JSON
		
	
	
	
	
	
{
 | 
						|
    "id": "Sample-1",
 | 
						|
    "name": "Sample 1",
 | 
						|
    "description": "This is sample data #1"
 | 
						|
}
 |