mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-28 11:31:08 +00:00

Remedy disk-space leak where `partition_odt()` would leave an on-disk copy of each `.odt` file passed as a file-like object. `partition_odt()` creates a temporary file in which it writes each source-document provided as a file-like object. This file is not deleted and disk consumption grows without bound. The `convert_and_partition_docx()` function used to convert ODT->DOCX uses `pandoc` (a command-line program) to do the conversion. Because this command-line program operates in a different memory space, the source file cannot be passed as an in-memory object and needs to be on the filesystem. When the ODT source-document is passed as a file-like object, it is written to disk so the conversion program has access to it. It is not deleted afterward. Fix this by writing the temporary source ODT file in a `TemporaryDirectory` and also use that location to write the conversion-target DOCX file. That directory is automatically removed when `partition_odt()` completes. While we're in there, improve the factoring of `partition_odt()`. - Extract `convert_and_partition_docx()` from `partition.docx` (used only by `partition_odt()`) to `_convert_odt_to_docx()` in `partition.odt` where it is used. Decouple file conversion from calling `partition_docx()` with the converted file as the `partition_docx()` call is `partition_odt()`'s natural responsibility. - Improve docstrings, typing, and comments. - All tests pass both before and after.