2 Commits

Author SHA1 Message Date
Steve Canny
8644a3b09a
fix(odt): fix disk-space leak in partition_odt() (#3037)
Remedy disk-space leak where `partition_odt()` would leave an on-disk
copy of each `.odt` file passed as a file-like object.

`partition_odt()` creates a temporary file in which it writes each
source-document provided as a file-like object. This file is not deleted
and disk consumption grows without bound.

The `convert_and_partition_docx()` function used to convert ODT->DOCX
uses `pandoc` (a command-line program) to do the conversion. Because
this command-line program operates in a different memory space, the
source file cannot be passed as an in-memory object and needs to be on
the filesystem. When the ODT source-document is passed as a file-like
object, it is written to disk so the conversion program has access to
it. It is not deleted afterward.

Fix this by writing the temporary source ODT file in a
`TemporaryDirectory` and also use that location to write the
conversion-target DOCX file. That directory is automatically removed
when `partition_odt()` completes.

While we're in there, improve the factoring of `partition_odt()`.

- Extract `convert_and_partition_docx()` from `partition.docx` (used
only by `partition_odt()`) to `_convert_odt_to_docx()` in
`partition.odt` where it is used. Decouple file conversion from calling
`partition_docx()` with the converted file as the `partition_docx()`
call is `partition_odt()`'s natural responsibility.
- Improve docstrings, typing, and comments.
- All tests pass both before and after.
2024-05-16 20:04:10 +00:00
Steve Canny
b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00