3 Commits

Author SHA1 Message Date
Steve Canny
77a9e1b54d
rfctr(html): drop convert_and_partition_html() (#3215)
**Summary**
Remove `unstructured.partition.html.convert_and_partition_html()`. Move
file-type conversion (to HTML) responsibility to each brokering
partitioner that uses that strategy and let them call `partition_html()`
for themselves with the result.

**Additional Context**

Rationale:
- `partition_html()` does not want or need to know which partitioners
might broker partitioning to it.
- Different brokering partitioners have their own methods to convert
their format to HTML and quirks that may be involved for their format.
Avoid coupling them so they can evolve independently.
- The core of the conversion work is already encapsulated in
`unstructured.partition.common.convert_file_to_html_text_using_pandoc()`.
- `convert_and_partition_html()` represents an additional brokering
layer with the entailed complexities of an additional site for default
parameter values to be (mis-)applied and/or dropped and is an additional
location for new parameters to be added.
2024-06-17 19:43:18 +00:00
Steve Canny
8644a3b09a
fix(odt): fix disk-space leak in partition_odt() (#3037)
Remedy disk-space leak where `partition_odt()` would leave an on-disk
copy of each `.odt` file passed as a file-like object.

`partition_odt()` creates a temporary file in which it writes each
source-document provided as a file-like object. This file is not deleted
and disk consumption grows without bound.

The `convert_and_partition_docx()` function used to convert ODT->DOCX
uses `pandoc` (a command-line program) to do the conversion. Because
this command-line program operates in a different memory space, the
source file cannot be passed as an in-memory object and needs to be on
the filesystem. When the ODT source-document is passed as a file-like
object, it is written to disk so the conversion program has access to
it. It is not deleted afterward.

Fix this by writing the temporary source ODT file in a
`TemporaryDirectory` and also use that location to write the
conversion-target DOCX file. That directory is automatically removed
when `partition_odt()` completes.

While we're in there, improve the factoring of `partition_odt()`.

- Extract `convert_and_partition_docx()` from `partition.docx` (used
only by `partition_odt()`) to `_convert_odt_to_docx()` in
`partition.odt` where it is used. Decouple file conversion from calling
`partition_docx()` with the converted file as the `partition_docx()`
call is `partition_odt()`'s natural responsibility.
- Improve docstrings, typing, and comments.
- All tests pass both before and after.
2024-05-16 20:04:10 +00:00
Steve Canny
b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00