unstructured/test_unstructured
qued e5d08662d4
enhancement: memory efficient xml partitioning (#1547)
Closes #1236. Partitions XML documents iteratively in most cases*, never
loading the entire tree into memory. This ends up being much faster.

(* The exception is when the argument `xml_path` is passed to filter
elements. I was not able to find a way in Python to compare XPaths while
streaming the elements, aside from writing a custom XPath parser. So the
shortest way forward was to bite the bullet and load the whole tree in
memory when filtering by XPath.)

Memory usage is about 20% of usage on `main` when processing a 470MB XML
file. Time to process is 10s vs 900s.

Output is slightly different, but appears to be an improvement, adding
lines of text that are skipped in current partitioning. No text is lost.
2023-09-28 02:34:06 +00:00
..