unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-12 08:37:51 +00:00

History

enhancement: memory efficient xml partitioning (#1547 )

Closes #1236. Partitions XML documents iteratively in most cases*, never
loading the entire tree into memory. This ends up being much faster.

(* The exception is when the argument `xml_path` is passed to filter
elements. I was not able to find a way in Python to compare XPaths while
streaming the elements, aside from writing a custom XPath parser. So the
shortest way forward was to bite the bullet and load the whole tree in
memory when filtering by XPath.)

Memory usage is about 20% of usage on `main` when processing a 470MB XML
file. Time to process is 10s vs 900s.

Output is slightly different, but appears to be an improvement, adding
lines of text that are skipped in current partitioning. No text is lost.

2023-09-28 02:34:06 +00:00

chunking

fix: coordinates metadata hinders chunking (#1374 )

2023-09-14 10:10:03 +00:00

cleaners

Add clean_ligatures to core cleaners (#1326 )

2023-09-07 21:30:18 +00:00

documents

fix: add backwards compatibility to ElementMetadata (#1526 )

2023-09-27 18:40:56 +00:00

file_utils

Chore: Libmagic detection for "application/octet-stream" when it is not a zip file. (#1347 )