unstructured/partition at cd8c6a2e0941af09426de749c0720a6e1af1e2e3 - unstructured - Gitea: Git with a cup of tea

yujunjun/unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-12 08:37:51 +00:00

History

qued e5d08662d4

enhancement: memory efficient xml partitioning (#1547 )

Closes #1236. Partitions XML documents iteratively in most cases*, never
loading the entire tree into memory. This ends up being much faster.

(* The exception is when the argument `xml_path` is passed to filter
elements. I was not able to find a way in Python to compare XPaths while
streaming the elements, aside from writing a custom XPath parser. So the
shortest way forward was to bite the bullet and load the whole tree in
memory when filtering by XPath.)

Memory usage is about 20% of usage on `main` when processing a 470MB XML
file. Time to process is 10s vs 900s.

Output is slightly different, but appears to be an improvement, adding
lines of text that are skipped in current partitioning. No text is lost.

2023-09-28 02:34:06 +00:00

..

fix: update test_json to not use auto partition (#1187 )

2023-08-29 16:59:26 -04:00

Feat: Native hierarchies for docx element types (#1505 )

2023-09-27 11:32:46 -04:00

fix: updating element types (#1394 )

2023-09-15 11:51:22 -05:00

Adds data source properties to onedrive, reddit and slack (#1281 )

2023-09-20 04:26:36 +00:00

fix: updating element types (#1394 )

2023-09-15 11:51:22 -05:00

chore: adding test case for odt tables (#1434 )

2023-09-16 22:29:44 -07:00

feat: get embedded url, associate text and start index for pdf (#1539 )

2023-09-27 13:43:32 -04:00

Feat: Native hierarchies for docx element types (#1505 )

2023-09-27 11:32:46 -04:00

fix: updating element types (#1394 )

2023-09-15 11:51:22 -05:00

fix: avoid PDF sorting error on negative coords (#1361 )

2023-09-10 19:29:49 -07:00

fix: update test_json to not use auto partition (#1187 )

2023-08-29 16:59:26 -04:00

test_api.py

chore: deprecation warning for file_filename (#1191 )

2023-08-24 07:02:47 +00:00

test_auto.py

feat: introduce language detection function for text partitioning function (#1453 )

2023-09-26 18:09:27 +00:00

test_common.py

feat: improved chipper elements mapping and new category_depth metadata (#1308 )

2023-09-15 14:43:17 +00:00

test_constants.py

fix: etree parser error (#1077 )

2023-08-10 23:28:57 +00:00

test_email.py

Feat: Create a naive hierarchy for elements (#1268 )

2023-09-14 11:23:16 -04:00

test_html_partition.py

Adds data source properties to onedrive, reddit and slack (#1281 )

2023-09-20 04:26:36 +00:00

test_json.py

fix: update test_json to not use auto partition (#1187 )

2023-08-29 16:59:26 -04:00

test_lang.py

feat: introduce language detection function for text partitioning function (#1453 )

2023-09-26 18:09:27 +00:00

test_strategies.py

Set default strategy for images to be "hi_res" (#968 )

2023-08-02 09:22:20 -07:00

test_text_type.py

chore: refactor languages parameter for text_type functions (#1399 )

2023-09-13 19:46:36 +00:00

test_text.py

feat: introduce language detection function for text partitioning function (#1453 )

2023-09-26 18:09:27 +00:00

test_xml_partition.py

enhancement: memory efficient xml partitioning (#1547 )

2023-09-28 02:34:06 +00:00