mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-06-27 02:30:08 +00:00
feat: use block matrix to reduce peak memory usage for matmul (#3947)
This PR targets the most memory expensive operation in partition pdf and images: deduplicate pdfminer elements. In large pages the number of elements can be over 10k, which would generate multiple 10k x 10k square double float matrices during deduplication, pushing peak memory usage close to 13Gb  This PR breaks this computation down by computing partial IOU. More precisely it computes IOU for each 2000 elements against all the elements at a time to reduce peak memory usage by about 10x to around 1.6Gb.  The block size is configurable based on user preference for peak memory usage and it is set by changing the env `UNST_MATMUL_MEMORY_CAP_IN_GB`.
This commit is contained in:
parent
19373de5ff
commit
961c8d5b11
@ -1,4 +1,4 @@
|
||||
## 0.16.24-dev5
|
||||
## 0.16.24
|
||||
|
||||
### Enhancements
|
||||
|
||||
@ -6,6 +6,7 @@
|
||||
in unstructured and `register_partitioner` to enable registering your own partitioner for any file type.
|
||||
|
||||
- **`extract_image_block_types` now also works for CamelCase elemenet type names**. Previously `NarrativeText` and similar CamelCase element types can't be extracted using the mentioned parameter in `partition`. Now figures for those elements can be extracted like `Image` and `Table` elements
|
||||
- **use block matrix to reduce peak memory usage for pdf/image partition**.
|
||||
|
||||
### Features
|
||||
|
||||
|
@ -1 +1 @@
|
||||
__version__ = "0.16.24-dev5" # pragma: no cover
|
||||
__version__ = "0.16.24" # pragma: no cover
|
||||
|
@ -1,5 +1,6 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from typing import TYPE_CHECKING, Any, BinaryIO, Iterable, List, Optional, Union, cast
|
||||
|
||||
import numpy as np
|
||||
@ -708,10 +709,16 @@ def remove_duplicate_elements(
|
||||
) -> TextRegions:
|
||||
"""Removes duplicate text elements extracted by PDFMiner from a document layout."""
|
||||
|
||||
iou = boxes_self_iou(elements.element_coords, threshold)
|
||||
# this is equivalent of finding those rows where `not iou[i, i + 1 :].any()`, i.e., any element
|
||||
# that has no overlap above the threshold with any other elements
|
||||
return elements.slice(~np.triu(iou, k=1).any(axis=1))
|
||||
coords = elements.element_coords
|
||||
# experiments show 2e3 is the block size that constrains the peak memory around 1Gb for this
|
||||
# function; that accounts for all the intermediate matricies allocated and memory for storing
|
||||
# final results
|
||||
memory_cap_in_gb = os.getenv("UNST_MATMUL_MEMORY_CAP_IN_GB", 1)
|
||||
n_split = np.ceil(coords.shape[0] / 2e3 / memory_cap_in_gb)
|
||||
splits = np.array_split(coords, n_split, axis=0)
|
||||
|
||||
ious = [~np.triu(boxes_iou(split, coords, threshold), k=1).any(axis=1) for split in splits]
|
||||
return elements.slice(np.concatenate(ious))
|
||||
|
||||
|
||||
def aggregate_embedded_text_by_block(
|
||||
|
Loading…
x
Reference in New Issue
Block a user