feat: use block matrix to reduce peak memory usage for matmul (#3947)

This PR targets the most memory expensive operation in partition pdf and
images: deduplicate pdfminer elements. In large pages the number of
elements can be over 10k, which would generate multiple 10k x 10k square
double float matrices during deduplication, pushing peak memory usage
close to 13Gb
![Screenshot 2025-03-06 at 3 22
52 PM](https://github.com/user-attachments/assets/fdc26806-947b-4b5a-9d8e-4faeb0179b9f)


This PR breaks this computation down by computing partial IOU. More
precisely it computes IOU for each 2000 elements against all the
elements at a time to reduce peak memory usage by about 10x to around
1.6Gb.

![image](https://github.com/user-attachments/assets/e7b9f149-2b6a-4fc9-83c7-652e20849b76)


The block size is configurable based on user preference for peak memory
usage and it is set by changing the env `UNST_MATMUL_MEMORY_CAP_IN_GB`.
This commit is contained in:
Yao You 2025-03-06 18:28:36 -06:00 committed by GitHub
parent 19373de5ff
commit 961c8d5b11
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 14 additions and 6 deletions

View File

@ -1,4 +1,4 @@
## 0.16.24-dev5
## 0.16.24
### Enhancements
@ -6,6 +6,7 @@
in unstructured and `register_partitioner` to enable registering your own partitioner for any file type.
- **`extract_image_block_types` now also works for CamelCase elemenet type names**. Previously `NarrativeText` and similar CamelCase element types can't be extracted using the mentioned parameter in `partition`. Now figures for those elements can be extracted like `Image` and `Table` elements
- **use block matrix to reduce peak memory usage for pdf/image partition**.
### Features

View File

@ -1 +1 @@
__version__ = "0.16.24-dev5" # pragma: no cover
__version__ = "0.16.24" # pragma: no cover

View File

@ -1,5 +1,6 @@
from __future__ import annotations
import os
from typing import TYPE_CHECKING, Any, BinaryIO, Iterable, List, Optional, Union, cast
import numpy as np
@ -708,10 +709,16 @@ def remove_duplicate_elements(
) -> TextRegions:
"""Removes duplicate text elements extracted by PDFMiner from a document layout."""
iou = boxes_self_iou(elements.element_coords, threshold)
# this is equivalent of finding those rows where `not iou[i, i + 1 :].any()`, i.e., any element
# that has no overlap above the threshold with any other elements
return elements.slice(~np.triu(iou, k=1).any(axis=1))
coords = elements.element_coords
# experiments show 2e3 is the block size that constrains the peak memory around 1Gb for this
# function; that accounts for all the intermediate matricies allocated and memory for storing
# final results
memory_cap_in_gb = os.getenv("UNST_MATMUL_MEMORY_CAP_IN_GB", 1)
n_split = np.ceil(coords.shape[0] / 2e3 / memory_cap_in_gb)
splits = np.array_split(coords, n_split, axis=0)
ious = [~np.triu(boxes_iou(split, coords, threshold), k=1).any(axis=1) for split in splits]
return elements.slice(np.concatenate(ious))
def aggregate_embedded_text_by_block(