unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-06-27 02:30:08 +00:00

History

Yao You 8f2a719873

Feat/refactor layoutelement textregion to vectorized data structure (#3881 )

This PR refactors the data structure for `list[LayoutElement]` and
`list[TextRegion]` used in partition pdf/image files.

- new data structure replaces a list of objects with one object with
`numpy` array to store data
- this only affects partition internal steps and it doesn't change input
or output signature of `partition` function itself, i.e., `partition`
still returns `list[Element]`
- internally `list[LayoutElement]` -> `LayoutElements`;
`list[TextRegion]` -> `TextRegions`
- current refactor stops before clean up pdfminer elements inside
inferred layout elements -> the algorithm of clean up needs to be
refactored before the data structure refactor can move forward. So
current refactor converts the array data structure into list data
structure with `element_array.as_list()` call. This is the last step
before turning `list[LayoutElement]` into `list[Element]` as return
- a future PR will update this last step so that we build
`list[Element]` from `LayoutElements` data structure instead.

The goal of this PR is to replace the data structure as much as possible
without changing underlying logic. There are a few places where the
slicing or filtering logic was simple enough to be converted into vector
data structure operations. Those are refactored to be vector based. As a
result there is some small improvements observed in ingest test. This is
likely because the vector operations cleaned up some previous
inconsistency in data types and operations.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>

2025-01-23 17:11:38 +00:00

ocr_models

feat: partition_pdf() support language specification for PaddleOCR (#3400 )

2024-07-16 22:19:25 +00:00

__init__.py

rfctr: prepare to add orig_elements serde (#2668 )

2024-03-20 21:27:59 +00:00

test_config.py

feat: add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR config parameteres (#3014 )

2024-05-17 19:16:10 +00:00

test_sorting.py

Feat/refactor layoutelement textregion to vectorized data structure (#3881 )

2025-01-23 17:11:38 +00:00

test_xycut.py

refactor: partition_pdf() for ocr_only strategy (#1811 )

2023-10-30 20:13:29 +00:00