unstructured/test_unstructured_ingest/expected-structured-output/local-single-file-with-pdf-infer-table-structure
Yao You 723c0740e0
Feat/vectorize layout merging (#3900)
This PR rewrites the logic in `unstructured_inference` that merges
extracted with inferred layout using vectorized operations. The goal is
to:
- vectorize the operation to improve memory and cpu efficiency
- apply logic equally without order being a factor (the
`unstructured_inference` version uses loops and modifies the content of
the inner loop on the fly -> order of the out loop, which is the order
of extracted elements becomes a factor) determining the merging results
- rewrite the loop into clear steps with clear rules
- setup stage for followup improvements

While this PR aim to reproduce the existing behavior as much as possible
it is not an exact replica of the looped version. Because order is not a
factor any more some extracted elements that used to be not considered
part of a larger inferred element (due to processing order being not
optimum) are now properly merged. This lead to changes in one ingest
test. For example, the change shows that now we properly merge the
section numerical number with the section title as the full title
element.

## Test:

Since the goal of this refactor is to preserve as much existing behavior
as possible we rely on existing tests. As mentioned above the one file
that changed output during ingest test is a net positive change.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2025-02-07 20:25:57 +00:00
..