qued 15253a53ea
feat: track text source (#4112)
The purpose of this PR is to use the newly created `is_extracted`
parameter in `TextRegion` (and the corresponding vector version
`is_extracted_array` in `TextRegions`), flagging elements that were
extracted directly from PDFs as such.

This also involved:
- New tests
- A version update to bring in the new `unstructured-inference`
- An ingest fixtures update
- An optimization from Codeflash that's not directly related

One important thing to review is that all avenues by which an element is
extracted and ends up in the output of a partition are covered... fast,
hi_res, etc.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: luke-kucing <luke@unstructured.io>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: qued <qued@users.noreply.github.com>
2025-11-11 19:04:01 +00:00
..
2025-04-29 00:58:05 +00:00