unstructured/expected-structured-output at 15253a53eaf7f987d2bd5ba97c043e92297261c8 - unstructured - Gitea: Git with a cup of tea

yujunjun/unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-18 19:37:29 +00:00

History

qued 15253a53ea

feat: track text source (#4112 )

The purpose of this PR is to use the newly created `is_extracted`
parameter in `TextRegion` (and the corresponding vector version
`is_extracted_array` in `TextRegions`), flagging elements that were
extracted directly from PDFs as such.

This also involved:
- New tests
- A version update to bring in the new `unstructured-inference`
- An ingest fixtures update
- An optimization from Codeflash that's not directly related

One important thing to review is that all avenues by which an element is
extracted and ends up in the output of a partition are covered... fast,
hi_res, etc.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: luke-kucing <luke@unstructured.io>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: qued <qued@users.noreply.github.com>

2025-11-11 19:04:01 +00:00

..

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

feat: support pdf link extraction in hi_res strategy (#3753 )

2024-10-31 16:52:27 +00:00

feat: track text source (#4112 )

2025-11-11 19:04:01 +00:00

fix: hi_res PDF parsing: only uncategorized text for extracted elements (#3975 )

2025-04-04 14:38:23 -07:00

biomed-path/07/07

fix: hi_res PDF parsing: only uncategorized text for extracted elements (#3975 )

2025-04-04 14:38:23 -07:00

fix: improve false-positive Title elements on Chinese text (#3836 )

2024-12-18 01:16:42 +00:00

confluence-diff

feat: support extracting image url in html (#3955 )

2025-03-13 22:41:10 +00:00

rfctr(csv): minify HTML and table text is cct (#3733 )

2024-10-19 06:49:09 +00:00

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

fix: improve false-positive Title elements on Chinese text (#3836 )

2024-12-18 01:16:42 +00:00

rfctr(part): remove double-decoration 4 (#3690 )

2024-10-03 16:41:31 +00:00

fix: revert dropping of filename extension for some connectors (#3109 )

2024-05-29 19:14:22 +00:00

rfctr(part): remove double-decoration 4 (#3690 )

2024-10-03 16:41:31 +00:00

embed-mixedbreadai

Potter/mixedbread embedder (#3513 )

2024-08-27 14:52:13 +00:00

fix: revert dropping of filename extension for some connectors (#3109 )

2024-05-29 19:14:22 +00:00

feat: add VoyageAI embeddings (#3069 ) (#3099 )

2024-05-24 21:48:35 +00:00

fix(xlsx): XLSX emits std minified .text_as_html (#3558 )

2024-10-17 22:05:11 +00:00

fix: html incorrectly categorizing text (#3841 )

2024-12-18 18:46:54 +00:00

fix: hi_res PDF parsing: only uncategorized text for extracted elements (#3975 )

2025-04-04 14:38:23 -07:00

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

feat: allow extraction of camel cased element type names (#3938 )

2025-03-04 01:33:05 +00:00

feat: Kafka source and destination connector (#3176 )

2024-06-22 23:26:23 +00:00

local-single-file

fix cve (#3989 )

2025-04-29 00:58:05 +00:00

local-single-file-basic-chunking

enhancement: Speed up function _assign_hash_ids by 34% (#4101 )

2025-09-25 20:49:15 +00:00

local-single-file-chunk-no-orig-elements

fix cve (#3989 )

2025-04-29 00:58:05 +00:00

local-single-file-with-encoding

chore: switch to charset normalizer (#4060 )

2025-07-22 19:02:40 +00:00

local-single-file-with-pdf-infer-table-structure

feat: track text source (#4112 )

2025-11-11 19:04:01 +00:00

rfctr(csv): minify HTML and table text is cct (#3733 )

2024-10-19 06:49:09 +00:00

feat: support extracting image url in html (#3955 )

2025-03-13 22:41:10 +00:00

onedrive/utic-test-ingest-fixtures

fix(xlsx): XLSX emits std minified .text_as_html (#3558 )

2024-10-17 22:05:11 +00:00

rfctr(part): remove double-decoration 4 (#3690 )

2024-10-03 16:41:31 +00:00

chore: dependency bumps, release commit for 0.16.12 (#3831 )

2025-01-05 13:50:19 -08:00

pdf-fast-reprocess/azure

feat: detect language for PDFs (#4051 )

2025-07-15 18:53:28 +00:00

feat: support pdf link extraction in hi_res strategy (#3753 )

2024-10-31 16:52:27 +00:00

fix cve (#3989 )

2025-04-29 00:58:05 +00:00

feat: support extracting image url in html (#3955 )

2025-03-13 22:41:10 +00:00

rfctr(csv): minify HTML and table text is cct (#3733 )

2024-10-19 06:49:09 +00:00

rfctr: Implement Sharepoint V2 Source Connector (#3314 )

2024-07-09 09:52:59 +00:00

Sharepoint-with-permissions

rfctr: Implement Sharepoint V2 Source Connector (#3314 )

2024-07-09 09:52:59 +00:00

fix: update slack test to point to new channel (#3328 )

2024-07-01 18:11:21 +00:00