unstructured/expected-structured-output at 43b682ad3f66cb9c1fa55a30a7ae827087f4e50f - unstructured - Gitea: Git with a cup of tea

yujunjun/unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-14 01:17:36 +00:00

History

Yao You 43b682ad3f

feat: allow extraction of camel cased element type names (#3938 )

This PR allows element types with CamelCase names to be extractable
using `extract_image_block_types` variable.

Before: specify `extract_image_block_types=["NarrativeText"]` (or any
casing for `NarrativeText`) would raise a warning that it doesn't match
any available types and not image would be extracted for this element
type

Now: specify `extract_image_block_types=["NarrativeText"]` would extract
images for this element type

## testing

```python
from unstructured.partition.auto import partition
f = "example-docs/pdf/embedded-images-tables.pdf"
elements = partition(f, strategy="hi_res", extract_image_block_types=["narrativetext"])
```

Without this PR no figures would be extracted. With this PR a local
folder would be created to contain images of the narrative text elements
in path like `./figures/figure-1-1.jpg`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

2025-03-04 01:33:05 +00:00

..

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

feat: support pdf link extraction in hi_res strategy (#3753 )

2024-10-31 16:52:27 +00:00

fix: fix multiple values for infer_table_structure (#3870 )

2025-01-17 18:41:04 +00:00

Feat: Add pdfminer parameters configuration (#3918 )

2025-02-17 11:41:20 +00:00

biomed-path/07/07

feat: improve pdfminer element processing (#3618 )

2024-09-12 21:17:27 +00:00

fix: improve false-positive Title elements on Chinese text (#3836 )

2024-12-18 01:16:42 +00:00

confluence-diff

fix: html incorrectly categorizing text (#3841 )

2024-12-18 18:46:54 +00:00

rfctr(csv): minify HTML and table text is cct (#3733 )

2024-10-19 06:49:09 +00:00

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

fix: improve false-positive Title elements on Chinese text (#3836 )

2024-12-18 01:16:42 +00:00

rfctr(part): remove double-decoration 4 (#3690 )

2024-10-03 16:41:31 +00:00

fix: revert dropping of filename extension for some connectors (#3109 )

2024-05-29 19:14:22 +00:00

rfctr(part): remove double-decoration 4 (#3690 )

2024-10-03 16:41:31 +00:00

embed-mixedbreadai

Potter/mixedbread embedder (#3513 )

2024-08-27 14:52:13 +00:00

fix: revert dropping of filename extension for some connectors (#3109 )

2024-05-29 19:14:22 +00:00

feat: add VoyageAI embeddings (#3069 ) (#3099 )

2024-05-24 21:48:35 +00:00

fix(xlsx): XLSX emits std minified .text_as_html (#3558 )

2024-10-17 22:05:11 +00:00

fix: html incorrectly categorizing text (#3841 )

2024-12-18 18:46:54 +00:00

feat: allow extraction of camel cased element type names (#3938 )

2025-03-04 01:33:05 +00:00

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

feat: allow extraction of camel cased element type names (#3938 )

2025-03-04 01:33:05 +00:00

feat: Kafka source and destination connector (#3176 )

2024-06-22 23:26:23 +00:00

local-single-file

rfctr(part): remove double-decoration 4 (#3690 )

2024-10-03 16:41:31 +00:00

local-single-file-basic-chunking

fix: improve false-positive Title elements on Chinese text (#3836 )

2024-12-18 01:16:42 +00:00

local-single-file-chunk-no-orig-elements

Feat: Add pdfminer parameters configuration (#3918 )

2025-02-17 11:41:20 +00:00

local-single-file-with-encoding

fix: html incorrectly categorizing text (#3841 )

2024-12-18 18:46:54 +00:00

local-single-file-with-pdf-infer-table-structure

Feat: Add pdfminer parameters configuration (#3918 )

2025-02-17 11:41:20 +00:00

rfctr(csv): minify HTML and table text is cct (#3733 )

2024-10-19 06:49:09 +00:00

fix: html incorrectly categorizing text (#3841 )

2024-12-18 18:46:54 +00:00

onedrive/utic-test-ingest-fixtures

fix(xlsx): XLSX emits std minified .text_as_html (#3558 )

2024-10-17 22:05:11 +00:00

rfctr(part): remove double-decoration 4 (#3690 )

2024-10-03 16:41:31 +00:00

chore: dependency bumps, release commit for 0.16.12 (#3831 )

2025-01-05 13:50:19 -08:00

pdf-fast-reprocess

Feat: Add pdfminer parameters configuration (#3918 )

2025-02-17 11:41:20 +00:00

feat: support pdf link extraction in hi_res strategy (#3753 )

2024-10-31 16:52:27 +00:00

rfctr(csv): minify HTML and table text is cct (#3733 )

2024-10-19 06:49:09 +00:00

rfctr(email): eml partitioner rewrite (#3694 )

2024-10-16 02:02:33 +00:00

rfctr(csv): minify HTML and table text is cct (#3733 )

2024-10-19 06:49:09 +00:00

rfctr: Implement Sharepoint V2 Source Connector (#3314 )

2024-07-09 09:52:59 +00:00

Sharepoint-with-permissions

rfctr: Implement Sharepoint V2 Source Connector (#3314 )

2024-07-09 09:52:59 +00:00

fix: update slack test to point to new channel (#3328 )

2024-07-01 18:11:21 +00:00