unstructured/expected-structured-output at 87a88a3c8787b999dad1a7071c36d64c61154025 - unstructured - Gitea: Git with a cup of tea

yujunjun/unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-19 11:57:32 +00:00

History

Christine Straub 87a88a3c87

feat: improve pdfminer element processing (#3618 )

This PR implements splitting of `pdfminer` elements (`groups of text
chunks`) into smaller bounding boxes (`text lines`). This implementation
prevents loss of information from the object detection model and
facilitates more effective removal of duplicated `pdfminer` text. This
PR also addresses #3430.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>

2024-09-12 21:17:27 +00:00

..

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

astradb/ingest_test_src

chore: rename astra to astradb (#3458 )

2024-08-05 20:41:02 +00:00

feat: improve pdfminer element processing (#3618 )

2024-09-12 21:17:27 +00:00

feat: improve pdfminer element processing (#3618 )

2024-09-12 21:17:27 +00:00

biomed-path/07/07

feat: improve pdfminer element processing (#3618 )

2024-09-12 21:17:27 +00:00

feat(chunk): split tables on even row boundaries (#3504 )

2024-08-19 18:56:53 +00:00

confluence-diff

feat(chunk): split tables on even row boundaries (#3504 )

2024-08-19 18:56:53 +00:00

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

feat(chunk): split tables on even row boundaries (#3504 )

2024-08-19 18:56:53 +00:00

Feat/migrate elasticsearch src connector (#3174 )

2024-06-13 17:57:59 +00:00

fix: revert dropping of filename extension for some connectors (#3109 )

2024-05-29 19:14:22 +00:00

fix: revert dropping of filename extension for some connectors (#3109 )

2024-05-29 19:14:22 +00:00

embed-mixedbreadai

Potter/mixedbread embedder (#3513 )

2024-08-27 14:52:13 +00:00

fix: revert dropping of filename extension for some connectors (#3109 )

2024-05-29 19:14:22 +00:00

feat: add VoyageAI embeddings (#3069 ) (#3099 )

2024-05-24 21:48:35 +00:00

feat(chunk): split tables on even row boundaries (#3504 )

2024-08-19 18:56:53 +00:00

rfctr(html): replace html parser (#3218 )

2024-07-11 00:14:28 +00:00

feat: improve pdfminer element processing (#3618 )

2024-09-12 21:17:27 +00:00

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

feat: Kafka source and destination connector (#3176 )

2024-06-22 23:26:23 +00:00

local-single-file

fix: revert dropping of filename extension for some connectors (#3109 )

2024-05-29 19:14:22 +00:00

local-single-file-basic-chunking

Roman/fix ingest async connectors (#3210 )

2024-06-17 16:55:19 +00:00

local-single-file-chunk-no-orig-elements

refactor: restructure PDF/Image example document organization (#3410 )

2024-07-18 22:21:32 +00:00

local-single-file-with-encoding

rfctr(html): replace html parser (#3218 )

2024-07-11 00:14:28 +00:00

local-single-file-with-pdf-infer-table-structure

feat: improve pdfminer element processing (#3618 )

2024-09-12 21:17:27 +00:00

mongodb/sample-mongodb-data

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

add support for start_index in html links extraction (#2600 )

2024-04-12 06:14:20 +00:00

onedrive/utic-test-ingest-fixtures

feat/migrate onedrive src (#3295 )

2024-06-26 23:59:51 +00:00

rfctr [P6M-397]: opensearch source connector v2 (#3302 )

2024-07-01 20:35:26 +00:00

feat: msg and email metadata (#3444 )

2024-08-01 19:24:17 +00:00

pdf-fast-reprocess

fix: revert dropping of filename extension for some connectors (#3109 )

2024-05-29 19:14:22 +00:00

feat: improve pdfminer element processing (#3618 )

2024-09-12 21:17:27 +00:00

fix: revert dropping of filename extension for some connectors (#3109 )

2024-05-29 19:14:22 +00:00

feat: msg and email metadata (#3444 )

2024-08-01 19:24:17 +00:00

feat: Migrate over fsspec connectors (#3066 )

2024-06-05 19:12:06 +00:00

rfctr: Implement Sharepoint V2 Source Connector (#3314 )

2024-07-09 09:52:59 +00:00

Sharepoint-with-permissions

rfctr: Implement Sharepoint V2 Source Connector (#3314 )

2024-07-09 09:52:59 +00:00

fix: update slack test to point to new channel (#3328 )

2024-07-01 18:11:21 +00:00