mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-25 06:04:53 +00:00
bugfix: ingest pipeline with chunking and embedding does not persist data to the embedding step (#1893)
Closes: #1892 (check the issue for more info)
This commit is contained in:
parent
450e7f0614
commit
c249d02fa8
@ -1,4 +1,4 @@
|
||||
## 0.10.28-dev0
|
||||
## 0.10.28-dev1
|
||||
|
||||
### Enhancements
|
||||
|
||||
@ -8,6 +8,7 @@
|
||||
|
||||
### Fixes
|
||||
|
||||
* **Fix ingest pipeline to be able to use chunking and embedding together** Problem: When ingest pipeline was using chunking and embedding together, embedding outputs were empty and the outputs of chunking couldn't be re-read into memory and be forwarded to embeddings. Fix: Added CompositeElement type to TYPE_TO_TEXT_ELEMENT_MAP to be able to process CompositeElements with unstructured.staging.base.isd_to_elements
|
||||
* **Fix unnecessary mid-text chunk-splitting.** The "pre-chunker" did not consider separator blank-line ("\n\n") length when grouping elements for a single chunk. As a result, sections were frequently over-populated producing a over-sized chunk that required mid-text splitting.
|
||||
|
||||
## 0.10.27
|
||||
@ -31,7 +32,7 @@
|
||||
|
||||
### Features
|
||||
|
||||
* **Functionality to catch and classify overlapping/nested elements** Method to identify overlapping-bboxes cases within detected elements in a document. It returns two values: a boolean defining if there are overlapping elements present, and a list reporting them with relevant metadata. The output includes information about the `overlapping_elements`, `overlapping_case`, `overlapping_percentage`, `largest_ngram_percentage`, `overlap_percentage_total`, `max_area`, `min_area`, and `total_area`.
|
||||
* **Functionality to catch and classify overlapping/nested elements** Method to identify overlapping-bboxes cases within detected elements in a document. It returns two values: a boolean defining if there are overlapping elements present, and a list reporting them with relevant metadata. The output includes information about the `overlapping_elements`, `overlapping_case`, `overlapping_percentage`, `largest_ngram_percentage`, `overlap_percentage_total`, `max_area`, `min_area`, and `total_area`.
|
||||
* **Add Local connector source metadata** python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.
|
||||
* **Add Local connector source metadata.** python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.
|
||||
|
||||
|
||||
@ -72,6 +72,7 @@ python_version=$(python --version 2>&1)
|
||||
tests_to_ignore=(
|
||||
'test-ingest-notion.sh'
|
||||
'test-ingest-dropbox.sh'
|
||||
'test-ingest-sharepoint.sh'
|
||||
)
|
||||
|
||||
for test in "${all_tests[@]}"; do
|
||||
@ -106,4 +107,4 @@ for eval in "${all_eval[@]}"; do
|
||||
echo "--------- RUNNING SCRIPT $eval ---------"
|
||||
./test_unstructured_ingest/evaluation-metrics.sh "$eval"
|
||||
echo "--------- FINISHED SCRIPT $eval ---------"
|
||||
done
|
||||
done
|
||||
|
||||
@ -1 +1 @@
|
||||
__version__ = "0.10.28-dev0" # pragma: no cover
|
||||
__version__ = "0.10.28-dev1" # pragma: no cover
|
||||
|
||||
@ -640,4 +640,5 @@ TYPE_TO_TEXT_ELEMENT_MAP: Dict[str, Any] = {
|
||||
"Field-Name": Title,
|
||||
"Value": NarrativeText,
|
||||
"Link": NarrativeText,
|
||||
"CompositeElement": Text,
|
||||
}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user