bugfix: ingest pipeline with chunking and embedding does not persist data to the embedding step (#1893)

Closes: #1892 (check the issue for more info)
This commit is contained in:
Ahmet Melek 2023-10-27 14:07:00 +01:00 committed by GitHub
parent 450e7f0614
commit c249d02fa8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 7 additions and 4 deletions

View File

@ -1,4 +1,4 @@
## 0.10.28-dev0
## 0.10.28-dev1
### Enhancements
@ -8,6 +8,7 @@
### Fixes
* **Fix ingest pipeline to be able to use chunking and embedding together** Problem: When ingest pipeline was using chunking and embedding together, embedding outputs were empty and the outputs of chunking couldn't be re-read into memory and be forwarded to embeddings. Fix: Added CompositeElement type to TYPE_TO_TEXT_ELEMENT_MAP to be able to process CompositeElements with unstructured.staging.base.isd_to_elements
* **Fix unnecessary mid-text chunk-splitting.** The "pre-chunker" did not consider separator blank-line ("\n\n") length when grouping elements for a single chunk. As a result, sections were frequently over-populated producing a over-sized chunk that required mid-text splitting.
## 0.10.27
@ -31,7 +32,7 @@
### Features
* **Functionality to catch and classify overlapping/nested elements** Method to identify overlapping-bboxes cases within detected elements in a document. It returns two values: a boolean defining if there are overlapping elements present, and a list reporting them with relevant metadata. The output includes information about the `overlapping_elements`, `overlapping_case`, `overlapping_percentage`, `largest_ngram_percentage`, `overlap_percentage_total`, `max_area`, `min_area`, and `total_area`.
* **Functionality to catch and classify overlapping/nested elements** Method to identify overlapping-bboxes cases within detected elements in a document. It returns two values: a boolean defining if there are overlapping elements present, and a list reporting them with relevant metadata. The output includes information about the `overlapping_elements`, `overlapping_case`, `overlapping_percentage`, `largest_ngram_percentage`, `overlap_percentage_total`, `max_area`, `min_area`, and `total_area`.
* **Add Local connector source metadata** python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.
* **Add Local connector source metadata.** python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.

View File

@ -72,6 +72,7 @@ python_version=$(python --version 2>&1)
tests_to_ignore=(
'test-ingest-notion.sh'
'test-ingest-dropbox.sh'
'test-ingest-sharepoint.sh'
)
for test in "${all_tests[@]}"; do
@ -106,4 +107,4 @@ for eval in "${all_eval[@]}"; do
echo "--------- RUNNING SCRIPT $eval ---------"
./test_unstructured_ingest/evaluation-metrics.sh "$eval"
echo "--------- FINISHED SCRIPT $eval ---------"
done
done

View File

@ -1 +1 @@
__version__ = "0.10.28-dev0" # pragma: no cover
__version__ = "0.10.28-dev1" # pragma: no cover

View File

@ -640,4 +640,5 @@ TYPE_TO_TEXT_ELEMENT_MAP: Dict[str, Any] = {
"Field-Name": Title,
"Value": NarrativeText,
"Link": NarrativeText,
"CompositeElement": Text,
}