diff --git a/CHANGELOG.md b/CHANGELOG.md index 5f74285f7..6da779501 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,4 +1,4 @@ -## 0.10.28-dev0 +## 0.10.28-dev1 ### Enhancements @@ -8,6 +8,7 @@ ### Fixes +* **Fix ingest pipeline to be able to use chunking and embedding together** Problem: When ingest pipeline was using chunking and embedding together, embedding outputs were empty and the outputs of chunking couldn't be re-read into memory and be forwarded to embeddings. Fix: Added CompositeElement type to TYPE_TO_TEXT_ELEMENT_MAP to be able to process CompositeElements with unstructured.staging.base.isd_to_elements * **Fix unnecessary mid-text chunk-splitting.** The "pre-chunker" did not consider separator blank-line ("\n\n") length when grouping elements for a single chunk. As a result, sections were frequently over-populated producing a over-sized chunk that required mid-text splitting. ## 0.10.27 @@ -31,7 +32,7 @@ ### Features -* **Functionality to catch and classify overlapping/nested elements** Method to identify overlapping-bboxes cases within detected elements in a document. It returns two values: a boolean defining if there are overlapping elements present, and a list reporting them with relevant metadata. The output includes information about the `overlapping_elements`, `overlapping_case`, `overlapping_percentage`, `largest_ngram_percentage`, `overlap_percentage_total`, `max_area`, `min_area`, and `total_area`. +* **Functionality to catch and classify overlapping/nested elements** Method to identify overlapping-bboxes cases within detected elements in a document. It returns two values: a boolean defining if there are overlapping elements present, and a list reporting them with relevant metadata. The output includes information about the `overlapping_elements`, `overlapping_case`, `overlapping_percentage`, `largest_ngram_percentage`, `overlap_percentage_total`, `max_area`, `min_area`, and `total_area`. * **Add Local connector source metadata** python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time. * **Add Local connector source metadata.** python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time. diff --git a/test_unstructured_ingest/test-ingest.sh b/test_unstructured_ingest/test-ingest.sh index f7c7260e4..1a9379f17 100755 --- a/test_unstructured_ingest/test-ingest.sh +++ b/test_unstructured_ingest/test-ingest.sh @@ -72,6 +72,7 @@ python_version=$(python --version 2>&1) tests_to_ignore=( 'test-ingest-notion.sh' 'test-ingest-dropbox.sh' + 'test-ingest-sharepoint.sh' ) for test in "${all_tests[@]}"; do @@ -106,4 +107,4 @@ for eval in "${all_eval[@]}"; do echo "--------- RUNNING SCRIPT $eval ---------" ./test_unstructured_ingest/evaluation-metrics.sh "$eval" echo "--------- FINISHED SCRIPT $eval ---------" -done \ No newline at end of file +done diff --git a/unstructured/__version__.py b/unstructured/__version__.py index 1afb2323b..e6b64d42d 100644 --- a/unstructured/__version__.py +++ b/unstructured/__version__.py @@ -1 +1 @@ -__version__ = "0.10.28-dev0" # pragma: no cover +__version__ = "0.10.28-dev1" # pragma: no cover diff --git a/unstructured/documents/elements.py b/unstructured/documents/elements.py index d6e7da601..7f9d22631 100644 --- a/unstructured/documents/elements.py +++ b/unstructured/documents/elements.py @@ -640,4 +640,5 @@ TYPE_TO_TEXT_ELEMENT_MAP: Dict[str, Any] = { "Field-Name": Title, "Value": NarrativeText, "Link": NarrativeText, + "CompositeElement": Text, }