mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-09-18 21:10:01 +00:00
chore: update CHANGELOG.md (#1420)
Move to a new CHANGELOG.md convention to more fully describe changes. Bullets should address: what was broken? what was fixed? why does it matter? To assist with scanning changes, the first sentence in each bullet is in **bold**. Note: it's also worth looking at the rendered markdown in the branch: https://github.com/Unstructured-IO/unstructured/blob/crag/changelog-tweak/CHANGELOG.md rather than just the git diff. --------- Co-authored-by: Klaijan <klaijan@unstructured.io>
This commit is contained in:
parent
3655a752bc
commit
8f60784178
23
CHANGELOG.md
23
CHANGELOG.md
@ -2,27 +2,24 @@
|
||||
|
||||
### Enhancements
|
||||
|
||||
* Clarify message when sentence is not counted toward sentence count b/c there aren't enough words
|
||||
* Adds numbered ListItem grouping when pdfminer broke down by line-by-line using coordinates
|
||||
* Use text-based classification hen elements come back uncategorized from PDF/Image partitioning
|
||||
* Updated HTML Partitioning to extract tables
|
||||
* Create and add `add_chunking_strategy` decorator to partition functions
|
||||
* Adds `languages` as an input parameter and marks `ocr_languages` kwarg for deprecation in pdf partitioning functions
|
||||
* Adds `xlsx` and `xls` to `skip_infer_table_types` default list in `partition`
|
||||
* Adds `languages` as an input parameter and marks `ocr_languages` kwarg for deprecation in image partitioning functions
|
||||
* Adds `languages` as an input parameter and marks `ocr_languages` kwarg for deprecation in auto partition
|
||||
* Replaces `language` with `languages` as an input parameter to unstructured-partition-text_type functions
|
||||
* Removes `UNSTRUCTURED_LANGUAGE` env var. To skip English specific checks, set the `languages` parameter to non-English language(s).
|
||||
* **Better ListItem grouping for PDF's (fast strategy).** The `partition_pdf` with `fast` strategy previously broke down some numbered list item lines as separate elements. This enhancement leverages the x,y coordinates and bbox sizes to help decide whether the following chunk of text is a continuation of the immediate previous detected ListItem element or not, and not detect it as its own non-ListItem element.
|
||||
* **Fall back to text-based classification for uncategorized Layout elements for Images and PDF's**. Improves element classification by running existing text-based rules on previously UncategorizedText elements
|
||||
* **Adds table partitioning for Partitioning for many doc types including: .html, .epub., .md, .rst, .odt, and .msg.** At the core of this change is the .html partition functionality, which is leveraged by the other effected doc types. This impacts many scenarios where `Table` Elements are now propery extracted.
|
||||
* **Create and add `add_chunking_strategy` decorator to partition functions.** Previously, users were responsible for their own chunking after partitioning elements, often required for downstream applications. Now, individual elements may be combined into right-sized chunks where min and max character size may be specified if `chunking_strategy=by_title`. Relevant elements are grouped together for better downstream results. This enables users immediately use partitioned results effectively in downstream applications (e.g. RAG architecture apps) without any additional post-processing.
|
||||
* **Adds `languages` as an input parameter and marks `ocr_languages` kwarg for deprecation in pdf, image, and auto partitioning functions.** Previously, language information was only being used for Tesseract OCR for image-based documents and was in a Tesseract specific string format, but by refactoring into a list of standard language codes independent of Tesseract, the `unstructured` library will better support `languages` for other non-image pipelines and/or support for other OCR engines.
|
||||
* **Removes `UNSTRUCTURED_LANGUAGE` env var usage and replaces `language` with `languages` as an input parameter to unstructured-partition-text_type functions.** The previous parameter/input setup was not user-friendly or scalable to the variety of elements being processed. By refactoring the inputted language information into a list of standard language codes, we can support future applications of the element language such as detection, metadata, and multi-language elements. Now, to skip English specific checks, set the `languages` parameter to any non-English language(s).
|
||||
* **Adds `xlsx` and `xls` filetype extensions to the `skip_infer_table_types` default list in `partition`.** By adding these file types to the input parameter these files should not go through table extraction. Users can still specify if they would like to extract tables from these filetypes, but will have to set the `skip_infer_table_types` to exclude the desired filetype extension. This avoids mis-representing complex spreadsheets where there may be multiple sub-tables and other content.
|
||||
* **Better debug output related to sentence counting internals**. Clarify message when sentence is not counted toward sentence count because there aren't enough words, relevant for developers focused on `unstructured`s NLP internals.
|
||||
|
||||
### Features
|
||||
|
||||
* Adds a naive hierarchy for elements via a `parent_id` on the element's metadata
|
||||
* **Adds a naive hierarchy for elements via a `parent_id` on the element's metadata**
|
||||
* Users will now have more metadata for implementing vectordb/LLM chunking strategies. For example, text elements could be queried by their preceding title element.
|
||||
* Title elements created from HTML headings will properly nest
|
||||
|
||||
### Fixes
|
||||
|
||||
* Fixes a chunking issue via dropping the field "coordinates". Problem: chunk_by_title function was chunking each element to its own individual chunk while it needed to group elements into a fewer number of chunks. We've discovered that this happens due to a metadata matching logic in chunk_by_title function, and discovered that elements with different metadata can't be put into the same chunk. At the same time, any element with "coordinates" essentially had different metadata than other elements, due each element locating in different places and having different coordinates. Fix: That is why we have included the key "coordinates" inside a list of excluded metadata keys, while doing this "metadata_matches" comparision. Importance: This change is crucial to be able to chunk by title for documents which include "coordinates" metadata in their elements.
|
||||
* **Fixes a chunking issue via dropping the field "coordinates".** Problem: chunk_by_title function was chunking each element to its own individual chunk while it needed to group elements into a fewer number of chunks. We've discovered that this happens due to a metadata matching logic in chunk_by_title function, and discovered that elements with different metadata can't be put into the same chunk. At the same time, any element with "coordinates" essentially had different metadata than other elements, due each element locating in different places and having different coordinates. Fix: That is why we have included the key "coordinates" inside a list of excluded metadata keys, while doing this "metadata_matches" comparision. Importance: This change is crucial to be able to chunk by title for documents which include "coordinates" metadata in their elements.
|
||||
|
||||
## 0.10.14
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user