mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-08-17 21:29:05 +00:00

### Summary An initial pass on smart chunking for RAG applications. Breaks a document into sections based on the presence of `Title` elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is `1500`. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars` characters, which could occur with a long `NarrativeText` element. - Section under `combine_under_n_chars` characters are combined. The default is `500`. ### Testing ```python from unstructured.partition.html import partition_html from unstructured.chunking.title import chunk_by_title url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" elements = partition_html(url=url) chunks = chunk_by_title(elements) for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input() ```