mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-19 15:06:21 +00:00

### Summary An initial pass on smart chunking for RAG applications. Breaks a document into sections based on the presence of `Title` elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is `1500`. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars` characters, which could occur with a long `NarrativeText` element. - Section under `combine_under_n_chars` characters are combined. The default is `500`. ### Testing ```python from unstructured.partition.html import partition_html from unstructured.chunking.title import chunk_by_title url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" elements = partition_html(url=url) chunks = chunk_by_title(elements) for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input() ```
22 lines
837 B
ReStructuredText
22 lines
837 B
ReStructuredText
Bricks
|
|
======
|
|
|
|
Bricks are functions that live in ``unstructured`` and are the primary public API for the library.
|
|
There are several types of bricks in ``unstructured``, corresponding to the different stages of document pre-processing: partitioning, cleaning, chunking and staging.
|
|
After reading this section, you should understand the following:
|
|
|
|
* How to partition a document into json or csv.
|
|
* How to remove unwanted content from document elements using cleaning bricks.
|
|
* How to extract content from a document using the extraction bricks.
|
|
* How to prepare data for downstream use cases using staging bricks
|
|
* How to chunk partitioned documents for use cases such as Retrieval Augmented Generation (RAG).
|
|
|
|
.. toctree::
|
|
:maxdepth: 1
|
|
|
|
bricks/partition
|
|
bricks/cleaning
|
|
bricks/extracting
|
|
bricks/staging
|
|
bricks/chunking
|