unstructured/docs/source/bricks.rst
Matt Robinson f6a745a74f
feat: chunk elements based on titles (#1222)
### Summary

An initial pass on smart chunking for RAG applications. Breaks a
document into sections based on the presence of `Title` elements. Also
starts a new section under the following conditions:

- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is `1500`. The chunking function does not split individual
elements, so it's possible for a section to exceed that threshold if an
individual element if over `new_after_n_chars` characters, which could
occur with a long `NarrativeText` element.
- Section under `combine_under_n_chars` characters are combined. The
default is `500`.

### Testing

```python
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()
```
2023-08-29 16:04:57 +00:00

22 lines
837 B
ReStructuredText

Bricks
======
Bricks are functions that live in ``unstructured`` and are the primary public API for the library.
There are several types of bricks in ``unstructured``, corresponding to the different stages of document pre-processing: partitioning, cleaning, chunking and staging.
After reading this section, you should understand the following:
* How to partition a document into json or csv.
* How to remove unwanted content from document elements using cleaning bricks.
* How to extract content from a document using the extraction bricks.
* How to prepare data for downstream use cases using staging bricks
* How to chunk partitioned documents for use cases such as Retrieval Augmented Generation (RAG).
.. toctree::
:maxdepth: 1
bricks/partition
bricks/cleaning
bricks/extracting
bricks/staging
bricks/chunking