unstructured/docs/source/core/chunking.rst

52 lines
2.1 KiB
ReStructuredText
Raw Normal View History

########
Chunking
########
Chunking functions in ``unstructured`` use metadata and document elements
detected with ``partition`` functions to split a document into subsections
for uses cases such as Retrieval Augmented Generation (RAG).
``chunk_by_title``
------------------
The ``chunk_by_title`` function combines elements into sections by looking
for the presence of titles. When a title is detected, a new section is created.
Tables and non-text elements (such as page breaks or images) are always their
own section.
New sections are also created if changes in metadata occure. Examples of when
this occurs include when the section of the document or the page number changes
or when an element comes from an attachment instead of from the main document.
If you set ``multipage_sections=True``, ``chunk_by_title`` will allow for sections
that span between pages. This kwarg is ``True`` by default.
``chunk_by_title`` will start a new section if the length of a section exceed
``new_after_n_chars``. The default value is ``1500``. ``chunk_by_title`` does
not split elements, it is possible for a section to exceed that lenght, for
example if a ``NarrativeText`` elements exceeds ``1500`` characters on its on.
Similarly, sections under ``combine_text_under_n_chars`` will be combined if they
do not exceed the specified threshold, which defaults to ``500``. This will combine
a series of ``Title`` elements that occur one after another, which sometimes
happens in lists that are not detected as ``ListItem`` elements. Set
``combine_text_under_n_chars=0`` to turn off this behavior.
The following shows an example of how to use ``chunk_by_title``. You will
see the document chunked into sections instead of elements.
.. code:: python
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()