feature(chunking): add basic strategy and overlap (#2367)

This PR culminates the restructuring of chunking over my prior
dozen-or-so commits by adding the new options to the API and
documentation.

Separately I'll be adding a new ingest test to defend against
regression, although the integration test included in this PR will do a
pretty good job of that too.
This commit is contained in:
Steve Canny 2024-01-10 14:19:24 -08:00 committed by GitHub
parent a8a103bc5c
commit 23edf2e911
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 402 additions and 45 deletions

View File

@ -2,6 +2,9 @@
### Enhancements
* **Add "basic" chunking strategy.** Add baseline chunking strategy that includes all shared chunking behaviors without breaking chunks on section or page boundaries.
* **Add overlap option for chunking.** Add option to overlap chunks. Intra-chunk and inter-chunk overlap are requested separately. Intra-chunk overlap is applied only to the second and later chunks formed by text-splitting an oversized chunk. Inter-chunk overlap may also be specified; this applies overlap between "normal" (not-oversized) chunks.
### Features
### Fixes

View File

@ -2,50 +2,171 @@
Chunking
########
Chunking functions in ``unstructured`` use metadata and document elements
detected with ``partition`` functions to split a document into subsections
for uses cases such as Retrieval Augmented Generation (RAG).
Chunking functions in ``unstructured`` use metadata and document elements detected with
``partition`` functions to split a document into smaller parts for uses cases such as Retrieval
Augmented Generation (RAG).
Chunking Basics
---------------
Chunking in ``unstructured`` differs from other chunking mechanisms you may be familiar with.
Typical approaches start with the text extracted from the document and form chunks based on
plain-text features, character sequences like ``"\n\n"`` or ``"\n"`` that might indicate a paragraph
boundary or list-item boundary.
Because ``unstructured`` uses specific knowledge about each document format to partition the
document into semantic units (document elements), we only need to resort to text-splitting when a
single element exceeds the desired maximum chunk size. Except in that case, all chunks contain one
or more whole elements, preserving the coherence of semantic units established during partitioning.
A few concepts about chunking are worth introducing before discussing the details.
- Chunking is performed on *document elements*. It is a separate step performed *after*
partitioning, on the elements produced by partitioning. (Although it can be combined with
partitioning in a single step.)
- In general, chunking *combines* consecutive elements to form chunks as large as possible without
exceeding the maximum chunk size.
- A single element that by itself exceeds the maximum chunk size is divided into two or more chunks
using text-splitting.
- Chunking produces a sequence of ``CompositeElement``, ``Table``, or ``TableChunk`` elements. Each
"chunk" is an instance of one of these three types.
``chunk_by_title``
------------------
Chunking Options
----------------
The ``chunk_by_title`` function combines elements into sections by looking
for the presence of titles. When a title is detected, a new section is created.
Tables and non-text elements (such as page breaks or images) are always their
own section.
The following options are available to tune chunking behaviors. These are keyword arguments that can
be used in a partitioning or chunking function call. All these options have defaults and need only
be specified when a non-default setting is required. Specific chunking strategies (such as
"by-title") may have additional options.
New sections are also created if changes in metadata occure. Examples of when
this occurs include when the section of the document or the page number changes
or when an element comes from an attachment instead of from the main document.
If you set ``multipage_sections=True``, ``chunk_by_title`` will allow for sections
that span between pages. This kwarg is ``True`` by default.
- ``max_characters: int (default=500)`` - the hard maximum size for a chunk. No chunk will exceed
this number of characters. A single element that by itself exceeds this size will be divided into
two or more chunks using text-splitting.
``chunk_by_title`` will start a new section if the length of a section exceed
``new_after_n_chars``. The default value is ``1500``. ``chunk_by_title`` does
not split elements, it is possible for a section to exceed that lenght, for
example if a ``NarrativeText`` elements exceeds ``1500`` characters on its on.
- ``new_after_n_chars: int (default=max_characters)`` - the "soft" maximum size for a chunk. A chunk
that already exceeds this number of characters will not be extended, even if the next element
would fit without exceeding the specified hard maximum. This can be used in conjunction with
``max_characters`` to set a "preferred" size, like "I prefer chunks of around 1000 characters, but
I'd rather have a chunk of 1500 (max_characters) than resort to text-splitting". This would be
specified with ``(..., max_characters=1500, new_after_n_chars=1000)``.
Similarly, sections under ``combine_text_under_n_chars`` will be combined if they
do not exceed the specified threshold, which defaults to ``500``. This will combine
a series of ``Title`` elements that occur one after another, which sometimes
happens in lists that are not detected as ``ListItem`` elements. Set
``combine_text_under_n_chars=0`` to turn off this behavior.
- ``overlap: int (default=0)`` - only when using text-splitting to break up an oversized chunk,
include this number of characters from the end of the prior chunk as a prefix on the next. This
can mitigate the effect of splitting the semantic unit represented by the oversized element at an
arbitrary position based on text length.
The following shows an example of how to use ``chunk_by_title``. You will
see the document chunked into sections instead of elements.
- ``overlap_all: bool (default=False)`` - also apply overlap between "normal" chunks, not just when
text-splitting to break up an oversized element. Because normal chunks are formed from whole
elements that each have a clean semantic boundary, this option may "pollute" normal chunks. You'll
need to decide based on your use-case whether this option is right for you.
Chunking elements
-----------------
Chunking can be performed as part of partitioning or as a separate step after
partitioning:
Specifying a chunking strategy while partitioning
+++++++++++++++++++++++++++++++++++++++++++++++++
Chunking can be performed as part of partitioning by specifying a value for the
``chunking_strategy`` argument. The current options are ``basic`` and ``by-title`` (described
below).
.. code:: python
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title
chunks = partition_html(url=url, chunking_strategy="basic")
Calling a chunking function
+++++++++++++++++++++++++++
Chunking can also be performed separately from partitioning by calling a chunking function directly.
This may be convenient, for example, when tuning chunking parameters. Chunking is typically faster
than partitioning, especially when OCR or inference is used, so a faster feedback loop is possible
by doing these separately:
.. code:: python
from unstructured.chunking.basic import chunk_elements
from unstructured.partition.html import partition_html
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_elements(elements)
# -- OR --
from unstructured.chunking.title import chunk_by_title
chunks = chunk_by_title(elements)
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()
Chunking Strategies
-------------------
There are currently two chunking strategies, *basic* and *by_title*. The ``by_title`` strategy
shares most behaviors with the basic strategy so we'll describe the baseline strategy first:
"basic" chunking strategy
+++++++++++++++++++++++++
- The basic strategy combines sequential elements to maximally fill each chunk while respecting both
the specified ``max_characters`` (hard-max) and ``new_after_n_chars`` (soft-max) option values.
- A single element that by itself exceeds the hard-max is isolated (never combined with another
element) and then divided into two or more chunks using text-splitting.
- A ``Table`` element is always isolated and never combined with another element. A ``Table`` can be
oversized, like any other text element, and in that case is divided into two or more
``TableChunk`` elements using text-splitting.
- If specified, ``overlap`` is applied between split-chunks and is also applied between normal
chunks when ``overlap_all`` is ``True``.
"by_title" chunking strategy
++++++++++++++++++++++++++++
The ``by_title`` chunking strategy preserves section boundaries and optionally page boundaries as
well. "Preserving" here means that a single chunk will never contain text that occurred in two
different sections. When a new section starts, the existing chunk is closed and a new one started,
even if the next element would fit in the prior chunk.
In addition to the behaviors of the ``basic`` strategy above, the ``by_title`` strategy has the
following behaviors:
- **Detect section headings.** A ``Title`` element is considered to start a new section. When a
``Title`` element is encountered, the prior chunk is closed and a new chunk started, even if the
``Title`` element would fit in the prior chunk. This implements the first aspect of the "preserve
section boundaries" contract.
- **Detect metadata.section change.** An element with a new value in ``element.metadata.section`` is
considered to start a new section. When a change in this value is encountered a new chunk is
started. This implements the second aspect of preserving section boundaries. This metadata is not
present in all document formats so is not used alone. An element having ``None`` for this metadata
field is considered to be part of the prior section; a section break is only detected on an
explicit change in value.
- **Respect page boundaries.** Page boundaries can optionally also be respected using the
``multipage_sections`` argument. This defaults to ``True`` meaning that a page break does *not*
start a new chunk. Setting this to ``False`` will separate elements that occur on different pages
into distinct chunks.
- **Combine small sections.** In certain documents, partitioning may identify a list-item or other
short paragraph as a ``Title`` element even though it does not serve as a section heading. This
can produce chunks substantially smaller than desired. This behavior can be mitigated using the
``combine_text_under_n_chars`` argument. This defaults to the same value as ``max_characters``
such that sequential small sections are combined to maximally fill the chunking window. Setting
this to ``0`` will disable section combining.

View File

@ -0,0 +1,108 @@
"""Unit-test suite for the `unstructured.chunking.basic` module.
That module implements the baseline chunking strategy. The baseline strategy has all behaviors
shared by all chunking strategies and no extra rules like perserve section or page boundaries.
"""
from __future__ import annotations
from unstructured.chunking.basic import chunk_elements
from unstructured.documents.elements import CompositeElement, Text, Title
from unstructured.partition.docx import partition_docx
def test_it_chunks_a_document_when_basic_chunking_strategy_is_specified_on_partition_function():
"""Basic chunking can be combined with partitioning, exercising the decorator."""
filename = "example-docs/handbook-1p.docx"
chunks = partition_docx(filename, chunking_strategy="basic")
assert chunks == [
CompositeElement(
"US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 INTRODUCTION"
"\n\nA.\tPURPOSE"
),
CompositeElement(
"The United States Trustee appoints and supervises standing trustees and monitors and"
" supervises cases under chapter 13 of title 11 of the United States Code. 28 U.S.C."
" § 586(b). The Handbook, issued as part of our duties under 28 U.S.C. § 586,"
" establishes or clarifies the position of the United States Trustee Program (Program)"
" on the duties owed by a standing trustee to the debtors, creditors, other parties in"
" interest, and the United States Trustee. The Handbook does not present a full and"
),
CompositeElement(
"complete statement of the law; it should not be used as a substitute for legal"
" research and analysis. The standing trustee must be familiar with relevant"
" provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules),"
" any local bankruptcy rules, and case law. 11 U.S.C. § 321, 28 U.S.C. § 586,"
" 28 C.F.R. § 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips"
" identified in this Handbook but these are not considered mandatory."
),
CompositeElement(
"Nothing in this Handbook should be construed to excuse the standing trustee from"
" complying with all duties imposed by the Bankruptcy Code and Rules, local rules, and"
" orders of the court. The standing trustee should notify the United States Trustee"
" whenever the provision of the Handbook conflicts with the local rules or orders of"
" the court. The standing trustee is accountable for all duties set forth in this"
" Handbook, but need not personally perform any duty unless otherwise indicated. All"
),
CompositeElement(
"statutory references in this Handbook refer to the Bankruptcy Code, 11 U.S.C. § 101"
" et seq., unless otherwise indicated."
),
CompositeElement(
"This Handbook does not create additional rights against the standing trustee or"
" United States Trustee in favor of other parties.\n\nB.\tROLE OF THE UNITED STATES"
" TRUSTEE"
),
CompositeElement(
"The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the"
" responsibilities for daytoday administration of cases. Debtors, creditors, and"
" third parties with adverse interests to the trustee were concerned that the court,"
" which previously appointed and supervised the trustee, would not impartially"
" adjudicate their rights as adversaries of that trustee. To address these concerns,"
" judicial and administrative functions within the bankruptcy system were bifurcated."
),
CompositeElement(
"Many administrative functions formerly performed by the court were placed within the"
" Department of Justice through the creation of the Program. Among the administrative"
" functions assigned to the United States Trustee were the appointment and supervision"
" of chapter 13 trustees./ This Handbook is issued under the authority of the"
" Programs enabling statutes. \n\nC.\tSTATUTORY DUTIES OF A STANDING TRUSTEE\t"
),
CompositeElement(
"The standing trustee has a fiduciary responsibility to the bankruptcy estate. The"
" standing trustee is more than a mere disbursing agent. The standing trustee must"
" be personally involved in the trustee operation. If the standing trustee is or"
" becomes unable to perform the duties and responsibilities of a standing trustee,"
" the standing trustee must immediately advise the United States Trustee."
" 28 U.S.C. § 586(b), 28 C.F.R. § 58.4(b) referencing 28 C.F.R. § 58.3(b)."
),
CompositeElement(
"Although this Handbook is not intended to be a complete statutory reference, the"
" standing trustees primary statutory duties are set forth in 11 U.S.C. § 1302, which"
" incorporates by reference some of the duties of chapter 7 trustees found in"
" 11 U.S.C. § 704. These duties include, but are not limited to, the"
" following:\n\nCopyright"
),
]
def test_it_chunks_elements_when_the_user_already_has_them():
elements = [
Title("Introduction"),
Text(
# --------------------------------------------------------- 64 -v
"Lorem ipsum dolor sit amet consectetur adipiscing elit. In rhoncus ipsum sed lectus"
" porta volutpat.",
),
]
chunks = chunk_elements(elements, max_characters=64)
assert chunks == [
CompositeElement("Introduction"),
# -- splits on even word boundary, not mid-"rhoncus" --
CompositeElement("Lorem ipsum dolor sit amet consectetur adipiscing elit. In"),
CompositeElement("rhoncus ipsum sed lectus porta volutpat."),
]

View File

@ -1145,14 +1145,16 @@ def test_add_chunking_strategy_on_partition_auto_respects_max_chars():
assert len(partitioned_table_elements_5_chars) != len(table_elements)
assert len(partitioned_table_elements_200_chars) != len(table_elements)
assert len(partitioned_table_elements_5_chars[0].text) == 5
# trailing whitespace is stripped from the first chunk, leaving only a checkbox character
assert len(partitioned_table_elements_5_chars[0].text) == 1
# but the second chunk is the full 5 characters
assert len(partitioned_table_elements_5_chars[1].text) == 5
assert len(partitioned_table_elements_5_chars[0].metadata.text_as_html) == 5
# the first table element is under 200 chars so doesn't get chunked!
assert table_elements[0] == partitioned_table_elements_200_chars[0]
assert len(partitioned_table_elements_200_chars[0].text) < 200
assert len(partitioned_table_elements_200_chars[1].text) == 200
assert len(partitioned_table_elements_200_chars[1].text) == 198
assert len(partitioned_table_elements_200_chars[1].metadata.text_as_html) == 200

View File

@ -11,6 +11,7 @@ from typing import Any, Callable, Dict, List
from typing_extensions import ParamSpec
from unstructured.chunking.basic import chunk_elements
from unstructured.chunking.title import chunk_by_title
from unstructured.documents.elements import Element
@ -25,6 +26,10 @@ def add_chunking_strategy() -> Callable[[Callable[_P, List[Element]]], Callable[
"""
def decorator(func: Callable[_P, List[Element]]) -> Callable[_P, List[Element]]:
# -- Patch the docstring of the decorated function to add chunking strategy and
# -- chunking-related argument documentation. This only applies when `chunking_strategy`
# -- is an explicit argument of the decorated function and "chunking_strategy" is not
# -- already mentioned in the docstring.
if func.__doc__ and (
"chunking_strategy" in func.__code__.co_varnames
and "chunking_strategy" not in func.__doc__
@ -32,16 +37,15 @@ def add_chunking_strategy() -> Callable[[Callable[_P, List[Element]]], Callable[
func.__doc__ += (
"\nchunking_strategy"
+ "\n\tStrategy used for chunking text into larger or smaller elements."
+ "\n\tDefaults to `None` with optional arg of 'by_title'."
+ "\n\tDefaults to `None` with optional arg of 'basic' or 'by_title'."
+ "\n\tAdditional Parameters:"
+ "\n\t\tmultipage_sections"
+ "\n\t\t\tIf True, sections can span multiple pages. Defaults to True."
+ "\n\t\tcombine_text_under_n_chars"
+ "\n\t\t\tCombines elements (for example a series of titles) until a section"
+ "\n\t\t\treaches a length of n characters."
+ "\n\t\t\treaches a length of n characters. Only applies to 'by_title' strategy."
+ "\n\t\tnew_after_n_chars"
+ "\n\t\t\tCuts off new sections once they reach a length of n characters"
+ "\n\t\t\ta soft max."
+ "\n\t\t\tCuts off chunks once they reach a length of n characters; a soft max."
+ "\n\t\tmax_characters"
+ "\n\t\t\tChunks elements text and text_as_html (if present) into chunks"
+ "\n\t\t\tof length n characters, a hard max."
@ -49,20 +53,43 @@ def add_chunking_strategy() -> Callable[[Callable[_P, List[Element]]], Callable[
@functools.wraps(func)
def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
"""The decorated function is replaced with this one."""
def get_call_args_applying_defaults() -> Dict[str, Any]:
"""Map both explicit and default arguments of decorated func call by param name."""
sig = inspect.signature(func)
call_args: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
for param in sig.parameters.values():
if param.name not in call_args and param.default is not param.empty:
call_args[param.name] = param.default
return call_args
# -- call the partitioning function to get the elements --
elements = func(*args, **kwargs)
sig = inspect.signature(func)
params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
for param in sig.parameters.values():
if param.name not in params and param.default is not param.empty:
params[param.name] = param.default
if params.get("chunking_strategy") == "by_title":
elements = chunk_by_title(
# -- look for a chunking-strategy argument and run the indicated chunker when present --
call_args = get_call_args_applying_defaults()
if call_args.get("chunking_strategy") == "by_title":
return chunk_by_title(
elements,
multipage_sections=params.get("multipage_sections", True),
combine_text_under_n_chars=params.get("combine_text_under_n_chars", 500),
new_after_n_chars=params.get("new_after_n_chars", 500),
max_characters=params.get("max_characters", 500),
combine_text_under_n_chars=call_args.get("combine_text_under_n_chars", 500),
max_characters=call_args.get("max_characters", 500),
multipage_sections=call_args.get("multipage_sections", True),
new_after_n_chars=call_args.get("new_after_n_chars", 500),
overlap=call_args.get("overlap", 0),
overlap_all=call_args.get("overlap_all", False),
)
if call_args.get("chunking_strategy") == "basic":
return chunk_elements(
elements,
max_characters=call_args.get("max_characters", 500),
new_after_n_chars=call_args.get("new_after_n_chars", 500),
overlap=call_args.get("overlap", 0),
overlap_all=call_args.get("overlap_all", False),
)
return elements
return wrapper

View File

@ -77,6 +77,10 @@ class ChunkingOptions:
Specifies the length of a string ("tail") to be drawn from each chunk and prefixed to the
next chunk as a context-preserving mechanism. By default, this only applies to split-chunks
where an oversized element is divided into multiple chunks by text-splitting.
overlap_all
Default: `False`. When `True`, apply overlap between "normal" chunks formed from whole
elements and not subject to text-splitting. Use this with caution as it entails a certain
level of "pollution" of otherwise clean semantic chunk boundaries.
text_splitting_separators
A sequence of strings like `("\n", " ")` to be used as target separators during
text-splitting. Text-splitting only applies to splitting an oversized element into two or
@ -95,7 +99,7 @@ class ChunkingOptions:
new_after_n_chars: Optional[int] = None,
overlap: int = 0,
overlap_all: bool = False,
text_splitting_separators: Sequence[str] = (),
text_splitting_separators: Sequence[str] = ("\n", " "),
):
self._combine_text_under_n_chars_arg = combine_text_under_n_chars
self._max_characters = max_characters
@ -114,7 +118,7 @@ class ChunkingOptions:
new_after_n_chars: Optional[int] = None,
overlap: int = 0,
overlap_all: bool = False,
text_splitting_separators: Sequence[str] = (),
text_splitting_separators: Sequence[str] = ("\n", " "),
) -> Self:
"""Construct validated instance.

View File

@ -0,0 +1,80 @@
"""Implementation of baseline chunking.
This is the "plain-vanilla" chunking strategy. All the fundamental chunking behaviors are present in
this strategy and also in all other strategies. Those are:
- Maximally fill each chunk with sequential elements.
- Isolate oversized elements and divide (only) those chunks by text-splitting.
- Overlap when requested.
"Fancier" strategies add higher-level semantic-unit boundaries to be respected. For example, in the
by-title strategy, section boundaries are respected, meaning a chunk never contains text from two
different sections. When a new section is detected the current chunk is closed and a new one
started.
"""
from __future__ import annotations
from typing import List, Optional, Sequence
from unstructured.chunking.base import BasePreChunker, ChunkingOptions
from unstructured.documents.elements import Element
def chunk_elements(
elements: Sequence[Element],
new_after_n_chars: Optional[int] = None,
max_characters: int = 500,
overlap: int = 0,
overlap_all: bool = False,
) -> List[Element]:
"""Combine sequential `elements` into chunks, respecting specified text-length limits.
Produces a sequence of `CompositeElement`, `Table`, and `TableChunk` elements (chunks).
Parameters
----------
elements
A list of unstructured elements. Usually the output of a partition function.
max_characters
Hard maximum chunk length. No chunk will exceed this length. A single element that exceeds
this length will be divided into two or more chunks using text-splitting.
new_after_n_chars
A chunk that of this length or greater is not extended to include the next element, even if
that element would fit without exceeding `max_characters`. A "soft max" length that can be
used in conjunction with `max_characters` to limit most chunks to a preferred length while
still allowing larger elements to be included in a single chunk without resorting to
text-splitting. Defaults to `max_characters` when not specified, which effectively disables
any soft window. Specifying 0 for this argument causes each element to appear in a chunk by
itself (although an element with text longer than `max_characters` will be still be split
into two or more chunks).
overlap
Specifies the length of a string ("tail") to be drawn from each chunk and prefixed to the
next chunk as a context-preserving mechanism. By default, this only applies to split-chunks
where an oversized element is divided into multiple chunks by text-splitting.
overlap_all
Default: `False`. When `True`, apply overlap between "normal" chunks formed from whole
elements and not subject to text-splitting. Use this with caution as it produces a certain
level of "pollution" of otherwise clean semantic chunk boundaries.
"""
# -- raises ValueError on invalid parameters --
opts = ChunkingOptions.new(
max_characters=max_characters,
new_after_n_chars=new_after_n_chars,
overlap=overlap,
overlap_all=overlap_all,
)
return [
chunk
for pre_chunk in BasicPreChunker.iter_pre_chunks(elements, opts)
for chunk in pre_chunk.iter_chunks()
]
class BasicPreChunker(BasePreChunker):
"""Produces pre-chunks from a sequence of document-elements using the "basic" rule-set.
The "basic" rule-set is essentially "no-rules" other than `Table` is segregated into its own
pre-chunk.
"""

View File

@ -26,6 +26,8 @@ def chunk_by_title(
combine_text_under_n_chars: Optional[int] = None,
new_after_n_chars: Optional[int] = None,
max_characters: int = 500,
overlap: int = 0,
overlap_all: bool = False,
) -> List[Element]:
"""Uses title elements to identify sections within the document for chunking.
@ -54,12 +56,22 @@ def chunk_by_title(
max_characters
Chunks elements text and text_as_html (if present) into chunks of length
n characters (hard max)
overlap
Specifies the length of a string ("tail") to be drawn from each chunk and prefixed to the
next chunk as a context-preserving mechanism. By default, this only applies to split-chunks
where an oversized element is divided into multiple chunks by text-splitting.
overlap_all
Default: `False`. When `True`, apply overlap between "normal" chunks formed from whole
elements and not subject to text-splitting. Use this with caution as it entails a certain
level of "pollution" of otherwise clean semantic chunk boundaries.
"""
opts = ChunkingOptions.new(
combine_text_under_n_chars=combine_text_under_n_chars,
max_characters=max_characters,
multipage_sections=multipage_sections,
new_after_n_chars=new_after_n_chars,
overlap=overlap,
overlap_all=overlap_all,
)
pre_chunks = PreChunkCombiner(