mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-28 23:58:13 +00:00
feature(chunking): add basic strategy and overlap (#2367)
This PR culminates the restructuring of chunking over my prior dozen-or-so commits by adding the new options to the API and documentation. Separately I'll be adding a new ingest test to defend against regression, although the integration test included in this PR will do a pretty good job of that too.
This commit is contained in:
parent
a8a103bc5c
commit
23edf2e911
@ -2,6 +2,9 @@
|
||||
|
||||
### Enhancements
|
||||
|
||||
* **Add "basic" chunking strategy.** Add baseline chunking strategy that includes all shared chunking behaviors without breaking chunks on section or page boundaries.
|
||||
* **Add overlap option for chunking.** Add option to overlap chunks. Intra-chunk and inter-chunk overlap are requested separately. Intra-chunk overlap is applied only to the second and later chunks formed by text-splitting an oversized chunk. Inter-chunk overlap may also be specified; this applies overlap between "normal" (not-oversized) chunks.
|
||||
|
||||
### Features
|
||||
|
||||
### Fixes
|
||||
|
||||
@ -2,50 +2,171 @@
|
||||
Chunking
|
||||
########
|
||||
|
||||
Chunking functions in ``unstructured`` use metadata and document elements
|
||||
detected with ``partition`` functions to split a document into subsections
|
||||
for uses cases such as Retrieval Augmented Generation (RAG).
|
||||
Chunking functions in ``unstructured`` use metadata and document elements detected with
|
||||
``partition`` functions to split a document into smaller parts for uses cases such as Retrieval
|
||||
Augmented Generation (RAG).
|
||||
|
||||
Chunking Basics
|
||||
---------------
|
||||
|
||||
Chunking in ``unstructured`` differs from other chunking mechanisms you may be familiar with.
|
||||
Typical approaches start with the text extracted from the document and form chunks based on
|
||||
plain-text features, character sequences like ``"\n\n"`` or ``"\n"`` that might indicate a paragraph
|
||||
boundary or list-item boundary.
|
||||
|
||||
Because ``unstructured`` uses specific knowledge about each document format to partition the
|
||||
document into semantic units (document elements), we only need to resort to text-splitting when a
|
||||
single element exceeds the desired maximum chunk size. Except in that case, all chunks contain one
|
||||
or more whole elements, preserving the coherence of semantic units established during partitioning.
|
||||
|
||||
A few concepts about chunking are worth introducing before discussing the details.
|
||||
|
||||
- Chunking is performed on *document elements*. It is a separate step performed *after*
|
||||
partitioning, on the elements produced by partitioning. (Although it can be combined with
|
||||
partitioning in a single step.)
|
||||
|
||||
- In general, chunking *combines* consecutive elements to form chunks as large as possible without
|
||||
exceeding the maximum chunk size.
|
||||
|
||||
- A single element that by itself exceeds the maximum chunk size is divided into two or more chunks
|
||||
using text-splitting.
|
||||
|
||||
- Chunking produces a sequence of ``CompositeElement``, ``Table``, or ``TableChunk`` elements. Each
|
||||
"chunk" is an instance of one of these three types.
|
||||
|
||||
|
||||
``chunk_by_title``
|
||||
------------------
|
||||
Chunking Options
|
||||
----------------
|
||||
|
||||
The ``chunk_by_title`` function combines elements into sections by looking
|
||||
for the presence of titles. When a title is detected, a new section is created.
|
||||
Tables and non-text elements (such as page breaks or images) are always their
|
||||
own section.
|
||||
The following options are available to tune chunking behaviors. These are keyword arguments that can
|
||||
be used in a partitioning or chunking function call. All these options have defaults and need only
|
||||
be specified when a non-default setting is required. Specific chunking strategies (such as
|
||||
"by-title") may have additional options.
|
||||
|
||||
New sections are also created if changes in metadata occure. Examples of when
|
||||
this occurs include when the section of the document or the page number changes
|
||||
or when an element comes from an attachment instead of from the main document.
|
||||
If you set ``multipage_sections=True``, ``chunk_by_title`` will allow for sections
|
||||
that span between pages. This kwarg is ``True`` by default.
|
||||
- ``max_characters: int (default=500)`` - the hard maximum size for a chunk. No chunk will exceed
|
||||
this number of characters. A single element that by itself exceeds this size will be divided into
|
||||
two or more chunks using text-splitting.
|
||||
|
||||
``chunk_by_title`` will start a new section if the length of a section exceed
|
||||
``new_after_n_chars``. The default value is ``1500``. ``chunk_by_title`` does
|
||||
not split elements, it is possible for a section to exceed that lenght, for
|
||||
example if a ``NarrativeText`` elements exceeds ``1500`` characters on its on.
|
||||
- ``new_after_n_chars: int (default=max_characters)`` - the "soft" maximum size for a chunk. A chunk
|
||||
that already exceeds this number of characters will not be extended, even if the next element
|
||||
would fit without exceeding the specified hard maximum. This can be used in conjunction with
|
||||
``max_characters`` to set a "preferred" size, like "I prefer chunks of around 1000 characters, but
|
||||
I'd rather have a chunk of 1500 (max_characters) than resort to text-splitting". This would be
|
||||
specified with ``(..., max_characters=1500, new_after_n_chars=1000)``.
|
||||
|
||||
Similarly, sections under ``combine_text_under_n_chars`` will be combined if they
|
||||
do not exceed the specified threshold, which defaults to ``500``. This will combine
|
||||
a series of ``Title`` elements that occur one after another, which sometimes
|
||||
happens in lists that are not detected as ``ListItem`` elements. Set
|
||||
``combine_text_under_n_chars=0`` to turn off this behavior.
|
||||
- ``overlap: int (default=0)`` - only when using text-splitting to break up an oversized chunk,
|
||||
include this number of characters from the end of the prior chunk as a prefix on the next. This
|
||||
can mitigate the effect of splitting the semantic unit represented by the oversized element at an
|
||||
arbitrary position based on text length.
|
||||
|
||||
The following shows an example of how to use ``chunk_by_title``. You will
|
||||
see the document chunked into sections instead of elements.
|
||||
- ``overlap_all: bool (default=False)`` - also apply overlap between "normal" chunks, not just when
|
||||
text-splitting to break up an oversized element. Because normal chunks are formed from whole
|
||||
elements that each have a clean semantic boundary, this option may "pollute" normal chunks. You'll
|
||||
need to decide based on your use-case whether this option is right for you.
|
||||
|
||||
|
||||
Chunking elements
|
||||
-----------------
|
||||
|
||||
Chunking can be performed as part of partitioning or as a separate step after
|
||||
partitioning:
|
||||
|
||||
Specifying a chunking strategy while partitioning
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Chunking can be performed as part of partitioning by specifying a value for the
|
||||
``chunking_strategy`` argument. The current options are ``basic`` and ``by-title`` (described
|
||||
below).
|
||||
|
||||
.. code:: python
|
||||
|
||||
from unstructured.partition.html import partition_html
|
||||
from unstructured.chunking.title import chunk_by_title
|
||||
|
||||
chunks = partition_html(url=url, chunking_strategy="basic")
|
||||
|
||||
Calling a chunking function
|
||||
+++++++++++++++++++++++++++
|
||||
|
||||
Chunking can also be performed separately from partitioning by calling a chunking function directly.
|
||||
This may be convenient, for example, when tuning chunking parameters. Chunking is typically faster
|
||||
than partitioning, especially when OCR or inference is used, so a faster feedback loop is possible
|
||||
by doing these separately:
|
||||
|
||||
.. code:: python
|
||||
|
||||
from unstructured.chunking.basic import chunk_elements
|
||||
from unstructured.partition.html import partition_html
|
||||
|
||||
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
|
||||
elements = partition_html(url=url)
|
||||
chunks = chunk_elements(elements)
|
||||
|
||||
# -- OR --
|
||||
|
||||
from unstructured.chunking.title import chunk_by_title
|
||||
|
||||
chunks = chunk_by_title(elements)
|
||||
|
||||
for chunk in chunks:
|
||||
print(chunk)
|
||||
print("\n\n" + "-"*80)
|
||||
input()
|
||||
|
||||
|
||||
Chunking Strategies
|
||||
-------------------
|
||||
|
||||
There are currently two chunking strategies, *basic* and *by_title*. The ``by_title`` strategy
|
||||
shares most behaviors with the basic strategy so we'll describe the baseline strategy first:
|
||||
|
||||
"basic" chunking strategy
|
||||
+++++++++++++++++++++++++
|
||||
|
||||
- The basic strategy combines sequential elements to maximally fill each chunk while respecting both
|
||||
the specified ``max_characters`` (hard-max) and ``new_after_n_chars`` (soft-max) option values.
|
||||
|
||||
- A single element that by itself exceeds the hard-max is isolated (never combined with another
|
||||
element) and then divided into two or more chunks using text-splitting.
|
||||
|
||||
- A ``Table`` element is always isolated and never combined with another element. A ``Table`` can be
|
||||
oversized, like any other text element, and in that case is divided into two or more
|
||||
``TableChunk`` elements using text-splitting.
|
||||
|
||||
- If specified, ``overlap`` is applied between split-chunks and is also applied between normal
|
||||
chunks when ``overlap_all`` is ``True``.
|
||||
|
||||
|
||||
"by_title" chunking strategy
|
||||
++++++++++++++++++++++++++++
|
||||
|
||||
The ``by_title`` chunking strategy preserves section boundaries and optionally page boundaries as
|
||||
well. "Preserving" here means that a single chunk will never contain text that occurred in two
|
||||
different sections. When a new section starts, the existing chunk is closed and a new one started,
|
||||
even if the next element would fit in the prior chunk.
|
||||
|
||||
In addition to the behaviors of the ``basic`` strategy above, the ``by_title`` strategy has the
|
||||
following behaviors:
|
||||
|
||||
- **Detect section headings.** A ``Title`` element is considered to start a new section. When a
|
||||
``Title`` element is encountered, the prior chunk is closed and a new chunk started, even if the
|
||||
``Title`` element would fit in the prior chunk. This implements the first aspect of the "preserve
|
||||
section boundaries" contract.
|
||||
|
||||
- **Detect metadata.section change.** An element with a new value in ``element.metadata.section`` is
|
||||
considered to start a new section. When a change in this value is encountered a new chunk is
|
||||
started. This implements the second aspect of preserving section boundaries. This metadata is not
|
||||
present in all document formats so is not used alone. An element having ``None`` for this metadata
|
||||
field is considered to be part of the prior section; a section break is only detected on an
|
||||
explicit change in value.
|
||||
|
||||
- **Respect page boundaries.** Page boundaries can optionally also be respected using the
|
||||
``multipage_sections`` argument. This defaults to ``True`` meaning that a page break does *not*
|
||||
start a new chunk. Setting this to ``False`` will separate elements that occur on different pages
|
||||
into distinct chunks.
|
||||
|
||||
- **Combine small sections.** In certain documents, partitioning may identify a list-item or other
|
||||
short paragraph as a ``Title`` element even though it does not serve as a section heading. This
|
||||
can produce chunks substantially smaller than desired. This behavior can be mitigated using the
|
||||
``combine_text_under_n_chars`` argument. This defaults to the same value as ``max_characters``
|
||||
such that sequential small sections are combined to maximally fill the chunking window. Setting
|
||||
this to ``0`` will disable section combining.
|
||||
|
||||
108
test_unstructured/chunking/test_basic.py
Normal file
108
test_unstructured/chunking/test_basic.py
Normal file
@ -0,0 +1,108 @@
|
||||
"""Unit-test suite for the `unstructured.chunking.basic` module.
|
||||
|
||||
That module implements the baseline chunking strategy. The baseline strategy has all behaviors
|
||||
shared by all chunking strategies and no extra rules like perserve section or page boundaries.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from unstructured.chunking.basic import chunk_elements
|
||||
from unstructured.documents.elements import CompositeElement, Text, Title
|
||||
from unstructured.partition.docx import partition_docx
|
||||
|
||||
|
||||
def test_it_chunks_a_document_when_basic_chunking_strategy_is_specified_on_partition_function():
|
||||
"""Basic chunking can be combined with partitioning, exercising the decorator."""
|
||||
filename = "example-docs/handbook-1p.docx"
|
||||
|
||||
chunks = partition_docx(filename, chunking_strategy="basic")
|
||||
|
||||
assert chunks == [
|
||||
CompositeElement(
|
||||
"US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 – INTRODUCTION"
|
||||
"\n\nA.\tPURPOSE"
|
||||
),
|
||||
CompositeElement(
|
||||
"The United States Trustee appoints and supervises standing trustees and monitors and"
|
||||
" supervises cases under chapter 13 of title 11 of the United States Code. 28 U.S.C."
|
||||
" § 586(b). The Handbook, issued as part of our duties under 28 U.S.C. § 586,"
|
||||
" establishes or clarifies the position of the United States Trustee Program (Program)"
|
||||
" on the duties owed by a standing trustee to the debtors, creditors, other parties in"
|
||||
" interest, and the United States Trustee. The Handbook does not present a full and"
|
||||
),
|
||||
CompositeElement(
|
||||
"complete statement of the law; it should not be used as a substitute for legal"
|
||||
" research and analysis. The standing trustee must be familiar with relevant"
|
||||
" provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules),"
|
||||
" any local bankruptcy rules, and case law. 11 U.S.C. § 321, 28 U.S.C. § 586,"
|
||||
" 28 C.F.R. § 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips"
|
||||
" identified in this Handbook but these are not considered mandatory."
|
||||
),
|
||||
CompositeElement(
|
||||
"Nothing in this Handbook should be construed to excuse the standing trustee from"
|
||||
" complying with all duties imposed by the Bankruptcy Code and Rules, local rules, and"
|
||||
" orders of the court. The standing trustee should notify the United States Trustee"
|
||||
" whenever the provision of the Handbook conflicts with the local rules or orders of"
|
||||
" the court. The standing trustee is accountable for all duties set forth in this"
|
||||
" Handbook, but need not personally perform any duty unless otherwise indicated. All"
|
||||
),
|
||||
CompositeElement(
|
||||
"statutory references in this Handbook refer to the Bankruptcy Code, 11 U.S.C. § 101"
|
||||
" et seq., unless otherwise indicated."
|
||||
),
|
||||
CompositeElement(
|
||||
"This Handbook does not create additional rights against the standing trustee or"
|
||||
" United States Trustee in favor of other parties.\n\nB.\tROLE OF THE UNITED STATES"
|
||||
" TRUSTEE"
|
||||
),
|
||||
CompositeElement(
|
||||
"The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the"
|
||||
" responsibilities for daytoday administration of cases. Debtors, creditors, and"
|
||||
" third parties with adverse interests to the trustee were concerned that the court,"
|
||||
" which previously appointed and supervised the trustee, would not impartially"
|
||||
" adjudicate their rights as adversaries of that trustee. To address these concerns,"
|
||||
" judicial and administrative functions within the bankruptcy system were bifurcated."
|
||||
),
|
||||
CompositeElement(
|
||||
"Many administrative functions formerly performed by the court were placed within the"
|
||||
" Department of Justice through the creation of the Program. Among the administrative"
|
||||
" functions assigned to the United States Trustee were the appointment and supervision"
|
||||
" of chapter 13 trustees./ This Handbook is issued under the authority of the"
|
||||
" Program’s enabling statutes. \n\nC.\tSTATUTORY DUTIES OF A STANDING TRUSTEE\t"
|
||||
),
|
||||
CompositeElement(
|
||||
"The standing trustee has a fiduciary responsibility to the bankruptcy estate. The"
|
||||
" standing trustee is more than a mere disbursing agent. The standing trustee must"
|
||||
" be personally involved in the trustee operation. If the standing trustee is or"
|
||||
" becomes unable to perform the duties and responsibilities of a standing trustee,"
|
||||
" the standing trustee must immediately advise the United States Trustee."
|
||||
" 28 U.S.C. § 586(b), 28 C.F.R. § 58.4(b) referencing 28 C.F.R. § 58.3(b)."
|
||||
),
|
||||
CompositeElement(
|
||||
"Although this Handbook is not intended to be a complete statutory reference, the"
|
||||
" standing trustee’s primary statutory duties are set forth in 11 U.S.C. § 1302, which"
|
||||
" incorporates by reference some of the duties of chapter 7 trustees found in"
|
||||
" 11 U.S.C. § 704. These duties include, but are not limited to, the"
|
||||
" following:\n\nCopyright"
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
def test_it_chunks_elements_when_the_user_already_has_them():
|
||||
elements = [
|
||||
Title("Introduction"),
|
||||
Text(
|
||||
# --------------------------------------------------------- 64 -v
|
||||
"Lorem ipsum dolor sit amet consectetur adipiscing elit. In rhoncus ipsum sed lectus"
|
||||
" porta volutpat.",
|
||||
),
|
||||
]
|
||||
|
||||
chunks = chunk_elements(elements, max_characters=64)
|
||||
|
||||
assert chunks == [
|
||||
CompositeElement("Introduction"),
|
||||
# -- splits on even word boundary, not mid-"rhoncus" --
|
||||
CompositeElement("Lorem ipsum dolor sit amet consectetur adipiscing elit. In"),
|
||||
CompositeElement("rhoncus ipsum sed lectus porta volutpat."),
|
||||
]
|
||||
@ -1145,14 +1145,16 @@ def test_add_chunking_strategy_on_partition_auto_respects_max_chars():
|
||||
assert len(partitioned_table_elements_5_chars) != len(table_elements)
|
||||
assert len(partitioned_table_elements_200_chars) != len(table_elements)
|
||||
|
||||
assert len(partitioned_table_elements_5_chars[0].text) == 5
|
||||
# trailing whitespace is stripped from the first chunk, leaving only a checkbox character
|
||||
assert len(partitioned_table_elements_5_chars[0].text) == 1
|
||||
# but the second chunk is the full 5 characters
|
||||
assert len(partitioned_table_elements_5_chars[1].text) == 5
|
||||
assert len(partitioned_table_elements_5_chars[0].metadata.text_as_html) == 5
|
||||
|
||||
# the first table element is under 200 chars so doesn't get chunked!
|
||||
assert table_elements[0] == partitioned_table_elements_200_chars[0]
|
||||
assert len(partitioned_table_elements_200_chars[0].text) < 200
|
||||
assert len(partitioned_table_elements_200_chars[1].text) == 200
|
||||
assert len(partitioned_table_elements_200_chars[1].text) == 198
|
||||
assert len(partitioned_table_elements_200_chars[1].metadata.text_as_html) == 200
|
||||
|
||||
|
||||
|
||||
@ -11,6 +11,7 @@ from typing import Any, Callable, Dict, List
|
||||
|
||||
from typing_extensions import ParamSpec
|
||||
|
||||
from unstructured.chunking.basic import chunk_elements
|
||||
from unstructured.chunking.title import chunk_by_title
|
||||
from unstructured.documents.elements import Element
|
||||
|
||||
@ -25,6 +26,10 @@ def add_chunking_strategy() -> Callable[[Callable[_P, List[Element]]], Callable[
|
||||
"""
|
||||
|
||||
def decorator(func: Callable[_P, List[Element]]) -> Callable[_P, List[Element]]:
|
||||
# -- Patch the docstring of the decorated function to add chunking strategy and
|
||||
# -- chunking-related argument documentation. This only applies when `chunking_strategy`
|
||||
# -- is an explicit argument of the decorated function and "chunking_strategy" is not
|
||||
# -- already mentioned in the docstring.
|
||||
if func.__doc__ and (
|
||||
"chunking_strategy" in func.__code__.co_varnames
|
||||
and "chunking_strategy" not in func.__doc__
|
||||
@ -32,16 +37,15 @@ def add_chunking_strategy() -> Callable[[Callable[_P, List[Element]]], Callable[
|
||||
func.__doc__ += (
|
||||
"\nchunking_strategy"
|
||||
+ "\n\tStrategy used for chunking text into larger or smaller elements."
|
||||
+ "\n\tDefaults to `None` with optional arg of 'by_title'."
|
||||
+ "\n\tDefaults to `None` with optional arg of 'basic' or 'by_title'."
|
||||
+ "\n\tAdditional Parameters:"
|
||||
+ "\n\t\tmultipage_sections"
|
||||
+ "\n\t\t\tIf True, sections can span multiple pages. Defaults to True."
|
||||
+ "\n\t\tcombine_text_under_n_chars"
|
||||
+ "\n\t\t\tCombines elements (for example a series of titles) until a section"
|
||||
+ "\n\t\t\treaches a length of n characters."
|
||||
+ "\n\t\t\treaches a length of n characters. Only applies to 'by_title' strategy."
|
||||
+ "\n\t\tnew_after_n_chars"
|
||||
+ "\n\t\t\tCuts off new sections once they reach a length of n characters"
|
||||
+ "\n\t\t\ta soft max."
|
||||
+ "\n\t\t\tCuts off chunks once they reach a length of n characters; a soft max."
|
||||
+ "\n\t\tmax_characters"
|
||||
+ "\n\t\t\tChunks elements text and text_as_html (if present) into chunks"
|
||||
+ "\n\t\t\tof length n characters, a hard max."
|
||||
@ -49,20 +53,43 @@ def add_chunking_strategy() -> Callable[[Callable[_P, List[Element]]], Callable[
|
||||
|
||||
@functools.wraps(func)
|
||||
def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
|
||||
"""The decorated function is replaced with this one."""
|
||||
|
||||
def get_call_args_applying_defaults() -> Dict[str, Any]:
|
||||
"""Map both explicit and default arguments of decorated func call by param name."""
|
||||
sig = inspect.signature(func)
|
||||
call_args: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
|
||||
for param in sig.parameters.values():
|
||||
if param.name not in call_args and param.default is not param.empty:
|
||||
call_args[param.name] = param.default
|
||||
return call_args
|
||||
|
||||
# -- call the partitioning function to get the elements --
|
||||
elements = func(*args, **kwargs)
|
||||
sig = inspect.signature(func)
|
||||
params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
|
||||
for param in sig.parameters.values():
|
||||
if param.name not in params and param.default is not param.empty:
|
||||
params[param.name] = param.default
|
||||
if params.get("chunking_strategy") == "by_title":
|
||||
elements = chunk_by_title(
|
||||
|
||||
# -- look for a chunking-strategy argument and run the indicated chunker when present --
|
||||
call_args = get_call_args_applying_defaults()
|
||||
|
||||
if call_args.get("chunking_strategy") == "by_title":
|
||||
return chunk_by_title(
|
||||
elements,
|
||||
multipage_sections=params.get("multipage_sections", True),
|
||||
combine_text_under_n_chars=params.get("combine_text_under_n_chars", 500),
|
||||
new_after_n_chars=params.get("new_after_n_chars", 500),
|
||||
max_characters=params.get("max_characters", 500),
|
||||
combine_text_under_n_chars=call_args.get("combine_text_under_n_chars", 500),
|
||||
max_characters=call_args.get("max_characters", 500),
|
||||
multipage_sections=call_args.get("multipage_sections", True),
|
||||
new_after_n_chars=call_args.get("new_after_n_chars", 500),
|
||||
overlap=call_args.get("overlap", 0),
|
||||
overlap_all=call_args.get("overlap_all", False),
|
||||
)
|
||||
|
||||
if call_args.get("chunking_strategy") == "basic":
|
||||
return chunk_elements(
|
||||
elements,
|
||||
max_characters=call_args.get("max_characters", 500),
|
||||
new_after_n_chars=call_args.get("new_after_n_chars", 500),
|
||||
overlap=call_args.get("overlap", 0),
|
||||
overlap_all=call_args.get("overlap_all", False),
|
||||
)
|
||||
|
||||
return elements
|
||||
|
||||
return wrapper
|
||||
|
||||
@ -77,6 +77,10 @@ class ChunkingOptions:
|
||||
Specifies the length of a string ("tail") to be drawn from each chunk and prefixed to the
|
||||
next chunk as a context-preserving mechanism. By default, this only applies to split-chunks
|
||||
where an oversized element is divided into multiple chunks by text-splitting.
|
||||
overlap_all
|
||||
Default: `False`. When `True`, apply overlap between "normal" chunks formed from whole
|
||||
elements and not subject to text-splitting. Use this with caution as it entails a certain
|
||||
level of "pollution" of otherwise clean semantic chunk boundaries.
|
||||
text_splitting_separators
|
||||
A sequence of strings like `("\n", " ")` to be used as target separators during
|
||||
text-splitting. Text-splitting only applies to splitting an oversized element into two or
|
||||
@ -95,7 +99,7 @@ class ChunkingOptions:
|
||||
new_after_n_chars: Optional[int] = None,
|
||||
overlap: int = 0,
|
||||
overlap_all: bool = False,
|
||||
text_splitting_separators: Sequence[str] = (),
|
||||
text_splitting_separators: Sequence[str] = ("\n", " "),
|
||||
):
|
||||
self._combine_text_under_n_chars_arg = combine_text_under_n_chars
|
||||
self._max_characters = max_characters
|
||||
@ -114,7 +118,7 @@ class ChunkingOptions:
|
||||
new_after_n_chars: Optional[int] = None,
|
||||
overlap: int = 0,
|
||||
overlap_all: bool = False,
|
||||
text_splitting_separators: Sequence[str] = (),
|
||||
text_splitting_separators: Sequence[str] = ("\n", " "),
|
||||
) -> Self:
|
||||
"""Construct validated instance.
|
||||
|
||||
|
||||
80
unstructured/chunking/basic.py
Normal file
80
unstructured/chunking/basic.py
Normal file
@ -0,0 +1,80 @@
|
||||
"""Implementation of baseline chunking.
|
||||
|
||||
This is the "plain-vanilla" chunking strategy. All the fundamental chunking behaviors are present in
|
||||
this strategy and also in all other strategies. Those are:
|
||||
|
||||
- Maximally fill each chunk with sequential elements.
|
||||
- Isolate oversized elements and divide (only) those chunks by text-splitting.
|
||||
- Overlap when requested.
|
||||
|
||||
"Fancier" strategies add higher-level semantic-unit boundaries to be respected. For example, in the
|
||||
by-title strategy, section boundaries are respected, meaning a chunk never contains text from two
|
||||
different sections. When a new section is detected the current chunk is closed and a new one
|
||||
started.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import List, Optional, Sequence
|
||||
|
||||
from unstructured.chunking.base import BasePreChunker, ChunkingOptions
|
||||
from unstructured.documents.elements import Element
|
||||
|
||||
|
||||
def chunk_elements(
|
||||
elements: Sequence[Element],
|
||||
new_after_n_chars: Optional[int] = None,
|
||||
max_characters: int = 500,
|
||||
overlap: int = 0,
|
||||
overlap_all: bool = False,
|
||||
) -> List[Element]:
|
||||
"""Combine sequential `elements` into chunks, respecting specified text-length limits.
|
||||
|
||||
Produces a sequence of `CompositeElement`, `Table`, and `TableChunk` elements (chunks).
|
||||
|
||||
Parameters
|
||||
----------
|
||||
elements
|
||||
A list of unstructured elements. Usually the output of a partition function.
|
||||
max_characters
|
||||
Hard maximum chunk length. No chunk will exceed this length. A single element that exceeds
|
||||
this length will be divided into two or more chunks using text-splitting.
|
||||
new_after_n_chars
|
||||
A chunk that of this length or greater is not extended to include the next element, even if
|
||||
that element would fit without exceeding `max_characters`. A "soft max" length that can be
|
||||
used in conjunction with `max_characters` to limit most chunks to a preferred length while
|
||||
still allowing larger elements to be included in a single chunk without resorting to
|
||||
text-splitting. Defaults to `max_characters` when not specified, which effectively disables
|
||||
any soft window. Specifying 0 for this argument causes each element to appear in a chunk by
|
||||
itself (although an element with text longer than `max_characters` will be still be split
|
||||
into two or more chunks).
|
||||
overlap
|
||||
Specifies the length of a string ("tail") to be drawn from each chunk and prefixed to the
|
||||
next chunk as a context-preserving mechanism. By default, this only applies to split-chunks
|
||||
where an oversized element is divided into multiple chunks by text-splitting.
|
||||
overlap_all
|
||||
Default: `False`. When `True`, apply overlap between "normal" chunks formed from whole
|
||||
elements and not subject to text-splitting. Use this with caution as it produces a certain
|
||||
level of "pollution" of otherwise clean semantic chunk boundaries.
|
||||
"""
|
||||
# -- raises ValueError on invalid parameters --
|
||||
opts = ChunkingOptions.new(
|
||||
max_characters=max_characters,
|
||||
new_after_n_chars=new_after_n_chars,
|
||||
overlap=overlap,
|
||||
overlap_all=overlap_all,
|
||||
)
|
||||
|
||||
return [
|
||||
chunk
|
||||
for pre_chunk in BasicPreChunker.iter_pre_chunks(elements, opts)
|
||||
for chunk in pre_chunk.iter_chunks()
|
||||
]
|
||||
|
||||
|
||||
class BasicPreChunker(BasePreChunker):
|
||||
"""Produces pre-chunks from a sequence of document-elements using the "basic" rule-set.
|
||||
|
||||
The "basic" rule-set is essentially "no-rules" other than `Table` is segregated into its own
|
||||
pre-chunk.
|
||||
"""
|
||||
@ -26,6 +26,8 @@ def chunk_by_title(
|
||||
combine_text_under_n_chars: Optional[int] = None,
|
||||
new_after_n_chars: Optional[int] = None,
|
||||
max_characters: int = 500,
|
||||
overlap: int = 0,
|
||||
overlap_all: bool = False,
|
||||
) -> List[Element]:
|
||||
"""Uses title elements to identify sections within the document for chunking.
|
||||
|
||||
@ -54,12 +56,22 @@ def chunk_by_title(
|
||||
max_characters
|
||||
Chunks elements text and text_as_html (if present) into chunks of length
|
||||
n characters (hard max)
|
||||
overlap
|
||||
Specifies the length of a string ("tail") to be drawn from each chunk and prefixed to the
|
||||
next chunk as a context-preserving mechanism. By default, this only applies to split-chunks
|
||||
where an oversized element is divided into multiple chunks by text-splitting.
|
||||
overlap_all
|
||||
Default: `False`. When `True`, apply overlap between "normal" chunks formed from whole
|
||||
elements and not subject to text-splitting. Use this with caution as it entails a certain
|
||||
level of "pollution" of otherwise clean semantic chunk boundaries.
|
||||
"""
|
||||
opts = ChunkingOptions.new(
|
||||
combine_text_under_n_chars=combine_text_under_n_chars,
|
||||
max_characters=max_characters,
|
||||
multipage_sections=multipage_sections,
|
||||
new_after_n_chars=new_after_n_chars,
|
||||
overlap=overlap,
|
||||
overlap_all=overlap_all,
|
||||
)
|
||||
|
||||
pre_chunks = PreChunkCombiner(
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user