feature(chunking): add basic strategy and overlap (#2367)

This PR culminates the restructuring of chunking over my prior dozen-or-so commits by adding the new options to the API and documentation. Separately I'll be adding a new ingest test to defend against regression, although the integration test included in this PR will do a pretty good job of that too.
2025-12-28 23:58:13 +00:00 · 2024-01-10 14:19:24 -08:00 · 2024-01-10 14:19:24 -08:00 · 23edf2e911
commit 23edf2e911
parent a8a103bc5c
8 changed files with 402 additions and 45 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -2,6 +2,9 @@

 ### Enhancements

+* **Add "basic" chunking strategy.** Add baseline chunking strategy that includes all shared chunking behaviors without breaking chunks on section or page boundaries.
+* **Add overlap option for chunking.** Add option to overlap chunks. Intra-chunk and inter-chunk overlap are requested separately. Intra-chunk overlap is applied only to the second and later chunks formed by text-splitting an oversized chunk. Inter-chunk overlap may also be specified; this applies overlap between "normal" (not-oversized) chunks.
+
 ### Features

 ### Fixes
--- a/docs/source/core/chunking.rst
+++ b/docs/source/core/chunking.rst
@ -2,50 +2,171 @@
 Chunking
 ########

-Chunking functions in ``unstructured`` use metadata and document elements
-detected with ``partition`` functions to split a document into subsections
-for uses cases such as Retrieval Augmented Generation (RAG).
+Chunking functions in ``unstructured`` use metadata and document elements detected with
+``partition`` functions to split a document into smaller parts for uses cases such as Retrieval
+Augmented Generation (RAG).
+
+Chunking Basics
+---------------
+
+Chunking in ``unstructured`` differs from other chunking mechanisms you may be familiar with.
+Typical approaches start with the text extracted from the document and form chunks based on
+plain-text features, character sequences like ``"\n\n"`` or ``"\n"`` that might indicate a paragraph
+boundary or list-item boundary.
+
+Because ``unstructured`` uses specific knowledge about each document format to partition the
+document into semantic units (document elements), we only need to resort to text-splitting when a
+single element exceeds the desired maximum chunk size. Except in that case, all chunks contain one
+or more whole elements, preserving the coherence of semantic units established during partitioning.
+
+A few concepts about chunking are worth introducing before discussing the details.
+
+- Chunking is performed on *document elements*. It is a separate step performed *after*
+  partitioning, on the elements produced by partitioning. (Although it can be combined with
+  partitioning in a single step.)
+
+- In general, chunking *combines* consecutive elements to form chunks as large as possible without
+  exceeding the maximum chunk size.
+
+- A single element that by itself exceeds the maximum chunk size is divided into two or more chunks
+  using text-splitting.
+
+- Chunking produces a sequence of ``CompositeElement``, ``Table``, or ``TableChunk`` elements. Each
+  "chunk" is an instance of one of these three types.


-``chunk_by_title``
------------------
+Chunking Options
+----------------

-The ``chunk_by_title`` function combines elements into sections by looking
-for the presence of titles. When a title is detected, a new section is created.
-Tables and non-text elements (such as page breaks or images) are always their
-own section.
+The following options are available to tune chunking behaviors. These are keyword arguments that can
+be used in a partitioning or chunking function call. All these options have defaults and need only
+be specified when a non-default setting is required. Specific chunking strategies (such as
+"by-title") may have additional options.

-New sections are also created if changes in metadata occure. Examples of when
-this occurs include when the section of the document or the page number changes
-or when an element comes from an attachment instead of from the main document.
-If you set ``multipage_sections=True``, ``chunk_by_title`` will allow for sections
-that span between pages. This kwarg is ``True`` by default.
+- ``max_characters: int (default=500)`` - the hard maximum size for a chunk. No chunk will exceed
+  this number of characters. A single element that by itself exceeds this size will be divided into
+  two or more chunks using text-splitting.

-``chunk_by_title`` will start a new section if the length of a section exceed
-``new_after_n_chars``. The default value is ``1500``. ``chunk_by_title`` does
-not split elements, it is possible for a section to exceed that lenght, for
-example if a ``NarrativeText`` elements exceeds ``1500`` characters on its on.
+- ``new_after_n_chars: int (default=max_characters)`` - the "soft" maximum size for a chunk. A chunk
+  that already exceeds this number of characters will not be extended, even if the next element
+  would fit without exceeding the specified hard maximum. This can be used in conjunction with
+  ``max_characters`` to set a "preferred" size, like "I prefer chunks of around 1000 characters, but
+  I'd rather have a chunk of 1500 (max_characters) than resort to text-splitting". This would be
+  specified with ``(..., max_characters=1500, new_after_n_chars=1000)``.

-Similarly, sections under ``combine_text_under_n_chars`` will be combined if they
-do not exceed the specified threshold, which defaults to ``500``. This will combine
-a series of ``Title`` elements that occur one after another, which sometimes
-happens in lists that are not detected as ``ListItem`` elements. Set
-``combine_text_under_n_chars=0`` to turn off this behavior.
+- ``overlap: int (default=0)`` - only when using text-splitting to break up an oversized chunk,
+  include this number of characters from the end of the prior chunk as a prefix on the next. This
+  can mitigate the effect of splitting the semantic unit represented by the oversized element at an
+  arbitrary position based on text length.

-The following shows an example of how to use ``chunk_by_title``. You will
-see the document chunked into sections instead of elements.
+- ``overlap_all: bool (default=False)`` - also apply overlap between "normal" chunks, not just when
+  text-splitting to break up an oversized element. Because normal chunks are formed from whole
+  elements that each have a clean semantic boundary, this option may "pollute" normal chunks. You'll
+  need to decide based on your use-case whether this option is right for you.


+Chunking elements
+-----------------
+
+Chunking can be performed as part of partitioning or as a separate step after
+partitioning:
+
+Specifying a chunking strategy while partitioning
+++++++++++++++++++++++++++++++++++++++++++++++++
+
+Chunking can be performed as part of partitioning by specifying a value for the
+``chunking_strategy`` argument. The current options are ``basic`` and ``by-title`` (described
+below).
+
 .. code:: python

  from unstructured.partition.html import partition_html
-  from unstructured.chunking.title import chunk_by_title
+
+  chunks = partition_html(url=url, chunking_strategy="basic")
+
+Calling a chunking function
+++++++++++++++++++++++++++
+
+Chunking can also be performed separately from partitioning by calling a chunking function directly.
+This may be convenient, for example, when tuning chunking parameters. Chunking is typically faster
+than partitioning, especially when OCR or inference is used, so a faster feedback loop is possible
+by doing these separately:
+
+.. code:: python
+
+  from unstructured.chunking.basic import chunk_elements
+  from unstructured.partition.html import partition_html

  url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
  elements = partition_html(url=url)
+  chunks = chunk_elements(elements)
+
+  # -- OR --
+
+  from unstructured.chunking.title import chunk_by_title
+
  chunks = chunk_by_title(elements)

  for chunk in chunks:
      print(chunk)
      print("\n\n" + "-"*80)
      input()
+
+
+Chunking Strategies
+-------------------
+
+There are currently two chunking strategies, *basic* and *by_title*. The ``by_title`` strategy
+shares most behaviors with the basic strategy so we'll describe the baseline strategy first:
+
+"basic" chunking strategy
+++++++++++++++++++++++++
+
+- The basic strategy combines sequential elements to maximally fill each chunk while respecting both
+  the specified ``max_characters`` (hard-max) and ``new_after_n_chars`` (soft-max) option values.
+
+- A single element that by itself exceeds the hard-max is isolated (never combined with another
+  element) and then divided into two or more chunks using text-splitting.
+
+- A ``Table`` element is always isolated and never combined with another element. A ``Table`` can be
+  oversized, like any other text element, and in that case is divided into two or more
+  ``TableChunk`` elements using text-splitting.
+
+- If specified, ``overlap`` is applied between split-chunks and is also applied between normal
+  chunks when ``overlap_all`` is ``True``.
+
+
+"by_title" chunking strategy
++++++++++++++++++++++++++++
+
+The ``by_title`` chunking strategy preserves section boundaries and optionally page boundaries as
+well. "Preserving" here means that a single chunk will never contain text that occurred in two
+different sections. When a new section starts, the existing chunk is closed and a new one started,
+even if the next element would fit in the prior chunk.
+
+In addition to the behaviors of the ``basic`` strategy above, the ``by_title`` strategy has the
+following behaviors:
+
+- **Detect section headings.** A ``Title`` element is considered to start a new section. When a
+  ``Title`` element is encountered, the prior chunk is closed and a new chunk started, even if the
+  ``Title`` element would fit in the prior chunk. This implements the first aspect of the "preserve
+  section boundaries" contract.
+
+- **Detect metadata.section change.** An element with a new value in ``element.metadata.section`` is
+  considered to start a new section. When a change in this value is encountered a new chunk is
+  started. This implements the second aspect of preserving section boundaries. This metadata is not
+  present in all document formats so is not used alone. An element having ``None`` for this metadata
+  field is considered to be part of the prior section; a section break is only detected on an
+  explicit change in value.
+
+- **Respect page boundaries.** Page boundaries can optionally also be respected using the
+  ``multipage_sections`` argument. This defaults to ``True`` meaning that a page break does *not*
+  start a new chunk. Setting this to ``False`` will separate elements that occur on different pages
+  into distinct chunks.
+
+- **Combine small sections.** In certain documents, partitioning may identify a list-item or other
+  short paragraph as a ``Title`` element even though it does not serve as a section heading. This
+  can produce chunks substantially smaller than desired. This behavior can be mitigated using the
+  ``combine_text_under_n_chars`` argument. This defaults to the same value as ``max_characters``
+  such that sequential small sections are combined to maximally fill the chunking window. Setting
+  this to ``0`` will disable section combining.
--- a/test_unstructured/chunking/test_basic.py
+++ b/test_unstructured/chunking/test_basic.py
@ -0,0 +1,108 @@
+"""Unit-test suite for the `unstructured.chunking.basic` module.
+
+That module implements the baseline chunking strategy. The baseline strategy has all behaviors
+shared by all chunking strategies and no extra rules like perserve section or page boundaries.
+"""
+
+from __future__ import annotations
+
+from unstructured.chunking.basic import chunk_elements
+from unstructured.documents.elements import CompositeElement, Text, Title
+from unstructured.partition.docx import partition_docx
+
+
+def test_it_chunks_a_document_when_basic_chunking_strategy_is_specified_on_partition_function():
+    """Basic chunking can be combined with partitioning, exercising the decorator."""
+    filename = "example-docs/handbook-1p.docx"
+
+    chunks = partition_docx(filename, chunking_strategy="basic")
+
+    assert chunks == [
+        CompositeElement(
+            "US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 – INTRODUCTION"
+            "\n\nA.\tPURPOSE"
+        ),
+        CompositeElement(
+            "The United States Trustee appoints and supervises standing trustees and monitors and"
+            " supervises cases under chapter 13 of title 11 of the United States Code.  28 U.S.C."
+            " § 586(b).  The Handbook, issued as part of our duties under 28 U.S.C. § 586,"
+            " establishes or clarifies the position of the United States Trustee Program (Program)"
+            " on the duties owed by a standing trustee to the debtors, creditors, other parties in"
+            " interest, and the United States Trustee.  The Handbook does not present a full and"
+        ),
+        CompositeElement(
+            "complete statement of the law; it should not be used as a substitute for legal"
+            " research and analysis.  The standing trustee must be familiar with relevant"
+            " provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules),"
+            " any local bankruptcy rules, and case law.  11 U.S.C. § 321, 28 U.S.C. § 586,"
+            " 28 C.F.R. § 58.6(a)(3).  Standing trustees are encouraged to follow Practice Tips"
+            " identified in this Handbook but these are not considered mandatory."
+        ),
+        CompositeElement(
+            "Nothing in this Handbook should be construed to excuse the standing trustee from"
+            " complying with all duties imposed by the Bankruptcy Code and Rules, local rules, and"
+            " orders of the court.  The standing trustee should notify the United States Trustee"
+            " whenever the provision of the Handbook conflicts with the local rules or orders of"
+            " the court.  The standing trustee is accountable for all duties set forth in this"
+            " Handbook, but need not personally perform any duty unless otherwise indicated.  All"
+        ),
+        CompositeElement(
+            "statutory references in this Handbook refer to the Bankruptcy Code, 11 U.S.C. § 101"
+            " et seq., unless otherwise indicated."
+        ),
+        CompositeElement(
+            "This Handbook does not create additional rights against the standing trustee or"
+            " United States Trustee in favor of other parties.\n\nB.\tROLE OF THE UNITED STATES"
+            " TRUSTEE"
+        ),
+        CompositeElement(
+            "The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the"
+            " responsibilities for daytoday administration of cases.  Debtors, creditors, and"
+            " third parties with adverse interests to the trustee were concerned that the court,"
+            " which previously appointed and supervised the trustee, would not impartially"
+            " adjudicate their rights as adversaries of that trustee. To address these concerns,"
+            " judicial and administrative functions within the bankruptcy system were bifurcated."
+        ),
+        CompositeElement(
+            "Many administrative functions formerly performed by the court were placed within the"
+            " Department of Justice through the creation of the Program.  Among the administrative"
+            " functions assigned to the United States Trustee were the appointment and supervision"
+            " of chapter 13 trustees./  This Handbook is issued under the authority of the"
+            " Program’s enabling statutes. \n\nC.\tSTATUTORY DUTIES OF A STANDING TRUSTEE\t"
+        ),
+        CompositeElement(
+            "The standing trustee has a fiduciary responsibility to the bankruptcy estate.  The"
+            " standing trustee is more than a mere disbursing agent.  The standing trustee must"
+            " be personally involved in the trustee operation.  If the standing trustee is or"
+            " becomes unable to perform the duties and responsibilities of a standing trustee,"
+            " the standing trustee must immediately advise the United States Trustee."
+            "  28 U.S.C. § 586(b), 28 C.F.R. § 58.4(b) referencing 28 C.F.R. § 58.3(b)."
+        ),
+        CompositeElement(
+            "Although this Handbook is not intended to be a complete statutory reference, the"
+            " standing trustee’s primary statutory duties are set forth in 11 U.S.C. § 1302, which"
+            " incorporates by reference some of the duties of chapter 7 trustees found in"
+            " 11 U.S.C. § 704.  These duties include, but are not limited to, the"
+            " following:\n\nCopyright"
+        ),
+    ]
+
+
+def test_it_chunks_elements_when_the_user_already_has_them():
+    elements = [
+        Title("Introduction"),
+        Text(
+            # --------------------------------------------------------- 64 -v
+            "Lorem ipsum dolor sit amet consectetur adipiscing elit. In rhoncus ipsum sed lectus"
+            " porta volutpat.",
+        ),
+    ]
+
+    chunks = chunk_elements(elements, max_characters=64)
+
+    assert chunks == [
+        CompositeElement("Introduction"),
+        # -- splits on even word boundary, not mid-"rhoncus" --
+        CompositeElement("Lorem ipsum dolor sit amet consectetur adipiscing elit. In"),
+        CompositeElement("rhoncus ipsum sed lectus porta volutpat."),
+    ]
--- a/test_unstructured/partition/test_auto.py
+++ b/test_unstructured/partition/test_auto.py
@ -1145,14 +1145,16 @@ def test_add_chunking_strategy_on_partition_auto_respects_max_chars():
    assert len(partitioned_table_elements_5_chars) != len(table_elements)
    assert len(partitioned_table_elements_200_chars) != len(table_elements)

-    assert len(partitioned_table_elements_5_chars[0].text) == 5
+    # trailing whitespace is stripped from the first chunk, leaving only a checkbox character
+    assert len(partitioned_table_elements_5_chars[0].text) == 1
+    # but the second chunk is the full 5 characters
    assert len(partitioned_table_elements_5_chars[1].text) == 5
    assert len(partitioned_table_elements_5_chars[0].metadata.text_as_html) == 5

    # the first table element is under 200 chars so doesn't get chunked!
    assert table_elements[0] == partitioned_table_elements_200_chars[0]
    assert len(partitioned_table_elements_200_chars[0].text) < 200
-    assert len(partitioned_table_elements_200_chars[1].text) == 200
+    assert len(partitioned_table_elements_200_chars[1].text) == 198
    assert len(partitioned_table_elements_200_chars[1].metadata.text_as_html) == 200


--- a/unstructured/chunking/init.py
+++ b/unstructured/chunking/init.py
@ -11,6 +11,7 @@ from typing import Any, Callable, Dict, List

 from typing_extensions import ParamSpec

+from unstructured.chunking.basic import chunk_elements
 from unstructured.chunking.title import chunk_by_title
 from unstructured.documents.elements import Element

@ -25,6 +26,10 @@ def add_chunking_strategy() -> Callable[[Callable[_P, List[Element]]], Callable[
    """

    def decorator(func: Callable[_P, List[Element]]) -> Callable[_P, List[Element]]:
+        # -- Patch the docstring of the decorated function to add chunking strategy and
+        # -- chunking-related argument documentation. This only applies when `chunking_strategy`
+        # -- is an explicit argument of the decorated function and "chunking_strategy" is not
+        # -- already mentioned in the docstring.
        if func.__doc__ and (
            "chunking_strategy" in func.__code__.co_varnames
            and "chunking_strategy" not in func.__doc__
@ -32,16 +37,15 @@ def add_chunking_strategy() -> Callable[[Callable[_P, List[Element]]], Callable[
            func.__doc__ += (
                "\nchunking_strategy"
                + "\n\tStrategy used for chunking text into larger or smaller elements."
-                + "\n\tDefaults to `None` with optional arg of 'by_title'."
+                + "\n\tDefaults to `None` with optional arg of 'basic' or 'by_title'."
                + "\n\tAdditional Parameters:"
                + "\n\t\tmultipage_sections"
                + "\n\t\t\tIf True, sections can span multiple pages. Defaults to True."
                + "\n\t\tcombine_text_under_n_chars"
                + "\n\t\t\tCombines elements (for example a series of titles) until a section"
-                + "\n\t\t\treaches a length of n characters."
+                + "\n\t\t\treaches a length of n characters. Only applies to 'by_title' strategy."
                + "\n\t\tnew_after_n_chars"
-                + "\n\t\t\tCuts off new sections once they reach a length of n characters"
-                + "\n\t\t\ta soft max."
+                + "\n\t\t\tCuts off chunks once they reach a length of n characters; a soft max."
                + "\n\t\tmax_characters"
                + "\n\t\t\tChunks elements text and text_as_html (if present) into chunks"
                + "\n\t\t\tof length n characters, a hard max."
@ -49,20 +53,43 @@ def add_chunking_strategy() -> Callable[[Callable[_P, List[Element]]], Callable[

        @functools.wraps(func)
        def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
+            """The decorated function is replaced with this one."""
+
+            def get_call_args_applying_defaults() -> Dict[str, Any]:
+                """Map both explicit and default arguments of decorated func call by param name."""
+                sig = inspect.signature(func)
+                call_args: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
+                for param in sig.parameters.values():
+                    if param.name not in call_args and param.default is not param.empty:
+                        call_args[param.name] = param.default
+                return call_args
+
+            # -- call the partitioning function to get the elements --
            elements = func(*args, **kwargs)
-            sig = inspect.signature(func)
-            params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
-            for param in sig.parameters.values():
-                if param.name not in params and param.default is not param.empty:
-                    params[param.name] = param.default
-            if params.get("chunking_strategy") == "by_title":
-                elements = chunk_by_title(
+
+            # -- look for a chunking-strategy argument and run the indicated chunker when present --
+            call_args = get_call_args_applying_defaults()
+
+            if call_args.get("chunking_strategy") == "by_title":
+                return chunk_by_title(
                    elements,
-                    multipage_sections=params.get("multipage_sections", True),
-                    combine_text_under_n_chars=params.get("combine_text_under_n_chars", 500),
-                    new_after_n_chars=params.get("new_after_n_chars", 500),
-                    max_characters=params.get("max_characters", 500),
+                    combine_text_under_n_chars=call_args.get("combine_text_under_n_chars", 500),
+                    max_characters=call_args.get("max_characters", 500),
+                    multipage_sections=call_args.get("multipage_sections", True),
+                    new_after_n_chars=call_args.get("new_after_n_chars", 500),
+                    overlap=call_args.get("overlap", 0),
+                    overlap_all=call_args.get("overlap_all", False),
                )
+
+            if call_args.get("chunking_strategy") == "basic":
+                return chunk_elements(
+                    elements,
+                    max_characters=call_args.get("max_characters", 500),
+                    new_after_n_chars=call_args.get("new_after_n_chars", 500),
+                    overlap=call_args.get("overlap", 0),
+                    overlap_all=call_args.get("overlap_all", False),
+                )
+
            return elements

        return wrapper
--- a/unstructured/chunking/base.py
+++ b/unstructured/chunking/base.py
@ -77,6 +77,10 @@ class ChunkingOptions:
        Specifies the length of a string ("tail") to be drawn from each chunk and prefixed to the
        next chunk as a context-preserving mechanism. By default, this only applies to split-chunks
        where an oversized element is divided into multiple chunks by text-splitting.
+    overlap_all
+        Default: `False`. When `True`, apply overlap between "normal" chunks formed from whole
+        elements and not subject to text-splitting. Use this with caution as it entails a certain
+        level of "pollution" of otherwise clean semantic chunk boundaries.
    text_splitting_separators
        A sequence of strings like `("\n", " ")` to be used as target separators during
        text-splitting. Text-splitting only applies to splitting an oversized element into two or
@ -95,7 +99,7 @@ class ChunkingOptions:
        new_after_n_chars: Optional[int] = None,
        overlap: int = 0,
        overlap_all: bool = False,
-        text_splitting_separators: Sequence[str] = (),
+        text_splitting_separators: Sequence[str] = ("\n", " "),
    ):
        self._combine_text_under_n_chars_arg = combine_text_under_n_chars
        self._max_characters = max_characters
@ -114,7 +118,7 @@ class ChunkingOptions:
        new_after_n_chars: Optional[int] = None,
        overlap: int = 0,
        overlap_all: bool = False,
-        text_splitting_separators: Sequence[str] = (),
+        text_splitting_separators: Sequence[str] = ("\n", " "),
    ) -> Self:
        """Construct validated instance.

--- a/unstructured/chunking/basic.py
+++ b/unstructured/chunking/basic.py
@ -0,0 +1,80 @@
+"""Implementation of baseline chunking.
+
+This is the "plain-vanilla" chunking strategy. All the fundamental chunking behaviors are present in
+this strategy and also in all other strategies. Those are:
+
+- Maximally fill each chunk with sequential elements.
+- Isolate oversized elements and divide (only) those chunks by text-splitting.
+- Overlap when requested.
+
+"Fancier" strategies add higher-level semantic-unit boundaries to be respected. For example, in the
+by-title strategy, section boundaries are respected, meaning a chunk never contains text from two
+different sections. When a new section is detected the current chunk is closed and a new one
+started.
+"""
+
+from __future__ import annotations
+
+from typing import List, Optional, Sequence
+
+from unstructured.chunking.base import BasePreChunker, ChunkingOptions
+from unstructured.documents.elements import Element
+
+
+def chunk_elements(
+    elements: Sequence[Element],
+    new_after_n_chars: Optional[int] = None,
+    max_characters: int = 500,
+    overlap: int = 0,
+    overlap_all: bool = False,
+) -> List[Element]:
+    """Combine sequential `elements` into chunks, respecting specified text-length limits.
+
+    Produces a sequence of `CompositeElement`, `Table`, and `TableChunk` elements (chunks).
+
+    Parameters
+    ----------
+    elements
+        A list of unstructured elements. Usually the output of a partition function.
+    max_characters
+        Hard maximum chunk length. No chunk will exceed this length. A single element that exceeds
+        this length will be divided into two or more chunks using text-splitting.
+    new_after_n_chars
+        A chunk that of this length or greater is not extended to include the next element, even if
+        that element would fit without exceeding `max_characters`. A "soft max" length that can be
+        used in conjunction with `max_characters` to limit most chunks to a preferred length while
+        still allowing larger elements to be included in a single chunk without resorting to
+        text-splitting. Defaults to `max_characters` when not specified, which effectively disables
+        any soft window. Specifying 0 for this argument causes each element to appear in a chunk by
+        itself (although an element with text longer than `max_characters` will be still be split
+        into two or more chunks).
+    overlap
+        Specifies the length of a string ("tail") to be drawn from each chunk and prefixed to the
+        next chunk as a context-preserving mechanism. By default, this only applies to split-chunks
+        where an oversized element is divided into multiple chunks by text-splitting.
+    overlap_all
+        Default: `False`. When `True`, apply overlap between "normal" chunks formed from whole
+        elements and not subject to text-splitting. Use this with caution as it produces a certain
+        level of "pollution" of otherwise clean semantic chunk boundaries.
+    """
+    # -- raises ValueError on invalid parameters --
+    opts = ChunkingOptions.new(
+        max_characters=max_characters,
+        new_after_n_chars=new_after_n_chars,
+        overlap=overlap,
+        overlap_all=overlap_all,
+    )
+
+    return [
+        chunk
+        for pre_chunk in BasicPreChunker.iter_pre_chunks(elements, opts)
+        for chunk in pre_chunk.iter_chunks()
+    ]
+
+
+class BasicPreChunker(BasePreChunker):
+    """Produces pre-chunks from a sequence of document-elements using the "basic" rule-set.
+
+    The "basic" rule-set is essentially "no-rules" other than `Table` is segregated into its own
+    pre-chunk.
+    """
--- a/unstructured/chunking/title.py
+++ b/unstructured/chunking/title.py
@ -26,6 +26,8 @@ def chunk_by_title(
    combine_text_under_n_chars: Optional[int] = None,
    new_after_n_chars: Optional[int] = None,
    max_characters: int = 500,
+    overlap: int = 0,
+    overlap_all: bool = False,
 ) -> List[Element]:
    """Uses title elements to identify sections within the document for chunking.

@ -54,12 +56,22 @@ def chunk_by_title(
    max_characters
        Chunks elements text and text_as_html (if present) into chunks of length
        n characters (hard max)
+    overlap
+        Specifies the length of a string ("tail") to be drawn from each chunk and prefixed to the
+        next chunk as a context-preserving mechanism. By default, this only applies to split-chunks
+        where an oversized element is divided into multiple chunks by text-splitting.
+    overlap_all
+        Default: `False`. When `True`, apply overlap between "normal" chunks formed from whole
+        elements and not subject to text-splitting. Use this with caution as it entails a certain
+        level of "pollution" of otherwise clean semantic chunk boundaries.
    """
    opts = ChunkingOptions.new(
        combine_text_under_n_chars=combine_text_under_n_chars,
        max_characters=max_characters,
        multipage_sections=multipage_sections,
        new_after_n_chars=new_after_n_chars,
+        overlap=overlap,
+        overlap_all=overlap_all,
    )

    pre_chunks = PreChunkCombiner(