unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-03 23:20:35 +00:00

Author	SHA1	Message	Date
Steve Canny	7a1e732aa1	feat(chunking): add inter-chunk overlap (#2309 ) Reviewer: This PR probably reviews faster commit-by-commit. Each of the commits is groomed and focuses on a separate clear aspect of this implementation. This PR adds inter-chunk overlap capability to chunking. It does not yet expose it via the API. Inter-chunk overlap is overlap between whole pre-chunks, prior to any text-splitting required for oversized chunks. Contrast with intra-chunk overlap implemented in the prior PR which implements overlap on these latter text-splitting boundaries. Inter-chunk overlap is disabled by default since a pre-chunk already has a "clean" semantic boundary (composed of whole elements) and adding overlap there introduces noise from the adjacent context. If the user wants inter-chunk overlap they must specify `overlap_all=True` in the options. Inter-chunk overlap uses the same `overlap` length value used by intra-chunk overlap and does not overlap when that value is 0.	2024-01-05 01:24:12 +00:00
Steve Canny	eb1b022ff8	feat(chunking): add overlap on chunk-splits (#2305 ) There are two distinct overlap operations with completely different implementations. This is "intra-chunk" overlap, applying overlap to chunks resulting from text-splitting an oversized element. So if an oversized element had text "abcd efgh ijkl mnop qrst" and was split at 15 chars with overlap of 5, it would produce "abcd efgh ijkl" and "ijkl mnop qrst". Any inter-chunk overlap from the prior chunk and applied at the beginning of the string (before "abcd") is handled in a separate operation in the next PR.	2023-12-22 20:35:18 +00:00
Steve Canny	093a11d058	rfctr(chunking): split oversized chunks on word boundary (#2297 ) The text of an oversized chunk is split on an arbitrary character boundary (mid-word). The `chunk_by_character()` strategy introduces the idea of allowing the user to specify a separator to use for chunk-splitting. For `langchain` this is typically "\n\n", "\n", or " "; blank-line, newline, or word boundaries respectively. Even if the user is allowed to specify a separator, we must provide fall-back for when a chunk contains no such character. This can be done incrementally, like blank-line is preferable to newline, newline is preferable to word, and word is preferable to arbitrary character. Further, there is nothing particular to `chunk_by_character()` in providing such a fall-back text-splitting strategy. It would be preferable for all strategies to split oversized chunks on even-word boundaries for example. Note that while a "blank-line" ("\n\n") may be common in plain text, it is unlikely to appear in the text of an element because it would have been interpreted as an element boundary during partitioning. Add _TextSplitter with basic separator preferences and fall-back and apply it to chunk-splitting for all strategies. The `by_character` chunking strategy may enhance this behavior by adding the option for a user to specify a particular separator suited to their use case.	2023-12-21 05:45:36 +00:00
Steve Canny	82714cad98	rfctr(chunking): extract BasePreChunker (#2294 ) The `_split_elements_by_title_and_table()` function fulfills the pre-chunker role for `chunk_by_title()`, but most of its operation is not strategy-specific and can be reused by other chunking strategies. Extract `BasePreChunker` and use it as the base class for `_ByTitlePreChunker` which now only needs to provide the boundary predicates specific to that strategy.	2023-12-20 06:30:21 +00:00
Steve Canny	4e2ba2c9b2	rfctr(chunking): extract boundary predicates (#2284 ) `chunk_by_title()` respects certain semantic boundaries while chunking. Those are sections introduced by a `Title` element, sections introduced by a `metadata.section` value change, and optionally page-breaks. "Respecting" in this context means that elements on opposite sides of a semantic boundary never appear in the same chunk. The `metadata_differs()` function used for this purpose is clumsy to use requiring the caller to maintain state (prior element). It also combines what are independent predicates such that they cannot be individually reused. Introduce the `BoundaryPredicate` type which takes an element and returns bool, indicating whether the element introduces a new semantic boundary. These can be reused by any chunking strategy that needs them and allows the pre-chunking operation to be generalized for use by any chunking strategy, which it will be in the following PR.	2023-12-19 18:20:05 +00:00
Steve Canny	0c7f64ecaa	rfctr(chunking): generalize PreChunkBuilder (#2283 ) To implement inter-pre-chunk overlap, we need a context that sees every pre-chunk both before and after it is accumulated (from elements). - We need access to the pre-chunk when it is completed so we can extract the "tail" overlap to be applied to the next chunk. - We need access to the as-yet-unpopulated pre-chunk so we can add the prior tail to it as a prefix. This "visibility" is split between `PreChunkBuilder` and the pre-chunker itself, which handles `TablePreChunk`s without the builder. Move `Table` element and TablePreChunk` formation into `PreChunkBuilder` such that _all_ element types (adding `Table` elements in particular) pass through it. Then `PreChunkBuilder` becomes the context we require. The actual overlap harvesting and application will come in a subsequent commit.	2023-12-18 22:21:34 +00:00
Steve Canny	36e81c3367	rfctr(chunking): extract general-purpose objects to base (#2281 ) Many of the classes defined in `unstructured.chunking.title` are applicable to any chunking strategy and will shortly be used for the "by-character" chunking strategy as well. Move these and their tests to `unstructured.chunking.base`. Along the way, rename `TextPreChunkBuilder` to `PreChunkBuilder` because it will be generalized in a subsequent PR to also take `Table` elements such that inter-pre-chunk overlap can be implemented. Otherwise, no logic changes, just moves.	2023-12-16 17:28:15 +00:00
Steve Canny	70cf141036	rfctr: extract ChunkingOptions (#2266 ) Chunking options for things like chunk-size are largely independent of chunking strategy. Further, validating the args and applying defaults based on call arguments is sophisticated to make its use easy for the caller. These details distract from what the chunker is actually doing and would need to be repeated for every chunking strategy if left where they are. Extract these settings and the rules governing chunking behavior based on options into its own immutable object that can be passed to any component that is subject to optional behavior (pretty much all of them).	2023-12-15 19:51:02 +00:00

8 Commits