2 Commits

Author SHA1 Message Date
Steve Canny
36e81c3367
rfctr(chunking): extract general-purpose objects to base (#2281)
Many of the classes defined in `unstructured.chunking.title` are
applicable to any chunking strategy and will shortly be used for the
"by-character" chunking strategy as well.

Move these and their tests to `unstructured.chunking.base`.

Along the way, rename `TextPreChunkBuilder` to `PreChunkBuilder` because
it will be generalized in a subsequent PR to also take `Table` elements
such that inter-pre-chunk overlap can be implemented.

Otherwise, no logic changes, just moves.
2023-12-16 17:28:15 +00:00
Steve Canny
70cf141036
rfctr: extract ChunkingOptions (#2266)
Chunking options for things like chunk-size are largely independent of
chunking strategy. Further, validating the args and applying defaults
based on call arguments is sophisticated to make its use easy for the
caller. These details distract from what the chunker is actually doing
and would need to be repeated for every chunking strategy if left where
they are.

Extract these settings and the rules governing chunking behavior based
on options into its own immutable object that can be passed to any
component that is subject to optional behavior (pretty much all of
them).
2023-12-15 19:51:02 +00:00