mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-08-29 11:16:38 +00:00

### `chunk_by_title()` interface is "rude" **Executive Summary.** Perhaps the most commonly specified option for `chunk_by_title()` is `max_characters` (default: 500), which specifies the chunk window size. When a user specifies this value, they get an error message: ```python >>> chunks = chunk_by_title(elements, max_characters=100) ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters. ``` A few of the things that might reasonably pass through a user's mind at such a moment are: * "Is `110` not a valid value for `max_characters`? Why would that be?" * "I didn't specify a value for `combine_text_under_n_chars` or `new_after_n_chars`, in fact I don't know what they are because I haven't studied the documentation and would prefer not to; I just want smaller chunks! How could I supply an invalid value when I haven't supplied any value at all for these?" * "Which of these values is the problem? Why are you making me figure that out for myself? I'm sure the code knows which one is not valid, why doesn't it share that information with me? I'm busy here!" In this particular case, the problem is that `combine_text_under_n_chars` (defaults to 500) is greater than `max_characters`, which means it would never take effect (which is actually not a problem in itself). To fix this, once figuring out that was the problem, probably after opening an issue and maybe reading the source code, the user would need to specify: ```python >>> chunks = chunk_by_title( ... elements, max_characters=100, combine_text_under_n_chars=100 ... ) ``` This and other stressful user scenarios can be remedied by: * Using "active" defaults for the `combine_text_under_n_chars` and `new_after_n_chars` options. * Providing a specific error message for each way a constraint may be violated, such that direction to remedy the problem is immediately clear to the user. An *active default* is for example: * Make the default for `combine_text_under_n_chars: int | None = None` such that the code can detect when it has not been specified. * When not specified, set its value to `max_characters`, the same as its current (static) default. This particular change would avoid the behavior in the motivating example above. Another alternative for this argument is simply: ```python combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars) ``` ### Fix 1. Add constraint-specific error messages. 2. Use "active" defaults for `combine_text_under_n_ chars` and `new_after_n_chars`. 3. Improve docstring to describe active defaults, and explain other argument behaviors, in particular identifying suppression options like `combine_text_under_n_chars = 0` to disable chunk combining.