unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-04 23:52:23 +00:00

Author	SHA1	Message	Date
Yao You	b814ece39f	fix: properly handle the case when an element's text is None (#3995 ) Some elements, like `Image`, can have `None` as its `text` attribute's value. In that case current chunking logic fails because it expects the field to always have a length or can be split. The fix is to update the logic as `element.text or ""` for checking length and add flow control to early exit to avoid calling split on `None`.	2025-05-05 18:08:11 +00:00
Steve Canny	4379d883a3	chunk: relax table segregation during chunking (#3812 ) Summary Relax table-segregation rule applied during chunking such that a `Table` and `Text`-subtype elements can be combined into a single chunk when the chunking window allows. Additional Context Until now, `Table` elements have always been segregated during chunking, i.e. a chunk that contained a table would never contain any other element. In certain scenarios, especially when a large chunking window of say 2000 characters is used, this behavior can reduce retrieval effectiveness by isolating the table from surrounding context. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-12-09 18:57:22 +00:00
Steve Canny	c85f29e6ca	fix(xlsx): XLSX emits std minified .text_as_html (#3558 ) Summary Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_xlsx()`. Produce minified `.text_as_html` consistent with that formed by chunking. Additional Context - XLSX `.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements). - `table.text` is clean-concatenated-text (CCT) of table. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-10-17 22:05:11 +00:00
Steve Canny	086b8d6f8a	rfctr(part): prepare for pluggable auto-partitioners 2 (#3657 ) Summary Step 2 in prep for pluggable auto-partitioners, remove `regex_metadata` field from `ElementMetadata`. Additional Context - "regex-metadata" was an experimental feature that didn't pan out. - It's implemented by one of the post-partitioning metadata decorators, so get rid of it as part of the cleanup before consolidating those decorators.	2024-09-24 17:33:25 +00:00
Steve Canny	a861ed8fe7	feat(chunk): split tables on even row boundaries (#3504 ) Summary Use more sophisticated algorithm for splitting oversized `Table` elements into `TableChunk` elements during chunking to ensure element text and HTML are "synchronized" and HTML is always parseable. Additional Context Table splitting now has the following characteristics: - `TableChunk.metadata.text_as_html` is always a parseable HTML `<table>` subtree. - `TableChunk.text` is always the text in the HTML version of the table fragment in `.metadata.text_as_html`. Text and HTML are "synchronized". - The table is divided at a whole-row boundary whenever possible. - A row is broken at an even-cell boundary when a single row is larger than the chunking window. - A cell is broken at an even-word boundary when a single cell is larger than the chunking window. - `.text_as_html` is "minified", removing all extraneous whitespace and unneeded elements or attributes. This maximizes the semantic "density" of each chunk.	2024-08-19 18:56:53 +00:00
Steve Canny	cbe1b35621	rfctr(chunk): prep for adding TableSplitter (#3510 ) Summary Mechanical refactoring in preparation for adding (pre-chunk) `TableSplitter` in a PR stacked on this one.	2024-08-12 18:04:49 +00:00
Steve Canny	05ff975081	fix: remove unused `ElementMetadata.section` (#2921 ) Summary The `.section` field in `ElementMetadata` is dead code, possibly a remainder from a prior iteration of `partition_epub()`. In any case, it is not populated by any partitioner. Remove it and any code that uses it.	2024-04-22 23:58:17 +00:00
Steve Canny	1af41d5f90	feat(chunking): add .orig_elements behavior to chunking (#2656 ) Summary Add the actual behavior to populate `.metadata.orig_elements` during chunking, when so instructed by the `include_orig_elements` option. Additional Context The underlying structures to support this, namely the `.metadata.orig_elements` field and the `include_orig_elements` chunking option, were added in closely prior PRs. This PR adds the behavior to actually populate that metadata field during chunking when the option is set.	2024-03-18 19:27:39 +00:00
Steve Canny	137ea67336	feat(chunking): add include_orig_elements chunking option (#2649 ) Summary Add `include_orig_elements: bool = True` as a new chunking option. This PR does not implement _adding_ original elements to chunks, only accepting this parameter as a chunking option and assigning `True` to it as a default value when it is omitted as a keyword argument. Note this will need to be added in other repositories as well in order to fully support this new option by all access methods. In particular it will need to be added in `unstructured-api` in order to become available via the SDKs.	2024-03-15 18:48:07 +00:00
Steve Canny	8ea203adf7	feat(chunking): composite text gets is_continuation (#2639 ) Summary Add `metadata.is_continuation = True` to metadata of second-and-later text-split chunks formed from an oversized non-table element. Previously this metadata was only present on text-split `TableChunk` elements. This enables downstream filtering of intentionally redundant metadata on chunk elements that may not be desired for all purposes. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-12 19:44:41 +00:00
Steve Canny	51cf6bf716	rfctr(chunking): extract strategy-specific chunking options (#2556 ) Summary A pluggable chunking strategy needs its own local set of chunking options that subclasses a base-class in `unstructured`. Extract distinct `_ByTitleChunkingOptions` and `_BasicChunkingOptions` for the existing two chunking strategies and move their strategy-specific option setting and validation to the respective subclass. This was also a good opportunity for us to clean up a few odds and ends we'd been meaning to. Might be worth looking at the commits individually as they are cohesive incremental steps toward the goal.	2024-02-23 18:22:44 +00:00
Steve Canny	1947375b2e	rfctr(chunking): preparation for plug-in chunkers, Part I (#2550 ) Summary In order to accommodate customized chunkers other than those directly provided by `unstructured`, some further modularization is necessary such that a new chunker can be added as a "plug-in" without modifying the `unstructured` library code. This PR is the straightforward refactoring required for this process like typing changes. There are also some other small changes we've been meaning to make like making all chunking options accept `None` to represent their default value so the broad field of callers (e.g. ingest, unstructured-api, SDK) don't need to determine and set default values for chunking arguments leading to diverging defaults. Isolating these "noisy" but easy to accept changes in this preparatory PR reduces the noise in the more substantive changes to follow.	2024-02-21 23:16:13 +00:00
Steve Canny	7a1e732aa1	feat(chunking): add inter-chunk overlap (#2309 ) Reviewer: This PR probably reviews faster commit-by-commit. Each of the commits is groomed and focuses on a separate clear aspect of this implementation. This PR adds inter-chunk overlap capability to chunking. It does not yet expose it via the API. Inter-chunk overlap is overlap between whole pre-chunks, prior to any text-splitting required for oversized chunks. Contrast with intra-chunk overlap implemented in the prior PR which implements overlap on these latter text-splitting boundaries. Inter-chunk overlap is disabled by default since a pre-chunk already has a "clean" semantic boundary (composed of whole elements) and adding overlap there introduces noise from the adjacent context. If the user wants inter-chunk overlap they must specify `overlap_all=True` in the options. Inter-chunk overlap uses the same `overlap` length value used by intra-chunk overlap and does not overlap when that value is 0.	2024-01-05 01:24:12 +00:00
Steve Canny	eb1b022ff8	feat(chunking): add overlap on chunk-splits (#2305 ) There are two distinct overlap operations with completely different implementations. This is "intra-chunk" overlap, applying overlap to chunks resulting from text-splitting an oversized element. So if an oversized element had text "abcd efgh ijkl mnop qrst" and was split at 15 chars with overlap of 5, it would produce "abcd efgh ijkl" and "ijkl mnop qrst". Any inter-chunk overlap from the prior chunk and applied at the beginning of the string (before "abcd") is handled in a separate operation in the next PR.	2023-12-22 20:35:18 +00:00
Steve Canny	093a11d058	rfctr(chunking): split oversized chunks on word boundary (#2297 ) The text of an oversized chunk is split on an arbitrary character boundary (mid-word). The `chunk_by_character()` strategy introduces the idea of allowing the user to specify a separator to use for chunk-splitting. For `langchain` this is typically "\n\n", "\n", or " "; blank-line, newline, or word boundaries respectively. Even if the user is allowed to specify a separator, we must provide fall-back for when a chunk contains no such character. This can be done incrementally, like blank-line is preferable to newline, newline is preferable to word, and word is preferable to arbitrary character. Further, there is nothing particular to `chunk_by_character()` in providing such a fall-back text-splitting strategy. It would be preferable for all strategies to split oversized chunks on even-word boundaries for example. Note that while a "blank-line" ("\n\n") may be common in plain text, it is unlikely to appear in the text of an element because it would have been interpreted as an element boundary during partitioning. Add _TextSplitter with basic separator preferences and fall-back and apply it to chunk-splitting for all strategies. The `by_character` chunking strategy may enhance this behavior by adding the option for a user to specify a particular separator suited to their use case.	2023-12-21 05:45:36 +00:00
Steve Canny	82714cad98	rfctr(chunking): extract BasePreChunker (#2294 ) The `_split_elements_by_title_and_table()` function fulfills the pre-chunker role for `chunk_by_title()`, but most of its operation is not strategy-specific and can be reused by other chunking strategies. Extract `BasePreChunker` and use it as the base class for `_ByTitlePreChunker` which now only needs to provide the boundary predicates specific to that strategy.	2023-12-20 06:30:21 +00:00
Steve Canny	4e2ba2c9b2	rfctr(chunking): extract boundary predicates (#2284 ) `chunk_by_title()` respects certain semantic boundaries while chunking. Those are sections introduced by a `Title` element, sections introduced by a `metadata.section` value change, and optionally page-breaks. "Respecting" in this context means that elements on opposite sides of a semantic boundary never appear in the same chunk. The `metadata_differs()` function used for this purpose is clumsy to use requiring the caller to maintain state (prior element). It also combines what are independent predicates such that they cannot be individually reused. Introduce the `BoundaryPredicate` type which takes an element and returns bool, indicating whether the element introduces a new semantic boundary. These can be reused by any chunking strategy that needs them and allows the pre-chunking operation to be generalized for use by any chunking strategy, which it will be in the following PR.	2023-12-19 18:20:05 +00:00
Steve Canny	0c7f64ecaa	rfctr(chunking): generalize PreChunkBuilder (#2283 ) To implement inter-pre-chunk overlap, we need a context that sees every pre-chunk both before and after it is accumulated (from elements). - We need access to the pre-chunk when it is completed so we can extract the "tail" overlap to be applied to the next chunk. - We need access to the as-yet-unpopulated pre-chunk so we can add the prior tail to it as a prefix. This "visibility" is split between `PreChunkBuilder` and the pre-chunker itself, which handles `TablePreChunk`s without the builder. Move `Table` element and TablePreChunk` formation into `PreChunkBuilder` such that _all_ element types (adding `Table` elements in particular) pass through it. Then `PreChunkBuilder` becomes the context we require. The actual overlap harvesting and application will come in a subsequent commit.	2023-12-18 22:21:34 +00:00
Steve Canny	36e81c3367	rfctr(chunking): extract general-purpose objects to base (#2281 ) Many of the classes defined in `unstructured.chunking.title` are applicable to any chunking strategy and will shortly be used for the "by-character" chunking strategy as well. Move these and their tests to `unstructured.chunking.base`. Along the way, rename `TextPreChunkBuilder` to `PreChunkBuilder` because it will be generalized in a subsequent PR to also take `Table` elements such that inter-pre-chunk overlap can be implemented. Otherwise, no logic changes, just moves.	2023-12-16 17:28:15 +00:00
Steve Canny	70cf141036	rfctr: extract ChunkingOptions (#2266 ) Chunking options for things like chunk-size are largely independent of chunking strategy. Further, validating the args and applying defaults based on call arguments is sophisticated to make its use easy for the caller. These details distract from what the chunker is actually doing and would need to be repeated for every chunking strategy if left where they are. Extract these settings and the rules governing chunking behavior based on options into its own immutable object that can be passed to any component that is subject to optional behavior (pretty much all of them).	2023-12-15 19:51:02 +00:00

20 Commits