mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-03 15:11:30 +00:00
21 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
70cf141036
|
rfctr: extract ChunkingOptions (#2266)
Chunking options for things like chunk-size are largely independent of chunking strategy. Further, validating the args and applying defaults based on call arguments is sophisticated to make its use easy for the caller. These details distract from what the chunker is actually doing and would need to be repeated for every chunking strategy if left where they are. Extract these settings and the rules governing chunking behavior based on options into its own immutable object that can be passed to any component that is subject to optional behavior (pretty much all of them). |
||
![]() |
cbeaed21ef
|
rfctr: rename pre chunk (#2261)
The original naming for the pre-cursor to a chunk in `chunk_by_title()` was conflated with the idea of how these element subsequences were bounded (by document-section) for that strategy. I mistakenly picked that up as a universal concept but in fact no notion of section arises in the `by_character` or other chunking strategies. Fix this misconception by using the name `pre-chunk` for this concept throughout. |
||
![]() |
74d089d942
|
rfctr: skip CheckBox elements during chunking (#2253)
`CheckBox` elements get special treatment during chunking. `CheckBox` does not derive from `Text` and can contribute no text to a chunk. It is considered "non-combinable" and so is emitted as-is as a chunk of its own. A consequence of this is it breaks an otherwise contiguous chunk into two wherever it occurs. This is problematic, but becomes much more so when overlap is introduced. Each chunk accepts a "tail" text fragment from its preceding element and contributes its own tail fragment to the next chunk. These tails represent the "overlap" between chunks. However, a non-text chunk can neither accept nor provide a tail-fragment and so interrupts the overlap. None of the possible solutions are terrific. Give `Element` a `.text` attribute such that _all_ elements have a `.text` attribute, even though its value is the empty-string for element-types such as CheckBox and PageBreak which inherently have no text. As a consequence, several `cast()` wrappers are no longer required to satisfy strict type-checking. This also allows a `CheckBox` element to be combined with `Text` subtypes during chunking, essentially the same way `PageBreak` is, contributing no text to the chunk. Also, remove the `_NonTextSection` object which previously wrapped a `CheckBox` element during pre-chunking as it is no longer required. |
||
![]() |
7a741c9ae6
|
fix(chunk): #1985 mis-splits of Table chunks (#2076)
Closes #1985 **Summary.** Due to an interaction of coding errors, HTML text in `TableChunk` splits of a `Table` element were repeating the entire HTML for the table in each chunk. **Technical Summary.** This behavior was fixed but not published in the last chunking PR of a series. Finish up that PR and submit it all here. This PR extracts chunking to the particular Section type (each has their own distinct chunking behavior). |
||
![]() |
252405c780
|
Dynamic ElementMetadata implementation (#2043)
### Executive Summary The structure of element metadata is currently static, meaning only predefined fields can appear in the metadata. We would like the flexibility for end-users, at their own discretion, to define and use additional metadata fields that make sense for their particular use-case. ### Concepts A key concept for dynamic metadata is _known field_. A known-field is one of those explicitly defined on `ElementMetadata`. Each of these has a type and can be specified when _constructing_ a new `ElementMetadata` instance. This is in contrast to an _end-user defined_ (or _ad-hoc_) metadata field, one not known at "compile" time and added at the discretion of an end-user to suit the purposes of their application. An ad-hoc field can only be added by _assignment_ on an already constructed instance. ### End-user ad-hoc metadata field behaviors An ad-hoc field can be added to an `ElementMetadata` instance by assignment: ```python >>> metadata = ElementMetadata() >>> metadata.coefficient = 0.536 ``` A field added in this way can be accessed by name: ```python >>> metadata.coefficient 0.536 ``` and that field will appear in the JSON/dict for that instance: ```python >>> metadata = ElementMetadata() >>> metadata.coefficient = 0.536 >>> metadata.to_dict() {"coefficient": 0.536} ``` However, accessing a "user-defined" value that has _not_ been assigned on that instance raises `AttributeError`: ```python >>> metadata.coeffcient # -- misspelled "coefficient" -- AttributeError: 'ElementMetadata' object has no attribute 'coeffcient' ``` This makes "tagging" a metadata item with a value very convenient, but entails the proviso that if an end-user wants to add a metadata field to _some_ elements and not others (sparse population), AND they want to access that field by name on ANY element and receive `None` where it has not been assigned, they will need to use an expression like this: ```python coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None ``` ### Implementation Notes - **ad-hoc metadata fields** are discarded during consolidation (for chunking) because we don't have a consolidation strategy defined for those. We could consider using a default consolidation strategy like `FIRST` or possibly allow a user to register a strategy (although that gets hairy in non-private and multiple-memory-space situations.) - ad-hoc metadata fields **cannot start with an underscore**. - We have no way to distinguish an ad-hoc field from any "noise" fields that might appear in a JSON/dict loaded using `.from_dict()`, so unlike the original (which only loaded known-fields), we'll rehydrate anything that we find there. - No real type-safety is possible on ad-hoc fields but the type-checker does not complain because the type of all ad-hoc fields is `Any` (which is the best available behavior in my view). - We may want to consider whether end-users should be able to add ad-hoc fields to "sub" metadata objects too, like `DataSourceMetadata` and conceivably `CoordinatesMetadata` (although I'm not immediately seeing a use-case for the second one). |
||
![]() |
51d07b6434
|
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined into a _chunk_, the metadata in each of the elements is _consolidated_ into a single `ElementMetadata` instance. There are two main problems with the current implementation: 1. The current algorithm simply uses the metadata of the first element as the metadata for the chunk. This produces: - **empty chunk metadata** when the first element has no metadata, such as a `PageBreak("")` - **missing chunk metadata** when the first element contains only partial metadata such as a `Header()` or `Footer()` - **misleading metadata** when the first element contains values applicable only to that element, such as `category_depth`, `coordinates` (bounding-box), `header_footer_type`, or `parent_id` 2. Second, list metadata such as `emphasized_text_content`, `emphasized_text_tags`, `link_texts` and `link_urls` is only combined when it is unique within the combined list. These lists are "unzipped" pairs. For example, the first `link_texts` corresponds to the first `link_urls` value. When an item is removed from one (because it matches a prior entry) and not the other (say same text "here" but different URL) the positional correspondence is broken and downstream processing will at best be wrong, at worst raise an exception. ### Technical Discussion Element metadata cannot be determined in the general case simply by sampling that of the first element. At the same time, a simple union of all values is also not sufficient. To effectively consolidate the current variety of metadata fields we need four distinct strategies, selecting which to apply to each field based on that fields provenance and other characteristics. The four strategies are: - `FIRST` - Select the first non-`None` value across all the elements. Several fields are determined by the document source (`filename`, `file_directory`, etc.) and will not change within the output of a single partitioning run. They might not appear in every element, but they will be the same whenever they do appear. This strategy takes the first one that appears, if any, as proxy for the value for the entire chunk. - `LIST` - Consolidate the four list fields like `emphasized_text_content` and `link_urls` by concatenating them in element order (no set semantics apply). All values from `elements[n]` appear before those from `elements[n+1]` and existing order is preserved. - `LIST_UNIQUE` - Combine only unique elements across the (list) values of the elements, preserving order in which a unique item first appeared. - `REGEX` - Regex metadata has its own rules, including adjusting the `start` and `end` offset of each match based its new position in the concatenated text. - `DROP` - Not all metadata can or should appear in a chunk. For example, a chunk cannot be guaranteed to have a single `category_depth` or `parent_id`. Other strategies such as `COORDINATES` could be added to consolidate the bounding box of the chunk from the coordinates of its elements, roughly `min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or `SUM` depending on how metadata evolves. The proposed strategy assignments are these: - `attached_to_filename`: FIRST, - `category_depth`: DROP, - `coordinates`: DROP, - `data_source`: FIRST, - `detection_class_prob`: DROP, # -- ? confirm -- - `detection_origin`: DROP, # -- ? confirm -- - `emphasized_text_contents`: LIST, - `emphasized_text_tags`: LIST, - `file_directory`: FIRST, - `filename`: FIRST, - `filetype`: FIRST, - `header_footer_type`: DROP, - `image_path`: DROP, - `is_continuation`: DROP, # -- not expected, added by chunking, not before -- - `languages`: LIST_UNIQUE, - `last_modified`: FIRST, - `link_texts`: LIST, - `link_urls`: LIST, - `links`: DROP, # -- deprecated field -- - `max_characters`: DROP, # -- unused in code, probably remove from ElementMetadata -- - `page_name`: FIRST, - `page_number`: FIRST, - `parent_id`: DROP, - `regex_metadata`: REGEX, - `section`: FIRST, # -- section unconditionally breaks on new section -- - `sent_from`: FIRST, - `sent_to`: FIRST, - `subject`: FIRST, - `text_as_html`: DROP, # -- not expected, only occurs in TableSection -- - `url`: FIRST, **Assumptions:** - each .eml file is partitioned->chunked separately (not in batches), therefore sent-from, sent-to, and subject will not change within a section. ### Implementation Implementation of this behavior requires two steps: 1. **Collect** all non-`None` values from all elements, each in a sequence by field-name. Fields not populated in any of the elements do not appear in the collection. ```python all_meta = { "filename": ["memo.docx", "memo.docx"] "link_texts": [["here", "here"], ["and here"]] "parent_id": ["f273a7cb", "808b4ced"] } ``` 2. **Apply** the specified strategy to each item in the overall collection to produce the consolidated chunk meta (see implementation). ### Factoring For the following reasons, the implementation of metadata consolidation is extracted from its current location in `chunk_by_title()` to a handful of collaborating methods in `_TextSection`. - The current implementation of metadata consolidation "inline" in `chunk_by_title()` already has too many moving pieces to be understood without extended study. Adding strategies to that would make it worse. - `_TextSection` is the only section type where metadata is consolidated (the other two types always have exactly one element so already exactly one metadata.) - `_TextSection` is already the expert on all the information required to consolidate metadata, in particular the elements that make up the section and their text. Some other problems were also fixed in that transition, such as mutation of elements during the consolidation process. ### Technical Risk: adding new `ElementMetadata` field breaks metadata If each metadata field requires a strategy assignment to be consolidated and a developer adds a new `ElementMetadata` field without adding a corresponding strategy mapping, metadata consolidation could break or produce incorrect results. This risk can be mitigated multiple ways: 1. Add a test that verifies a strategy is defined for each (Recommended). 2. Define a default strategy, either `DROP` or `FIRST` for scalar types, `LIST` for list types. 3. Raise an exception when an unknown metadata field is encountered. This PR implements option 1 such that a developer will be notified before merge if they add a new metadata field but do not define a strategy for it. ### Other Considerations - If end-users can in-future add arbitrary metadata fields _before_ chunking, then we'll need to define metadata-consolidation behavior for such fields. Depending on how we implement user-defined metadata fields we might: - Require explicit definition of a new metadata field before use, perhaps with a method like `ElementMetadata.add_custom_field()` which requires a consolidation strategy to be defined (and/or has a default value). - Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the field is type `list`. ### Further Context Metadata is only consolidated for `TextSection` because the other two section types (`TableSection` and `NonTextSection`) can only contain a single element. --- ## Further discussion on consolidation strategy by field ### document-static These fields are very likely to be the same for all elements in a single document: - `attached_to_filename` - `data_source` - `file_directory` - `filename` - `filetype` - `last_modified` - `sent_from` - `sent_to` - `subject` - `url` *Consolidation strategy:* `FIRST` - use first one found, if any. ### section-static These fields are very likely to be the same for all elements in a single section, which is the scope we really care about for metadata consolidation: - `section` - an EPUB document-section unconditionally starts new section. *Consolidation strategy:* `FIRST` - use first one found, if any. ### consolidated list-items These `List` fields are consolidated by concatenating the lists from each element that has one: - `emphasized_text_contents` - `emphasized_text_tags` - `link_texts` - `link_urls` - `regex_metadata` - special case, this one gets indexes adjusted too. *Consolidation strategy:* `LIST` - concatenate lists across elements. ### dynamic These fields are likely to hold unique data for each element: - `category_depth` - `coordinates` - `image_path` - `parent_id` *Consolidation strategy:* - `DROP` as likely misleading. - `COORDINATES` strategy could be added to compute the bounding box from all bounding boxes. - Consider allowing if they are all the same, perhaps an `ALL` strategy. ### slow-changing These fields are somewhere in-between, likely to be common between multiple elements but varied within a document: - `header_footer_type` - *strategy:* drop as not-consolidatable - `languages` - *strategy:* take first occurence - `page_name` - *strategy:* take first occurence - `page_number` - *strategy:* take first occurence, will all be the same when `multipage_sections` is `False`. Worst-case semantics are "this chunk began on this page". ### N/A These field types do not figure in metadata-consolidation: - `detection_class_prob` - I'm thinking this is for debug and should not appear in chunks, but need confirmation. - `detection_origin` - for debug only - `is_continuation` - is _produced_ by chunking, never by partitioning (not in our code anyway). - `links` (deprecated, probably should be dropped) - `max_characters` - is unused as far as I can tell, is unreferenced in source code. Should be removed from `ElementMetadata` as far as I can tell. - `text_as_html` - only appears in a `Table` element, each of which appears in its own section so needs no consolidation. Never appears in `TextSection`. *Consolidation strategy:* `DROP` any that appear (several never will) |
||
![]() |
05c3cd1be2
|
feat: clean pdfminer elements inside tables (#1808)
This PR introduces `clean_pdfminer_inner_elements` , which deletes pdfminer elements inside other detection origins such as YoloX or detectron. This function returns the clean document. Also, the ingest-test fixtures were updated to reflect the new standard output. The best way to check that this function is working properly is check the new test `test_clean_pdfminer_inner_elements` in `test_unstructured/partition/utils/test_processing_elements.py` --------- Co-authored-by: Roman Isecke <roman@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> |
||
![]() |
7373391aa4
|
fix: sectioner dissociated titles from their chunk (#1861)
### disassociated-titles **Executive Summary**. Section titles are often combined with the prior section and then missing from the section they belong to. _Chunk combination_ is a behavior in which two succesive small chunks are combined into a single chunk that better fills the chunk window. Chunking can be and by default is configured to combine sequential small chunks that will together fit within the full chunk window (default 500 chars). Combination is only valid for "whole" chunks. The current implementation attempts to combine at the element level (in the sectioner), meaning a small initial element (such as a `Title`) is combined with the prior section without considering the remaining length of the section that title belongs to. This frequently causes a title element to be removed from the chunk it belongs to and added to the prior, otherwise unrelated, chunk. Example: ```python elements: List[Element] = [ Title("Lorem Ipsum"), # 11 Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55 Title("Rhoncus"), # 7 Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."), # 57 ] chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80) # -- want -------------------- CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.') CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.') # -- got --------------------- CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus') CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.') ``` **Technical Summary.** Combination cannot be effectively performed at the element level, at least not without complicating things with arbitrary look-ahead into future elements. Much more straightforward is to combine sections once they have been formed from the element stream. **Fix.** Introduce an intermediate stream processor that accepts a stream of sections and emits a stream of sometimes-combined sections. The solution implemented in this PR builds upon introducing `_Section` objects to replace the `List[Element]` primitive used previously: - `_TextSection` gets the `.combine()` method and `.text_length` property which allows a combining client to produce a combined section (only text-sections are ever combined). - `_SectionCombiner` is introduced to encapsulate the logic of combination, acting as a "filter", accepting a stream of sections and emitting the same type, just with some resulting from two or more combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`. - `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes responsibility for repeatedly accumulating sections, characterizing their length and doing the actual combining (calling `_Section.combine(other_section)`) when instructed. Very similar in concept to `_TextSectionBuilder`, just at the section level instead of element level. - Remove attempts to combine sections at the element level from `_split_elements_by_title_and_table()` and install `_SectionCombiner` as filter between sectioner and chunker. |
||
![]() |
f273a7cb83
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length **Executive Summary.** A primary responsibility of the sectioner is to minimize the number of chunks that need to be split mid-text. It does this by computing text-length of the section being formed and "finishing" the section when adding another element would extend its text beyond the window size. When element-text is consolidated into a chunk, the text of each element is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does not currently consider the added length of separators (2-chars each) and so forms sections that need to be split mid-text when chunked. Chunk-splitting should only be necessary when the text of a single element is longer than the chunking window. **Example** ```python elements: List[Element] = [ Title("Chunking Priorities"), # 19 chars ListItem("Divide text into manageable chunks"), # 34 chars ListItem("Preserve semantic boundaries"), # 28 chars ListItem("Minimize mid-text chunk-splitting"), # 33 chars ] # 114 chars total but 120 chars with separators chunks = chunk_by_title(elements, max_characters=115) ``` Want: ```python [ CompositeElement( "Chunking Priorities" "\n\nDivide text into manageable chunks" "\n\nPreserve semantic boundaries" ), CompositeElement("Minimize mid-text chunk-splitting"), ] ``` Got: ```python [ CompositeElement( "Chunking Priorities" "\n\nDivide text into manageable chunks" "\n\nPreserve semantic boundaries" "\n\nMinimize mid-text chunk-spli"), ) CompositeElement("tting") ``` ### Technical Summary Because the sectioner does not consider separator (`"\n\n"`) length when it computes the space remaining in the section, it over-populates the section and when the chunker concatenates the element text (each separated by the separator) the text exceeds the window length and the chunk must be split mid-text, even though there was an even element boundary it could have been split on. ### Fix Consider separator length in the space-remaining computation. The solution here extracts both the `section.text_length` and `section.space_remaining` computations to a `_TextSectionBuilder` object which removes the need for the sectioner (`_split_elements_by_title_and_table()`) to deal with primitives (List[Element], running text length, separator length, etc.) and allows it to focus on the rules of when to start a new section. This solution may seem like overkill at the moment and indeed it would be except it forms the foundation for adding section-level chunk combination (fix: dissociated title elements) in the next PR. The objects introduced here will gain several additional responsibilities in the next few chunking PRs in the pipeline and will earn their place. |
||
![]() |
808b4ced7a
|
build(deps): remove ebooklib (#1878)
* **Removed `ebooklib` as a dependency** `ebooklib` is licensed under AGPL3, which is incompatible with the Apache 2.0 license. Thus it is being removed. |
||
![]() |
40a265d027
|
fix: chunk_by_title() interface is rude (#1844)
### `chunk_by_title()` interface is "rude" **Executive Summary.** Perhaps the most commonly specified option for `chunk_by_title()` is `max_characters` (default: 500), which specifies the chunk window size. When a user specifies this value, they get an error message: ```python >>> chunks = chunk_by_title(elements, max_characters=100) ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters. ``` A few of the things that might reasonably pass through a user's mind at such a moment are: * "Is `110` not a valid value for `max_characters`? Why would that be?" * "I didn't specify a value for `combine_text_under_n_chars` or `new_after_n_chars`, in fact I don't know what they are because I haven't studied the documentation and would prefer not to; I just want smaller chunks! How could I supply an invalid value when I haven't supplied any value at all for these?" * "Which of these values is the problem? Why are you making me figure that out for myself? I'm sure the code knows which one is not valid, why doesn't it share that information with me? I'm busy here!" In this particular case, the problem is that `combine_text_under_n_chars` (defaults to 500) is greater than `max_characters`, which means it would never take effect (which is actually not a problem in itself). To fix this, once figuring out that was the problem, probably after opening an issue and maybe reading the source code, the user would need to specify: ```python >>> chunks = chunk_by_title( ... elements, max_characters=100, combine_text_under_n_chars=100 ... ) ``` This and other stressful user scenarios can be remedied by: * Using "active" defaults for the `combine_text_under_n_chars` and `new_after_n_chars` options. * Providing a specific error message for each way a constraint may be violated, such that direction to remedy the problem is immediately clear to the user. An *active default* is for example: * Make the default for `combine_text_under_n_chars: int | None = None` such that the code can detect when it has not been specified. * When not specified, set its value to `max_characters`, the same as its current (static) default. This particular change would avoid the behavior in the motivating example above. Another alternative for this argument is simply: ```python combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars) ``` ### Fix 1. Add constraint-specific error messages. 2. Use "active" defaults for `combine_text_under_n_ chars` and `new_after_n_chars`. 3. Improve docstring to describe active defaults, and explain other argument behaviors, in particular identifying suppression options like `combine_text_under_n_chars = 0` to disable chunk combining. |
||
![]() |
0584e1d031
|
chore: fix infer_table bug (#1833)
Carrying `skip_infer_table_types` to `infer_table_structure` in
partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a
`text_as_html` field.
Note: I've continued to exclude this var from partitioners that go
through html flow, I think if we've already got the html it doesn't make
sense to carry the infer variable along, since we're not 'infer-ing' the
html table in these cases.
TODO:
✅ add unit tests
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: amanda103 <amanda103@users.noreply.github.com>
|
||
![]() |
82c8adba3f
|
fix: split-chunks appear out-of-order (#1824)
**Executive Summary.** Code inspection in preparation for adding the chunk-overlap feature revealed a bug causing split-chunks to be inserted out-of-order. For example, elements like this: ``` Text("One" + 400 chars) Text("Two" + 400 chars) Text("Three" + 600 chars) Text("Four" + 400 chars) Text("Five" + 600 chars) ``` Should produce chunks: ``` CompositeElement("One ...") # (400 chars) CompositeElement("Two ...") # (400 chars) CompositeElement("Three ...") # (500 chars) CompositeElement("rest of Three ...") # (100 chars) CompositeElement("Four") # (400 chars) CompositeElement("Five ...") # (500 chars) CompositeElement("rest of Five ...") # (100 chars) ``` but produced this instead: ``` CompositeElement("Five ...") # (500 chars) CompositeElement("rest of Five ...") # (100 chars) CompositeElement("Three ...") # (500 chars) CompositeElement("rest of Three ...") # (100 chars) CompositeElement("One ...") # (400 chars) CompositeElement("Two ...") # (400 chars) CompositeElement("Four") # (400 chars) ``` This PR fixes that behavior that was introduced on Oct 9 this year in commit: f98d5e65 when adding chunk splitting. **Technical Summary** The essential transformation of chunking is: ``` elements sections chunks List[Element] -> List[List[Element]] -> List[CompositeElement] ``` 1. The _sectioner_ (`_split_elements_by_title_and_table()`) _groups_ semantically-related elements into _sections_ (`List[Element]`), in the best case, that would be a title (heading) and the text that follows it (until the next title). A heading and its text is often referred to as a _section_ in publishing parlance, hence the name. 2. The _chunker_ (`chunk_by_title()` currently) does two things: 1. first it _consolidates_ the elements of each section into a single `ConsolidatedElement` object (a "chunk"). This includes both joining the element text into a single string as well as consolidating the metadata of the section elements. 2. then if necessary it _splits_ the chunk into two or more `ConsolidatedElement` objects when the consolidated text is too long to fit in the specified window (`max_characters`). Chunk splitting is only required when a single element (like a big paragraph) has text longer than the specified window. Otherwise a section and the chunk that derives from it reflects an even element boundary. `chunk_by_title()` was elaborated in commit f98d5e65 to add this "chunk-splitting" behavior. At the time there was some notion of wanting to "split from the end backward" such that any small remainder chunk would appear first, and could possibly be combined with a small prior chunk. To accomplish this, split chunks were _inserted_ at the beginning of the list instead of _appended_ to the end. The `chunked_elements` variable (`List[CompositeElement]`) holds the sequence of chunks that result from the chunking operation and is the returned value for `chunk_by_title()`. This was the list "split-from-the-end" chunks were inserted at the beginning of and that unfortunately produces this out-of-order behavior because the insertion was at the beginning of this "all-chunks-in-document" list, not a sublist just for this chunk. Further, the "split-from-the-end" behavior can produce no benefit because chunks are never combined, only _elements_ are combined (across semantic boundaries into a single section when a section is small) and sectioning occurs _prior_ to chunking. The fix is to rework the chunk-splitting passage to a straighforward iterative algorithm that works both when a chunk must be split and when it doesn't. This algorithm is also very easily extended to implement split-chunk-overlap which is coming up in an immediately following PR. ```python # -- split chunk into CompositeElements objects maxlen or smaller -- text_len = len(text) start = 0 remaining = text_len while remaining > 0: end = min(start + max_characters, text_len) chunked_elements.append(CompositeElement(text=text[start:end], metadata=chunk_meta)) start = end - overlap remaining = text_len - end ``` *Forensic analysis* The out-of-order-chunks behavior was introduced in commit 4ea71683 on 10/09/2023 in the same PR in which chunk-splitting was introduced. --------- Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com> Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com> |
||
![]() |
ce40cdc55f
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. **Tech details:** * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io> |
||
![]() |
d9c2516364
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation for adding the chunk-overlap feature revealed a type mismatch for regex-metadata between chunking tests and the (authoritative) ElementMetadata definition. The implementation of regex-metadata aspects of chunking passed the tests but did not produce the appropriate behaviors in production where the actual data-structure was different. This PR fixes these two bugs. 1. **Over-chunking.** The presence of `regex-metadata` in an element was incorrectly being interpreted as a semantic boundary, leading to such elements being isolated in their own chunks. 2. **Discarded regex-metadata.** regex-metadata present on the second or later elements in a section (chunk) was discarded. **Technical Summary** The type of `ElementMetadata.regex_metadata` is `Dict[str, List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text": "this matched", "start": 7, "end": 19}`. Multiple regexes can be specified, each with a name like "mail-stop", "version", etc. Each of those may produce its own set of matches, like: ```python >>> element.regex_metadata { "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}], "version": [ {"text": "current: v1.7.2", "start": 7, "end": 21}, {"text": "supersedes: v1.7.0", "start": 22, "end": 40}, ], } ``` *Forensic analysis* * The regex-metadata feature was added by Matt Robinson on 06/16/2023 commit: 4ea71683. The regex_metadata data structure is the same as when it was added. * The chunk-by-title feature was added by Matt Robinson on 08/29/2023 commit: f6a745a7. The mistaken regex-metadata data structure in the tests is present in that commit. Looks to me like a mis-remembering of the regex-metadata data-structure and insufficient type-checking rigor (type-checker strictness level set too low) to warn of the mistake. **Over-chunking Behavior** The over-chunking looked like this: Chunking three elements with regex metadata should combine them into a single chunk (`CompositeElement` object), subject to maximum size rules (default 500 chars). ```python elements: List[Element] = [ Title( "Lorem Ipsum", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]} ), ), Text( "Lorem ipsum dolor sit amet consectetur adipiscing elit.", metadata=ElementMetadata( regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]} ), ), Text( "In rhoncus ipsum sed lectus porta volutpat.", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]} ), ), ] chunks = chunk_by_title(elements) assert chunks == [ CompositeElement( "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus" " ipsum sed lectus porta volutpat." ) ] ``` Observed behavior looked like this: ```python chunks => [ CompositeElement('Lorem Ipsum') CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.') CompositeElement('In rhoncus ipsum sed lectus porta volutpat.') ] ``` The fix changed the approach from breaking on any metadata field not in a specified group (`regex_metadata` was missing from this group) to only breaking on specified fields (whitelisting instead of blacklisting). This avoids overchunking every time we add a new metadata field and is also simpler and easier to understand. This change in approach is discussed in more detail here #1790. **Dropping regex-metadata Behavior** Chunking this section: ```python elements: List[Element] = [ Title( "Lorem Ipsum", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]} ), ), Text( "Lorem ipsum dolor sit amet consectetur adipiscing elit.", metadata=ElementMetadata( regex_metadata={ "dolor": [RegexMetadata(text="dolor", start=12, end=17)], "ipsum": [RegexMetadata(text="ipsum", start=6, end=11)], } ), ), Text( "In rhoncus ipsum sed lectus porta volutpat.", metadata=ElementMetadata( regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]} ), ), ] ``` ..should produce this regex_metadata on the single produced chunk: ```python assert chunk == CompositeElement( "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus" " ipsum sed lectus porta volutpat." ) assert chunk.metadata.regex_metadata == { "dolor": [RegexMetadata(text="dolor", start=25, end=30)], "ipsum": [ RegexMetadata(text="Ipsum", start=6, end=11), RegexMetadata(text="ipsum", start=19, end=24), RegexMetadata(text="ipsum", start=81, end=86), ], } ``` but instead produced this: ```python regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]} ``` Which is the regex-metadata from the first element only. The fix was to remove the consolidation+adjustment process from inside the "list-attribute-processing" loop (because regex-metadata is not a list) and process regex metadata separately. |
||
![]() |
f98d5e65ca
|
chore: adding max_characters to other element type chunking (#1673)
This PR adds the `max_characters` (hard max) param to non-table element chunking. Additionally updates the `num_characters` metadata to `max_characters` to make it clearer which param we're referencing. To test: ``` from unstructured.partition.html import partition_html filename = "example-docs/example-10k-1p.html" chunk_elements = partition_html( filename, chunking_strategy="by_title", combine_text_under_n_chars=0, new_after_n_chars=50, max_characters=100, ) for chunk in chunk_elements: print(len(chunk.text)) # previously we were only respecting the "soft max" (default of 500) for elements other than tables # now we should see that all the elements have text fields under 100 chars. ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io> |
||
![]() |
9960ce5f00
|
fix: chunking fails with detection_class_prob in metadata (#1637) | ||
![]() |
1fb464235a
|
chore: Table chunking (#1540)
This change is adding to our `add_chunking_strategy` logic so that we are able to chunk Table elements' `text` and `text_as_html` params. In order to keep the functionality under the same `by_title` chunking strategy we have renamed the `combine_under_n_chars` to `max_characters`. It functions the same way for the combining elements under Title's, as well as specifying a chunk size (in chars) for TableChunk elements. *renaming the variable to `max_characters` will also reflect the 'hard max' we will implement for large elements in followup PRs Additionally -> some lint changes snuck in when I ran `make tidy` hence the minor changes in unrelated files :) TODO: ✅ add unit tests --> note: added where I could to unit tests! Some unit tests I just clarified that the chunking strategy was now 'by_title' because we don't have a file example that has Table elements to test the 'by_num_characters' chunking strategy ✅ update changelog To manually test: ``` In [1]: filename="example-docs/example-10k.html" In [2]: from unstructured.chunking.title import chunk_table_element In [3]: from unstructured.partition.auto import partition In [4]: elements = partition(filename) # element at -2 happens to be a Table, and we'll get chunks of char size 4 here In [5]: chunks = chunk_table_element(elements[-2], 4) # examine text and text_as_html params ln [6]: for c in chunks: print(c.text) print(c.metadata.text_as_html) ``` --------- Co-authored-by: Yao You <theyaoyou@gmail.com> |
||
![]() |
bd33a52ee0
|
fix: coordinates metadata hinders chunking (#1374)
Closes https://github.com/Unstructured-IO/unstructured/issues/1373 This PR: - drops the `coordinates` metadata field in `chunk_by_title` to fix https://github.com/Unstructured-IO/unstructured/issues/1373 (read issue for the details) - adds relevant test that checks the particular case |
||
![]() |
c58b261feb
|
chunk_by_title decorator (#1304)
### Summary Partial solution to #1185. Related to #1222. Creates decorator from `chunk_by_title` cleaning brick. Breaks a document into sections based on the presence of Title elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is 1500. The **chunking function does not split individual elements**, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars characters`, which could occur with a long NarrativeText element. Combines sections under these conditions - Sections under `combine_under_n_chars` characters are combined. The default is 500. ### Testing from unstructured.partition.html import partition_html url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" chunks = partition_html(url=url, chunking_strategy="by_title") for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input() |
||
![]() |
f6a745a74f
|
feat: chunk elements based on titles (#1222)
### Summary An initial pass on smart chunking for RAG applications. Breaks a document into sections based on the presence of `Title` elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is `1500`. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars` characters, which could occur with a long `NarrativeText` element. - Section under `combine_under_n_chars` characters are combined. The default is `500`. ### Testing ```python from unstructured.partition.html import partition_html from unstructured.chunking.title import chunk_by_title url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" elements = partition_html(url=url) chunks = chunk_by_title(elements) for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input() ``` |