fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
# pyright: reportPrivateUsage=false
|
|
|
|
|
|
|
|
from typing import List
|
|
|
|
|
2023-09-11 16:00:14 -05:00
|
|
|
import pytest
|
|
|
|
|
2023-08-29 12:04:57 -04:00
|
|
|
from unstructured.chunking.title import (
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
_NonTextSection,
|
fix: sectioner dissociated titles from their chunk (#1861)
### disassociated-titles
**Executive Summary**. Section titles are often combined with the prior
section and then missing from the section they belong to.
_Chunk combination_ is a behavior in which two succesive small chunks
are combined into a single chunk that better fills the chunk window.
Chunking can be and by default is configured to combine sequential small
chunks that will together fit within the full chunk window (default 500
chars).
Combination is only valid for "whole" chunks. The current implementation
attempts to combine at the element level (in the sectioner), meaning a
small initial element (such as a `Title`) is combined with the prior
section without considering the remaining length of the section that
title belongs to. This frequently causes a title element to be removed
from the chunk it belongs to and added to the prior, otherwise
unrelated, chunk.
Example:
```python
elements: List[Element] = [
Title("Lorem Ipsum"), # 11
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55
Title("Rhoncus"), # 7
Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."), # 57
]
chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80)
# -- want --------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
# -- got ---------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
```
**Technical Summary.** Combination cannot be effectively performed at
the element level, at least not without complicating things with
arbitrary look-ahead into future elements. Much more straightforward is
to combine sections once they have been formed from the element stream.
**Fix.** Introduce an intermediate stream processor that accepts a
stream of sections and emits a stream of sometimes-combined sections.
The solution implemented in this PR builds upon introducing `_Section`
objects to replace the `List[Element]` primitive used previously:
- `_TextSection` gets the `.combine()` method and `.text_length`
property which allows a combining client to produce a combined section
(only text-sections are ever combined).
- `_SectionCombiner` is introduced to encapsulate the logic of
combination, acting as a "filter", accepting a stream of sections and
emitting the same type, just with some resulting from two or more
combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`.
- `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes
responsibility for repeatedly accumulating sections, characterizing
their length and doing the actual combining (calling
`_Section.combine(other_section)`) when instructed. Very similar in
concept to `_TextSectionBuilder`, just at the section level instead of
element level.
- Remove attempts to combine sections at the element level from
`_split_elements_by_title_and_table()` and install `_SectionCombiner` as
filter between sectioner and chunker.
2023-10-29 21:20:27 -07:00
|
|
|
_SectionCombiner,
|
2023-08-29 12:04:57 -04:00
|
|
|
_split_elements_by_title_and_table,
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
_TableSection,
|
|
|
|
_TextSection,
|
fix: sectioner dissociated titles from their chunk (#1861)
### disassociated-titles
**Executive Summary**. Section titles are often combined with the prior
section and then missing from the section they belong to.
_Chunk combination_ is a behavior in which two succesive small chunks
are combined into a single chunk that better fills the chunk window.
Chunking can be and by default is configured to combine sequential small
chunks that will together fit within the full chunk window (default 500
chars).
Combination is only valid for "whole" chunks. The current implementation
attempts to combine at the element level (in the sectioner), meaning a
small initial element (such as a `Title`) is combined with the prior
section without considering the remaining length of the section that
title belongs to. This frequently causes a title element to be removed
from the chunk it belongs to and added to the prior, otherwise
unrelated, chunk.
Example:
```python
elements: List[Element] = [
Title("Lorem Ipsum"), # 11
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55
Title("Rhoncus"), # 7
Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."), # 57
]
chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80)
# -- want --------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
# -- got ---------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
```
**Technical Summary.** Combination cannot be effectively performed at
the element level, at least not without complicating things with
arbitrary look-ahead into future elements. Much more straightforward is
to combine sections once they have been formed from the element stream.
**Fix.** Introduce an intermediate stream processor that accepts a
stream of sections and emits a stream of sometimes-combined sections.
The solution implemented in this PR builds upon introducing `_Section`
objects to replace the `List[Element]` primitive used previously:
- `_TextSection` gets the `.combine()` method and `.text_length`
property which allows a combining client to produce a combined section
(only text-sections are ever combined).
- `_SectionCombiner` is introduced to encapsulate the logic of
combination, acting as a "filter", accepting a stream of sections and
emitting the same type, just with some resulting from two or more
combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`.
- `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes
responsibility for repeatedly accumulating sections, characterizing
their length and doing the actual combining (calling
`_Section.combine(other_section)`) when instructed. Very similar in
concept to `_TextSectionBuilder`, just at the section level instead of
element level.
- Remove attempts to combine sections at the element level from
`_split_elements_by_title_and_table()` and install `_SectionCombiner` as
filter between sectioner and chunker.
2023-10-29 21:20:27 -07:00
|
|
|
_TextSectionAccumulator,
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
_TextSectionBuilder,
|
2023-08-29 12:04:57 -04:00
|
|
|
chunk_by_title,
|
|
|
|
)
|
2023-09-14 13:10:03 +03:00
|
|
|
from unstructured.documents.coordinates import CoordinateSystem
|
2023-08-29 12:04:57 -04:00
|
|
|
from unstructured.documents.elements import (
|
|
|
|
CheckBox,
|
|
|
|
CompositeElement,
|
2023-09-14 13:10:03 +03:00
|
|
|
CoordinatesMetadata,
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
Element,
|
2023-08-29 12:04:57 -04:00
|
|
|
ElementMetadata,
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
ListItem,
|
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-01 18:49:20 -07:00
|
|
|
PageBreak,
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
RegexMetadata,
|
2023-08-29 12:04:57 -04:00
|
|
|
Table,
|
2023-11-16 08:22:50 -08:00
|
|
|
TableChunk,
|
2023-08-29 12:04:57 -04:00
|
|
|
Text,
|
|
|
|
Title,
|
|
|
|
)
|
2023-09-11 16:00:14 -05:00
|
|
|
from unstructured.partition.html import partition_html
|
2023-08-29 12:04:57 -04:00
|
|
|
|
fix: chunk_by_title() interface is rude (#1844)
### `chunk_by_title()` interface is "rude"
**Executive Summary.** Perhaps the most commonly specified option for
`chunk_by_title()` is `max_characters` (default: 500), which specifies
the chunk window size.
When a user specifies this value, they get an error message:
```python
>>> chunks = chunk_by_title(elements, max_characters=100)
ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters.
```
A few of the things that might reasonably pass through a user's mind at
such a moment are:
* "Is `110` not a valid value for `max_characters`? Why would that be?"
* "I didn't specify a value for `combine_text_under_n_chars` or
`new_after_n_chars`, in fact I don't know what they are because I
haven't studied the documentation and would prefer not to; I just want
smaller chunks! How could I supply an invalid value when I haven't
supplied any value at all for these?"
* "Which of these values is the problem? Why are you making me figure
that out for myself? I'm sure the code knows which one is not valid, why
doesn't it share that information with me? I'm busy here!"
In this particular case, the problem is that
`combine_text_under_n_chars` (defaults to 500) is greater than
`max_characters`, which means it would never take effect (which is
actually not a problem in itself).
To fix this, once figuring out that was the problem, probably after
opening an issue and maybe reading the source code, the user would need
to specify:
```python
>>> chunks = chunk_by_title(
... elements, max_characters=100, combine_text_under_n_chars=100
... )
```
This and other stressful user scenarios can be remedied by:
* Using "active" defaults for the `combine_text_under_n_chars` and
`new_after_n_chars` options.
* Providing a specific error message for each way a constraint may be
violated, such that direction to remedy the problem is immediately clear
to the user.
An *active default* is for example:
* Make the default for `combine_text_under_n_chars: int | None = None`
such that the code can detect when it has not been specified.
* When not specified, set its value to `max_characters`, the same as its
current (static) default.
This particular change would avoid the behavior in the motivating
example above.
Another alternative for this argument is simply:
```python
combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars)
```
### Fix
1. Add constraint-specific error messages.
2. Use "active" defaults for `combine_text_under_n_ chars` and
`new_after_n_chars`.
3. Improve docstring to describe active defaults, and explain other
argument behaviors, in particular identifying suppression options like
`combine_text_under_n_chars = 0` to disable chunk combining.
2023-10-24 16:22:38 -07:00
|
|
|
# == chunk_by_title() validation behaviors =======================================================
|
|
|
|
|
|
|
|
|
|
|
|
@pytest.mark.parametrize("max_characters", [0, -1, -42])
|
|
|
|
def test_it_rejects_max_characters_not_greater_than_zero(max_characters: int):
|
|
|
|
elements: List[Element] = [Text("Lorem ipsum dolor.")]
|
|
|
|
|
|
|
|
with pytest.raises(
|
2023-10-26 12:22:40 -05:00
|
|
|
ValueError,
|
|
|
|
match=f"'max_characters' argument must be > 0, got {max_characters}",
|
fix: chunk_by_title() interface is rude (#1844)
### `chunk_by_title()` interface is "rude"
**Executive Summary.** Perhaps the most commonly specified option for
`chunk_by_title()` is `max_characters` (default: 500), which specifies
the chunk window size.
When a user specifies this value, they get an error message:
```python
>>> chunks = chunk_by_title(elements, max_characters=100)
ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters.
```
A few of the things that might reasonably pass through a user's mind at
such a moment are:
* "Is `110` not a valid value for `max_characters`? Why would that be?"
* "I didn't specify a value for `combine_text_under_n_chars` or
`new_after_n_chars`, in fact I don't know what they are because I
haven't studied the documentation and would prefer not to; I just want
smaller chunks! How could I supply an invalid value when I haven't
supplied any value at all for these?"
* "Which of these values is the problem? Why are you making me figure
that out for myself? I'm sure the code knows which one is not valid, why
doesn't it share that information with me? I'm busy here!"
In this particular case, the problem is that
`combine_text_under_n_chars` (defaults to 500) is greater than
`max_characters`, which means it would never take effect (which is
actually not a problem in itself).
To fix this, once figuring out that was the problem, probably after
opening an issue and maybe reading the source code, the user would need
to specify:
```python
>>> chunks = chunk_by_title(
... elements, max_characters=100, combine_text_under_n_chars=100
... )
```
This and other stressful user scenarios can be remedied by:
* Using "active" defaults for the `combine_text_under_n_chars` and
`new_after_n_chars` options.
* Providing a specific error message for each way a constraint may be
violated, such that direction to remedy the problem is immediately clear
to the user.
An *active default* is for example:
* Make the default for `combine_text_under_n_chars: int | None = None`
such that the code can detect when it has not been specified.
* When not specified, set its value to `max_characters`, the same as its
current (static) default.
This particular change would avoid the behavior in the motivating
example above.
Another alternative for this argument is simply:
```python
combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars)
```
### Fix
1. Add constraint-specific error messages.
2. Use "active" defaults for `combine_text_under_n_ chars` and
`new_after_n_chars`.
3. Improve docstring to describe active defaults, and explain other
argument behaviors, in particular identifying suppression options like
`combine_text_under_n_chars = 0` to disable chunk combining.
2023-10-24 16:22:38 -07:00
|
|
|
):
|
|
|
|
chunk_by_title(elements, max_characters=max_characters)
|
|
|
|
|
|
|
|
|
|
|
|
def test_it_does_not_complain_when_specifying_max_characters_by_itself():
|
|
|
|
"""Caller can specify `max_characters` arg without specifying any others.
|
|
|
|
|
|
|
|
In particular, When `combine_text_under_n_chars` is not specified it defaults to the value of
|
|
|
|
`max_characters`; it has no fixed default value that can be greater than `max_characters` and
|
|
|
|
trigger an exception.
|
|
|
|
"""
|
|
|
|
elements: List[Element] = [Text("Lorem ipsum dolor.")]
|
|
|
|
|
|
|
|
try:
|
|
|
|
chunk_by_title(elements, max_characters=50)
|
|
|
|
except ValueError:
|
|
|
|
pytest.fail("did not accept `max_characters` as option by itself")
|
|
|
|
|
|
|
|
|
|
|
|
@pytest.mark.parametrize("n_chars", [-1, -42])
|
|
|
|
def test_it_rejects_combine_text_under_n_chars_for_n_less_than_zero(n_chars: int):
|
|
|
|
elements: List[Element] = [Text("Lorem ipsum dolor.")]
|
|
|
|
|
|
|
|
with pytest.raises(
|
2023-10-26 12:22:40 -05:00
|
|
|
ValueError,
|
|
|
|
match=f"'combine_text_under_n_chars' argument must be >= 0, got {n_chars}",
|
fix: chunk_by_title() interface is rude (#1844)
### `chunk_by_title()` interface is "rude"
**Executive Summary.** Perhaps the most commonly specified option for
`chunk_by_title()` is `max_characters` (default: 500), which specifies
the chunk window size.
When a user specifies this value, they get an error message:
```python
>>> chunks = chunk_by_title(elements, max_characters=100)
ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters.
```
A few of the things that might reasonably pass through a user's mind at
such a moment are:
* "Is `110` not a valid value for `max_characters`? Why would that be?"
* "I didn't specify a value for `combine_text_under_n_chars` or
`new_after_n_chars`, in fact I don't know what they are because I
haven't studied the documentation and would prefer not to; I just want
smaller chunks! How could I supply an invalid value when I haven't
supplied any value at all for these?"
* "Which of these values is the problem? Why are you making me figure
that out for myself? I'm sure the code knows which one is not valid, why
doesn't it share that information with me? I'm busy here!"
In this particular case, the problem is that
`combine_text_under_n_chars` (defaults to 500) is greater than
`max_characters`, which means it would never take effect (which is
actually not a problem in itself).
To fix this, once figuring out that was the problem, probably after
opening an issue and maybe reading the source code, the user would need
to specify:
```python
>>> chunks = chunk_by_title(
... elements, max_characters=100, combine_text_under_n_chars=100
... )
```
This and other stressful user scenarios can be remedied by:
* Using "active" defaults for the `combine_text_under_n_chars` and
`new_after_n_chars` options.
* Providing a specific error message for each way a constraint may be
violated, such that direction to remedy the problem is immediately clear
to the user.
An *active default* is for example:
* Make the default for `combine_text_under_n_chars: int | None = None`
such that the code can detect when it has not been specified.
* When not specified, set its value to `max_characters`, the same as its
current (static) default.
This particular change would avoid the behavior in the motivating
example above.
Another alternative for this argument is simply:
```python
combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars)
```
### Fix
1. Add constraint-specific error messages.
2. Use "active" defaults for `combine_text_under_n_ chars` and
`new_after_n_chars`.
3. Improve docstring to describe active defaults, and explain other
argument behaviors, in particular identifying suppression options like
`combine_text_under_n_chars = 0` to disable chunk combining.
2023-10-24 16:22:38 -07:00
|
|
|
):
|
|
|
|
chunk_by_title(elements, combine_text_under_n_chars=n_chars)
|
|
|
|
|
|
|
|
|
|
|
|
def test_it_accepts_0_for_combine_text_under_n_chars_to_disable_chunk_combining():
|
|
|
|
"""Specifying `combine_text_under_n_chars=0` is how a caller disables chunk-combining."""
|
|
|
|
elements: List[Element] = [Text("Lorem ipsum dolor.")]
|
|
|
|
|
|
|
|
chunks = chunk_by_title(elements, max_characters=50, combine_text_under_n_chars=0)
|
|
|
|
|
|
|
|
assert chunks == [CompositeElement("Lorem ipsum dolor.")]
|
|
|
|
|
|
|
|
|
|
|
|
def test_it_does_not_complain_when_specifying_combine_text_under_n_chars_by_itself():
|
|
|
|
"""Caller can specify `combine_text_under_n_chars` arg without specifying any other options."""
|
|
|
|
elements: List[Element] = [Text("Lorem ipsum dolor.")]
|
|
|
|
|
|
|
|
try:
|
|
|
|
chunk_by_title(elements, combine_text_under_n_chars=50)
|
|
|
|
except ValueError:
|
|
|
|
pytest.fail("did not accept `combine_text_under_n_chars` as option by itself")
|
|
|
|
|
|
|
|
|
|
|
|
def test_it_silently_accepts_combine_text_under_n_chars_greater_than_maxchars():
|
|
|
|
"""`combine_text_under_n_chars` > `max_characters` doesn't affect chunking behavior.
|
|
|
|
|
|
|
|
So rather than raising an exception or warning, we just cap that value at `max_characters` which
|
|
|
|
is the behavioral equivalent.
|
|
|
|
"""
|
|
|
|
elements: List[Element] = [Text("Lorem ipsum dolor.")]
|
|
|
|
|
|
|
|
try:
|
|
|
|
chunk_by_title(elements, max_characters=500, combine_text_under_n_chars=600)
|
|
|
|
except ValueError:
|
|
|
|
pytest.fail("did not accept `new_after_n_chars` greater than `max_characters`")
|
|
|
|
|
|
|
|
|
|
|
|
@pytest.mark.parametrize("n_chars", [-1, -42])
|
|
|
|
def test_it_rejects_new_after_n_chars_for_n_less_than_zero(n_chars: int):
|
|
|
|
elements: List[Element] = [Text("Lorem ipsum dolor.")]
|
|
|
|
|
|
|
|
with pytest.raises(
|
2023-10-26 12:22:40 -05:00
|
|
|
ValueError,
|
|
|
|
match=f"'new_after_n_chars' argument must be >= 0, got {n_chars}",
|
fix: chunk_by_title() interface is rude (#1844)
### `chunk_by_title()` interface is "rude"
**Executive Summary.** Perhaps the most commonly specified option for
`chunk_by_title()` is `max_characters` (default: 500), which specifies
the chunk window size.
When a user specifies this value, they get an error message:
```python
>>> chunks = chunk_by_title(elements, max_characters=100)
ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters.
```
A few of the things that might reasonably pass through a user's mind at
such a moment are:
* "Is `110` not a valid value for `max_characters`? Why would that be?"
* "I didn't specify a value for `combine_text_under_n_chars` or
`new_after_n_chars`, in fact I don't know what they are because I
haven't studied the documentation and would prefer not to; I just want
smaller chunks! How could I supply an invalid value when I haven't
supplied any value at all for these?"
* "Which of these values is the problem? Why are you making me figure
that out for myself? I'm sure the code knows which one is not valid, why
doesn't it share that information with me? I'm busy here!"
In this particular case, the problem is that
`combine_text_under_n_chars` (defaults to 500) is greater than
`max_characters`, which means it would never take effect (which is
actually not a problem in itself).
To fix this, once figuring out that was the problem, probably after
opening an issue and maybe reading the source code, the user would need
to specify:
```python
>>> chunks = chunk_by_title(
... elements, max_characters=100, combine_text_under_n_chars=100
... )
```
This and other stressful user scenarios can be remedied by:
* Using "active" defaults for the `combine_text_under_n_chars` and
`new_after_n_chars` options.
* Providing a specific error message for each way a constraint may be
violated, such that direction to remedy the problem is immediately clear
to the user.
An *active default* is for example:
* Make the default for `combine_text_under_n_chars: int | None = None`
such that the code can detect when it has not been specified.
* When not specified, set its value to `max_characters`, the same as its
current (static) default.
This particular change would avoid the behavior in the motivating
example above.
Another alternative for this argument is simply:
```python
combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars)
```
### Fix
1. Add constraint-specific error messages.
2. Use "active" defaults for `combine_text_under_n_ chars` and
`new_after_n_chars`.
3. Improve docstring to describe active defaults, and explain other
argument behaviors, in particular identifying suppression options like
`combine_text_under_n_chars = 0` to disable chunk combining.
2023-10-24 16:22:38 -07:00
|
|
|
):
|
|
|
|
chunk_by_title(elements, new_after_n_chars=n_chars)
|
|
|
|
|
|
|
|
|
|
|
|
def test_it_does_not_complain_when_specifying_new_after_n_chars_by_itself():
|
|
|
|
"""Caller can specify `new_after_n_chars` arg without specifying any other options.
|
|
|
|
|
|
|
|
In particular, `combine_text_under_n_chars` value is adjusted down to the `new_after_n_chars`
|
|
|
|
value when the default for `combine_text_under_n_chars` exceeds the value of
|
|
|
|
`new_after_n_chars`.
|
|
|
|
"""
|
|
|
|
elements: List[Element] = [Text("Lorem ipsum dolor.")]
|
|
|
|
|
|
|
|
try:
|
|
|
|
chunk_by_title(elements, new_after_n_chars=50)
|
|
|
|
except ValueError:
|
|
|
|
pytest.fail("did not accept `new_after_n_chars` as option by itself")
|
|
|
|
|
|
|
|
|
|
|
|
def test_it_accepts_0_for_new_after_n_chars_to_put_each_element_into_its_own_chunk():
|
|
|
|
"""Specifying `new_after_n_chars=0` places each element into its own section.
|
|
|
|
|
|
|
|
This puts each element into its own chunk, although long chunks are still split.
|
|
|
|
"""
|
|
|
|
elements: List[Element] = [
|
|
|
|
Text("Lorem"),
|
|
|
|
Text("ipsum"),
|
|
|
|
Text("dolor"),
|
|
|
|
]
|
|
|
|
|
|
|
|
chunks = chunk_by_title(elements, max_characters=50, new_after_n_chars=0)
|
|
|
|
|
|
|
|
assert chunks == [
|
|
|
|
CompositeElement("Lorem"),
|
|
|
|
CompositeElement("ipsum"),
|
|
|
|
CompositeElement("dolor"),
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def test_it_silently_accepts_new_after_n_chars_greater_than_maxchars():
|
|
|
|
"""`new_after_n_chars` > `max_characters` doesn't affect chunking behavior.
|
|
|
|
|
|
|
|
So rather than raising an exception or warning, we just cap that value at `max_characters` which
|
|
|
|
is the behavioral equivalent.
|
|
|
|
"""
|
|
|
|
elements: List[Element] = [Text("Lorem ipsum dolor.")]
|
|
|
|
|
|
|
|
try:
|
|
|
|
chunk_by_title(elements, max_characters=500, new_after_n_chars=600)
|
|
|
|
except ValueError:
|
|
|
|
pytest.fail("did not accept `new_after_n_chars` greater than `max_characters`")
|
|
|
|
|
|
|
|
|
|
|
|
# ================================================================================================
|
|
|
|
|
2023-08-29 12:04:57 -04:00
|
|
|
|
fix: split-chunks appear out-of-order (#1824)
**Executive Summary.** Code inspection in preparation for adding the
chunk-overlap feature revealed a bug causing split-chunks to be inserted
out-of-order. For example, elements like this:
```
Text("One" + 400 chars)
Text("Two" + 400 chars)
Text("Three" + 600 chars)
Text("Four" + 400 chars)
Text("Five" + 600 chars)
```
Should produce chunks:
```
CompositeElement("One ...") # (400 chars)
CompositeElement("Two ...") # (400 chars)
CompositeElement("Three ...") # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("Four") # (400 chars)
CompositeElement("Five ...") # (500 chars)
CompositeElement("rest of Five ...") # (100 chars)
```
but produced this instead:
```
CompositeElement("Five ...") # (500 chars)
CompositeElement("rest of Five ...") # (100 chars)
CompositeElement("Three ...") # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("One ...") # (400 chars)
CompositeElement("Two ...") # (400 chars)
CompositeElement("Four") # (400 chars)
```
This PR fixes that behavior that was introduced on Oct 9 this year in
commit: f98d5e65 when adding chunk splitting.
**Technical Summary**
The essential transformation of chunking is:
```
elements sections chunks
List[Element] -> List[List[Element]] -> List[CompositeElement]
```
1. The _sectioner_ (`_split_elements_by_title_and_table()`) _groups_
semantically-related elements into _sections_ (`List[Element]`), in the
best case, that would be a title (heading) and the text that follows it
(until the next title). A heading and its text is often referred to as a
_section_ in publishing parlance, hence the name.
2. The _chunker_ (`chunk_by_title()` currently) does two things:
1. first it _consolidates_ the elements of each section into a single
`ConsolidatedElement` object (a "chunk"). This includes both joining the
element text into a single string as well as consolidating the metadata
of the section elements.
2. then if necessary it _splits_ the chunk into two or more
`ConsolidatedElement` objects when the consolidated text is too long to
fit in the specified window (`max_characters`).
Chunk splitting is only required when a single element (like a big
paragraph) has text longer than the specified window. Otherwise a
section and the chunk that derives from it reflects an even element
boundary.
`chunk_by_title()` was elaborated in commit f98d5e65 to add this
"chunk-splitting" behavior.
At the time there was some notion of wanting to "split from the end
backward" such that any small remainder chunk would appear first, and
could possibly be combined with a small prior chunk. To accomplish this,
split chunks were _inserted_ at the beginning of the list instead of
_appended_ to the end.
The `chunked_elements` variable (`List[CompositeElement]`) holds the
sequence of chunks that result from the chunking operation and is the
returned value for `chunk_by_title()`. This was the list
"split-from-the-end" chunks were inserted at the beginning of and that
unfortunately produces this out-of-order behavior because the insertion
was at the beginning of this "all-chunks-in-document" list, not a
sublist just for this chunk.
Further, the "split-from-the-end" behavior can produce no benefit
because chunks are never combined, only _elements_ are combined (across
semantic boundaries into a single section when a section is small) and
sectioning occurs _prior_ to chunking.
The fix is to rework the chunk-splitting passage to a straighforward
iterative algorithm that works both when a chunk must be split and when
it doesn't. This algorithm is also very easily extended to implement
split-chunk-overlap which is coming up in an immediately following PR.
```python
# -- split chunk into CompositeElements objects maxlen or smaller --
text_len = len(text)
start = 0
remaining = text_len
while remaining > 0:
end = min(start + max_characters, text_len)
chunked_elements.append(CompositeElement(text=text[start:end], metadata=chunk_meta))
start = end - overlap
remaining = text_len - end
```
*Forensic analysis*
The out-of-order-chunks behavior was introduced in commit 4ea71683 on
10/09/2023 in the same PR in which chunk-splitting was introduced.
---------
Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-20 18:37:34 -07:00
|
|
|
def test_it_splits_a_large_section_into_multiple_chunks():
|
|
|
|
elements: List[Element] = [
|
|
|
|
Title("Introduction"),
|
|
|
|
Text(
|
|
|
|
"Lorem ipsum dolor sit amet consectetur adipiscing elit. In rhoncus ipsum sed lectus"
|
2023-10-23 17:11:53 -07:00
|
|
|
" porta volutpat.",
|
fix: split-chunks appear out-of-order (#1824)
**Executive Summary.** Code inspection in preparation for adding the
chunk-overlap feature revealed a bug causing split-chunks to be inserted
out-of-order. For example, elements like this:
```
Text("One" + 400 chars)
Text("Two" + 400 chars)
Text("Three" + 600 chars)
Text("Four" + 400 chars)
Text("Five" + 600 chars)
```
Should produce chunks:
```
CompositeElement("One ...") # (400 chars)
CompositeElement("Two ...") # (400 chars)
CompositeElement("Three ...") # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("Four") # (400 chars)
CompositeElement("Five ...") # (500 chars)
CompositeElement("rest of Five ...") # (100 chars)
```
but produced this instead:
```
CompositeElement("Five ...") # (500 chars)
CompositeElement("rest of Five ...") # (100 chars)
CompositeElement("Three ...") # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("One ...") # (400 chars)
CompositeElement("Two ...") # (400 chars)
CompositeElement("Four") # (400 chars)
```
This PR fixes that behavior that was introduced on Oct 9 this year in
commit: f98d5e65 when adding chunk splitting.
**Technical Summary**
The essential transformation of chunking is:
```
elements sections chunks
List[Element] -> List[List[Element]] -> List[CompositeElement]
```
1. The _sectioner_ (`_split_elements_by_title_and_table()`) _groups_
semantically-related elements into _sections_ (`List[Element]`), in the
best case, that would be a title (heading) and the text that follows it
(until the next title). A heading and its text is often referred to as a
_section_ in publishing parlance, hence the name.
2. The _chunker_ (`chunk_by_title()` currently) does two things:
1. first it _consolidates_ the elements of each section into a single
`ConsolidatedElement` object (a "chunk"). This includes both joining the
element text into a single string as well as consolidating the metadata
of the section elements.
2. then if necessary it _splits_ the chunk into two or more
`ConsolidatedElement` objects when the consolidated text is too long to
fit in the specified window (`max_characters`).
Chunk splitting is only required when a single element (like a big
paragraph) has text longer than the specified window. Otherwise a
section and the chunk that derives from it reflects an even element
boundary.
`chunk_by_title()` was elaborated in commit f98d5e65 to add this
"chunk-splitting" behavior.
At the time there was some notion of wanting to "split from the end
backward" such that any small remainder chunk would appear first, and
could possibly be combined with a small prior chunk. To accomplish this,
split chunks were _inserted_ at the beginning of the list instead of
_appended_ to the end.
The `chunked_elements` variable (`List[CompositeElement]`) holds the
sequence of chunks that result from the chunking operation and is the
returned value for `chunk_by_title()`. This was the list
"split-from-the-end" chunks were inserted at the beginning of and that
unfortunately produces this out-of-order behavior because the insertion
was at the beginning of this "all-chunks-in-document" list, not a
sublist just for this chunk.
Further, the "split-from-the-end" behavior can produce no benefit
because chunks are never combined, only _elements_ are combined (across
semantic boundaries into a single section when a section is small) and
sectioning occurs _prior_ to chunking.
The fix is to rework the chunk-splitting passage to a straighforward
iterative algorithm that works both when a chunk must be split and when
it doesn't. This algorithm is also very easily extended to implement
split-chunk-overlap which is coming up in an immediately following PR.
```python
# -- split chunk into CompositeElements objects maxlen or smaller --
text_len = len(text)
start = 0
remaining = text_len
while remaining > 0:
end = min(start + max_characters, text_len)
chunked_elements.append(CompositeElement(text=text[start:end], metadata=chunk_meta))
start = end - overlap
remaining = text_len - end
```
*Forensic analysis*
The out-of-order-chunks behavior was introduced in commit 4ea71683 on
10/09/2023 in the same PR in which chunk-splitting was introduced.
---------
Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-20 18:37:34 -07:00
|
|
|
),
|
|
|
|
]
|
|
|
|
|
fix: chunk_by_title() interface is rude (#1844)
### `chunk_by_title()` interface is "rude"
**Executive Summary.** Perhaps the most commonly specified option for
`chunk_by_title()` is `max_characters` (default: 500), which specifies
the chunk window size.
When a user specifies this value, they get an error message:
```python
>>> chunks = chunk_by_title(elements, max_characters=100)
ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters.
```
A few of the things that might reasonably pass through a user's mind at
such a moment are:
* "Is `110` not a valid value for `max_characters`? Why would that be?"
* "I didn't specify a value for `combine_text_under_n_chars` or
`new_after_n_chars`, in fact I don't know what they are because I
haven't studied the documentation and would prefer not to; I just want
smaller chunks! How could I supply an invalid value when I haven't
supplied any value at all for these?"
* "Which of these values is the problem? Why are you making me figure
that out for myself? I'm sure the code knows which one is not valid, why
doesn't it share that information with me? I'm busy here!"
In this particular case, the problem is that
`combine_text_under_n_chars` (defaults to 500) is greater than
`max_characters`, which means it would never take effect (which is
actually not a problem in itself).
To fix this, once figuring out that was the problem, probably after
opening an issue and maybe reading the source code, the user would need
to specify:
```python
>>> chunks = chunk_by_title(
... elements, max_characters=100, combine_text_under_n_chars=100
... )
```
This and other stressful user scenarios can be remedied by:
* Using "active" defaults for the `combine_text_under_n_chars` and
`new_after_n_chars` options.
* Providing a specific error message for each way a constraint may be
violated, such that direction to remedy the problem is immediately clear
to the user.
An *active default* is for example:
* Make the default for `combine_text_under_n_chars: int | None = None`
such that the code can detect when it has not been specified.
* When not specified, set its value to `max_characters`, the same as its
current (static) default.
This particular change would avoid the behavior in the motivating
example above.
Another alternative for this argument is simply:
```python
combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars)
```
### Fix
1. Add constraint-specific error messages.
2. Use "active" defaults for `combine_text_under_n_ chars` and
`new_after_n_chars`.
3. Improve docstring to describe active defaults, and explain other
argument behaviors, in particular identifying suppression options like
`combine_text_under_n_chars = 0` to disable chunk combining.
2023-10-24 16:22:38 -07:00
|
|
|
chunks = chunk_by_title(elements, max_characters=50)
|
fix: split-chunks appear out-of-order (#1824)
**Executive Summary.** Code inspection in preparation for adding the
chunk-overlap feature revealed a bug causing split-chunks to be inserted
out-of-order. For example, elements like this:
```
Text("One" + 400 chars)
Text("Two" + 400 chars)
Text("Three" + 600 chars)
Text("Four" + 400 chars)
Text("Five" + 600 chars)
```
Should produce chunks:
```
CompositeElement("One ...") # (400 chars)
CompositeElement("Two ...") # (400 chars)
CompositeElement("Three ...") # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("Four") # (400 chars)
CompositeElement("Five ...") # (500 chars)
CompositeElement("rest of Five ...") # (100 chars)
```
but produced this instead:
```
CompositeElement("Five ...") # (500 chars)
CompositeElement("rest of Five ...") # (100 chars)
CompositeElement("Three ...") # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("One ...") # (400 chars)
CompositeElement("Two ...") # (400 chars)
CompositeElement("Four") # (400 chars)
```
This PR fixes that behavior that was introduced on Oct 9 this year in
commit: f98d5e65 when adding chunk splitting.
**Technical Summary**
The essential transformation of chunking is:
```
elements sections chunks
List[Element] -> List[List[Element]] -> List[CompositeElement]
```
1. The _sectioner_ (`_split_elements_by_title_and_table()`) _groups_
semantically-related elements into _sections_ (`List[Element]`), in the
best case, that would be a title (heading) and the text that follows it
(until the next title). A heading and its text is often referred to as a
_section_ in publishing parlance, hence the name.
2. The _chunker_ (`chunk_by_title()` currently) does two things:
1. first it _consolidates_ the elements of each section into a single
`ConsolidatedElement` object (a "chunk"). This includes both joining the
element text into a single string as well as consolidating the metadata
of the section elements.
2. then if necessary it _splits_ the chunk into two or more
`ConsolidatedElement` objects when the consolidated text is too long to
fit in the specified window (`max_characters`).
Chunk splitting is only required when a single element (like a big
paragraph) has text longer than the specified window. Otherwise a
section and the chunk that derives from it reflects an even element
boundary.
`chunk_by_title()` was elaborated in commit f98d5e65 to add this
"chunk-splitting" behavior.
At the time there was some notion of wanting to "split from the end
backward" such that any small remainder chunk would appear first, and
could possibly be combined with a small prior chunk. To accomplish this,
split chunks were _inserted_ at the beginning of the list instead of
_appended_ to the end.
The `chunked_elements` variable (`List[CompositeElement]`) holds the
sequence of chunks that result from the chunking operation and is the
returned value for `chunk_by_title()`. This was the list
"split-from-the-end" chunks were inserted at the beginning of and that
unfortunately produces this out-of-order behavior because the insertion
was at the beginning of this "all-chunks-in-document" list, not a
sublist just for this chunk.
Further, the "split-from-the-end" behavior can produce no benefit
because chunks are never combined, only _elements_ are combined (across
semantic boundaries into a single section when a section is small) and
sectioning occurs _prior_ to chunking.
The fix is to rework the chunk-splitting passage to a straighforward
iterative algorithm that works both when a chunk must be split and when
it doesn't. This algorithm is also very easily extended to implement
split-chunk-overlap which is coming up in an immediately following PR.
```python
# -- split chunk into CompositeElements objects maxlen or smaller --
text_len = len(text)
start = 0
remaining = text_len
while remaining > 0:
end = min(start + max_characters, text_len)
chunked_elements.append(CompositeElement(text=text[start:end], metadata=chunk_meta))
start = end - overlap
remaining = text_len - end
```
*Forensic analysis*
The out-of-order-chunks behavior was introduced in commit 4ea71683 on
10/09/2023 in the same PR in which chunk-splitting was introduced.
---------
Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-20 18:37:34 -07:00
|
|
|
|
|
|
|
assert chunks == [
|
|
|
|
CompositeElement("Introduction"),
|
|
|
|
CompositeElement("Lorem ipsum dolor sit amet consectetur adipiscing "),
|
|
|
|
CompositeElement("elit. In rhoncus ipsum sed lectus porta volutpat."),
|
|
|
|
]
|
|
|
|
|
|
|
|
|
2023-08-29 12:04:57 -04:00
|
|
|
def test_split_elements_by_title_and_table():
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
elements: List[Element] = [
|
2023-08-29 12:04:57 -04:00
|
|
|
Title("A Great Day"),
|
|
|
|
Text("Today is a great day."),
|
|
|
|
Text("It is sunny outside."),
|
2023-11-16 08:22:50 -08:00
|
|
|
Table("Heading\nCell text"),
|
2023-08-29 12:04:57 -04:00
|
|
|
Title("An Okay Day"),
|
|
|
|
Text("Today is an okay day."),
|
|
|
|
Text("It is rainy outside."),
|
|
|
|
Title("A Bad Day"),
|
|
|
|
Text("Today is a bad day."),
|
|
|
|
Text("It is storming outside."),
|
|
|
|
CheckBox(),
|
|
|
|
]
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
|
fix: chunk_by_title() interface is rude (#1844)
### `chunk_by_title()` interface is "rude"
**Executive Summary.** Perhaps the most commonly specified option for
`chunk_by_title()` is `max_characters` (default: 500), which specifies
the chunk window size.
When a user specifies this value, they get an error message:
```python
>>> chunks = chunk_by_title(elements, max_characters=100)
ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters.
```
A few of the things that might reasonably pass through a user's mind at
such a moment are:
* "Is `110` not a valid value for `max_characters`? Why would that be?"
* "I didn't specify a value for `combine_text_under_n_chars` or
`new_after_n_chars`, in fact I don't know what they are because I
haven't studied the documentation and would prefer not to; I just want
smaller chunks! How could I supply an invalid value when I haven't
supplied any value at all for these?"
* "Which of these values is the problem? Why are you making me figure
that out for myself? I'm sure the code knows which one is not valid, why
doesn't it share that information with me? I'm busy here!"
In this particular case, the problem is that
`combine_text_under_n_chars` (defaults to 500) is greater than
`max_characters`, which means it would never take effect (which is
actually not a problem in itself).
To fix this, once figuring out that was the problem, probably after
opening an issue and maybe reading the source code, the user would need
to specify:
```python
>>> chunks = chunk_by_title(
... elements, max_characters=100, combine_text_under_n_chars=100
... )
```
This and other stressful user scenarios can be remedied by:
* Using "active" defaults for the `combine_text_under_n_chars` and
`new_after_n_chars` options.
* Providing a specific error message for each way a constraint may be
violated, such that direction to remedy the problem is immediately clear
to the user.
An *active default* is for example:
* Make the default for `combine_text_under_n_chars: int | None = None`
such that the code can detect when it has not been specified.
* When not specified, set its value to `max_characters`, the same as its
current (static) default.
This particular change would avoid the behavior in the motivating
example above.
Another alternative for this argument is simply:
```python
combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars)
```
### Fix
1. Add constraint-specific error messages.
2. Use "active" defaults for `combine_text_under_n_ chars` and
`new_after_n_chars`.
3. Improve docstring to describe active defaults, and explain other
argument behaviors, in particular identifying suppression options like
`combine_text_under_n_chars = 0` to disable chunk combining.
2023-10-24 16:22:38 -07:00
|
|
|
sections = _split_elements_by_title_and_table(
|
|
|
|
elements,
|
|
|
|
multipage_sections=True,
|
|
|
|
new_after_n_chars=500,
|
|
|
|
max_characters=500,
|
|
|
|
)
|
2023-08-29 12:04:57 -04:00
|
|
|
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
section = next(sections)
|
|
|
|
assert isinstance(section, _TextSection)
|
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-01 18:49:20 -07:00
|
|
|
assert section._elements == [
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
Title("A Great Day"),
|
|
|
|
Text("Today is a great day."),
|
|
|
|
Text("It is sunny outside."),
|
|
|
|
]
|
|
|
|
# --
|
|
|
|
section = next(sections)
|
|
|
|
assert isinstance(section, _TableSection)
|
2023-11-16 08:22:50 -08:00
|
|
|
assert section._table == Table("Heading\nCell text")
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
# ==
|
|
|
|
section = next(sections)
|
|
|
|
assert isinstance(section, _TextSection)
|
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-01 18:49:20 -07:00
|
|
|
assert section._elements == [
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
Title("An Okay Day"),
|
|
|
|
Text("Today is an okay day."),
|
|
|
|
Text("It is rainy outside."),
|
|
|
|
]
|
|
|
|
# --
|
|
|
|
section = next(sections)
|
|
|
|
assert isinstance(section, _TextSection)
|
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-01 18:49:20 -07:00
|
|
|
assert section._elements == [
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
Title("A Bad Day"),
|
|
|
|
Text("Today is a bad day."),
|
|
|
|
Text("It is storming outside."),
|
2023-08-29 12:04:57 -04:00
|
|
|
]
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
# --
|
|
|
|
section = next(sections)
|
|
|
|
assert isinstance(section, _NonTextSection)
|
|
|
|
# --
|
|
|
|
with pytest.raises(StopIteration):
|
|
|
|
next(sections)
|
2023-08-29 12:04:57 -04:00
|
|
|
|
|
|
|
|
|
|
|
def test_chunk_by_title():
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
elements: List[Element] = [
|
2023-08-29 12:04:57 -04:00
|
|
|
Title("A Great Day", metadata=ElementMetadata(emphasized_text_contents=["Day"])),
|
|
|
|
Text("Today is a great day.", metadata=ElementMetadata(emphasized_text_contents=["day"])),
|
|
|
|
Text("It is sunny outside."),
|
2023-11-16 08:22:50 -08:00
|
|
|
Table("Heading\nCell text"),
|
2023-08-29 12:04:57 -04:00
|
|
|
Title("An Okay Day"),
|
|
|
|
Text("Today is an okay day."),
|
|
|
|
Text("It is rainy outside."),
|
|
|
|
Title("A Bad Day"),
|
|
|
|
Text(
|
|
|
|
"Today is a bad day.",
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
metadata=ElementMetadata(
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
regex_metadata={"a": [RegexMetadata(text="A", start=0, end=1)]},
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
),
|
2023-08-29 12:04:57 -04:00
|
|
|
),
|
|
|
|
Text("It is storming outside."),
|
|
|
|
CheckBox(),
|
|
|
|
]
|
Dynamic ElementMetadata implementation (#2043)
### Executive Summary
The structure of element metadata is currently static, meaning only
predefined fields can appear in the metadata. We would like the
flexibility for end-users, at their own discretion, to define and use
additional metadata fields that make sense for their particular
use-case.
### Concepts
A key concept for dynamic metadata is _known field_. A known-field is
one of those explicitly defined on `ElementMetadata`. Each of these has
a type and can be specified when _constructing_ a new `ElementMetadata`
instance. This is in contrast to an _end-user defined_ (or _ad-hoc_)
metadata field, one not known at "compile" time and added at the
discretion of an end-user to suit the purposes of their application.
An ad-hoc field can only be added by _assignment_ on an already
constructed instance.
### End-user ad-hoc metadata field behaviors
An ad-hoc field can be added to an `ElementMetadata` instance by
assignment:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
```
A field added in this way can be accessed by name:
```python
>>> metadata.coefficient
0.536
```
and that field will appear in the JSON/dict for that instance:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
>>> metadata.to_dict()
{"coefficient": 0.536}
```
However, accessing a "user-defined" value that has _not_ been assigned
on that instance raises `AttributeError`:
```python
>>> metadata.coeffcient # -- misspelled "coefficient" --
AttributeError: 'ElementMetadata' object has no attribute 'coeffcient'
```
This makes "tagging" a metadata item with a value very convenient, but
entails the proviso that if an end-user wants to add a metadata field to
_some_ elements and not others (sparse population), AND they want to
access that field by name on ANY element and receive `None` where it has
not been assigned, they will need to use an expression like this:
```python
coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None
```
### Implementation Notes
- **ad-hoc metadata fields** are discarded during consolidation (for
chunking) because we don't have a consolidation strategy defined for
those. We could consider using a default consolidation strategy like
`FIRST` or possibly allow a user to register a strategy (although that
gets hairy in non-private and multiple-memory-space situations.)
- ad-hoc metadata fields **cannot start with an underscore**.
- We have no way to distinguish an ad-hoc field from any "noise" fields
that might appear in a JSON/dict loaded using `.from_dict()`, so unlike
the original (which only loaded known-fields), we'll rehydrate anything
that we find there.
- No real type-safety is possible on ad-hoc fields but the type-checker
does not complain because the type of all ad-hoc fields is `Any` (which
is the best available behavior in my view).
- We may want to consider whether end-users should be able to add ad-hoc
fields to "sub" metadata objects too, like `DataSourceMetadata` and
conceivably `CoordinatesMetadata` (although I'm not immediately seeing a
use-case for the second one).
2023-11-15 13:22:15 -08:00
|
|
|
|
2023-10-03 09:40:34 -07:00
|
|
|
chunks = chunk_by_title(elements, combine_text_under_n_chars=0)
|
2023-08-29 12:04:57 -04:00
|
|
|
|
|
|
|
assert chunks == [
|
|
|
|
CompositeElement(
|
|
|
|
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.",
|
|
|
|
),
|
2023-11-16 08:22:50 -08:00
|
|
|
Table("Heading\nCell text"),
|
2023-08-29 12:04:57 -04:00
|
|
|
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
|
|
|
|
CompositeElement(
|
|
|
|
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
|
|
|
|
),
|
|
|
|
CheckBox(),
|
|
|
|
]
|
|
|
|
assert chunks[0].metadata == ElementMetadata(emphasized_text_contents=["Day", "day"])
|
|
|
|
assert chunks[3].metadata == ElementMetadata(
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
regex_metadata={"a": [RegexMetadata(text="A", start=11, end=12)]},
|
2023-08-29 12:04:57 -04:00
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
def test_chunk_by_title_respects_section_change():
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
elements: List[Element] = [
|
2023-08-29 12:04:57 -04:00
|
|
|
Title("A Great Day", metadata=ElementMetadata(section="first")),
|
|
|
|
Text("Today is a great day.", metadata=ElementMetadata(section="second")),
|
|
|
|
Text("It is sunny outside.", metadata=ElementMetadata(section="second")),
|
2023-11-16 08:22:50 -08:00
|
|
|
Table("Heading\nCell text"),
|
2023-08-29 12:04:57 -04:00
|
|
|
Title("An Okay Day"),
|
|
|
|
Text("Today is an okay day."),
|
|
|
|
Text("It is rainy outside."),
|
|
|
|
Title("A Bad Day"),
|
|
|
|
Text(
|
|
|
|
"Today is a bad day.",
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
metadata=ElementMetadata(
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
regex_metadata={"a": [RegexMetadata(text="A", start=0, end=1)]},
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
),
|
2023-08-29 12:04:57 -04:00
|
|
|
),
|
|
|
|
Text("It is storming outside."),
|
|
|
|
CheckBox(),
|
|
|
|
]
|
2023-11-16 08:22:50 -08:00
|
|
|
|
2023-10-03 09:40:34 -07:00
|
|
|
chunks = chunk_by_title(elements, combine_text_under_n_chars=0)
|
2023-08-29 12:04:57 -04:00
|
|
|
|
|
|
|
assert chunks == [
|
|
|
|
CompositeElement(
|
|
|
|
"A Great Day",
|
|
|
|
),
|
|
|
|
CompositeElement(
|
|
|
|
"Today is a great day.\n\nIt is sunny outside.",
|
|
|
|
),
|
2023-11-16 08:22:50 -08:00
|
|
|
Table("Heading\nCell text"),
|
2023-08-29 12:04:57 -04:00
|
|
|
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
|
|
|
|
CompositeElement(
|
|
|
|
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
|
|
|
|
),
|
|
|
|
CheckBox(),
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def test_chunk_by_title_separates_by_page_number():
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
elements: List[Element] = [
|
2023-08-29 12:04:57 -04:00
|
|
|
Title("A Great Day", metadata=ElementMetadata(page_number=1)),
|
|
|
|
Text("Today is a great day.", metadata=ElementMetadata(page_number=2)),
|
|
|
|
Text("It is sunny outside.", metadata=ElementMetadata(page_number=2)),
|
2023-11-16 08:22:50 -08:00
|
|
|
Table("Heading\nCell text"),
|
2023-08-29 12:04:57 -04:00
|
|
|
Title("An Okay Day"),
|
|
|
|
Text("Today is an okay day."),
|
|
|
|
Text("It is rainy outside."),
|
|
|
|
Title("A Bad Day"),
|
|
|
|
Text(
|
|
|
|
"Today is a bad day.",
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
metadata=ElementMetadata(
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
regex_metadata={"a": [RegexMetadata(text="A", start=0, end=1)]},
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
),
|
2023-08-29 12:04:57 -04:00
|
|
|
),
|
|
|
|
Text("It is storming outside."),
|
|
|
|
CheckBox(),
|
|
|
|
]
|
2023-10-03 09:40:34 -07:00
|
|
|
chunks = chunk_by_title(elements, multipage_sections=False, combine_text_under_n_chars=0)
|
2023-08-29 12:04:57 -04:00
|
|
|
|
|
|
|
assert chunks == [
|
|
|
|
CompositeElement(
|
|
|
|
"A Great Day",
|
|
|
|
),
|
|
|
|
CompositeElement(
|
|
|
|
"Today is a great day.\n\nIt is sunny outside.",
|
|
|
|
),
|
2023-11-16 08:22:50 -08:00
|
|
|
Table("Heading\nCell text"),
|
2023-08-29 12:04:57 -04:00
|
|
|
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
|
|
|
|
CompositeElement(
|
|
|
|
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
|
|
|
|
),
|
|
|
|
CheckBox(),
|
|
|
|
]
|
|
|
|
|
|
|
|
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
def test_chunk_by_title_does_not_break_on_regex_metadata_change():
|
|
|
|
"""Sectioner is insensitive to regex-metadata changes.
|
|
|
|
|
|
|
|
A regex-metadata match in an element does not signify a semantic boundary and a section should
|
|
|
|
not be split based on such a difference.
|
|
|
|
"""
|
|
|
|
elements: List[Element] = [
|
|
|
|
Title(
|
|
|
|
"Lorem Ipsum",
|
|
|
|
metadata=ElementMetadata(
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]},
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
|
|
|
|
metadata=ElementMetadata(
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]},
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"In rhoncus ipsum sed lectus porta volutpat.",
|
|
|
|
metadata=ElementMetadata(
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]},
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
),
|
|
|
|
),
|
|
|
|
]
|
|
|
|
|
|
|
|
chunks = chunk_by_title(elements)
|
|
|
|
|
|
|
|
assert chunks == [
|
|
|
|
CompositeElement(
|
|
|
|
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
" ipsum sed lectus porta volutpat.",
|
|
|
|
),
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def test_chunk_by_title_consolidates_and_adjusts_offsets_of_regex_metadata():
|
|
|
|
"""ElementMetadata.regex_metadata of chunk is union of regex_metadatas of its elements.
|
|
|
|
|
|
|
|
The `start` and `end` offsets of each regex-match are adjusted to reflect their new position in
|
|
|
|
the chunk after element text has been concatenated.
|
|
|
|
"""
|
|
|
|
elements: List[Element] = [
|
|
|
|
Title(
|
|
|
|
"Lorem Ipsum",
|
|
|
|
metadata=ElementMetadata(
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]},
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
regex_metadata={
|
|
|
|
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
|
|
|
|
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
},
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"In rhoncus ipsum sed lectus porta volutpat.",
|
|
|
|
metadata=ElementMetadata(
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]},
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
),
|
|
|
|
),
|
|
|
|
]
|
|
|
|
chunks = chunk_by_title(elements)
|
|
|
|
|
|
|
|
assert len(chunks) == 1
|
|
|
|
chunk = chunks[0]
|
|
|
|
assert chunk == CompositeElement(
|
|
|
|
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
" ipsum sed lectus porta volutpat.",
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
)
|
|
|
|
assert chunk.metadata.regex_metadata == {
|
|
|
|
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
|
|
|
|
"ipsum": [
|
|
|
|
RegexMetadata(text="Ipsum", start=6, end=11),
|
|
|
|
RegexMetadata(text="ipsum", start=19, end=24),
|
|
|
|
RegexMetadata(text="ipsum", start=81, end=86),
|
|
|
|
],
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2023-08-29 12:04:57 -04:00
|
|
|
def test_chunk_by_title_groups_across_pages():
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
elements: List[Element] = [
|
2023-08-29 12:04:57 -04:00
|
|
|
Title("A Great Day", metadata=ElementMetadata(page_number=1)),
|
|
|
|
Text("Today is a great day.", metadata=ElementMetadata(page_number=2)),
|
|
|
|
Text("It is sunny outside.", metadata=ElementMetadata(page_number=2)),
|
2023-11-16 08:22:50 -08:00
|
|
|
Table("Heading\nCell text"),
|
2023-08-29 12:04:57 -04:00
|
|
|
Title("An Okay Day"),
|
|
|
|
Text("Today is an okay day."),
|
|
|
|
Text("It is rainy outside."),
|
|
|
|
Title("A Bad Day"),
|
|
|
|
Text(
|
|
|
|
"Today is a bad day.",
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
metadata=ElementMetadata(
|
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary
Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.
**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image
### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:
screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">
### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR`
### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`
---------
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-20 20:24:23 -04:00
|
|
|
regex_metadata={"a": [RegexMetadata(text="A", start=0, end=1)]},
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
),
|
2023-08-29 12:04:57 -04:00
|
|
|
),
|
|
|
|
Text("It is storming outside."),
|
|
|
|
CheckBox(),
|
|
|
|
]
|
2023-10-03 09:40:34 -07:00
|
|
|
chunks = chunk_by_title(elements, multipage_sections=True, combine_text_under_n_chars=0)
|
2023-08-29 12:04:57 -04:00
|
|
|
|
|
|
|
assert chunks == [
|
|
|
|
CompositeElement(
|
|
|
|
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.",
|
|
|
|
),
|
2023-11-16 08:22:50 -08:00
|
|
|
Table("Heading\nCell text"),
|
2023-08-29 12:04:57 -04:00
|
|
|
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
|
|
|
|
CompositeElement(
|
|
|
|
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
|
|
|
|
),
|
|
|
|
CheckBox(),
|
|
|
|
]
|
2023-09-11 16:00:14 -05:00
|
|
|
|
|
|
|
|
|
|
|
def test_add_chunking_strategy_on_partition_html():
|
|
|
|
filename = "example-docs/example-10k-1p.html"
|
|
|
|
chunk_elements = partition_html(filename, chunking_strategy="by_title")
|
|
|
|
elements = partition_html(filename)
|
|
|
|
chunks = chunk_by_title(elements)
|
|
|
|
assert chunk_elements != elements
|
|
|
|
assert chunk_elements == chunks
|
|
|
|
|
|
|
|
|
2023-10-09 12:42:36 -07:00
|
|
|
def test_add_chunking_strategy_respects_max_characters():
|
|
|
|
filename = "example-docs/example-10k-1p.html"
|
|
|
|
chunk_elements = partition_html(
|
|
|
|
filename,
|
|
|
|
chunking_strategy="by_title",
|
|
|
|
combine_text_under_n_chars=0,
|
|
|
|
new_after_n_chars=50,
|
|
|
|
max_characters=100,
|
|
|
|
)
|
|
|
|
elements = partition_html(filename)
|
|
|
|
chunks = chunk_by_title(
|
|
|
|
elements,
|
|
|
|
combine_text_under_n_chars=0,
|
|
|
|
new_after_n_chars=50,
|
|
|
|
max_characters=100,
|
|
|
|
)
|
|
|
|
|
|
|
|
for chunk in chunks:
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
assert isinstance(chunk, Text)
|
2023-10-09 12:42:36 -07:00
|
|
|
assert len(chunk.text) <= 100
|
|
|
|
for chunk_element in chunk_elements:
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
assert isinstance(chunk_element, Text)
|
2023-10-09 12:42:36 -07:00
|
|
|
assert len(chunk_element.text) <= 100
|
|
|
|
assert chunk_elements != elements
|
|
|
|
assert chunk_elements == chunks
|
|
|
|
|
|
|
|
|
2023-09-11 16:00:14 -05:00
|
|
|
def test_add_chunking_strategy_on_partition_html_respects_multipage():
|
|
|
|
filename = "example-docs/example-10k-1p.html"
|
|
|
|
partitioned_elements_multipage_false_combine_chars_0 = partition_html(
|
|
|
|
filename,
|
|
|
|
chunking_strategy="by_title",
|
|
|
|
multipage_sections=False,
|
2023-10-03 09:40:34 -07:00
|
|
|
combine_text_under_n_chars=0,
|
|
|
|
new_after_n_chars=300,
|
|
|
|
max_characters=400,
|
2023-09-11 16:00:14 -05:00
|
|
|
)
|
|
|
|
partitioned_elements_multipage_true_combine_chars_0 = partition_html(
|
|
|
|
filename,
|
|
|
|
chunking_strategy="by_title",
|
|
|
|
multipage_sections=True,
|
2023-10-03 09:40:34 -07:00
|
|
|
combine_text_under_n_chars=0,
|
|
|
|
new_after_n_chars=300,
|
|
|
|
max_characters=400,
|
2023-09-11 16:00:14 -05:00
|
|
|
)
|
|
|
|
elements = partition_html(filename)
|
|
|
|
cleaned_elements_multipage_false_combine_chars_0 = chunk_by_title(
|
|
|
|
elements,
|
|
|
|
multipage_sections=False,
|
2023-10-03 09:40:34 -07:00
|
|
|
combine_text_under_n_chars=0,
|
|
|
|
new_after_n_chars=300,
|
|
|
|
max_characters=400,
|
2023-09-11 16:00:14 -05:00
|
|
|
)
|
|
|
|
cleaned_elements_multipage_true_combine_chars_0 = chunk_by_title(
|
|
|
|
elements,
|
|
|
|
multipage_sections=True,
|
2023-10-03 09:40:34 -07:00
|
|
|
combine_text_under_n_chars=0,
|
|
|
|
new_after_n_chars=300,
|
|
|
|
max_characters=400,
|
2023-09-11 16:00:14 -05:00
|
|
|
)
|
|
|
|
assert (
|
|
|
|
partitioned_elements_multipage_false_combine_chars_0
|
|
|
|
== cleaned_elements_multipage_false_combine_chars_0
|
|
|
|
)
|
|
|
|
assert (
|
|
|
|
partitioned_elements_multipage_true_combine_chars_0
|
|
|
|
== cleaned_elements_multipage_true_combine_chars_0
|
|
|
|
)
|
|
|
|
assert len(partitioned_elements_multipage_true_combine_chars_0) != len(
|
|
|
|
partitioned_elements_multipage_false_combine_chars_0,
|
|
|
|
)
|
|
|
|
|
|
|
|
|
2023-10-04 15:14:21 -07:00
|
|
|
def test_chunk_by_title_drops_detection_class_prob():
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
elements: List[Element] = [
|
2023-10-04 15:14:21 -07:00
|
|
|
Title(
|
|
|
|
"A Great Day",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
detection_class_prob=0.5,
|
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"Today is a great day.",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
detection_class_prob=0.62,
|
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"It is sunny outside.",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
detection_class_prob=0.73,
|
|
|
|
),
|
|
|
|
),
|
|
|
|
Title(
|
|
|
|
"An Okay Day",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
detection_class_prob=0.84,
|
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"Today is an okay day.",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
detection_class_prob=0.95,
|
|
|
|
),
|
|
|
|
),
|
|
|
|
]
|
|
|
|
chunks = chunk_by_title(elements, combine_text_under_n_chars=0)
|
|
|
|
assert str(chunks[0]) == str(
|
|
|
|
CompositeElement("A Great Day\n\nToday is a great day.\n\nIt is sunny outside."),
|
|
|
|
)
|
|
|
|
assert str(chunks[1]) == str(CompositeElement("An Okay Day\n\nToday is an okay day."))
|
|
|
|
|
|
|
|
|
2023-09-14 13:10:03 +03:00
|
|
|
def test_chunk_by_title_drops_extra_metadata():
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
elements: List[Element] = [
|
2023-09-14 13:10:03 +03:00
|
|
|
Title(
|
|
|
|
"A Great Day",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
coordinates=CoordinatesMetadata(
|
|
|
|
points=(
|
|
|
|
(0.1, 0.1),
|
|
|
|
(0.2, 0.1),
|
|
|
|
(0.1, 0.2),
|
|
|
|
(0.2, 0.2),
|
|
|
|
),
|
|
|
|
system=CoordinateSystem(width=0.1, height=0.1),
|
|
|
|
),
|
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"Today is a great day.",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
coordinates=CoordinatesMetadata(
|
|
|
|
points=(
|
|
|
|
(0.2, 0.2),
|
|
|
|
(0.3, 0.2),
|
|
|
|
(0.2, 0.3),
|
|
|
|
(0.3, 0.3),
|
|
|
|
),
|
|
|
|
system=CoordinateSystem(width=0.2, height=0.2),
|
|
|
|
),
|
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"It is sunny outside.",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
coordinates=CoordinatesMetadata(
|
|
|
|
points=(
|
|
|
|
(0.3, 0.3),
|
|
|
|
(0.4, 0.3),
|
|
|
|
(0.3, 0.4),
|
|
|
|
(0.4, 0.4),
|
|
|
|
),
|
|
|
|
system=CoordinateSystem(width=0.3, height=0.3),
|
|
|
|
),
|
|
|
|
),
|
|
|
|
),
|
|
|
|
Title(
|
|
|
|
"An Okay Day",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
coordinates=CoordinatesMetadata(
|
|
|
|
points=(
|
|
|
|
(0.3, 0.3),
|
|
|
|
(0.4, 0.3),
|
|
|
|
(0.3, 0.4),
|
|
|
|
(0.4, 0.4),
|
|
|
|
),
|
|
|
|
system=CoordinateSystem(width=0.3, height=0.3),
|
|
|
|
),
|
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"Today is an okay day.",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
coordinates=CoordinatesMetadata(
|
|
|
|
points=(
|
|
|
|
(0.4, 0.4),
|
|
|
|
(0.5, 0.4),
|
|
|
|
(0.4, 0.5),
|
|
|
|
(0.5, 0.5),
|
|
|
|
),
|
|
|
|
system=CoordinateSystem(width=0.4, height=0.4),
|
|
|
|
),
|
|
|
|
),
|
|
|
|
),
|
|
|
|
]
|
|
|
|
|
2023-10-03 09:40:34 -07:00
|
|
|
chunks = chunk_by_title(elements, combine_text_under_n_chars=0)
|
2023-09-14 13:10:03 +03:00
|
|
|
|
|
|
|
assert str(chunks[0]) == str(
|
|
|
|
CompositeElement("A Great Day\n\nToday is a great day.\n\nIt is sunny outside."),
|
|
|
|
)
|
|
|
|
|
|
|
|
assert str(chunks[1]) == str(CompositeElement("An Okay Day\n\nToday is an okay day."))
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
|
|
|
|
|
|
|
|
def test_it_considers_separator_length_when_sectioning():
|
|
|
|
"""Sectioner includes length of separators when computing remaining space."""
|
|
|
|
elements: List[Element] = [
|
|
|
|
Title("Chunking Priorities"), # 19 chars
|
|
|
|
ListItem("Divide text into manageable chunks"), # 34 chars
|
|
|
|
ListItem("Preserve semantic boundaries"), # 28 chars
|
|
|
|
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
|
|
|
|
] # 114 chars total but 120 chars with separators
|
|
|
|
|
|
|
|
chunks = chunk_by_title(elements, max_characters=115)
|
|
|
|
|
|
|
|
assert chunks == [
|
|
|
|
CompositeElement(
|
|
|
|
"Chunking Priorities"
|
|
|
|
"\n\nDivide text into manageable chunks"
|
2023-10-30 01:10:51 -06:00
|
|
|
"\n\nPreserve semantic boundaries",
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
),
|
|
|
|
CompositeElement("Minimize mid-text chunk-splitting"),
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
# == Sections ====================================================================================
|
|
|
|
|
|
|
|
|
|
|
|
class Describe_NonTextSection:
|
|
|
|
"""Unit-test suite for `unstructured.chunking.title._NonTextSection objects."""
|
|
|
|
|
2023-11-16 08:22:50 -08:00
|
|
|
def it_iterates_its_element_as_the_sole_chunk(self):
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
checkbox = CheckBox()
|
|
|
|
section = _NonTextSection(checkbox)
|
2023-11-16 08:22:50 -08:00
|
|
|
|
|
|
|
chunk_iter = section.iter_chunks(maxlen=500)
|
|
|
|
|
|
|
|
chunk = next(chunk_iter)
|
|
|
|
assert isinstance(chunk, CheckBox)
|
|
|
|
with pytest.raises(StopIteration):
|
|
|
|
next(chunk_iter)
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
|
|
|
|
|
|
|
|
class Describe_TableSection:
|
|
|
|
"""Unit-test suite for `unstructured.chunking.title._TableSection objects."""
|
|
|
|
|
2023-11-16 08:22:50 -08:00
|
|
|
def it_uses_its_table_as_the_sole_chunk_when_it_fits_in_the_window(self):
|
|
|
|
html_table = (
|
|
|
|
"<table>\n"
|
|
|
|
"<thead>\n"
|
|
|
|
"<tr><th>Header Col 1 </th><th>Header Col 2 </th></tr>\n"
|
|
|
|
"</thead>\n"
|
|
|
|
"<tbody>\n"
|
|
|
|
"<tr><td>Lorem ipsum </td><td>adipiscing </td></tr>\n"
|
|
|
|
"</tbody>\n"
|
|
|
|
"</table>"
|
|
|
|
)
|
|
|
|
text_table = "Header Col 1 Header Col 2\n" "Lorem ipsum adipiscing"
|
|
|
|
section = _TableSection(
|
|
|
|
Table(text_table, metadata=ElementMetadata(text_as_html=html_table))
|
|
|
|
)
|
|
|
|
|
|
|
|
chunk_iter = section.iter_chunks(maxlen=175)
|
|
|
|
|
|
|
|
chunk = next(chunk_iter)
|
|
|
|
assert isinstance(chunk, Table)
|
|
|
|
assert chunk.text == "Header Col 1 Header Col 2\nLorem ipsum adipiscing"
|
|
|
|
assert chunk.metadata.text_as_html == (
|
|
|
|
"<table>\n"
|
|
|
|
"<thead>\n"
|
|
|
|
"<tr><th>Header Col 1 </th><th>Header Col 2 </th></tr>\n"
|
|
|
|
"</thead>\n"
|
|
|
|
"<tbody>\n"
|
|
|
|
"<tr><td>Lorem ipsum </td><td>adipiscing </td></tr>\n"
|
|
|
|
"</tbody>\n"
|
|
|
|
"</table>"
|
|
|
|
)
|
|
|
|
with pytest.raises(StopIteration):
|
|
|
|
next(chunk_iter)
|
|
|
|
|
|
|
|
def but_it_splits_its_table_into_TableChunks_when_the_table_text_exceeds_the_window(self):
|
|
|
|
# fixed-overhead = 8+8+9+8+9+8 = 50
|
|
|
|
# per-row overhead = 27
|
|
|
|
html_table = (
|
|
|
|
"<table>\n" # 8
|
|
|
|
"<thead>\n" # 8
|
|
|
|
"<tr><th>Header Col 1 </th><th>Header Col 2 </th></tr>\n"
|
|
|
|
"</thead>\n" # 9
|
|
|
|
"<tbody>\n" # 8
|
|
|
|
"<tr><td>Lorem ipsum </td><td>A Link example</td></tr>\n"
|
|
|
|
"<tr><td>Consectetur </td><td>adipiscing elit</td></tr>\n"
|
|
|
|
"<tr><td>Nunc aliquam </td><td>id enim nec molestie</td></tr>\n"
|
|
|
|
"<tr><td>Vivamus quis </td><td>nunc ipsum donec ac fermentum</td></tr>\n"
|
|
|
|
"</tbody>\n" # 9
|
|
|
|
"</table>" # 8
|
|
|
|
)
|
|
|
|
text_table = (
|
|
|
|
"Header Col 1 Header Col 2\n"
|
|
|
|
"Lorem ipsum dolor sit amet\n"
|
|
|
|
"Consectetur adipiscing elit\n"
|
|
|
|
"Nunc aliquam id enim nec molestie\n"
|
|
|
|
"Vivamus quis nunc ipsum donec ac fermentum"
|
|
|
|
)
|
|
|
|
section = _TableSection(
|
|
|
|
Table(text_table, metadata=ElementMetadata(text_as_html=html_table))
|
|
|
|
)
|
|
|
|
|
|
|
|
chunk_iter = section.iter_chunks(maxlen=100)
|
|
|
|
|
|
|
|
chunk = next(chunk_iter)
|
|
|
|
assert isinstance(chunk, TableChunk)
|
|
|
|
assert chunk.text == (
|
|
|
|
"Header Col 1 Header Col 2\n"
|
|
|
|
"Lorem ipsum dolor sit amet\n"
|
|
|
|
"Consectetur adipiscing elit\n"
|
|
|
|
"Nunc aliqua"
|
|
|
|
)
|
|
|
|
assert chunk.metadata.text_as_html == (
|
|
|
|
"<table>\n"
|
|
|
|
"<thead>\n"
|
|
|
|
"<tr><th>Header Col 1 </th><th>Header Col 2 </th></tr>\n"
|
|
|
|
"</thead>\n"
|
|
|
|
"<tbody>\n"
|
|
|
|
"<tr><td>Lo"
|
|
|
|
)
|
|
|
|
# --
|
|
|
|
chunk = next(chunk_iter)
|
|
|
|
assert isinstance(chunk, TableChunk)
|
|
|
|
assert (
|
|
|
|
chunk.text == "m id enim nec molestie\nVivamus quis nunc ipsum donec ac fermentum"
|
|
|
|
)
|
|
|
|
assert chunk.metadata.text_as_html == (
|
|
|
|
"rem ipsum </td><td>A Link example</td></tr>\n"
|
|
|
|
"<tr><td>Consectetur </td><td>adipiscing elit</td><"
|
|
|
|
)
|
|
|
|
# -- note that text runs out but HTML continues because it's significantly longer. So two
|
|
|
|
# -- of these chunks have HTML but no text.
|
|
|
|
chunk = next(chunk_iter)
|
|
|
|
assert isinstance(chunk, TableChunk)
|
|
|
|
assert chunk.text == ""
|
|
|
|
assert chunk.metadata.text_as_html == (
|
|
|
|
"/tr>\n"
|
|
|
|
"<tr><td>Nunc aliquam </td><td>id enim nec molestie</td></tr>\n"
|
|
|
|
"<tr><td>Vivamus quis </td><td>"
|
|
|
|
)
|
|
|
|
# --
|
|
|
|
chunk = next(chunk_iter)
|
|
|
|
assert isinstance(chunk, TableChunk)
|
|
|
|
assert chunk.text == ""
|
|
|
|
assert chunk.metadata.text_as_html == (
|
|
|
|
"nunc ipsum donec ac fermentum</td></tr>\n</tbody>\n</table>"
|
|
|
|
)
|
|
|
|
# --
|
|
|
|
with pytest.raises(StopIteration):
|
|
|
|
next(chunk_iter)
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
|
|
|
|
|
|
|
|
class Describe_TextSection:
|
|
|
|
"""Unit-test suite for `unstructured.chunking.title._TextSection objects."""
|
|
|
|
|
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-01 18:49:20 -07:00
|
|
|
def it_can_combine_itself_with_another_TextSection_instance(self):
|
|
|
|
""".combine() produces a new section by appending the elements of `other_section`.
|
|
|
|
|
|
|
|
Note that neither the original or other section are mutated.
|
|
|
|
"""
|
|
|
|
section = _TextSection(
|
|
|
|
[
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
|
|
|
|
Text("In rhoncus ipsum sed lectus porta volutpat."),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
other_section = _TextSection(
|
|
|
|
[
|
|
|
|
Text("Donec semper facilisis metus finibus malesuada."),
|
|
|
|
Text("Vivamus magna nibh, blandit eu dui congue, feugiat efficitur velit."),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
|
|
|
|
new_section = section.combine(other_section)
|
|
|
|
|
|
|
|
assert new_section == _TextSection(
|
|
|
|
[
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
|
|
|
|
Text("In rhoncus ipsum sed lectus porta volutpat."),
|
|
|
|
Text("Donec semper facilisis metus finibus malesuada."),
|
|
|
|
Text("Vivamus magna nibh, blandit eu dui congue, feugiat efficitur velit."),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
assert section == _TextSection(
|
|
|
|
[
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
|
|
|
|
Text("In rhoncus ipsum sed lectus porta volutpat."),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
assert other_section == _TextSection(
|
|
|
|
[
|
|
|
|
Text("Donec semper facilisis metus finibus malesuada."),
|
|
|
|
Text("Vivamus magna nibh, blandit eu dui congue, feugiat efficitur velit."),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
|
2023-11-16 08:22:50 -08:00
|
|
|
def it_generates_a_single_chunk_from_its_elements_if_they_together_fit_in_window(self):
|
|
|
|
section = _TextSection(
|
|
|
|
[
|
|
|
|
Title("Introduction"),
|
|
|
|
Text(
|
|
|
|
"Lorem ipsum dolor sit amet consectetur adipiscing elit. In rhoncus ipsum sed"
|
|
|
|
"lectus porta volutpat.",
|
|
|
|
),
|
|
|
|
]
|
|
|
|
)
|
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-01 18:49:20 -07:00
|
|
|
|
2023-11-16 08:22:50 -08:00
|
|
|
chunk_iter = section.iter_chunks(maxlen=200)
|
|
|
|
|
|
|
|
chunk = next(chunk_iter)
|
|
|
|
assert chunk == CompositeElement(
|
|
|
|
"Introduction\n\nLorem ipsum dolor sit amet consectetur adipiscing elit."
|
|
|
|
" In rhoncus ipsum sedlectus porta volutpat.",
|
|
|
|
)
|
|
|
|
assert chunk.metadata is section._consolidated_metadata
|
|
|
|
|
|
|
|
def but_it_generates_split_chunks_when_its_single_element_exceeds_window_size(self):
|
|
|
|
# -- Chunk-splitting only occurs when a *single* element is too big to fit in the window.
|
|
|
|
# -- The sectioner will isolate that element in a section of its own.
|
|
|
|
section = _TextSection(
|
|
|
|
[
|
|
|
|
Text(
|
|
|
|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod"
|
|
|
|
" tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim"
|
|
|
|
" veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea"
|
|
|
|
" commodo consequat."
|
|
|
|
),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
|
|
|
|
chunk_iter = section.iter_chunks(maxlen=200)
|
|
|
|
|
|
|
|
chunk = next(chunk_iter)
|
|
|
|
assert chunk == CompositeElement(
|
|
|
|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod"
|
|
|
|
" tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim"
|
|
|
|
" veniam, quis nostrud exercitation ullamco laboris nisi ut a"
|
|
|
|
)
|
|
|
|
assert chunk.metadata is section._consolidated_metadata
|
|
|
|
# --
|
|
|
|
chunk = next(chunk_iter)
|
|
|
|
assert chunk == CompositeElement("liquip ex ea commodo consequat.")
|
|
|
|
assert chunk.metadata is section._consolidated_metadata
|
|
|
|
# --
|
|
|
|
with pytest.raises(StopIteration):
|
|
|
|
next(chunk_iter)
|
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-01 18:49:20 -07:00
|
|
|
|
|
|
|
def it_knows_the_length_of_the_combined_text_of_its_elements_which_is_the_chunk_size(self):
|
|
|
|
""".text_length is the size of chunk this section will produce (before any splitting)."""
|
|
|
|
section = _TextSection([PageBreak(""), Text("foo"), Text("bar")])
|
|
|
|
assert section.text_length == 8
|
|
|
|
|
|
|
|
def it_extracts_all_populated_metadata_values_from_the_elements_to_help(self):
|
|
|
|
section = _TextSection(
|
|
|
|
[
|
|
|
|
Title(
|
|
|
|
"Lorem Ipsum",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
category_depth=0,
|
|
|
|
filename="foo.docx",
|
|
|
|
languages=["lat"],
|
|
|
|
parent_id="f87731e0",
|
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"'Lorem ipsum dolor' means 'Thank you very much' in Latin.",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
category_depth=1,
|
|
|
|
filename="foo.docx",
|
|
|
|
image_path="sprite.png",
|
|
|
|
languages=["lat", "eng"],
|
|
|
|
),
|
|
|
|
),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
|
|
|
|
assert section._all_metadata_values == {
|
|
|
|
# -- scalar values are accumulated in a list in element order --
|
|
|
|
"category_depth": [0, 1],
|
|
|
|
# -- all values are accumulated, not only unique ones --
|
|
|
|
"filename": ["foo.docx", "foo.docx"],
|
|
|
|
# -- list-type fields produce a list of lists --
|
|
|
|
"languages": [["lat"], ["lat", "eng"]],
|
|
|
|
# -- fields that only appear in some elements are captured --
|
|
|
|
"image_path": ["sprite.png"],
|
|
|
|
"parent_id": ["f87731e0"],
|
|
|
|
# -- A `None` value never appears, neither does a field-name with an empty list --
|
|
|
|
}
|
|
|
|
|
Dynamic ElementMetadata implementation (#2043)
### Executive Summary
The structure of element metadata is currently static, meaning only
predefined fields can appear in the metadata. We would like the
flexibility for end-users, at their own discretion, to define and use
additional metadata fields that make sense for their particular
use-case.
### Concepts
A key concept for dynamic metadata is _known field_. A known-field is
one of those explicitly defined on `ElementMetadata`. Each of these has
a type and can be specified when _constructing_ a new `ElementMetadata`
instance. This is in contrast to an _end-user defined_ (or _ad-hoc_)
metadata field, one not known at "compile" time and added at the
discretion of an end-user to suit the purposes of their application.
An ad-hoc field can only be added by _assignment_ on an already
constructed instance.
### End-user ad-hoc metadata field behaviors
An ad-hoc field can be added to an `ElementMetadata` instance by
assignment:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
```
A field added in this way can be accessed by name:
```python
>>> metadata.coefficient
0.536
```
and that field will appear in the JSON/dict for that instance:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
>>> metadata.to_dict()
{"coefficient": 0.536}
```
However, accessing a "user-defined" value that has _not_ been assigned
on that instance raises `AttributeError`:
```python
>>> metadata.coeffcient # -- misspelled "coefficient" --
AttributeError: 'ElementMetadata' object has no attribute 'coeffcient'
```
This makes "tagging" a metadata item with a value very convenient, but
entails the proviso that if an end-user wants to add a metadata field to
_some_ elements and not others (sparse population), AND they want to
access that field by name on ANY element and receive `None` where it has
not been assigned, they will need to use an expression like this:
```python
coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None
```
### Implementation Notes
- **ad-hoc metadata fields** are discarded during consolidation (for
chunking) because we don't have a consolidation strategy defined for
those. We could consider using a default consolidation strategy like
`FIRST` or possibly allow a user to register a strategy (although that
gets hairy in non-private and multiple-memory-space situations.)
- ad-hoc metadata fields **cannot start with an underscore**.
- We have no way to distinguish an ad-hoc field from any "noise" fields
that might appear in a JSON/dict loaded using `.from_dict()`, so unlike
the original (which only loaded known-fields), we'll rehydrate anything
that we find there.
- No real type-safety is possible on ad-hoc fields but the type-checker
does not complain because the type of all ad-hoc fields is `Any` (which
is the best available behavior in my view).
- We may want to consider whether end-users should be able to add ad-hoc
fields to "sub" metadata objects too, like `DataSourceMetadata` and
conceivably `CoordinatesMetadata` (although I'm not immediately seeing a
use-case for the second one).
2023-11-15 13:22:15 -08:00
|
|
|
def but_it_discards_ad_hoc_metadata_fields_during_consolidation(self):
|
|
|
|
metadata = ElementMetadata(
|
|
|
|
category_depth=0,
|
|
|
|
filename="foo.docx",
|
|
|
|
languages=["lat"],
|
|
|
|
parent_id="f87731e0",
|
|
|
|
)
|
|
|
|
metadata.coefficient = 0.62
|
|
|
|
metadata_2 = ElementMetadata(
|
|
|
|
category_depth=1,
|
|
|
|
filename="foo.docx",
|
|
|
|
image_path="sprite.png",
|
|
|
|
languages=["lat", "eng"],
|
|
|
|
)
|
|
|
|
metadata_2.quotient = 1.74
|
|
|
|
|
|
|
|
section = _TextSection(
|
|
|
|
[
|
|
|
|
Title("Lorem Ipsum", metadata=metadata),
|
|
|
|
Text("'Lorem ipsum dolor' means 'Thank you very much'.", metadata=metadata_2),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
|
|
|
|
# -- ad-hoc fields "coefficient" and "quotient" do not appear --
|
|
|
|
assert section._all_metadata_values == {
|
|
|
|
"category_depth": [0, 1],
|
|
|
|
"filename": ["foo.docx", "foo.docx"],
|
|
|
|
"image_path": ["sprite.png"],
|
|
|
|
"languages": [["lat"], ["lat", "eng"]],
|
|
|
|
"parent_id": ["f87731e0"],
|
|
|
|
}
|
|
|
|
|
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-01 18:49:20 -07:00
|
|
|
def it_consolidates_regex_metadata_in_a_field_specific_way(self):
|
|
|
|
"""regex_metadata of chunk is combined regex_metadatas of its elements.
|
|
|
|
|
|
|
|
Also, the `start` and `end` offsets of each regex-match are adjusted to reflect their new
|
|
|
|
position in the chunk after element text has been concatenated.
|
|
|
|
"""
|
|
|
|
section = _TextSection(
|
|
|
|
[
|
|
|
|
Title(
|
|
|
|
"Lorem Ipsum",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]},
|
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
regex_metadata={
|
|
|
|
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
|
|
|
|
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
|
|
|
|
},
|
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"In rhoncus ipsum sed lectus porta volutpat.",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]},
|
|
|
|
),
|
|
|
|
),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
|
|
|
|
regex_metadata = section._consolidated_regex_meta
|
|
|
|
|
|
|
|
assert regex_metadata == {
|
|
|
|
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
|
|
|
|
"ipsum": [
|
|
|
|
RegexMetadata(text="Ipsum", start=6, end=11),
|
|
|
|
RegexMetadata(text="ipsum", start=19, end=24),
|
|
|
|
RegexMetadata(text="ipsum", start=81, end=86),
|
|
|
|
],
|
|
|
|
}
|
|
|
|
|
|
|
|
def it_forms_ElementMetadata_constructor_kwargs_by_applying_consolidation_strategies(self):
|
|
|
|
"""._meta_kwargs is used like `ElementMetadata(**self._meta_kwargs)` to construct metadata.
|
|
|
|
|
|
|
|
Only non-None fields should appear in the dict and each field value should be the
|
|
|
|
consolidation of the values across the section elements.
|
|
|
|
"""
|
|
|
|
section = _TextSection(
|
|
|
|
[
|
|
|
|
PageBreak(""),
|
|
|
|
Title(
|
|
|
|
"Lorem Ipsum",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
filename="foo.docx",
|
|
|
|
# -- category_depth has DROP strategy so doesn't appear in result --
|
|
|
|
category_depth=0,
|
|
|
|
emphasized_text_contents=["Lorem", "Ipsum"],
|
|
|
|
emphasized_text_tags=["b", "i"],
|
|
|
|
languages=["lat"],
|
|
|
|
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]},
|
|
|
|
),
|
|
|
|
),
|
|
|
|
Text(
|
|
|
|
"'Lorem ipsum dolor' means 'Thank you very much' in Latin.",
|
|
|
|
metadata=ElementMetadata(
|
|
|
|
# -- filename change doesn't happen IRL but demonstrates FIRST strategy --
|
|
|
|
filename="bar.docx",
|
|
|
|
# -- emphasized_text_contents has LIST_CONCATENATE strategy, so "Lorem"
|
|
|
|
# -- appears twice in consolidated-meta (as it should) and length matches
|
|
|
|
# -- that of emphasized_text_tags both before and after consolidation.
|
|
|
|
emphasized_text_contents=["Lorem", "ipsum"],
|
|
|
|
emphasized_text_tags=["i", "b"],
|
|
|
|
# -- languages has LIST_UNIQUE strategy, so "lat(in)" appears only once --
|
|
|
|
languages=["eng", "lat"],
|
|
|
|
# -- regex_metadata has its own dedicated consolidation-strategy (REGEX) --
|
|
|
|
regex_metadata={
|
|
|
|
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
|
|
|
|
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
|
|
|
|
},
|
|
|
|
),
|
|
|
|
),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
|
|
|
|
meta_kwargs = section._meta_kwargs
|
|
|
|
|
|
|
|
assert meta_kwargs == {
|
|
|
|
"filename": "foo.docx",
|
|
|
|
"emphasized_text_contents": ["Lorem", "Ipsum", "Lorem", "ipsum"],
|
|
|
|
"emphasized_text_tags": ["b", "i", "i", "b"],
|
|
|
|
"languages": ["lat", "eng"],
|
|
|
|
"regex_metadata": {
|
|
|
|
"ipsum": [
|
|
|
|
RegexMetadata(text="Ipsum", start=6, end=11),
|
|
|
|
RegexMetadata(text="ipsum", start=19, end=24),
|
|
|
|
],
|
|
|
|
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
|
|
|
|
},
|
|
|
|
}
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
|
2023-11-16 08:22:50 -08:00
|
|
|
@pytest.mark.parametrize(
|
|
|
|
("elements", "expected_value"),
|
|
|
|
[
|
|
|
|
([Text("foo"), Text("bar")], "foo\n\nbar"),
|
|
|
|
([Text("foo"), PageBreak(""), Text("bar")], "foo\n\nbar"),
|
|
|
|
([PageBreak(""), Text("foo"), Text("bar")], "foo\n\nbar"),
|
|
|
|
([Text("foo"), Text("bar"), PageBreak("")], "foo\n\nbar"),
|
|
|
|
],
|
|
|
|
)
|
|
|
|
def it_knows_the_concatenated_text_of_the_section(
|
|
|
|
self, elements: List[Text], expected_value: str
|
|
|
|
):
|
|
|
|
"""._text is the "joined" text of the section elements.
|
|
|
|
|
|
|
|
The text-segment contributed by each element is separated from the next by a blank line
|
|
|
|
("\n\n"). An element that contributes no text does not give rise to a separator.
|
|
|
|
"""
|
|
|
|
section = _TextSection(elements)
|
|
|
|
assert section._text == expected_value
|
|
|
|
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
|
|
|
|
class Describe_TextSectionBuilder:
|
fix: sectioner dissociated titles from their chunk (#1861)
### disassociated-titles
**Executive Summary**. Section titles are often combined with the prior
section and then missing from the section they belong to.
_Chunk combination_ is a behavior in which two succesive small chunks
are combined into a single chunk that better fills the chunk window.
Chunking can be and by default is configured to combine sequential small
chunks that will together fit within the full chunk window (default 500
chars).
Combination is only valid for "whole" chunks. The current implementation
attempts to combine at the element level (in the sectioner), meaning a
small initial element (such as a `Title`) is combined with the prior
section without considering the remaining length of the section that
title belongs to. This frequently causes a title element to be removed
from the chunk it belongs to and added to the prior, otherwise
unrelated, chunk.
Example:
```python
elements: List[Element] = [
Title("Lorem Ipsum"), # 11
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55
Title("Rhoncus"), # 7
Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."), # 57
]
chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80)
# -- want --------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
# -- got ---------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
```
**Technical Summary.** Combination cannot be effectively performed at
the element level, at least not without complicating things with
arbitrary look-ahead into future elements. Much more straightforward is
to combine sections once they have been formed from the element stream.
**Fix.** Introduce an intermediate stream processor that accepts a
stream of sections and emits a stream of sometimes-combined sections.
The solution implemented in this PR builds upon introducing `_Section`
objects to replace the `List[Element]` primitive used previously:
- `_TextSection` gets the `.combine()` method and `.text_length`
property which allows a combining client to produce a combined section
(only text-sections are ever combined).
- `_SectionCombiner` is introduced to encapsulate the logic of
combination, acting as a "filter", accepting a stream of sections and
emitting the same type, just with some resulting from two or more
combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`.
- `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes
responsibility for repeatedly accumulating sections, characterizing
their length and doing the actual combining (calling
`_Section.combine(other_section)`) when instructed. Very similar in
concept to `_TextSectionBuilder`, just at the section level instead of
element level.
- Remove attempts to combine sections at the element level from
`_split_elements_by_title_and_table()` and install `_SectionCombiner` as
filter between sectioner and chunker.
2023-10-29 21:20:27 -07:00
|
|
|
"""Unit-test suite for `unstructured.chunking.title._TextSectionBuilder`."""
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
|
|
|
|
def it_is_empty_on_construction(self):
|
|
|
|
builder = _TextSectionBuilder(maxlen=50)
|
|
|
|
|
|
|
|
assert builder.text_length == 0
|
|
|
|
assert builder.remaining_space == 50
|
|
|
|
|
|
|
|
def it_accumulates_elements_added_to_it(self):
|
|
|
|
builder = _TextSectionBuilder(maxlen=150)
|
|
|
|
|
|
|
|
builder.add_element(Title("Introduction"))
|
|
|
|
assert builder.text_length == 12
|
|
|
|
assert builder.remaining_space == 136
|
|
|
|
|
|
|
|
builder.add_element(
|
|
|
|
Text(
|
|
|
|
"Lorem ipsum dolor sit amet consectetur adipiscing elit. In rhoncus ipsum sed"
|
2023-10-30 01:10:51 -06:00
|
|
|
"lectus porta volutpat.",
|
|
|
|
),
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
)
|
|
|
|
assert builder.text_length == 112
|
|
|
|
assert builder.remaining_space == 36
|
|
|
|
|
|
|
|
def it_generates_a_TextSection_when_flushed_and_resets_itself_to_empty(self):
|
|
|
|
builder = _TextSectionBuilder(maxlen=150)
|
|
|
|
builder.add_element(Title("Introduction"))
|
|
|
|
builder.add_element(
|
|
|
|
Text(
|
|
|
|
"Lorem ipsum dolor sit amet consectetur adipiscing elit. In rhoncus ipsum sed"
|
2023-10-30 01:10:51 -06:00
|
|
|
"lectus porta volutpat.",
|
|
|
|
),
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
)
|
|
|
|
|
|
|
|
section = next(builder.flush())
|
|
|
|
|
|
|
|
assert isinstance(section, _TextSection)
|
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-01 18:49:20 -07:00
|
|
|
assert section._elements == [
|
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length
**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.
When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.
Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.
**Example**
```python
elements: List[Element] = [
Title("Chunking Priorities"), # 19 chars
ListItem("Divide text into manageable chunks"), # 34 chars
ListItem("Preserve semantic boundaries"), # 28 chars
ListItem("Minimize mid-text chunk-splitting"), # 33 chars
] # 114 chars total but 120 chars with separators
chunks = chunk_by_title(elements, max_characters=115)
```
Want:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
),
CompositeElement("Minimize mid-text chunk-splitting"),
]
```
Got:
```python
[
CompositeElement(
"Chunking Priorities"
"\n\nDivide text into manageable chunks"
"\n\nPreserve semantic boundaries"
"\n\nMinimize mid-text chunk-spli"),
)
CompositeElement("tting")
```
### Technical Summary
Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.
### Fix
Consider separator length in the space-remaining computation.
The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.
This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 14:34:15 -07:00
|
|
|
Title("Introduction"),
|
|
|
|
Text(
|
|
|
|
"Lorem ipsum dolor sit amet consectetur adipiscing elit. In rhoncus ipsum sed"
|
|
|
|
"lectus porta volutpat.",
|
|
|
|
),
|
|
|
|
]
|
|
|
|
assert builder.text_length == 0
|
|
|
|
assert builder.remaining_space == 150
|
|
|
|
|
|
|
|
def but_it_does_not_generate_a_TextSection_on_flush_when_empty(self):
|
|
|
|
builder = _TextSectionBuilder(maxlen=150)
|
|
|
|
|
|
|
|
sections = list(builder.flush())
|
|
|
|
|
|
|
|
assert sections == []
|
|
|
|
assert builder.text_length == 0
|
|
|
|
assert builder.remaining_space == 150
|
|
|
|
|
|
|
|
def it_considers_separator_length_when_computing_text_length_and_remaining_space(self):
|
|
|
|
builder = _TextSectionBuilder(maxlen=50)
|
|
|
|
builder.add_element(Text("abcde"))
|
|
|
|
builder.add_element(Text("fghij"))
|
|
|
|
|
|
|
|
# -- .text_length includes a separator ("\n\n", len==2) between each text-segment,
|
|
|
|
# -- so 5 + 2 + 5 = 12 here, not 5 + 5 = 10
|
|
|
|
assert builder.text_length == 12
|
|
|
|
# -- .remaining_space is reduced by the length (2) of the trailing separator which would go
|
|
|
|
# -- between the current text and that of the next element if one was added.
|
|
|
|
# -- So 50 - 12 - 2 = 36 here, not 50 - 12 = 38
|
|
|
|
assert builder.remaining_space == 36
|
fix: sectioner dissociated titles from their chunk (#1861)
### disassociated-titles
**Executive Summary**. Section titles are often combined with the prior
section and then missing from the section they belong to.
_Chunk combination_ is a behavior in which two succesive small chunks
are combined into a single chunk that better fills the chunk window.
Chunking can be and by default is configured to combine sequential small
chunks that will together fit within the full chunk window (default 500
chars).
Combination is only valid for "whole" chunks. The current implementation
attempts to combine at the element level (in the sectioner), meaning a
small initial element (such as a `Title`) is combined with the prior
section without considering the remaining length of the section that
title belongs to. This frequently causes a title element to be removed
from the chunk it belongs to and added to the prior, otherwise
unrelated, chunk.
Example:
```python
elements: List[Element] = [
Title("Lorem Ipsum"), # 11
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55
Title("Rhoncus"), # 7
Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."), # 57
]
chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80)
# -- want --------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
# -- got ---------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
```
**Technical Summary.** Combination cannot be effectively performed at
the element level, at least not without complicating things with
arbitrary look-ahead into future elements. Much more straightforward is
to combine sections once they have been formed from the element stream.
**Fix.** Introduce an intermediate stream processor that accepts a
stream of sections and emits a stream of sometimes-combined sections.
The solution implemented in this PR builds upon introducing `_Section`
objects to replace the `List[Element]` primitive used previously:
- `_TextSection` gets the `.combine()` method and `.text_length`
property which allows a combining client to produce a combined section
(only text-sections are ever combined).
- `_SectionCombiner` is introduced to encapsulate the logic of
combination, acting as a "filter", accepting a stream of sections and
emitting the same type, just with some resulting from two or more
combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`.
- `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes
responsibility for repeatedly accumulating sections, characterizing
their length and doing the actual combining (calling
`_Section.combine(other_section)`) when instructed. Very similar in
concept to `_TextSectionBuilder`, just at the section level instead of
element level.
- Remove attempts to combine sections at the element level from
`_split_elements_by_title_and_table()` and install `_SectionCombiner` as
filter between sectioner and chunker.
2023-10-29 21:20:27 -07:00
|
|
|
|
|
|
|
|
|
|
|
# == SectionCombiner =============================================================================
|
|
|
|
|
|
|
|
|
|
|
|
class Describe_SectionCombiner:
|
|
|
|
"""Unit-test suite for `unstructured.chunking.title._SectionCombiner`."""
|
|
|
|
|
|
|
|
def it_combines_sequential_small_text_sections(self):
|
|
|
|
sections = [
|
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Lorem Ipsum"), # 11
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55
|
|
|
|
]
|
|
|
|
),
|
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Mauris Nec"), # 10
|
|
|
|
Text("Mauris nec urna non augue vulputate consequat eget et nisi."), # 59
|
|
|
|
]
|
|
|
|
),
|
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Sed Orci"), # 8
|
|
|
|
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."), # 63
|
|
|
|
]
|
|
|
|
),
|
|
|
|
]
|
|
|
|
|
|
|
|
section_iter = _SectionCombiner(
|
|
|
|
sections, maxlen=250, combine_text_under_n_chars=250
|
|
|
|
).iter_combined_sections()
|
|
|
|
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _TextSection)
|
|
|
|
assert section._elements == [
|
|
|
|
Title("Lorem Ipsum"),
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
|
|
|
|
Title("Mauris Nec"),
|
|
|
|
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
|
|
|
|
Title("Sed Orci"),
|
|
|
|
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
|
|
|
|
]
|
|
|
|
with pytest.raises(StopIteration):
|
|
|
|
next(section_iter)
|
|
|
|
|
|
|
|
def but_it_does_not_combine_table_or_non_text_sections(self):
|
|
|
|
sections = [
|
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Lorem Ipsum"),
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
|
|
|
|
]
|
|
|
|
),
|
2023-11-16 08:22:50 -08:00
|
|
|
_TableSection(Table("Heading\nCell text")),
|
fix: sectioner dissociated titles from their chunk (#1861)
### disassociated-titles
**Executive Summary**. Section titles are often combined with the prior
section and then missing from the section they belong to.
_Chunk combination_ is a behavior in which two succesive small chunks
are combined into a single chunk that better fills the chunk window.
Chunking can be and by default is configured to combine sequential small
chunks that will together fit within the full chunk window (default 500
chars).
Combination is only valid for "whole" chunks. The current implementation
attempts to combine at the element level (in the sectioner), meaning a
small initial element (such as a `Title`) is combined with the prior
section without considering the remaining length of the section that
title belongs to. This frequently causes a title element to be removed
from the chunk it belongs to and added to the prior, otherwise
unrelated, chunk.
Example:
```python
elements: List[Element] = [
Title("Lorem Ipsum"), # 11
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55
Title("Rhoncus"), # 7
Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."), # 57
]
chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80)
# -- want --------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
# -- got ---------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
```
**Technical Summary.** Combination cannot be effectively performed at
the element level, at least not without complicating things with
arbitrary look-ahead into future elements. Much more straightforward is
to combine sections once they have been formed from the element stream.
**Fix.** Introduce an intermediate stream processor that accepts a
stream of sections and emits a stream of sometimes-combined sections.
The solution implemented in this PR builds upon introducing `_Section`
objects to replace the `List[Element]` primitive used previously:
- `_TextSection` gets the `.combine()` method and `.text_length`
property which allows a combining client to produce a combined section
(only text-sections are ever combined).
- `_SectionCombiner` is introduced to encapsulate the logic of
combination, acting as a "filter", accepting a stream of sections and
emitting the same type, just with some resulting from two or more
combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`.
- `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes
responsibility for repeatedly accumulating sections, characterizing
their length and doing the actual combining (calling
`_Section.combine(other_section)`) when instructed. Very similar in
concept to `_TextSectionBuilder`, just at the section level instead of
element level.
- Remove attempts to combine sections at the element level from
`_split_elements_by_title_and_table()` and install `_SectionCombiner` as
filter between sectioner and chunker.
2023-10-29 21:20:27 -07:00
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Mauris Nec"),
|
|
|
|
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
|
|
|
|
]
|
|
|
|
),
|
|
|
|
_NonTextSection(CheckBox()),
|
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Sed Orci"),
|
|
|
|
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
|
|
|
|
]
|
|
|
|
),
|
|
|
|
]
|
|
|
|
|
|
|
|
section_iter = _SectionCombiner(
|
|
|
|
sections, maxlen=250, combine_text_under_n_chars=250
|
|
|
|
).iter_combined_sections()
|
|
|
|
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _TextSection)
|
|
|
|
assert section._elements == [
|
|
|
|
Title("Lorem Ipsum"),
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
|
|
|
|
]
|
|
|
|
# --
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _TableSection)
|
2023-11-16 08:22:50 -08:00
|
|
|
assert section._table == Table("Heading\nCell text")
|
fix: sectioner dissociated titles from their chunk (#1861)
### disassociated-titles
**Executive Summary**. Section titles are often combined with the prior
section and then missing from the section they belong to.
_Chunk combination_ is a behavior in which two succesive small chunks
are combined into a single chunk that better fills the chunk window.
Chunking can be and by default is configured to combine sequential small
chunks that will together fit within the full chunk window (default 500
chars).
Combination is only valid for "whole" chunks. The current implementation
attempts to combine at the element level (in the sectioner), meaning a
small initial element (such as a `Title`) is combined with the prior
section without considering the remaining length of the section that
title belongs to. This frequently causes a title element to be removed
from the chunk it belongs to and added to the prior, otherwise
unrelated, chunk.
Example:
```python
elements: List[Element] = [
Title("Lorem Ipsum"), # 11
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55
Title("Rhoncus"), # 7
Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."), # 57
]
chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80)
# -- want --------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
# -- got ---------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
```
**Technical Summary.** Combination cannot be effectively performed at
the element level, at least not without complicating things with
arbitrary look-ahead into future elements. Much more straightforward is
to combine sections once they have been formed from the element stream.
**Fix.** Introduce an intermediate stream processor that accepts a
stream of sections and emits a stream of sometimes-combined sections.
The solution implemented in this PR builds upon introducing `_Section`
objects to replace the `List[Element]` primitive used previously:
- `_TextSection` gets the `.combine()` method and `.text_length`
property which allows a combining client to produce a combined section
(only text-sections are ever combined).
- `_SectionCombiner` is introduced to encapsulate the logic of
combination, acting as a "filter", accepting a stream of sections and
emitting the same type, just with some resulting from two or more
combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`.
- `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes
responsibility for repeatedly accumulating sections, characterizing
their length and doing the actual combining (calling
`_Section.combine(other_section)`) when instructed. Very similar in
concept to `_TextSectionBuilder`, just at the section level instead of
element level.
- Remove attempts to combine sections at the element level from
`_split_elements_by_title_and_table()` and install `_SectionCombiner` as
filter between sectioner and chunker.
2023-10-29 21:20:27 -07:00
|
|
|
# --
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _TextSection)
|
|
|
|
assert section._elements == [
|
|
|
|
Title("Mauris Nec"),
|
|
|
|
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
|
|
|
|
]
|
|
|
|
# --
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _NonTextSection)
|
|
|
|
# --
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _TextSection)
|
|
|
|
assert section._elements == [
|
|
|
|
Title("Sed Orci"),
|
|
|
|
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
|
|
|
|
]
|
|
|
|
# --
|
|
|
|
with pytest.raises(StopIteration):
|
|
|
|
next(section_iter)
|
|
|
|
|
|
|
|
def it_respects_the_specified_combination_threshold(self):
|
|
|
|
sections = [
|
|
|
|
_TextSection( # 68
|
|
|
|
[
|
|
|
|
Title("Lorem Ipsum"), # 11
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55
|
|
|
|
]
|
|
|
|
),
|
|
|
|
_TextSection( # 71
|
|
|
|
[
|
|
|
|
Title("Mauris Nec"), # 10
|
|
|
|
Text("Mauris nec urna non augue vulputate consequat eget et nisi."), # 59
|
|
|
|
]
|
|
|
|
),
|
|
|
|
# -- len == 139
|
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Sed Orci"), # 8
|
|
|
|
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."), # 63
|
|
|
|
]
|
|
|
|
),
|
|
|
|
]
|
|
|
|
|
|
|
|
section_iter = _SectionCombiner(
|
|
|
|
sections, maxlen=250, combine_text_under_n_chars=80
|
|
|
|
).iter_combined_sections()
|
|
|
|
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _TextSection)
|
|
|
|
assert section._elements == [
|
|
|
|
Title("Lorem Ipsum"),
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
|
|
|
|
Title("Mauris Nec"),
|
|
|
|
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
|
|
|
|
]
|
|
|
|
# --
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _TextSection)
|
|
|
|
assert section._elements == [
|
|
|
|
Title("Sed Orci"),
|
|
|
|
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
|
|
|
|
]
|
|
|
|
# --
|
|
|
|
with pytest.raises(StopIteration):
|
|
|
|
next(section_iter)
|
|
|
|
|
|
|
|
def it_respects_the_hard_maximum_window_length(self):
|
|
|
|
sections = [
|
|
|
|
_TextSection( # 68
|
|
|
|
[
|
|
|
|
Title("Lorem Ipsum"), # 11
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55
|
|
|
|
]
|
|
|
|
),
|
|
|
|
_TextSection( # 71
|
|
|
|
[
|
|
|
|
Title("Mauris Nec"), # 10
|
|
|
|
Text("Mauris nec urna non augue vulputate consequat eget et nisi."), # 59
|
|
|
|
]
|
|
|
|
),
|
|
|
|
# -- len == 139
|
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Sed Orci"), # 8
|
|
|
|
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."), # 63
|
|
|
|
]
|
|
|
|
),
|
|
|
|
# -- len == 214
|
|
|
|
]
|
|
|
|
|
|
|
|
section_iter = _SectionCombiner(
|
|
|
|
sections, maxlen=200, combine_text_under_n_chars=200
|
|
|
|
).iter_combined_sections()
|
|
|
|
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _TextSection)
|
|
|
|
assert section._elements == [
|
|
|
|
Title("Lorem Ipsum"),
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
|
|
|
|
Title("Mauris Nec"),
|
|
|
|
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
|
|
|
|
]
|
|
|
|
# --
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _TextSection)
|
|
|
|
assert section._elements == [
|
|
|
|
Title("Sed Orci"),
|
|
|
|
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
|
|
|
|
]
|
|
|
|
# --
|
|
|
|
with pytest.raises(StopIteration):
|
|
|
|
next(section_iter)
|
|
|
|
|
|
|
|
def it_accommodates_and_isolates_an_oversized_section(self):
|
|
|
|
"""Such as occurs when a single element exceeds the window size."""
|
|
|
|
|
|
|
|
sections = [
|
|
|
|
_TextSection([Title("Lorem Ipsum")]),
|
|
|
|
_TextSection( # 179
|
|
|
|
[
|
|
|
|
Text(
|
|
|
|
"Lorem ipsum dolor sit amet consectetur adipiscing elit." # 55
|
|
|
|
" Mauris nec urna non augue vulputate consequat eget et nisi." # 60
|
|
|
|
" Sed orci quam, eleifend sit amet vehicula, elementum ultricies." # 64
|
|
|
|
)
|
|
|
|
]
|
|
|
|
),
|
|
|
|
_TextSection([Title("Vulputate Consequat")]),
|
|
|
|
]
|
|
|
|
|
|
|
|
section_iter = _SectionCombiner(
|
|
|
|
sections, maxlen=150, combine_text_under_n_chars=150
|
|
|
|
).iter_combined_sections()
|
|
|
|
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _TextSection)
|
|
|
|
assert section._elements == [Title("Lorem Ipsum")]
|
|
|
|
# --
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _TextSection)
|
|
|
|
assert section._elements == [
|
|
|
|
Text(
|
|
|
|
"Lorem ipsum dolor sit amet consectetur adipiscing elit."
|
|
|
|
" Mauris nec urna non augue vulputate consequat eget et nisi."
|
|
|
|
" Sed orci quam, eleifend sit amet vehicula, elementum ultricies."
|
|
|
|
)
|
|
|
|
]
|
|
|
|
# --
|
|
|
|
section = next(section_iter)
|
|
|
|
assert isinstance(section, _TextSection)
|
|
|
|
assert section._elements == [Title("Vulputate Consequat")]
|
|
|
|
# --
|
|
|
|
with pytest.raises(StopIteration):
|
|
|
|
next(section_iter)
|
|
|
|
|
|
|
|
|
|
|
|
class Describe_TextSectionAccumulator:
|
|
|
|
"""Unit-test suite for `unstructured.chunking.title._TextSectionAccumulator`."""
|
|
|
|
|
|
|
|
def it_is_empty_on_construction(self):
|
|
|
|
accum = _TextSectionAccumulator(maxlen=100)
|
|
|
|
|
|
|
|
assert accum.text_length == 0
|
|
|
|
assert accum.remaining_space == 100
|
|
|
|
|
|
|
|
def it_accumulates_sections_added_to_it(self):
|
|
|
|
accum = _TextSectionAccumulator(maxlen=500)
|
|
|
|
|
|
|
|
accum.add_section(
|
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Lorem Ipsum"),
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
)
|
|
|
|
assert accum.text_length == 68
|
|
|
|
assert accum.remaining_space == 430
|
|
|
|
|
|
|
|
accum.add_section(
|
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Mauris Nec"),
|
|
|
|
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
)
|
|
|
|
assert accum.text_length == 141
|
|
|
|
assert accum.remaining_space == 357
|
|
|
|
|
|
|
|
def it_generates_a_TextSection_when_flushed_and_resets_itself_to_empty(self):
|
|
|
|
accum = _TextSectionAccumulator(maxlen=150)
|
|
|
|
accum.add_section(
|
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Lorem Ipsum"),
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
)
|
|
|
|
accum.add_section(
|
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Mauris Nec"),
|
|
|
|
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
)
|
|
|
|
accum.add_section(
|
|
|
|
_TextSection(
|
|
|
|
[
|
|
|
|
Title("Sed Orci"),
|
|
|
|
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies quam."),
|
|
|
|
]
|
|
|
|
)
|
|
|
|
)
|
|
|
|
|
|
|
|
section_iter = accum.flush()
|
|
|
|
|
|
|
|
# -- iterator generates exactly one section --
|
|
|
|
section = next(section_iter)
|
|
|
|
with pytest.raises(StopIteration):
|
|
|
|
next(section_iter)
|
|
|
|
# -- and it is a _TextSection containing all the elements --
|
|
|
|
assert isinstance(section, _TextSection)
|
|
|
|
assert section._elements == [
|
|
|
|
Title("Lorem Ipsum"),
|
|
|
|
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
|
|
|
|
Title("Mauris Nec"),
|
|
|
|
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
|
|
|
|
Title("Sed Orci"),
|
|
|
|
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies quam."),
|
|
|
|
]
|
|
|
|
assert accum.text_length == 0
|
|
|
|
assert accum.remaining_space == 150
|
|
|
|
|
|
|
|
def but_it_does_not_generate_a_TextSection_on_flush_when_empty(self):
|
|
|
|
accum = _TextSectionAccumulator(maxlen=150)
|
|
|
|
|
|
|
|
sections = list(accum.flush())
|
|
|
|
|
|
|
|
assert sections == []
|
|
|
|
assert accum.text_length == 0
|
|
|
|
assert accum.remaining_space == 150
|
|
|
|
|
|
|
|
def it_considers_separator_length_when_computing_text_length_and_remaining_space(self):
|
|
|
|
accum = _TextSectionAccumulator(maxlen=100)
|
|
|
|
accum.add_section(_TextSection([Text("abcde")]))
|
|
|
|
accum.add_section(_TextSection([Text("fghij")]))
|
|
|
|
|
|
|
|
# -- .text_length includes a separator ("\n\n", len==2) between each text-segment,
|
|
|
|
# -- so 5 + 2 + 5 = 12 here, not 5 + 5 = 10
|
|
|
|
assert accum.text_length == 12
|
|
|
|
# -- .remaining_space is reduced by the length (2) of the trailing separator which would
|
|
|
|
# -- go between the current text and that of the next section if one was added.
|
|
|
|
# -- So 100 - 12 - 2 = 86 here, not 100 - 12 = 88
|
|
|
|
assert accum.remaining_space == 86
|