fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-01 18:49:20 -07:00
|
|
|
import dataclasses as dc
|
2023-08-11 07:02:37 -04:00
|
|
|
import json
|
2022-12-15 17:19:02 -05:00
|
|
|
from functools import partial
|
2023-02-27 17:30:54 +01:00
|
|
|
|
2022-12-15 17:19:02 -05:00
|
|
|
import pytest
|
|
|
|
|
|
|
|
from unstructured.cleaners.core import clean_prefix
|
|
|
|
from unstructured.cleaners.translate import translate_text
|
2023-06-20 11:19:55 -05:00
|
|
|
from unstructured.documents.coordinates import (
|
|
|
|
CoordinateSystem,
|
|
|
|
Orientation,
|
|
|
|
RelativeCoordinateSystem,
|
|
|
|
)
|
2023-08-09 15:32:20 -07:00
|
|
|
from unstructured.documents.elements import (
|
|
|
|
UUID,
|
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-01 18:49:20 -07:00
|
|
|
ConsolidationStrategy,
|
2023-08-09 15:32:20 -07:00
|
|
|
CoordinatesMetadata,
|
|
|
|
Element,
|
2023-09-27 14:40:56 -04:00
|
|
|
ElementMetadata,
|
2023-08-09 15:32:20 -07:00
|
|
|
NoID,
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
RegexMetadata,
|
2023-08-09 15:32:20 -07:00
|
|
|
Text,
|
|
|
|
)
|
2022-06-29 14:35:19 -04:00
|
|
|
|
|
|
|
|
|
|
|
def test_text_id():
|
|
|
|
text_element = Text(text="hello there!")
|
|
|
|
assert text_element.id == "c69509590d81db2f37f9d75480c8efed"
|
|
|
|
|
|
|
|
|
2023-08-09 15:32:20 -07:00
|
|
|
def test_text_uuid():
|
|
|
|
text_element = Text(text="hello there!", element_id=UUID())
|
2023-08-11 07:02:37 -04:00
|
|
|
assert len(text_element.id) == 36
|
|
|
|
assert text_element.id.count("-") == 4
|
|
|
|
# Test that the element is JSON serializable. This shold run without an error
|
|
|
|
json.dumps(text_element.to_dict())
|
2023-08-09 15:32:20 -07:00
|
|
|
|
|
|
|
|
2022-06-29 14:35:19 -04:00
|
|
|
def test_element_defaults_to_blank_id():
|
|
|
|
element = Element()
|
|
|
|
assert isinstance(element.id, NoID)
|
2022-12-15 17:19:02 -05:00
|
|
|
|
|
|
|
|
2023-08-09 15:32:20 -07:00
|
|
|
def test_element_uuid():
|
|
|
|
element = Element(element_id=UUID())
|
|
|
|
assert isinstance(element.id, UUID)
|
|
|
|
|
|
|
|
|
2022-12-15 17:19:02 -05:00
|
|
|
def test_text_element_apply_cleaners():
|
|
|
|
text_element = Text(text="[1] A Textbook on Crocodile Habitats")
|
|
|
|
|
|
|
|
text_element.apply(partial(clean_prefix, pattern=r"\[\d{1,2}\]"))
|
|
|
|
assert str(text_element) == "A Textbook on Crocodile Habitats"
|
|
|
|
|
|
|
|
|
|
|
|
def test_text_element_apply_multiple_cleaners():
|
|
|
|
cleaners = [
|
|
|
|
partial(clean_prefix, pattern=r"\[\d{1,2}\]"),
|
|
|
|
partial(translate_text, target_lang="ru"),
|
|
|
|
]
|
|
|
|
text_element = Text(text="[1] A Textbook on Crocodile Habitats")
|
|
|
|
text_element.apply(*cleaners)
|
|
|
|
assert str(text_element) == "Учебник по крокодильным средам обитания"
|
|
|
|
|
|
|
|
|
|
|
|
def test_apply_raises_if_func_does_not_produce_string():
|
|
|
|
text_element = Text(text="[1] A Textbook on Crocodile Habitats")
|
|
|
|
with pytest.raises(ValueError):
|
|
|
|
text_element.apply(lambda s: 1)
|
2023-06-20 11:19:55 -05:00
|
|
|
|
|
|
|
|
|
|
|
@pytest.mark.parametrize(
|
|
|
|
("coordinates", "orientation1", "orientation2", "expected_coords"),
|
|
|
|
[
|
|
|
|
(
|
|
|
|
((1, 2), (1, 4), (3, 4), (3, 2)),
|
|
|
|
Orientation.CARTESIAN,
|
|
|
|
Orientation.CARTESIAN,
|
|
|
|
((10, 20), (10, 40), (30, 40), (30, 20)),
|
|
|
|
),
|
|
|
|
(
|
|
|
|
((1, 2), (1, 4), (3, 4), (3, 2)),
|
|
|
|
Orientation.CARTESIAN,
|
|
|
|
Orientation.SCREEN,
|
|
|
|
((10, 1980), (10, 1960), (30, 1960), (30, 1980)),
|
|
|
|
),
|
|
|
|
(
|
|
|
|
((1, 2), (1, 4), (3, 4), (3, 2)),
|
|
|
|
Orientation.SCREEN,
|
|
|
|
Orientation.CARTESIAN,
|
|
|
|
((10, 1980), (10, 1960), (30, 1960), (30, 1980)),
|
|
|
|
),
|
|
|
|
(
|
|
|
|
((1, 2), (1, 4), (3, 4), (3, 2)),
|
|
|
|
Orientation.SCREEN,
|
|
|
|
Orientation.SCREEN,
|
|
|
|
((10, 20), (10, 40), (30, 40), (30, 20)),
|
|
|
|
),
|
|
|
|
],
|
|
|
|
)
|
|
|
|
def test_convert_coordinates_to_new_system(
|
|
|
|
coordinates,
|
|
|
|
orientation1,
|
|
|
|
orientation2,
|
|
|
|
expected_coords,
|
|
|
|
):
|
|
|
|
coord1 = CoordinateSystem(100, 200)
|
|
|
|
coord1.orientation = orientation1
|
|
|
|
coord2 = CoordinateSystem(1000, 2000)
|
|
|
|
coord2.orientation = orientation2
|
|
|
|
element = Element(coordinates=coordinates, coordinate_system=coord1)
|
|
|
|
new_coords = element.convert_coordinates_to_new_system(coord2)
|
|
|
|
for new_coord, expected_coord in zip(new_coords, expected_coords):
|
|
|
|
new_coord == pytest.approx(expected_coord)
|
|
|
|
element.convert_coordinates_to_new_system(coord2, in_place=True)
|
2023-07-05 11:25:11 -07:00
|
|
|
for new_coord, expected_coord in zip(element.metadata.coordinates.points, expected_coords):
|
2023-06-20 11:19:55 -05:00
|
|
|
assert new_coord == pytest.approx(expected_coord)
|
2023-07-05 11:25:11 -07:00
|
|
|
assert element.metadata.coordinates.system == coord2
|
2023-06-20 11:19:55 -05:00
|
|
|
|
|
|
|
|
2023-07-05 11:25:11 -07:00
|
|
|
def test_convert_coordinate_to_new_system_none():
|
|
|
|
element = Element(coordinates=None, coordinate_system=None)
|
2023-06-20 11:19:55 -05:00
|
|
|
coord = CoordinateSystem(100, 200)
|
|
|
|
coord.orientation = Orientation.SCREEN
|
|
|
|
assert element.convert_coordinates_to_new_system(coord) is None
|
|
|
|
|
|
|
|
|
2023-07-05 11:25:11 -07:00
|
|
|
def test_element_constructor_coordinates_all_present():
|
2023-06-20 11:19:55 -05:00
|
|
|
coordinates = ((1, 2), (1, 4), (3, 4), (3, 2))
|
|
|
|
coordinate_system = RelativeCoordinateSystem()
|
|
|
|
element = Element(coordinates=coordinates, coordinate_system=coordinate_system)
|
2023-07-05 11:25:11 -07:00
|
|
|
expected_coordinates_metadata = CoordinatesMetadata(
|
|
|
|
points=coordinates,
|
|
|
|
system=coordinate_system,
|
|
|
|
)
|
|
|
|
assert element.metadata.coordinates == expected_coordinates_metadata
|
|
|
|
|
|
|
|
|
|
|
|
def test_element_constructor_coordinates_points_absent():
|
|
|
|
with pytest.raises(ValueError) as exc_info:
|
|
|
|
Element(coordinate_system=RelativeCoordinateSystem())
|
|
|
|
assert (
|
|
|
|
str(exc_info.value)
|
|
|
|
== "Coordinates points should not exist without coordinates system and vice versa."
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
def test_element_constructor_coordinates_system_absent():
|
|
|
|
with pytest.raises(ValueError) as exc_info:
|
|
|
|
Element(coordinates=((1, 2), (1, 4), (3, 4), (3, 2)))
|
|
|
|
assert (
|
|
|
|
str(exc_info.value)
|
|
|
|
== "Coordinates points should not exist without coordinates system and vice versa."
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
def test_coordinate_metadata_serdes():
|
|
|
|
coordinates = ((1, 2), (1, 4), (3, 4), (3, 2))
|
|
|
|
coordinate_system = RelativeCoordinateSystem()
|
|
|
|
coordinates_metadata = CoordinatesMetadata(points=coordinates, system=coordinate_system)
|
2023-06-20 11:19:55 -05:00
|
|
|
expected_schema = {
|
|
|
|
"layout_height": 1,
|
2023-07-05 11:25:11 -07:00
|
|
|
"layout_width": 1,
|
|
|
|
"points": ((1, 2), (1, 4), (3, 4), (3, 2)),
|
|
|
|
"system": "RelativeCoordinateSystem",
|
|
|
|
}
|
|
|
|
coordinates_metadata_dict = coordinates_metadata.to_dict()
|
|
|
|
assert coordinates_metadata_dict == expected_schema
|
|
|
|
assert CoordinatesMetadata.from_dict(coordinates_metadata_dict) == coordinates_metadata
|
|
|
|
|
|
|
|
|
|
|
|
def test_element_to_dict():
|
|
|
|
coordinates = ((1, 2), (1, 4), (3, 4), (3, 2))
|
|
|
|
coordinate_system = RelativeCoordinateSystem()
|
|
|
|
element = Element(
|
|
|
|
element_id="awt32t1",
|
|
|
|
coordinates=coordinates,
|
|
|
|
coordinate_system=coordinate_system,
|
|
|
|
)
|
|
|
|
expected = {
|
|
|
|
"metadata": {
|
|
|
|
"coordinates": {
|
|
|
|
"layout_height": 1,
|
|
|
|
"layout_width": 1,
|
|
|
|
"points": ((1, 2), (1, 4), (3, 4), (3, 2)),
|
|
|
|
"system": "RelativeCoordinateSystem",
|
|
|
|
},
|
|
|
|
},
|
|
|
|
"type": None,
|
|
|
|
"element_id": "awt32t1",
|
2023-06-20 11:19:55 -05:00
|
|
|
}
|
2023-07-05 11:25:11 -07:00
|
|
|
assert element.to_dict() == expected
|
2023-09-27 14:40:56 -04:00
|
|
|
|
|
|
|
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
def test_regex_metadata_round_trips_through_JSON():
|
|
|
|
"""metadata.regex_metadata should appear at full depth in JSON."""
|
|
|
|
regex_metadata = {
|
|
|
|
"mail-stop": [RegexMetadata(text="MS-107", start=18, end=24)],
|
|
|
|
"version": [
|
|
|
|
RegexMetadata(text="current=v1.7.2", start=7, end=21),
|
|
|
|
RegexMetadata(text="supersedes=v1.7.2", start=22, end=40),
|
|
|
|
],
|
|
|
|
}
|
|
|
|
metadata = ElementMetadata(regex_metadata=regex_metadata)
|
|
|
|
|
|
|
|
metadata_json = json.dumps(metadata.to_dict())
|
|
|
|
deserialized_metadata = ElementMetadata.from_dict(json.loads(metadata_json))
|
|
|
|
reserialized_metadata_json = json.dumps(deserialized_metadata.to_dict())
|
|
|
|
|
|
|
|
assert reserialized_metadata_json == metadata_json
|
|
|
|
|
|
|
|
|
2023-09-27 14:40:56 -04:00
|
|
|
def test_metadata_from_dict_extra_fields():
|
|
|
|
"""
|
|
|
|
Assert that the metadata classes ignore nonexistent fields.
|
|
|
|
This can be an issue when elements_from_json gets a schema
|
|
|
|
from the future.
|
|
|
|
"""
|
|
|
|
element_metadata = {
|
|
|
|
"new_field": "hello",
|
|
|
|
"data_source": {
|
|
|
|
"new_field": "world",
|
|
|
|
},
|
|
|
|
"coordinates": {
|
|
|
|
"new_field": "foo",
|
|
|
|
},
|
|
|
|
}
|
|
|
|
|
|
|
|
metadata = ElementMetadata.from_dict(element_metadata)
|
|
|
|
metadata_dict = metadata.to_dict()
|
|
|
|
|
|
|
|
assert "new_field" not in metadata_dict
|
|
|
|
assert "new_field" not in metadata_dict["coordinates"]
|
|
|
|
assert "new_field" not in metadata_dict["data_source"]
|
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-01 18:49:20 -07:00
|
|
|
|
|
|
|
|
|
|
|
def test_there_is_a_consolidation_strategy_for_every_ElementMetadata_field():
|
|
|
|
metadata_field_names = sorted(f.name for f in dc.fields(ElementMetadata))
|
|
|
|
consolidation_strategies = ConsolidationStrategy.field_consolidation_strategies()
|
|
|
|
|
|
|
|
for field_name in metadata_field_names:
|
|
|
|
assert field_name in consolidation_strategies, (
|
|
|
|
f"ElementMetadata field `.{field_name}` does not have a consolidation strategy."
|
|
|
|
f" Add one in `ConsolidationStrategy.field_consolidation_strategies()."
|
|
|
|
)
|