Fixes a bug where `TypeError: 'NoneType' object is not iterable` raises
due to variable `res` returning as None
Checks the existence of `res` before iteration
- parametrize the output folder paths and expected output folder paths
in comparison scripts
- now allow user to use env `OUTPUT_ROOT` to control where the output
and expected output is
- currently assumes output from test and expected output are in the same
directory; this may need separation later
## test
run
```bash
OUTPUT_ROOT=/tmp ./test_unstructured_ingest/test-ingest-src.sh
```
and it should show files changed but not able to show diff since there
is no expected output content at `OUTPUT_ROOT`.
Then run
```bash
cp -R test_unstructured_ingest/expected-* /tmp/
OUTPUT_ROOT=/tmp ./test_unstructured_ingest/test-ingest-src.sh
```
we can see (due to CI and local instance producing different results)
actual line by line diff
### Summary
Closes#2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Closes#2050.
### Summary
- set zoom to `1` if zoom is less than `0` when parsing Tesseract OCR
data
- update `determine_pdf_auto_strategy` to return the `hi_res` strategy
if either `infer_table_structure` or `extract_images_in_pdf` is true
### Testing
PDF:
[getty_62-62.pdf](https://github.com/Unstructured-IO/unstructured/files/13322169/getty_62-62.pdf)
Run the following code in both the `main` branch and the `current`
branch.
```
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="getty_62-62.pdf",
extract_images_in_pdf=True,
infer_table_structure=True,
chunking_strategy="by_title",
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
image_output_dir_path=path,
)
```
Closes#2027
Tables or pages that contain only numbers are returned as floats in a
pandas.DataFrame when the image or page is converted from
`.image_to_data()`. An AttributeError was raised downstream when trying
to `.strip()` the floats. This update converts those floats if needed
and otherwise strips the text.
Testing (note: the document used for testing is new, so you will have to
copy it to the main branch in order to see that this snippet raises an
AttributeError on the main branch, but works on this branch)
```
from unstructured.partition.pdf import partition_pdf
filename = "example-docs/all-number-table.pdf"
partition_pdf(filename, strategy="ocr_only")
```
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Fixes the Makefile `ingest-` targets were broken in
https://github.com/Unstructured-IO/unstructured/pull/1799/files.
**Test Instructions**
for maketarget in $(grep .PHONY Makefile | grep install-ingest | perl -p
-e 's/.PHONY://' | tr -d '\n'); do
echo $maketarget; make $maketarget
done
Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements
present in the document XML. Page breaks are NOT reliably indicated by
"hard" page-breaks inserted by the author and when present are redundant
to a `w:lastRenderedPageBreak` element so cause over-counting if used.
Use rendered page-breaks only.
### Summary
- add constants for element type
- replace the `TYPE_TO_TEXT_ELEMENT_MAP` dictionary using the
`ElementType` constants
- replace element type strings using the constants
### Testing
CI should pass.
This metadata field is assumedly vestigial and is unused by any code in
the repo. `max_characters` is an optional argument to `chunk_by_title()`
and has meaning in that context, but is not written to the metadata.
Remove this unused field.
A DOCX document that has no sections can still contain one or more
tables. Such files are never created by Word but Word can open them just
fine. These can be and are generated by other applications.
Use the newly-added `Document.iter_inner_content()` method added
upstream in `python-docx` to capture both paragraphs and tables from a
section-less DOCX document.
This generalizes the fix for MS Teams chat-transcripts (an example of
sectionless-docx) implemented in #1825.
### Summary
Previously, the holistic evaluation script was a copy of the ingest
evaluation function with some modifications to aggregate the data by
doctype. This refactor instead takes the result of the
`measure_text_edit_distance` function (used by ingest) and aggregates
the results by doctype. This pattern can also be followed by future
aggregations we may want to perform.
### Test
Confirm the doctype aggregation functionality of the
`aggregate_cct_data_by_doctype` function by calling it on the ingest
metrics result sheet:
(from the top level unstructured folder)
```
python -c 'from unstructured.metrics.doctype_aggregation import *; aggregate_cct_data_by_doctype("./test_unstructured_ingest/metrics")'
```
The aggregated result will be written to the same metrics folder.
<img width="680" alt="Screenshot 2023-11-03 at 2 56 20 PM"
src="https://github.com/Unstructured-IO/unstructured/assets/42684285/7250191b-bdf7-4e9f-99ca-ddbe7ee74ac5">
Closes#1064
When using the `--partition-by-api` flag via unstructured-ingest, none
of the partition arguments are forwarded, meaning that these options are
disregarded. With this change, we now pass through all of the relevant
partition arguments to the api.
## Changes
* parse and pass relevant partition arguments to the api in
unstructured-ingest
* bonus: leverage an existing `partition.api` function to call out to
the api rather than including duplicative request logic in unstructured
ingest
* bonus: --pdf-infer-table-structure is now a flag not an arg (it
defaults false anyways, this is more succinct and consistent with
similar parameters)
* bonus: adds `hi_res_model_name` so a user can specify the model to
leverage when using a hi_res strategy.
## Testing
* update against_api.sh source test script to specify a partition
argument and validates that the response from the api respected the
argument
* manually ran a request and validated that it was processed with
chipper as specified (not sure if we want to bake a chipper request into
the ci tests) (validated that the response leveraged the chipper model):
```
PYTHONPATH=. ./unstructured/ingest/main.py \
local \
--output-dir /tmp/ingest-requests/chipper \
--verbose \
--reprocess \
--strategy hi_res \
--partition-by-api \
--hi-res-model-name chipper \
--api-key "$API_KEY" \
--input-path 'example-docs/layout-parser-paper-with-table.pdf'
```
### Description
Add a `check_connection` method to each connector to easily be able to
check it without running the full ingest process. As part of this PR,
some refactoring done to allow clients to be shared and populated across
the `check_connection` method and the `initialize` method, allowing for
the `check_connection` method to be called without having to rely on the
`initialize` one to be called first.
* bonus: fix the changelog
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Courtesy @cdpierse.
Adds a test to PR #1529 in accordance with feedback.
Description from original PR:
In python the default behaviour of `requests.get` without a `timeout`
being set is to hang indefinitely. We have a production use case where
the desired behaviour would be to raise a timeout error rather than have
the application just hang.
This PR adds a new optional keyword parameter `request_timeout` to
`partition` which is passed to `file_and_type_from_url` in the case
where we are fetching from a URL. This is then passed to `requests.get`
---------
Co-authored-by: Charles Pierse <charlespierse@gmail.com>
In DOCX, like HTML, a table cell can itself contain a table. This is not
uncommon and is typically used for formatting purposes.
When a DOCX table is nested, create nested HTML tables to reflect that
structure and create a plain-text table with captures all the text in
nested tables, formatting it as a reasonable facsimile of a table.
This implements the solution described and spiked in PR #1952.
---------
Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>
Intermittently the various destination test will fail with:
```
{noformat}--- Cleanup done ---
gs://utic-test-ingest-fixtures-output/1699377964/example-docs/
deleting gs://utic-test-ingest-fixtures-output/1699377964
Removing objects:
ERROR: (gcloud.storage.rm) The following URLs matched no objects or files:
-gs://utic-test-ingest-fixtures-output/1699377964
Last ran script: gcs.sh
Error: Process completed with exit code 1.{noformat}
```
Reference trace
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/6787927424/job/18452240764?pr=2020)
After some investigation it looks like this error is due to collisions
that occur because we’re assuming 1s date accuracy is sufficient when
generating (and deleting) "unique" test destination location names. The
likelihood is actually pretty high given that we run these tests against
a test matrix.
Instead we should just use a uuid for these unique destinations.
## Changes
- Use uuidgen instead of `date +%s` for unique destinations
### Summary
Click decorated functions cannot (properly) be called outside of the
click interface. This makes it difficult to reuse the setup
functionality in measure_text_edit_distance or
measure_element_type_accuracy. This PR removes the click decoration and
separates it into a wrapper function purely to execute the command.
### Technical Details
- Changed as suggested in [this StackOverflow
post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command)
response
- The locations of these now distinct functions are separate: the
`_command` click-decorated functions stay in ingest/evaluate.py, and the
core functions measure_text_edit_distance and
measure_element_type_accuracy are moved into the unstructured/metrics/
folder (which is a more logical location for them).
- Initial test added for measure_text_edit_distance
### Test
`sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction`
functionality is unchanged.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
Summary:
Close: https://github.com/Unstructured-IO/unstructured/issues/1920
* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
```
curl -X 'POST' 'http://0.0.0.0:8000/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""' \
-F 'strategy=hi_res' \
-F 'pdf_infer_table_structure=True' \
| jq -C . | less -R
```
* after this change:
* in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
* check out to this branch
* run `make run-web-app` again in api repo
* the curl command return output and see warning in log
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Per @tabossert we're now using a link shortener behind which we can
rotate the link to keep it current. That way we (🤞 ) never have to
update this here again.
#### Testing:
Links should work. No more links should exist in the documentation
except this one.
Closes#1782
This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers
Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost
For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
We currently have a method to trigger the ingest fixture workflow by
commit message in addition to workflow dispatch (the trigger in gha
gui). The former requires that the workflow run on every push. Because
nobody uses the former, let's scrap it and save the time in CI.
### Description
* A full schema was introduced to map the type of all output content
from the json partition output and mapped to a flattened table structure
to leverage table-based destination connectors. The delta table
destination connector was updated at the moment to take advantage of
this.
* Existing method to convert to a dataframe was updated because it had a
bug in it. Object content in the metadata would have the key name
changed when flattened but then this would be omitted since it didn't
exist in the `_get_metadata_table_fieldnames` response.
* Unit test was added to make sure we handle all values possible in an
Element when converting to a table
* Delta table ingest test was split into a source and destination test
(looking ahead to split these up in CI)
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Description
Update all destination tests to match pattern:
* Don't omit any metadata to check full schema
* Move azure cognitive dest test from src to dest
* Split delta table test into seperate src and dest tests
* Fix azure cognitive search and add to dest tests being run (wasn't
being run originally)
Remove bullets not related to end-user consumption of the unstructured
library.
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
This PR resolves
[CORE-2453](https://unstructured-ai.atlassian.net/browse/CORE-2453):
- parametrizes the output folder so that ingest output files can be
saved other than the same place where the scripts are; this is set by
env `OUTPUT_ROOT`
- parametrize the python path `PYTHONPATH` to first check existing
definition before default to `.`, the current folder
- parametrize the run script that carries out ingest using `RUN_SCRIPT`,
default is still `./unstructured/ingest/main.py`
These changes allows us to run ingest test with more control. To test:
- run `OUTPUT_ROOT=/tmp
./test_unstructured_ingest/src/local-single-file.sh`: the output now
should be in `/tmp` instead of in the ingest test folder
- run `RUN_SCRIPT=/hope/you/do/not/have/this/folder
./test_unstructured_ingest/src/local-single-file.sh` would raise an
error because system can't find `/hope/you/do/not/have/this/folder`
- run `RUN_SCRIPT=./unstructured/ingest/main.py
./test_unstructured_ingest/src/local-single-file.sh` should run as
normal
- do the following
```bash
cp ./unstructured/ingest/main.py /tmp/main.py
OUTPUT_ROOT=/tmp PYTHONPATH=$(pwd) RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh
```
This will run and generate output at `/tmp`
[CORE-2453]:
https://unstructured-ai.atlassian.net/browse/CORE-2453?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
### Description
To always support the latest changed to the partition method and the
possible kwargs it supports, the ingest CLI has been refactored to take
in a valid json string to represent those values to allow a user more
flexibility with controlling the partition method.
### Summary
To combine ingest and holistic metrics efforts, add the `doctype` field
to the results from the functions in evaluate.py for use in subsequent
aggregation functions.
### Test
Run `sh ./test_unstructured_ingest/evaluation-metrics.sh
text-extraction` and there will be a new doctype column with the file's
doctype extension.
<img width="508" alt="Screenshot 2023-11-01 at 2 23 11 PM"
src="https://github.com/Unstructured-IO/unstructured/assets/42684285/44583da9-e7ef-4142-be72-c2247b954bcf">
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
- This PR adds a function to check if a piece of text only contains a
bullet (no text) to prevent creating an empty element.
- Also fixed a test that had a typo.
This PR resolves#1294 by adding a Makefile to compile requirements.
This makefile respects the dependencies between file and will compile
them in order. E.g., extra-*.txt will be compiled __after__ base.txt is
updated.
Test locally by simply running `make pip-compile` or `cd requirements &&
make clean && make all`
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
### Summary
We no longer use the "bricks" terminology for partioning functions, etc
in the library. This PR updates various references to bricks within the
repo and the docs. This is just an initial pass to swap the terminology
out, it'll likely be helpful to reorganize the docs a bit as well.
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
*Reviewer:* May be quicker to review commit by commit as they are quite
distinct and well-groomed to each focus on a single clean-up task.
Clean up odds-and-ends in the docx partitioner in preparation for adding
nested-tables support in a closely following PR.
1. Remove obsolete TODOs now in GitHub issues, which is probably where
they belong in future anyway.
2. Remove local DOCX "workaround" code that has been implemented
upstream and is now obsolete.
3. "Clean" the docx tests, introducing strict typing, extracting a
fixture or two, and generally tightening things up.
4. Extract docx-local versions of
`unstructured.partition.common.convert_ms_office_table_to_text()` which
will be the base for adding nested-table support. More information on
why this is required in that commit.
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:
1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.
### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.
The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.
Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.
The proposed strategy assignments are these:
- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP, # -- ? confirm --
- `detection_origin`: DROP, # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP, # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,
**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
sent-from, sent-to, and subject will not change within a section.
### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
"filename": ["memo.docx", "memo.docx"]
"link_texts": [["here", "here"], ["and here"]]
"parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).
### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.
Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.
### Technical Risk: adding new `ElementMetadata` field breaks metadata
If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.
This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.
This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.
### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.
### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.
---
## Further discussion on consolidation strategy by field
### document-static
These fields are very likely to be the same for all elements in a single
document:
- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`
*Consolidation strategy:* `FIRST` - use first one found, if any.
### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:
- `section` - an EPUB document-section unconditionally starts new
section.
*Consolidation strategy:* `FIRST` - use first one found, if any.
### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:
- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.
*Consolidation strategy:* `LIST` - concatenate lists across elements.
### dynamic
These fields are likely to hold unique data for each element:
- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`
*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.
### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:
- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".
### N/A
These field types do not figure in metadata-consolidation:
- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.
*Consolidation strategy:* `DROP` any that appear (several never will)
### Summary
Closes#1520
Partial solution to #1521
- Adds an abstraction layer between the user API and the partitioner
implementation
- Adds comments explaining paragraph chunking
- Makes edits to pass strict type-checking for both text.py and
test_text.py