Steve Canny a861ed8fe7
feat(chunk): split tables on even row boundaries (#3504)
**Summary**
Use more sophisticated algorithm for splitting oversized `Table`
elements into `TableChunk` elements during chunking to ensure element
text and HTML are "synchronized" and HTML is always parseable.

**Additional Context**
Table splitting now has the following characteristics:
- `TableChunk.metadata.text_as_html` is always a parseable HTML
`<table>` subtree.
- `TableChunk.text` is always the text in the HTML version of the table
fragment in `.metadata.text_as_html`. Text and HTML are "synchronized".
- The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger
than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger
than the chunking window.
- `.text_as_html` is "minified", removing all extraneous whitespace and
unneeded elements or attributes. This maximizes the semantic "density"
of each chunk.
2024-08-19 18:56:53 +00:00

126 lines
3.4 KiB
JSON

[
{
"element_id": "8cf6f327a51bcafbe61f759da949eae4",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.226000",
"date_modified": "2023-07-09T12:54:45.226000",
"record_locator": {
"page_id": "1605942",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605942",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Copy and paste this section for each week.",
"type": "Title"
},
{
"element_id": "0e80063a2a29a298f8f761e73d131d6c",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.226000",
"date_modified": "2023-07-09T12:54:45.226000",
"record_locator": {
"page_id": "1605942",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605942",
"version": "1"
},
"emphasized_text_contents": [
"Win"
],
"emphasized_text_tags": [
"b"
],
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Win",
"type": "Title"
},
{
"element_id": "4669e057140b5179bcccddf030564faf",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.226000",
"date_modified": "2023-07-09T12:54:45.226000",
"record_locator": {
"page_id": "1605942",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605942",
"version": "1"
},
"emphasized_text_contents": [
"Needs input"
],
"emphasized_text_tags": [
"b"
],
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Needs input",
"type": "Title"
},
{
"element_id": "85e720f616ed2097311e2e844a1e9e37",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.226000",
"date_modified": "2023-07-09T12:54:45.226000",
"record_locator": {
"page_id": "1605942",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605942",
"version": "1"
},
"emphasized_text_contents": [
"Focus"
],
"emphasized_text_tags": [
"b"
],
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Focus",
"type": "Title"
},
{
"element_id": "6a881c411fad6112029b96cc9e477d1e",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.226000",
"date_modified": "2023-07-09T12:54:45.226000",
"record_locator": {
"page_id": "1605942",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605942",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"text_as_html": "<table><tr><td>Notes</td><td/></tr><tr><td>Important Links</td><td/></tr></table>"
},
"text": "Notes Important Links",
"type": "Table"
}
]