Steve Canny a861ed8fe7
feat(chunk): split tables on even row boundaries (#3504)
**Summary**
Use more sophisticated algorithm for splitting oversized `Table`
elements into `TableChunk` elements during chunking to ensure element
text and HTML are "synchronized" and HTML is always parseable.

**Additional Context**
Table splitting now has the following characteristics:
- `TableChunk.metadata.text_as_html` is always a parseable HTML
`<table>` subtree.
- `TableChunk.text` is always the text in the HTML version of the table
fragment in `.metadata.text_as_html`. Text and HTML are "synchronized".
- The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger
than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger
than the chunking window.
- `.text_as_html` is "minified", removing all extraneous whitespace and
unneeded elements or attributes. This maximizes the semantic "density"
of each chunk.
2024-08-19 18:56:53 +00:00

341 lines
10 KiB
JSON

[
{
"element_id": "87d54efb69679f52b8c22f98f5ee6008",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"text_as_html": "<table><tr><td>Driver</td><td/></tr><tr><td>Approver</td><td/></tr><tr><td>Contributors</td><td/></tr><tr><td>Informed</td><td/></tr><tr><td>Objective</td><td/></tr><tr><td>Due date</td><td/></tr><tr><td>Key outcomes</td><td/></tr><tr><td>Status</td><td>NOT STARTED / IN PROGRESS / COMPLETE</td></tr></table>"
},
"text": "Driver Approver Contributors Informed Objective Due date Key outcomes Status NOT STARTED / IN PROGRESS / COMPLETE",
"type": "Table"
},
{
"element_id": "c7fee3d3e71bbdd748c1f39a93896d82",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "\\uD83E\\uDD14 Problem Statement",
"type": "Title"
},
{
"element_id": "945c34601bbf9b38b95a3f5c82c8fb80",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "🎯 Scope",
"type": "Title"
},
{
"element_id": "a67a84caf3e93f9d3c6ee9462f6ac7bb",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"text_as_html": "<table><tr><td>Must have:</td><td/></tr><tr><td>Nice to have:</td><td/></tr><tr><td>Not in scope:</td><td/></tr></table>"
},
"text": "Must have: Nice to have: Not in scope:",
"type": "Table"
},
{
"element_id": "b346f3f9a795eec4eacd313a745107a9",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "\\uD83D\\uDDD3 Timeline",
"type": "Title"
},
{
"element_id": "7aa58f6123e145d68b491d3e735060f8",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Lane 1",
"type": "Title"
},
{
"element_id": "21354bac4c070eaa9722a971e6bdbfea",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Lane 2",
"type": "Title"
},
{
"element_id": "2fac077cc411e658746e76d86ea1ec37",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Feature 1",
"type": "Title"
},
{
"element_id": "ff8497516144be25a4c0922f14c6ee28",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Feature 2",
"type": "Title"
},
{
"element_id": "0f701ee7f075b9b83ca75e844ab8184a",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Feature 3",
"type": "Title"
},
{
"element_id": "edf428e92bdb9e94ac17f876cdf7c058",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Feature 4",
"type": "Title"
},
{
"element_id": "4d846dbdfa5783f976e41e1852ffb179",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "iOS app",
"type": "Title"
},
{
"element_id": "4fb022f234174c8dc5df55dd0c677833",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Android app",
"type": "Title"
},
{
"element_id": "6ad5138f5ba9db1b2cc4c68d97bd237a",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "\\uD83D\\uDEA9 Milestones and deadlines",
"type": "Title"
},
{
"element_id": "15b3d2fa95017389c5c47d1c5fc64b4d",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"text_as_html": "<table><tr><td>Milestone</td><td>Owner</td><td>Deadline</td><td>Status</td></tr><tr><td/><td/><td/><td/></tr><tr><td/><td/><td/><td/></tr><tr><td/><td/><td/><td/></tr></table>"
},
"text": "Milestone Owner Deadline Status",
"type": "Table"
},
{
"element_id": "540148258482b576b4bc0f0a2f5ab76d",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:55:50.911000",
"date_modified": "2023-07-09T12:56:10.564000",
"record_locator": {
"page_id": "1540126",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1540126",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "\\uD83D\\uDD17 Reference materials",
"type": "Title"
}
]