Steve Canny 22cbdce7ca
fix(html): unequal row lengths in HTMLTable.text_as_html (#2345)
Fixes #2339

Fixes to HTML partitioning introduced with v0.11.0 removed the use of
`tabulate` for forming the HTML placed in `HTMLTable.text_as_html`. This
had several benefits, but part of `tabulate`'s behavior was to make
row-length (cell-count) uniform across the rows of the table.

Lacking this prior uniformity produced a downstream problem reported in

On closer inspection, the method used to "harvest" cell-text was
producing more text-nodes than there were cells and was sensitive to
where whitespace was used to format the HTML. It also "moved" text to
different columns in certain rows.

Refine the cell-text gathering mechanism to get exactly one text string
for each row cell, eliminating whitespace formatting nodes and producing
strict correspondence between the number of cells in the original HTML
table row and that placed in HTML.text_as_html.

HTML tables that are uniform (every row has the same number of cells)
will produce a uniform table in `.text_as_html`. Merged cells may still
produce a non-uniform table in `.text_as_html` (because the source table
is non-uniform).
2024-01-04 21:53:19 +00:00

245 lines
7.0 KiB
JSON
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[
{
"element_id": "35054d4d1455c734e83a868656b4ad16",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1
},
"text": "\\uD83D\\uDDD3 Date",
"type": "Title"
},
{
"element_id": "0126c1353ddd7c8dfdb29f252a64a344",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1
},
"text": "\\uD83D\\uDC65 Participants",
"type": "Title"
},
{
"element_id": "e3b0c44298fc1c149afbf4c8996fb924",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1
},
"text": "",
"type": "ListItem"
},
{
"element_id": "e3b0c44298fc1c149afbf4c8996fb924",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1
},
"text": "",
"type": "ListItem"
},
{
"element_id": "fa64ff027cbc0c6929bc75d3c78c94c3",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1
},
"text": "\\uD83E\\uDD45 Goals",
"type": "Title"
},
{
"element_id": "e3b0c44298fc1c149afbf4c8996fb924",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1
},
"text": "",
"type": "ListItem"
},
{
"element_id": "537ea1b14dcba1742bdbd4a5fbfb488c",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1
},
"text": "\\uD83D\\uDDE3 Discussion topics",
"type": "Title"
},
{
"element_id": "37af06e8e75d96a448a00026754b7942",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1,
"text_as_html": "<table><tr><td>Time</td><td>Item</td><td>Presenter</td><td>Notes</td></tr><tr><td></td><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td></tr></table>"
},
"text": "Time Item Presenter Notes",
"type": "Table"
},
{
"element_id": "f158a8eaf72c7e9511d5e8ee03692652",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1
},
"text": "✅ Action items",
"type": "Title"
},
{
"element_id": "e3b0c44298fc1c149afbf4c8996fb924",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1
},
"text": "",
"type": "ListItem"
},
{
"element_id": "addb0aa08f77b69fa754ba55c6600c8a",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1
},
"text": "⤴ Decisions",
"type": "Title"
}
]