Yuming Long 542d442699
chore CORE-4775: remove html page number metadata field (#2942)
### Summary

Rip off page_number metadata fields until we have page counting for all
kinds of html files (not just limited to news articles with multiple
`<article>` tag)

### Test
Unit tests
`test_add_chunking_strategy_on_partition_html_respects_multipage` and
`test_add_chunking_strategy_title_on_partition_auto_respects_multipage`
removed since they relay on the `page_number` fields from the SEC html
file - now test moved to mock test for chunk_by_title -> revisit those
tests when we find test file for this

Also changed the element ids from partition outputs for html files -
element id change due to page number change (in element id hashing) ->
todo ticket: update other deterministic element id tests per crag's
comment

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2024-04-30 15:20:26 +00:00

234 lines
6.7 KiB
JSON
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[
{
"element_id": "ad87ce21734cf59c34b5a62cc0798c2f",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "\\uD83D\\uDDD3 Date",
"type": "Title"
},
{
"element_id": "1c52504ada107bf7ccef5f78b58a2083",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "\\uD83D\\uDC65 Participants",
"type": "Title"
},
{
"element_id": "8d5c737a91fb0e10371805abc1f89c79",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "",
"type": "ListItem"
},
{
"element_id": "1a46f51535f943b6dfa8746cd0f281ff",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "",
"type": "ListItem"
},
{
"element_id": "36b6c4ece8174d6fb3e516b7242f3c6c",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "\\uD83E\\uDD45 Goals",
"type": "Title"
},
{
"element_id": "1cc1de2b35cec6a4adfe184157d2eb8e",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "",
"type": "ListItem"
},
{
"element_id": "b0293c68400b43db9bc8d5ef33be43df",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "\\uD83D\\uDDE3 Discussion topics",
"type": "Title"
},
{
"element_id": "29654ae71d95e217350645e33e219bb3",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
],
"text_as_html": "<table><tr><td>Time</td><td>Item</td><td>Presenter</td><td>Notes</td></tr><tr><td></td><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td></tr></table>"
},
"text": "Time Item Presenter Notes",
"type": "Table"
},
{
"element_id": "0f8fa6214fd823bf85c04e6395bac656",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "✅ Action items",
"type": "Title"
},
{
"element_id": "01776c72263f5080b363440bfa4501a2",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "",
"type": "ListItem"
},
{
"element_id": "92f7acca41806b5beb8bc39eea59ac21",
"metadata": {
"data_source": {
"date_created": "2023-07-09T12:54:45.162000",
"date_modified": "2023-07-09T12:54:45.162000",
"record_locator": {
"page_id": "1605928",
"url": "https://unstructured-ingest-test.atlassian.net"
},
"url": "https://unstructured-ingest-test.atlassian.net/wiki/rest/api/content/1605928",
"version": "1"
},
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "⤴ Decisions",
"type": "Title"
}
]