Amanda Cameron a501d1d18f
Adding table extraction to partition_html (#1324)
Adding table extraction to HTML partitioning.

This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.

```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
2023-09-11 11:14:11 -07:00

151 lines
3.2 KiB
JSON

[
{
"type": "Title",
"element_id": "307afee17dac4c598e361c095338decd",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Copy and paste this section for each week."
},
{
"type": "Title",
"element_id": "b980a145c5e8c9e233a0643366ba520a",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"emphasized_text_contents": [
"Win"
],
"emphasized_text_tags": [
"strong"
]
},
"text": "Win"
},
{
"type": "ListItem",
"element_id": "e3b0c44298fc1c149afbf4c8996fb924",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": ""
},
{
"type": "ListItem",
"element_id": "e3b0c44298fc1c149afbf4c8996fb924",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": ""
},
{
"type": "ListItem",
"element_id": "e3b0c44298fc1c149afbf4c8996fb924",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": ""
},
{
"type": "Title",
"element_id": "aecc044c7725a6555114285dc28fe2d1",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"emphasized_text_contents": [
"Needs input"
],
"emphasized_text_tags": [
"strong"
]
},
"text": "Needs input"
},
{
"type": "ListItem",
"element_id": "e3b0c44298fc1c149afbf4c8996fb924",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": ""
},
{
"type": "ListItem",
"element_id": "e3b0c44298fc1c149afbf4c8996fb924",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": ""
},
{
"type": "ListItem",
"element_id": "e3b0c44298fc1c149afbf4c8996fb924",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": ""
},
{
"type": "Title",
"element_id": "9d3cab2b5efed4eaef42a707dbc813da",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"emphasized_text_contents": [
"Focus"
],
"emphasized_text_tags": [
"strong"
]
},
"text": "Focus"
},
{
"type": "ListItem",
"element_id": "e3b0c44298fc1c149afbf4c8996fb924",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": ""
},
{
"type": "ListItem",
"element_id": "e3b0c44298fc1c149afbf4c8996fb924",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": ""
},
{
"type": "Table",
"element_id": "a240e43c0ae70731c65ae5430d2dab7f",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"text_as_html": "<table><br><tbody><br><tr><td>Notes </td></tr><br><tr><td>Important Links</td></tr><br></tbody><br></table>"
},
"text": "Notes Important Links"
}
]