Amanda Cameron a501d1d18f
Adding table extraction to partition_html (#1324)
Adding table extraction to HTML partitioning.

This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.

```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
2023-09-11 11:14:11 -07:00

165 lines
5.0 KiB
JSON
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[
{
"type": "Table",
"element_id": "10b5ef18a3c7fb1d7436b2e1b256e5b9",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"text_as_html": "<table><br><tbody><br><tr><td>Driver </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr><br><tr><td>Approver </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr><br><tr><td>Contributors</td><td> </td><td> </td><td> </td><td> </td><td> </td></tr><br><tr><td>Informed </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr><br><tr><td>Objective </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr><br><tr><td>Due date </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr><br><tr><td>Key outcomes</td><td> </td><td> </td><td> </td><td> </td><td> </td></tr><br><tr><td>Status </td><td>NOT STARTED</td><td>/</td><td>IN PROGRESS</td><td>/</td><td>COMPLETE</td></tr><br></tbody><br></table>"
},
"text": "Driver Approver Contributors Informed Objective Due date Key outcomes Status NOT STARTED / IN PROGRESS / COMPLETE"
},
{
"type": "Title",
"element_id": "4e2022d4483a407d85060675f64fbe17",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "\\uD83E\\uDD14 Problem Statement"
},
{
"type": "Title",
"element_id": "81163675915a75217e4116686fdca412",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "🎯 Scope"
},
{
"type": "Table",
"element_id": "f1f364fbde77afa0e99e8ea7ab4f7c3f",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"text_as_html": "<table><br><tbody><br><tr><td>Must have: </td></tr><br><tr><td>Nice to have:</td></tr><br><tr><td>Not in scope:</td></tr><br></tbody><br></table>"
},
"text": "Must have: Nice to have: Not in scope:"
},
{
"type": "Title",
"element_id": "e8b61a28d07e977379b42df455a1cde4",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "\\uD83D\\uDDD3 Timeline"
},
{
"type": "Title",
"element_id": "5043f71fbc70e35c0be413d4135be99f",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Lane 1"
},
{
"type": "Title",
"element_id": "d5a2e177c588bf0c4f914baa4fae85b6",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Lane 2"
},
{
"type": "Title",
"element_id": "c98ba1acbd22a15ddddfc244cbd8a2db",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Feature 1"
},
{
"type": "Title",
"element_id": "e04620c8b3b611b3fefecef89baa63a9",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Feature 2"
},
{
"type": "Title",
"element_id": "82e522a86692cc50ee5c020c8e6ce6a0",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Feature 3"
},
{
"type": "Title",
"element_id": "822f7c45ea725c535970aab819a8ff10",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Feature 4"
},
{
"type": "Title",
"element_id": "6e0f6eca4ff17d3377c1c3e8e1f73457",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "iOS app"
},
{
"type": "Title",
"element_id": "0b60fe04b3c5c3c76371b6eca8b19c8e",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Android app"
},
{
"type": "Title",
"element_id": "e1cc184f345d146586fb12527c4fa696",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "\\uD83D\\uDEA9 Milestones and deadlines"
},
{
"type": "Table",
"element_id": "3f4ea3840d79521680c89a91dcd883cf",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"text_as_html": "<table><br><tbody><br><tr><td>Milestone</td><td>Owner</td><td>Deadline</td><td>Status</td></tr><br><tr><td> </td><td> </td><td> </td><td> </td></tr><br><tr><td> </td><td> </td><td> </td><td> </td></tr><br><tr><td> </td><td> </td><td> </td><td> </td></tr><br></tbody><br></table>"
},
"text": "Milestone Owner Deadline Status"
},
{
"type": "Title",
"element_id": "890c9b6d8d69ca1de5fd7a8b83fe78ff",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "\\uD83D\\uDD17 Reference materials"
}
]