mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-29 16:17:00 +00:00
### Summary Closes #1230. Updates `partition_html` to split on `<br>` tags that appear within text elements. ### Testing The following is code previously produced one giant element on `main`. ```python from unstructured.partition.html import partition_html filename = "example-docs/ideas-page.html" elements = partition_html(filename=filename) len(elements) # Should be 4 print("\n\n".join([str(el) for el in elements)]) ``` The output should be: ```python January 2023 (Someone fed my essays into GPT to make something that could answer questions based on them, then asked it where good ideas come from. The answer was ok, but not what I would have said. This is what I would have said.) The way to get new ideas is to notice anomalies: what seems strange, or missing, or broken? You can see anomalies in everyday life (much of standup comedy is based on this), but the best place to look for them is at the frontiers of knowledge. Knowledge grows fractally. From a distance its edges look smooth, but when you learn enough to get close to one, you'll notice it's full of gaps. These gaps will seem obvious; it will seem inexplicable that no one has tried x or wondered about y. In the best case, exploring such gaps yields whole new fractal buds. ```
88 lines
2.7 KiB
JSON
88 lines
2.7 KiB
JSON
[
|
|
{
|
|
"type": "Title",
|
|
"element_id": "56a9f768a0968be676f9addd5ec3032e",
|
|
"metadata": {
|
|
"data_source": {},
|
|
"filetype": "text/html",
|
|
"page_number": 1
|
|
},
|
|
"text": "Downloadify Example"
|
|
},
|
|
{
|
|
"type": "Title",
|
|
"element_id": "d551bbfc9477547e4dce6264d8196c7b",
|
|
"metadata": {
|
|
"data_source": {},
|
|
"filetype": "text/html",
|
|
"page_number": 1,
|
|
"link_urls": [
|
|
"http://github.com/dcneiner/Downloadify"
|
|
],
|
|
"link_texts": [
|
|
"Github Project Page"
|
|
]
|
|
},
|
|
"text": "More info available at the Github Project Page"
|
|
},
|
|
{
|
|
"type": "Title",
|
|
"element_id": "971b974235a86ca628dcc713d6e2e8d9",
|
|
"metadata": {
|
|
"data_source": {},
|
|
"filetype": "text/html",
|
|
"page_number": 1
|
|
},
|
|
"text": "Filename"
|
|
},
|
|
{
|
|
"type": "Title",
|
|
"element_id": "4112a488690bdbc1d39d5b78068eae9f",
|
|
"metadata": {
|
|
"data_source": {},
|
|
"filetype": "text/html",
|
|
"page_number": 1
|
|
},
|
|
"text": "File Contents"
|
|
},
|
|
{
|
|
"type": "NarrativeText",
|
|
"element_id": "f89c9cf63bd2e72f560ee043d942a1e7",
|
|
"metadata": {
|
|
"data_source": {},
|
|
"filetype": "text/html",
|
|
"page_number": 1
|
|
},
|
|
"text": "Whatever you put in this text box will be downloaded and saved in the file. If you leave it blank, no file will be downloaded"
|
|
},
|
|
{
|
|
"type": "NarrativeText",
|
|
"element_id": "53a4db70c6d40ed5206711ed8a255e03",
|
|
"metadata": {
|
|
"data_source": {},
|
|
"filetype": "text/html",
|
|
"page_number": 1
|
|
},
|
|
"text": "You must have Flash 10 installed to download this file."
|
|
},
|
|
{
|
|
"type": "Title",
|
|
"element_id": "839973fba0c850f1729fad098b031203",
|
|
"metadata": {
|
|
"data_source": {},
|
|
"filetype": "text/html",
|
|
"page_number": 1
|
|
},
|
|
"text": "Downloadify Invoke Script For This Page"
|
|
},
|
|
{
|
|
"type": "NarrativeText",
|
|
"element_id": "b7db0dffb05f01f3f13d34420b82c261",
|
|
"metadata": {
|
|
"data_source": {},
|
|
"filetype": "text/html",
|
|
"page_number": 1
|
|
},
|
|
"text": "Downloadify.create('downloadify',{\n filename: function(){\n return document.getElementById('filename').value;\n },\n data: function(){ \n return document.getElementById('data').value;\n },\n onComplete: function(){ \n alert('Your File Has Been Saved!'); \n },\n onCancel: function(){ \n alert('You have cancelled the saving of this file.');\n },\n onError: function(){ \n alert('You must put something in the File Contents or there will be nothing to save!'); \n },\n swf: 'media/downloadify.swf',\n downloadImage: 'images/download.png',\n width: 100,\n height: 30,\n transparent: true,\n append: false\n});"
|
|
}
|
|
] |