This PR:
- Fixes removing HTML tags that exist in <td> cells
- stripping function was in general problematic to implement in easy and
straightforward way (you can't modify `descendants` in-place). So I
decided instead of patching something in table cell I added stripping
everywhere in the same consistent way. This is why some tests needed
small edits with removing one white-space in each tag. I believe this
won't cause any problems for downstream tasks.
Tested HTML:
```html
<table class="Table">
<tbody>
<tr>
<td colspan="2">
Some text
</td>
<td>
<input checked="" class="Checkbox" type="checkbox"/>
</td>
</tr>
</tbody>
</table>
```
Before & After
```html
'<table class="Table" id="..."> <tbody> <tr> <td colspan="2">Some text</td><td></td></tr></tbody></table>'
'<table class="Table" id="..."><tbody><tr><td colspan="2">Some text</td><td><input checked="" type="checkbox"/></td></tr></tbody></table>''
```
- the "value" attribute from <input/> tag will be taken into account and
processed as "text" in ontology
- the tables will now be parsed without any ids and classes - we have
different reasons behind that, for example, embeddings with ids and
classes can lose some semantic value. Also, more tokens = more expensive
LLM call
- cleaned to_html, created to_text for OntologyElement
> This is POC change; not everything is working correctly and code
quality could be improved significantly
This ticket add parsing HTML to unstructured element and back. How is it
working?
HTML has a tree structure, Unstructured Elements is a list.
HTML structure is traversed in DFS order, creating Elements and adding
them to list. So the reading order from HTML is preserved. To be able to
compose tree again all elements has IDs, and metadata.parent_id is
leveraged
How html is preserved if there are 'layout' without text, or there are
deeply nested HTMLs that are just text from the point of view of
Unstructured Element?
Each element is parsed back to HTML using metadata.text_as_html field.
For layout elements only html_tag are there, for long text elements
there is everything required to recreate HTML - you can see examples in
unit tests or .json file I attached.
Pros of solution:
- Nothing had to be changed in element types
Cons:
- There are elements without Text which may be confusing (they could be
replaced by some special type)
Core transformation logic can be found in 2 functions in
`unstructured/documents/transformations.py`
Knowns bugs (they are minor):
- sometimes html tag is changed incorrectly
- metadata.category_depth and metadata.page_number are not set
- page break is not added between pages
How to test. Generate HTML:
```python3
from pathlib import Path
from vlm_partitioner.src.partition import partition
if __name__ == "__main__":
doc_dir = Path("out_dir")
file_path = Path("example_doc.pdf")
partition(str(file_path), provider="anthropic", output_dir=str(doc_dir))
```
Then parse to unstructured elements and back to html
```python3
from pathlib import Path
from unstructured.documents.html_utils import indent_html
from unstructured.documents.transformations import parse_html_to_ontology, ontology_to_unstructured_elements, \
unstructured_elements_to_ontology
from unstructured.staging.base import elements_to_json
if __name__ == "__main__":
output_dir = Path("out_dir/")
output_dir.mkdir(exist_ok=True, parents=True)
doc_path = Path("out_dir/example_doc.html")
html_content = doc_path.read_text()
ontology = parse_html_to_ontology(html_content)
unstructured_elements = ontology_to_unstructured_elements(ontology)
elements_to_json(unstructured_elements, str(output_dir / f"{doc_path.stem}_unstr.json"))
parsed_ontology = unstructured_elements_to_ontology(unstructured_elements)
html_to_save = indent_html(parsed_ontology.to_html())
Path(output_dir / f"{doc_path.stem}_parsed_unstr.html").write_text(html_to_save)
```
I attached example doc before and after running these scripts
[outputs.zip](https://github.com/user-attachments/files/17438673/outputs.zip)