mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-04 19:16:03 +00:00
## Summary
This PR fixes an issue where header/footer content in html are not
partitioned as `unstructured` `Header` or `Footer` element types. Rather
they are either `UncategorizedText` or taking on the type of the nested
structure inside the header/footer. E.g., `<header class="Header"><h1
class="Title">Header Title</h1></header>` would be partitioned as a
`Title` instead of `Header`.
## Bug description
This behavior is because we treat header and footer as layout, i.e.,
containers, in the ontology definition. As a result, during parsing we
[unwrap](ec209c6b5f/unstructured/partition/html/transformations.py (L361-L378))
the container and parse the contents as if they are from the main text
even though they are still part of header/footer.
The fix is to treat header/footer as text instead of layout in ontology
so that all content inside of them are properly gathered under
`Header`/`Footer` element types.