unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-08 17:46:54 +00:00

Author	SHA1	Message	Date
Yao You	aa332101ab	fix: fix header and footer not parsed as Header/Footer types (#4041 ) ## Summary This PR fixes an issue where header/footer content in html are not partitioned as `unstructured` `Header` or `Footer` element types. Rather they are either `UncategorizedText` or taking on the type of the nested structure inside the header/footer. E.g., `<header class="Header"><h1 class="Title">Header Title</h1></header>` would be partitioned as a `Title` instead of `Header`. ## Bug description This behavior is because we treat header and footer as layout, i.e., containers, in the ontology definition. As a result, during parsing we [unwrap](`ec209c6b5f/unstructured/partition/html/transformations.py (L361-L378)`) the container and parse the contents as if they are from the main text even though they are still part of header/footer. The fix is to treat header/footer as text instead of layout in ontology so that all content inside of them are properly gathered under `Header`/`Footer` element types.	2025-07-01 21:58:43 +00:00
Pluto	ec209c6b5f	Remove IDs from HTML code (#4012 ) In this pull request parent-child relationship for elements generated with v2 parser is based on actual element IDs instead of IDs baked somewhere in the HTML script. With some extra bug fixing it allowed for significantly simplifying json -> HTML script	2025-06-11 11:55:02 +00:00
Pluto	5bb95b5841	Fix parsing table cells (#3904 ) This PR: - Fixes removing HTML tags that exist in <td> cells - stripping function was in general problematic to implement in easy and straightforward way (you can't modify `descendants` in-place). So I decided instead of patching something in table cell I added stripping everywhere in the same consistent way. This is why some tests needed small edits with removing one white-space in each tag. I believe this won't cause any problems for downstream tasks. Tested HTML: ```html <table class="Table"> <tbody> <tr> <td colspan="2"> Some text </td> <td> <input checked="" class="Checkbox" type="checkbox"/> </td> </tr> </tbody> </table> ``` Before & After ```html '<table class="Table" id="..."> <tbody> <tr> <td colspan="2">Some text</td><td></td></tr></tbody></table>' '<table class="Table" id="..."><tbody><tr><td colspan="2">Some text</td><td><input checked="" type="checkbox"/></td></tr></tbody></table>'' ```	2025-02-05 15:28:49 +00:00
Pluto	e48d79eca1	image alt support (#3797 )	2024-11-26 16:20:23 +00:00
Pluto	e1babf0660	Define default HTML to ontology mapping (#3784 )	2024-11-20 13:01:28 +00:00
Pluto	1953b8699f	Ml 415/merge inline elements (#3749 )	2024-10-31 12:17:25 +00:00
Maksymilian Operlejn	eb1b294b73	ML-405/ML-427 - OntologyElement improvements (#3758 ) - the "value" attribute from <input/> tag will be taken into account and processed as "text" in ontology - the tables will now be parsed without any ids and classes - we have different reasons behind that, for example, embeddings with ids and classes can lose some semantic value. Also, more tokens = more expensive LLM call - cleaned to_html, created to_text for OntologyElement	2024-10-31 01:30:53 +00:00
Pluto	03a3ed8d3b	Add parsing HTML to unstructured elements (#3732 ) > This is POC change; not everything is working correctly and code quality could be improved significantly This ticket add parsing HTML to unstructured element and back. How is it working? HTML has a tree structure, Unstructured Elements is a list. HTML structure is traversed in DFS order, creating Elements and adding them to list. So the reading order from HTML is preserved. To be able to compose tree again all elements has IDs, and metadata.parent_id is leveraged How html is preserved if there are 'layout' without text, or there are deeply nested HTMLs that are just text from the point of view of Unstructured Element? Each element is parsed back to HTML using metadata.text_as_html field. For layout elements only html_tag are there, for long text elements there is everything required to recreate HTML - you can see examples in unit tests or .json file I attached. Pros of solution: - Nothing had to be changed in element types Cons: - There are elements without Text which may be confusing (they could be replaced by some special type) Core transformation logic can be found in 2 functions in `unstructured/documents/transformations.py` Knowns bugs (they are minor): - sometimes html tag is changed incorrectly - metadata.category_depth and metadata.page_number are not set - page break is not added between pages How to test. Generate HTML: ```python3 from pathlib import Path from vlm_partitioner.src.partition import partition if __name__ == "__main__": doc_dir = Path("out_dir") file_path = Path("example_doc.pdf") partition(str(file_path), provider="anthropic", output_dir=str(doc_dir)) ``` Then parse to unstructured elements and back to html ```python3 from pathlib import Path from unstructured.documents.html_utils import indent_html from unstructured.documents.transformations import parse_html_to_ontology, ontology_to_unstructured_elements, \ unstructured_elements_to_ontology from unstructured.staging.base import elements_to_json if __name__ == "__main__": output_dir = Path("out_dir/") output_dir.mkdir(exist_ok=True, parents=True) doc_path = Path("out_dir/example_doc.html") html_content = doc_path.read_text() ontology = parse_html_to_ontology(html_content) unstructured_elements = ontology_to_unstructured_elements(ontology) elements_to_json(unstructured_elements, str(output_dir / f"{doc_path.stem}_unstr.json")) parsed_ontology = unstructured_elements_to_ontology(unstructured_elements) html_to_save = indent_html(parsed_ontology.to_html()) Path(output_dir / f"{doc_path.stem}_parsed_unstr.html").write_text(html_to_save) ``` I attached example doc before and after running these scripts [outputs.zip](https://github.com/user-attachments/files/17438673/outputs.zip)	2024-10-23 12:28:07 +00:00

8 Commits