unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-12 00:18:56 +00:00

Author	SHA1	Message	Date
Antonio Jose Jimeno Yepes	0fa5174bd7	Image within div or span with no text is annotated as Image (#3962 ) Ticket: https://unstructured-ai.atlassian.net/browse/ML-942 The following uncompressed HTML document can be used to test the transformation using the `partition_html` function from the VLM partitioner. [recalibrating-risk-report.pdf.json.html.zip](https://github.com/user-attachments/files/19330528/recalibrating-risk-report.pdf.json.html.zip)	2025-03-20 04:09:02 +00:00
ryannikolaidis	66bf4b0198	feat: support extracting image url in html (#3955 ) also removes mimetype when base64 is not included in image metadata --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2025-03-13 22:41:10 +00:00
ryannikolaidis	c0457c1cc3	feat: include images when partitioning html (#3945 ) Currently we [filter img tags](`2addb19473/unstructured/partition/html/partition.py (L226-L229)`) before tags are converted to Elements by the html partitioner. More importantly we also don’t currently have a defined “block” / mapping to support these. This adds these mappings and logic to process. It also respects `extract_image_block_types` and `extract_image_block_to_payload` (as we do with pdfs) to determine whether base64 is included in the metadata. The partitioned Image Elements sets the text to the img tag’s alt text if available. The partitioned Image Elements include the [url in the metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209) (rather than image_base64) if the img tag src is a url. ## Testing unit tests have been added for explicit coverage. existing integration tests and other unit test fixtures have been updated to account for `Image` elements now present --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2025-03-08 01:25:21 +00:00
Marek Połom	f333d7fe7f	feat: Json elements to HTML converter (#3936 ) ## NOTE `test_unstructured_ingest/expected-structured-output-html` contains all test HTML fixtures. Original JSON files, from which these HTML fixtures are generated, were taken from `test_unstructured_ingest/expected-structured-output`	2025-03-04 13:57:35 +00:00
Pluto	5bb95b5841	Fix parsing table cells (#3904 ) This PR: - Fixes removing HTML tags that exist in <td> cells - stripping function was in general problematic to implement in easy and straightforward way (you can't modify `descendants` in-place). So I decided instead of patching something in table cell I added stripping everywhere in the same consistent way. This is why some tests needed small edits with removing one white-space in each tag. I believe this won't cause any problems for downstream tasks. Tested HTML: ```html <table class="Table"> <tbody> <tr> <td colspan="2"> Some text </td> <td> <input checked="" class="Checkbox" type="checkbox"/> </td> </tr> </tbody> </table> ``` Before & After ```html '<table class="Table" id="..."> <tbody> <tr> <td colspan="2">Some text</td><td></td></tr></tbody></table>' '<table class="Table" id="..."><tbody><tr><td colspan="2">Some text</td><td><input checked="" type="checkbox"/></td></tr></tbody></table>'' ```	2025-02-05 15:28:49 +00:00
Steve Canny	b3a2dd4755	fix: html incorrectly categorizing text (#3841 ) Fixes #3666 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-12-18 18:46:54 +00:00
Steve Canny	3b718ec89a	rfctr: prep for pluggable partitioners (#3806 ) Summary Prepare auto-partitioning for pluggable partitioners. Move toward a uniform partitioner call signature in `auto/partition()` such that a custom or override partitioner can be registered without requiring code changes. Additional Context The central job of `auto/partition()` is to detect the file-type of the given file and use that to dispatch partitioning to the corresponding partitioner function e.g. `partition_pdf()` or `partition_docx()`. In the existing code, each partitioner function is called with parameters "hand-picked" from the available parameters passed to the `partition()` function. This is unnecessary and couples those partitioners tightly with the dispatch function. The desired state is that all available arguments are passed as `kwargs` and the partitioner function "self-selects" the arguments it will be sensitive to, applies its own appropriate default values when the argument is omitted, and simply ignore any arguments it doesn't use. Note that achieving this requires no changes to partitioner functions because they already do precisely this. So the job is to pass all arguments (other than `filename` and `file`) to the partitioner as `kwargs`. This will allow additional or alternate partitioners to be registered at runtime and dispatched to, because as long as they have the signature `partition_x(filename, file, kwargs) -> list[Element]` then they can be dispatched to without customization.	2024-12-10 20:44:34 +00:00
Pluto	e48d79eca1	image alt support (#3797 )	2024-11-26 16:20:23 +00:00
Pluto	e1babf0660	Define default HTML to ontology mapping (#3784 )	2024-11-20 13:01:28 +00:00
Pluto	c2d17b1ca4	Fix extracting value from field (#3774 )	2024-11-07 18:21:39 +00:00
Pluto	66d1e5a5cb	Add max recursion limit and fix to_text() method (#3773 )	2024-11-07 15:08:16 +00:00
Pluto	1953b8699f	Ml 415/merge inline elements (#3749 )	2024-10-31 12:17:25 +00:00
Maksymilian Operlejn	eb1b294b73	ML-405/ML-427 - OntologyElement improvements (#3758 ) - the "value" attribute from <input/> tag will be taken into account and processed as "text" in ontology - the tables will now be parsed without any ids and classes - we have different reasons behind that, for example, embeddings with ids and classes can lose some semantic value. Also, more tokens = more expensive LLM call - cleaned to_html, created to_text for OntologyElement	2024-10-31 01:30:53 +00:00
Pluto	5a91f0cda9	Fix layout parsing (#3754 )	2024-10-25 14:42:06 +00:00
Pluto	2417f8ed84	Fix when parent id is none for first element in v2 notion: (#3752 )	2024-10-25 09:43:36 +00:00
Pluto	03a3ed8d3b	Add parsing HTML to unstructured elements (#3732 ) > This is POC change; not everything is working correctly and code quality could be improved significantly This ticket add parsing HTML to unstructured element and back. How is it working? HTML has a tree structure, Unstructured Elements is a list. HTML structure is traversed in DFS order, creating Elements and adding them to list. So the reading order from HTML is preserved. To be able to compose tree again all elements has IDs, and metadata.parent_id is leveraged How html is preserved if there are 'layout' without text, or there are deeply nested HTMLs that are just text from the point of view of Unstructured Element? Each element is parsed back to HTML using metadata.text_as_html field. For layout elements only html_tag are there, for long text elements there is everything required to recreate HTML - you can see examples in unit tests or .json file I attached. Pros of solution: - Nothing had to be changed in element types Cons: - There are elements without Text which may be confusing (they could be replaced by some special type) Core transformation logic can be found in 2 functions in `unstructured/documents/transformations.py` Knowns bugs (they are minor): - sometimes html tag is changed incorrectly - metadata.category_depth and metadata.page_number are not set - page break is not added between pages How to test. Generate HTML: ```python3 from pathlib import Path from vlm_partitioner.src.partition import partition if __name__ == "__main__": doc_dir = Path("out_dir") file_path = Path("example_doc.pdf") partition(str(file_path), provider="anthropic", output_dir=str(doc_dir)) ``` Then parse to unstructured elements and back to html ```python3 from pathlib import Path from unstructured.documents.html_utils import indent_html from unstructured.documents.transformations import parse_html_to_ontology, ontology_to_unstructured_elements, \ unstructured_elements_to_ontology from unstructured.staging.base import elements_to_json if __name__ == "__main__": output_dir = Path("out_dir/") output_dir.mkdir(exist_ok=True, parents=True) doc_path = Path("out_dir/example_doc.html") html_content = doc_path.read_text() ontology = parse_html_to_ontology(html_content) unstructured_elements = ontology_to_unstructured_elements(ontology) elements_to_json(unstructured_elements, str(output_dir / f"{doc_path.stem}_unstr.json")) parsed_ontology = unstructured_elements_to_ontology(unstructured_elements) html_to_save = indent_html(parsed_ontology.to_html()) Path(output_dir / f"{doc_path.stem}_parsed_unstr.html").write_text(html_to_save) ``` I attached example doc before and after running these scripts [outputs.zip](https://github.com/user-attachments/files/17438673/outputs.zip)	2024-10-23 12:28:07 +00:00
Steve Canny	9bd91a836e	rfctr(part): remove double-decoration 3 (#3687 ) Summary Install new `@apply_metadata()` on HTML and remove decorators from delegating partitioners EPUB, MD, ORG, RST, and RTF. Additional Context - All five of these delegating partitioners delegate to `partition_html()` so they're something of a matched set. EML and MSG also partially delegate to HTML but that's a harder problem (they also delegate to all other partitioners for attachments) that we'll address a couple PRs later . - Replace use of `@process_metadata()` and `@add_metadata_with_filetype()` decorators with `@apply_metadata()` on `partition_html()`. - Remove all decorators from delegating partitioners; this removes the "double-decorating".	2024-10-02 21:04:37 +00:00
Steve Canny	44bad216f3	rfctr(part): prepare for pluggable auto-partitioners 3 (#3661 ) Summary Remove unused `include_metadata` parameter. Additional Context - The `include_metadata` parameter was originally added circa v0.7.12 as a mechanism for avoiding the "double-decorating" problem on delegating partitioners. - It turns out it doesn't fully address that problem, is now unused, and is unnecessary for the solution we'll be adding as part of pluggable partitioners. - Remove the unnecessary complexity introduced by this unused parameter.	2024-09-25 18:17:48 +00:00
Steve Canny	3bab9d93e6	rfctr(part): prepare for pluggable auto-partitioners 1 (#3655 ) Summary In preparation for pluggable auto-partitioners simplify metadata as discussed. Additional Context - Pluggable auto-partitioners requires partitioners to have a consistent call signature. An arbitrary partitioner provided at runtime needs to have a call signature that is known and consistent. Basically `partition_x(filename, , file, *kwargs)`. - The current `auto.partition()` is highly coupled to each distinct file-type partitioner, deciding which arguments to forward to each. - This is driven by the existence of "delegating" partitioners, those that convert their file-type and then call a second partitioner to do the actual partitioning. Both the delegating and proxy partitioners are decorated with metadata-post-processing decorators and those decorators are not idempotent. We call the situation where those decorators would run twice "double-decorating". For example, EPUB converts to HTML and calls `partition_html()` and both `partition_epub()` and `partition_html()` are decorated. - The way double-decorating has been avoided in the past is to avoid sending the arguments the metadata decorators are sensitive to to the proxy partitioner. This is very obscure, complex to reason about, error-prone, and just overall not a viable strategy. The better solution is to not decorate delegating partitioners and let the proxy partitioner handle all the metadata. - This first step in preparation for that is part of simplifying the metadata processing by removing unused or unwanted legacy parameters. - `date_from_file_object` is a misnomer because a file-object never contains last-modified data. - It can never produce useful results in the API where last-modified information must be provided by `metadata_last_modified`. - It is an undocumented parameter so not in use. - Using it can produce incorrect metadata.	2024-09-23 22:23:10 +00:00
Steve Canny	a861ed8fe7	feat(chunk): split tables on even row boundaries (#3504 ) Summary Use more sophisticated algorithm for splitting oversized `Table` elements into `TableChunk` elements during chunking to ensure element text and HTML are "synchronized" and HTML is always parseable. Additional Context Table splitting now has the following characteristics: - `TableChunk.metadata.text_as_html` is always a parseable HTML `<table>` subtree. - `TableChunk.text` is always the text in the HTML version of the table fragment in `.metadata.text_as_html`. Text and HTML are "synchronized". - The table is divided at a whole-row boundary whenever possible. - A row is broken at an even-cell boundary when a single row is larger than the chunking window. - A cell is broken at an even-word boundary when a single cell is larger than the chunking window. - `.text_as_html` is "minified", removing all extraneous whitespace and unneeded elements or attributes. This maximizes the semantic "density" of each chunk.	2024-08-19 18:56:53 +00:00
Steve Canny	c27e0d0062	rfctr(html): replace html parser (#3218 ) Summary Replace legacy HTML parser with recursive version that captures all content and provides flexibility to add new metadata. It's also substantially faster although that's just a happy side-effect. Additional Context The prior HTML parsing algorithm that makes up the core of HTML partitioning was buggy and very difficult to reason about because it did not conform to the inherently recursive structure of HTML. The new version retains `lxml` as the performant and reliable base library but uses `lxml`'s custom element classes to efficiently classify HTML elements by their behaviors (block-item and inline (phrasing) primarily) and give those elements the desired partitioning behaviors. This solves a host of existing problems with content being skipped and elements (paragraphs) being divided improperly, but also provides a clear domain model for reasoning about its behavior and reliably adjusting it to suit our existing and future purposes. The parser's operation is recursive, closely modeling the recursive structure of HTML itself. It's behaviors are based on the HTML Standard and reliably produce proper and explainable results even for novel cases. Fixes #2325 Fixes #2562 Fixes #2675 Fixes #3168 Fixes #3227 Fixes #3228 Fixes #3230 Fixes #3237 Fixes #3245 Fixes #3247 Fixes #3255 Fixes #3309 ### BEHAVIOR DIFFERENCES #### `emphasized_text_tags` encoding is changed: - `<strong>` is encoded as `"b"` rather than `"strong"`. - `<em>` is encoded as `"i"` rather than `"em"`. - `<span>` is no longer recorded in `emphasized_text_tags` (because without the CSS we can't tell whether it's used for emphasis or if so what kind). - nested emphasis (e.g. bold+italic) is encoded as multiple characters ("bi"). - `emphasized_text_contents` is broken on emphasis-change boundaries, like: ```html `<p>foo <b>bar <i>baz</i> bada</b> bing</p>` ``` produces: ```json { "emphasized_text_contents": ["bar", "baz", "bada"], "emphasized_text_tags": ["b", "bi", "b"] } ``` whereas previously it would have produced: ```json { "emphasized_text_contents": ["bar baz bada", "baz"], "emphasized_text_tags": ["b", "i"] } ``` #### `<pre>` text is preserved as it appears in the html Except that a leading newline is removed if present (has to be in position 0 of text). Also, a trailing newline is stripped but only if it appears in the very last position ([-1]) of the `<pre>` text. Old parser stripped all leading and trailing whitespace. Result is that: ```html <pre> foo bar baz </pre> ``` parses to `"foo\nbar\nbaz"` which is the same result produced for: ```html <pre>foo bar baz</pre> ``` This equivalence is the same behavior exhibited by a browser, which is why we did the extra work to make it this way. #### Whitespace normalization Leading and trailing whitespace are removed from element text, just as it is removed in the browser. Runs of whitespace within the element text are reduced to a single space character (like in the browser). Note this means that `\t`, `\n`, and ` ` are replaced with a regular space character. All text derived from elements is whitespace normalized except the text within a `<pre>` tag. Any leading or trailing newline is trimmed from `<pre>` element text; all other whitespace is preserved just as it appeared in the HTML source. #### `link_start_indexes` metadata is no longer captured. Rationale: - It was frequently wrong, often `-1`. - It was deprecated but then added back in a community PR. - Maintaining it across any possible downstream transformations (e.g. chunking) would be expensive and almost certainly lead to wrong values as distant code evolves. - It is complex to compute and recompute when whitespace is normalized, adding substantial complexity to the code and reducing readability and maintainability #### `<br/>` element is replaced with a single newline (`"\n"`) but that is usually replaced with a space in `Element.text` when it is normalized. The newline is preserved within a `<pre>` element. - Related: _No paragraph-break on `<br/><br/>`_ #### Empty `h1..h6` elements are dropped. HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a `Title` element) when they contain no text or contain only whitespace. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-07-11 00:14:28 +00:00
Steve Canny	00e1d5c05b	rfctr(html): refine HTML parser (#3351 ) Note This refines the new HTML parser but _does not install it_. This is why no changes to ingest test expectations or other unit-tests are required here. Installing the new parser will happen in the next PR #3218. Summary The initial version of the parser (purposely) raised on a block element nested inside a phrasing element. While such nesting is not valid according to the HTML Standard, it is accepted by the browser and does happen in the wild. The refinements here handle this situation similarly to how the browser does, breaking phrasing at the block element boundaries and starting it up again after the block element. Unfortunately this adds complexity to the parser, but it makes the parser robust against pretty much any HTML we're likely to encounter and partitions it consistent with how it would be rendered in the browser.	2024-07-09 01:10:03 +00:00
Steve Canny	6fe1c9980e	rfctr(html): prepare for new html parser (#3257 ) Summary Extract as much mechanical refactoring from the HTML parser change-over into the PR as possible. This leaves the next PR focused on installing the new parser and the ingest-test impact. Reviewers: Commits are well groomed and reviewing commit-by-commit is probably easier. Additional Context This PR introduces the rewritten HTML parser. Its general design is recursive, consistent with the recursive structure of HTML (tree of elements). It also adds the unit tests for that parser but it does not _install_ the parser. So the behavior of `partition_html()` is unchanged by this PR. The next PR in this series will do that and handle the ingest and other unit test changes required to reflect the dozen or so bug-fixes the new parser provides.	2024-06-21 20:59:48 +00:00

23 Commits