3 Commits

Author SHA1 Message Date
Yao You
909716f310
feat: keep input tag's class attr in table (#4064)
This change affects partition html.

Previously when there is a table in the html, we clean any tags inside
the table of their class and id attributes except for the class
attribute for `img` tags. This change also preserves the class attribute
for `input` tags inside a table. This change is reflected in a table
element's metadata.text_as_html attribute.
2025-07-16 21:46:58 +00:00
Pluto
5bb95b5841
Fix parsing table cells (#3904)
This PR:
- Fixes removing HTML tags that exist in <td> cells 
- stripping function was in general problematic to implement in easy and
straightforward way (you can't modify `descendants` in-place). So I
decided instead of patching something in table cell I added stripping
everywhere in the same consistent way. This is why some tests needed
small edits with removing one white-space in each tag. I believe this
won't cause any problems for downstream tasks.

Tested HTML:
```html
<table class="Table">
    <tbody>
        <tr>
            <td colspan="2">
                Some text                                        
            </td>
            <td>
                <input checked="" class="Checkbox" type="checkbox"/>
            </td>
        </tr>
    </tbody>
</table>
```
Before & After
```html
'<table class="Table" id="..."> <tbody> <tr> <td colspan="2">Some text</td><td></td></tr></tbody></table>'
'<table class="Table" id="..."><tbody><tr><td colspan="2">Some text</td><td><input checked="" type="checkbox"/></td></tr></tbody></table>''
```
2025-02-05 15:28:49 +00:00
Pluto
e48d79eca1
image alt support (#3797) 2024-11-26 16:20:23 +00:00