3 Commits

Author SHA1 Message Date
Steve Canny
b3a2dd4755
fix: html incorrectly categorizing text (#3841)
Fixes #3666

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-12-18 18:46:54 +00:00
Steve Canny
00e1d5c05b
rfctr(html): refine HTML parser (#3351)
**Note**
This refines the new HTML parser but _does not install it_. This is why
no changes to ingest test expectations or other unit-tests are required
here. Installing the new parser will happen in the next PR #3218.

**Summary**
The initial version of the parser (purposely) raised on a block element
nested inside a phrasing element. While such nesting is not valid
according to the HTML Standard, it is accepted by the browser and does
happen in the wild.

The refinements here handle this situation similarly to how the browser
does, breaking phrasing at the block element boundaries and starting it
up again after the block element.

Unfortunately this adds complexity to the parser, but it makes the
parser robust against pretty much any HTML we're likely to encounter and
partitions it consistent with how it would be rendered in the browser.
2024-07-09 01:10:03 +00:00
Steve Canny
6fe1c9980e
rfctr(html): prepare for new html parser (#3257)
**Summary**
Extract as much mechanical refactoring from the HTML parser change-over
into the PR as possible. This leaves the next PR focused on installing
the new parser and the ingest-test impact.

**Reviewers:** Commits are well groomed and reviewing commit-by-commit
is probably easier.

**Additional Context**
This PR introduces the rewritten HTML parser. Its general design is
recursive, consistent with the recursive structure of HTML (tree of
elements). It also adds the unit tests for that parser but it does not
_install_ the parser. So the behavior of `partition_html()` is unchanged
by this PR. The next PR in this series will do that and handle the
ingest and other unit test changes required to reflect the dozen or so
bug-fixes the new parser provides.
2024-06-21 20:59:48 +00:00