mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-06-27 02:30:08 +00:00

**Summary** Replace legacy HTML parser with recursive version that captures all content and provides flexibility to add new metadata. It's also substantially faster although that's just a happy side-effect. **Additional Context** The prior HTML parsing algorithm that makes up the core of HTML partitioning was buggy and very difficult to reason about because it did not conform to the inherently recursive structure of HTML. The new version retains `lxml` as the performant and reliable base library but uses `lxml`'s custom element classes to efficiently classify HTML elements by their behaviors (block-item and inline (phrasing) primarily) and give those elements the desired partitioning behaviors. This solves a host of existing problems with content being skipped and elements (paragraphs) being divided improperly, but also provides a clear domain model for reasoning about its behavior and reliably adjusting it to suit our existing and future purposes. The parser's operation is recursive, closely modeling the recursive structure of HTML itself. It's behaviors are based on the HTML Standard and reliably produce proper and explainable results even for novel cases. Fixes #2325 Fixes #2562 Fixes #2675 Fixes #3168 Fixes #3227 Fixes #3228 Fixes #3230 Fixes #3237 Fixes #3245 Fixes #3247 Fixes #3255 Fixes #3309 ### BEHAVIOR DIFFERENCES #### `emphasized_text_tags` encoding is changed: - `<strong>` is encoded as `"b"` rather than `"strong"`. - `<em>` is encoded as `"i"` rather than `"em"`. - `<span>` is no longer recorded in `emphasized_text_tags` (because without the CSS we can't tell whether it's used for emphasis or if so what kind). - nested emphasis (e.g. bold+italic) is encoded as multiple characters ("bi"). - `emphasized_text_contents` is broken on emphasis-change boundaries, like: ```html `<p>foo <b>bar <i>baz</i> bada</b> bing</p>` ``` produces: ```json { "emphasized_text_contents": ["bar", "baz", "bada"], "emphasized_text_tags": ["b", "bi", "b"] } ``` whereas previously it would have produced: ```json { "emphasized_text_contents": ["bar baz bada", "baz"], "emphasized_text_tags": ["b", "i"] } ``` #### `<pre>` text is preserved as it appears in the html Except that a leading newline is removed if present (has to be in position 0 of text). Also, a trailing newline is stripped but only if it appears in the very last position ([-1]) of the `<pre>` text. Old parser stripped all leading and trailing whitespace. Result is that: ```html <pre> foo bar baz </pre> ``` parses to `"foo\nbar\nbaz"` which is the same result produced for: ```html <pre>foo bar baz</pre> ``` This equivalence is the same behavior exhibited by a browser, which is why we did the extra work to make it this way. #### Whitespace normalization Leading and trailing whitespace are removed from element text, just as it is removed in the browser. Runs of whitespace within the element text are reduced to a single space character (like in the browser). Note this means that `\t`, `\n`, and ` ` are replaced with a regular space character. All text derived from elements is whitespace normalized except the text within a `<pre>` tag. Any leading or trailing newline is trimmed from `<pre>` element text; all other whitespace is preserved just as it appeared in the HTML source. #### `link_start_indexes` metadata is no longer captured. Rationale: - It was frequently wrong, often `-1`. - It was deprecated but then added back in a community PR. - Maintaining it across any possible downstream transformations (e.g. chunking) would be expensive and almost certainly lead to wrong values as distant code evolves. - It is complex to compute and recompute when whitespace is normalized, adding substantial complexity to the code and reducing readability and maintainability #### `<br/>` element is replaced with a single newline (`"\n"`) but that is usually replaced with a space in `Element.text` when it is normalized. The newline is preserved within a `<pre>` element. - Related: _No paragraph-break on `<br/><br/>`_ #### Empty `h1..h6` elements are dropped. HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a `Title` element) when they contain no text or contain only whitespace. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>
114 lines
7.0 KiB
Plaintext
114 lines
7.0 KiB
Plaintext
MIME-Version: 1.0
|
|
Date: Wed, 4 Oct 2023 09:27:45 -0500
|
|
Message-ID: <CABDvgF2Wpt9+eSO7zgMJZ2fQb=QZ__CS6N_Y+msnGwpKeg1a+A@mail.gmail.com>
|
|
Subject: Test email with multiple languages
|
|
From: John <johnjennings702@gmail.com>
|
|
To: John <johnjennings702@gmail.com>
|
|
Content-Type: multipart/alternative; boundary="000000000000a0666d0606e4cfb7"
|
|
|
|
--000000000000a0666d0606e4cfb7
|
|
Content-Type: text/plain; charset="UTF-8"
|
|
Content-Transfer-Encoding: quoted-printable
|
|
|
|
All human beings are born free and equal in dignity and rights. They are
|
|
endowed with reason and conscience and should act towards one another in a
|
|
spirit of brotherhood. All human beings are born free and equal in dignity
|
|
and rights. They are endowed with reason and conscience and should act
|
|
towards one another in a spirit of brotherhood. All human beings are born
|
|
free and equal in dignity and rights. They are endowed with reason and
|
|
conscience and should act towards one another in a spirit of brotherhood.
|
|
All human beings are born free and equal in dignity and rights. They are
|
|
endowed with reason and conscience and should act towards one another in a
|
|
spirit of brotherhood. All human beings are born free and equal in dignity
|
|
and rights. They are endowed with reason and conscience and should act
|
|
towards one another in a spirit of brotherhood. All human beings are born
|
|
free and equal in dignity and rights. They are endowed with reason and
|
|
conscience and should act towards one another in a spirit of brotherhood.
|
|
|
|
All human beings are born free and equal in dignity and rights. They are
|
|
endowed with reason and conscience and should act towards one another in a
|
|
spirit of brotherhood. "Todos los seres humanos nacen libres e iguales en
|
|
dignidad y derechos y, dotados como est=C3=A1n de raz=C3=B3n y conciencia, =
|
|
deben
|
|
comportarse fraternalmente los unos con los otros. Todos los seres humanos
|
|
nacen libres e iguales en dignidad y derechos y, dotados como est=C3=A1n de
|
|
raz=C3=B3n y conciencia, deben comportarse fraternalmente los unos con los
|
|
otros."
|
|
|
|
All human beings are born free and equal in dignity and rights. They are
|
|
endowed with reason and conscience and should act towards one another in a
|
|
spirit of brotherhood. All human beings are born free and equal in dignity
|
|
and rights. They are endowed with reason and conscience and should act
|
|
towards one another in a spirit of brotherhood. All human beings are born
|
|
free and equal in dignity and rights. They are endowed with reason and
|
|
conscience and should act towards one another in a spirit of brotherhood.
|
|
All human beings are born free and equal in dignity and rights. They are
|
|
endowed with reason and conscience and should act towards one another in a
|
|
spirit of brotherhood.
|
|
|
|
All human beings are born free and equal in dignity and rights. They are
|
|
endowed with reason and conscience and should act towards one another in a
|
|
spirit of brotherhood. All human beings are born free and equal in dignity
|
|
and rights. They are endowed with reason and conscience and should act
|
|
towards one another in a spirit of brotherhood. All human beings are born
|
|
free and equal in dignity and rights. They are endowed with reason and
|
|
conscience and should act towards one another in a spirit of brotherhood.
|
|
|
|
"Todos los seres humanos nacen libres e iguales en dignidad y derechos y,
|
|
dotados como est=C3=A1n de raz=C3=B3n y conciencia, deben comportarse frate=
|
|
rnalmente
|
|
los unos con los otros. Todos los seres humanos nacen libres e iguales en
|
|
dignidad y derechos y, dotados como est=C3=A1n de raz=C3=B3n y conciencia, =
|
|
deben
|
|
comportarse fraternalmente los unos con los otros."
|
|
|
|
--000000000000a0666d0606e4cfb7
|
|
Content-Type: text/html; charset="UTF-8"
|
|
Content-Transfer-Encoding: quoted-printable
|
|
|
|
<div dir=3D"ltr">All human beings are born free and equal in dignity and ri=
|
|
ghts. They are endowed with reason and conscience and should act towards on=
|
|
e another in a spirit of brotherhood. All human beings are born free and eq=
|
|
ual in dignity and rights. They are endowed with reason and conscience and =
|
|
should act towards one another in a spirit of brotherhood. All human beings=
|
|
are born free and equal in dignity and rights. They are endowed with reaso=
|
|
n and conscience and should act towards one another in a spirit of brotherh=
|
|
ood. All human beings are born free and equal in dignity and rights. They a=
|
|
re endowed with reason and conscience and should act towards one another in=
|
|
a spirit of brotherhood. All human beings are born free and equal in digni=
|
|
ty and rights. They are endowed with reason and conscience and should act t=
|
|
owards one another in a spirit of brotherhood. All human beings are born fr=
|
|
ee and equal in dignity and rights. They are endowed with reason and consci=
|
|
ence and should act towards one another in a spirit of brotherhood. <p> =
|
|
All human beings are born free and equal in dignity and rights. They are en=
|
|
dowed with reason and conscience and should act towards one another in a sp=
|
|
irit of brotherhood. "Todos los seres humanos nacen libres e iguales e=
|
|
n dignidad y derechos y, dotados como est=C3=A1n de raz=C3=B3n y conciencia=
|
|
, deben comportarse fraternalmente los unos con los otros. Todos los seres =
|
|
humanos nacen libres e iguales en dignidad y derechos y, dotados como est=
|
|
=C3=A1n de raz=C3=B3n y conciencia, deben comportarse fraternalmente los un=
|
|
os con los otros."</p> <p>All human beings are born free and equal in =
|
|
dignity and rights. They are endowed with reason and conscience and should =
|
|
act towards one another in a spirit of brotherhood. All human beings are bo=
|
|
rn free and equal in dignity and rights. They are endowed with reason and c=
|
|
onscience and should act towards one another in a spirit of brotherhood. Al=
|
|
l human beings are born free and equal in dignity and rights. They are endo=
|
|
wed with reason and conscience and should act towards one another in a spir=
|
|
it of brotherhood. All human beings are born free and equal in dignity and =
|
|
rights. They are endowed with reason and conscience and should act towards =
|
|
one another in a spirit of brotherhood.</p> <p>All human beings are born fr=
|
|
ee and equal in dignity and rights. They are endowed with reason and consci=
|
|
ence and should act towards one another in a spirit of brotherhood. All hum=
|
|
an beings are born free and equal in dignity and rights. They are endowed w=
|
|
ith reason and conscience and should act towards one another in a spirit of=
|
|
brotherhood. All human beings are born free and equal in dignity and right=
|
|
s. They are endowed with reason and conscience and should act towards one a=
|
|
nother in a spirit of brotherhood.</p> <p>"Todos los seres humanos nac=
|
|
en libres e iguales en dignidad y derechos y, dotados como est=C3=A1n de ra=
|
|
z=C3=B3n y conciencia, deben comportarse fraternalmente los unos con los ot=
|
|
ros. Todos los seres humanos nacen libres e iguales en dignidad y derechos =
|
|
y, dotados como est=C3=A1n de raz=C3=B3n y conciencia, deben comportarse fr=
|
|
aternalmente los unos con los otros."</p></div>
|
|
|
|
--000000000000a0666d0606e4cfb7--
|