mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-08-18 13:45:45 +00:00

Fixes #1958. `<style>` is invalid where it appears in the HTML of thw WSJ page mentioned by that issue but invalid has little meaning in the HTML world if Chrome accepts it. In any case, we have no use for the contents of a `<style>` tag wherever it appears so safe enough for us to just strip all those tags. Note we do not want to also strip the *tail text* which can contain text we're interested in.