qued d83df422a6
chore: switch to charset normalizer (#4060)
Closes
[SPI-44](https://linear.app/unstructured/issue/SPI-44/spike-replace-chardet-with-charset-normalizer-if-possible).

Removes `chardet` as a dependency, standardizing on
`charset-normalizer`.

This involved:
- Changing `chardet` to `charset-normalizer` in our base dependency file
- Updating the code (in only one place) where `chardet` was used
- pip-compiling to update our published dependency tree
- Updating one test... `charset-normalizer` misdiagnosed the encoding of
a file used as a test fixture. My guess is that the ~10 characters in
the file were not enough for `charset-normalizer` to do a proper
inference, so I re-encoded another slightly longer file that's also used
for encoding testing, and it got that one.
- Updating an ingest test fixture.
- Updating the ingest test fixture update workflow to also update the
expected markdown results (this was a task I missed when adding the
markdown ingest tests)

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: qued <qued@users.noreply.github.com>
Co-authored-by: Maksymilian Operlejn <36171422+MaksOpp@users.noreply.github.com>
2025-07-22 19:02:40 +00:00

29 lines
1.3 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>
</title>
</head>
<body>
<h1 class="Title" id="a59f117741c76dca0bc8f5ee72e2010b">
My First Heading
</h1>
<p class="UncategorizedText" id="82eda2671c5ead903683b67b0f8e3f29">
My first paragraph.
</p>
<p class="UncategorizedText" id="d536ba7636a9a4603a81b358d1fe2590">
Some text with CP1252-specific characters:
</p>
<p class="NarrativeText" id="3b8ca5305e52587b8fbbfcd994de0667">
Die schöne Frau hat einen Kaffee mit Kuchen gegessen. Sie sagte: "Das war köstlich!" und lächelte dabei. Der Preis betrug 15,50 €.
L'été était trčs chaud cette année. J'ai acheté un café au lait pour 3,50 €. C'était délicieux ! L'homme a dit : "C'est parfait !"
El nińo comió paella con ńoquis. La seńora dijo: "ˇQué rico!" y pagó 25,75 €. El restaurante tenía un menú del día.
Kvinnan ĺt köttbullar med lingonsylt. Hon sa: "Det var fantastiskt!" och betalade 45,90 €. Mannen frĺgade: "Vill du ha mer?"
O Joăo comprou um café por 2,50 €. Ele disse: "Está ótimo!" e sorriu. A mulher perguntou: "Quer mais alguma coisa?"
De vrouw dronk koffie met koekjes. Ze zei: "Het was heerlijk!" en betaalde 4,25 €. Het kind vroeg: "Mag ik ook wat?"
</p>
</body>
</html>