mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-24 09:26:08 +00:00

Closes [SPI-44](https://linear.app/unstructured/issue/SPI-44/spike-replace-chardet-with-charset-normalizer-if-possible). Removes `chardet` as a dependency, standardizing on `charset-normalizer`. This involved: - Changing `chardet` to `charset-normalizer` in our base dependency file - Updating the code (in only one place) where `chardet` was used - pip-compiling to update our published dependency tree - Updating one test... `charset-normalizer` misdiagnosed the encoding of a file used as a test fixture. My guess is that the ~10 characters in the file were not enough for `charset-normalizer` to do a proper inference, so I re-encoded another slightly longer file that's also used for encoding testing, and it got that one. - Updating an ingest test fixture. - Updating the ingest test fixture update workflow to also update the expected markdown results (this was a task I missed when adding the markdown ingest tests) --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: qued <qued@users.noreply.github.com> Co-authored-by: Maksymilian Operlejn <36171422+MaksOpp@users.noreply.github.com>
29 lines
1.3 KiB
HTML
29 lines
1.3 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
<head>
|
|
<meta charset="utf-8"/>
|
|
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
|
|
<title>
|
|
</title>
|
|
</head>
|
|
<body>
|
|
<h1 class="Title" id="a59f117741c76dca0bc8f5ee72e2010b">
|
|
My First Heading
|
|
</h1>
|
|
<p class="UncategorizedText" id="82eda2671c5ead903683b67b0f8e3f29">
|
|
My first paragraph.
|
|
</p>
|
|
<p class="UncategorizedText" id="d536ba7636a9a4603a81b358d1fe2590">
|
|
Some text with CP1252-specific characters:
|
|
</p>
|
|
<p class="NarrativeText" id="3b8ca5305e52587b8fbbfcd994de0667">
|
|
Die schöne Frau hat einen Kaffee mit Kuchen gegessen. Sie sagte: "Das war köstlich!" und lächelte dabei. Der Preis betrug 15,50 €.
|
|
L'été était trčs chaud cette année. J'ai acheté un café au lait pour 3,50 €. C'était délicieux ! L'homme a dit : "C'est parfait !"
|
|
El nińo comió paella con ńoquis. La seńora dijo: "ˇQué rico!" y pagó 25,75 €. El restaurante tenía un menú del día.
|
|
Kvinnan ĺt köttbullar med lingonsylt. Hon sa: "Det var fantastiskt!" och betalade 45,90 €. Mannen frĺgade: "Vill du ha mer?"
|
|
O Joăo comprou um café por 2,50 €. Ele disse: "Está ótimo!" e sorriu. A mulher perguntou: "Quer mais alguma coisa?"
|
|
De vrouw dronk koffie met koekjes. Ze zei: "Het was heerlijk!" en betaalde 4,25 €. Het kind vroeg: "Mag ik ook wat?"
|
|
</p>
|
|
</body>
|
|
</html>
|