unstructured

yujunjun/unstructured

Fork 0

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-10 15:37:58 +00:00

Commit Graph

Author	SHA1	Message	Date
qued	d83df422a6	chore: switch to charset normalizer (#4060 ) Closes [SPI-44](https://linear.app/unstructured/issue/SPI-44/spike-replace-chardet-with-charset-normalizer-if-possible). Removes `chardet` as a dependency, standardizing on `charset-normalizer`. This involved: - Changing `chardet` to `charset-normalizer` in our base dependency file - Updating the code (in only one place) where `chardet` was used - pip-compiling to update our published dependency tree - Updating one test... `charset-normalizer` misdiagnosed the encoding of a file used as a test fixture. My guess is that the ~10 characters in the file were not enough for `charset-normalizer` to do a proper inference, so I re-encoded another slightly longer file that's also used for encoding testing, and it got that one. - Updating an ingest test fixture. - Updating the ingest test fixture update workflow to also update the expected markdown results (this was a task I missed when adding the markdown ingest tests) --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: qued <qued@users.noreply.github.com> Co-authored-by: Maksymilian Operlejn <36171422+MaksOpp@users.noreply.github.com>	2025-07-22 19:02:40 +00:00
Marek Połom	f333d7fe7f	feat: Json elements to HTML converter (#3936 ) ## NOTE `test_unstructured_ingest/expected-structured-output-html` contains all test HTML fixtures. Original JSON files, from which these HTML fixtures are generated, were taken from `test_unstructured_ingest/expected-structured-output`	2025-03-04 13:57:35 +00:00

Author

SHA1

Message

Date

qued

d83df422a6

chore: switch to charset normalizer (#4060 )

Closes
[SPI-44](https://linear.app/unstructured/issue/SPI-44/spike-replace-chardet-with-charset-normalizer-if-possible).

Removes `chardet` as a dependency, standardizing on
`charset-normalizer`.

This involved:
- Changing `chardet` to `charset-normalizer` in our base dependency file
- Updating the code (in only one place) where `chardet` was used
- pip-compiling to update our published dependency tree
- Updating one test... `charset-normalizer` misdiagnosed the encoding of
a file used as a test fixture. My guess is that the ~10 characters in
the file were not enough for `charset-normalizer` to do a proper
inference, so I re-encoded another slightly longer file that's also used
for encoding testing, and it got that one.
- Updating an ingest test fixture.
- Updating the ingest test fixture update workflow to also update the
expected markdown results (this was a task I missed when adding the
markdown ingest tests)

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: qued <qued@users.noreply.github.com>
Co-authored-by: Maksymilian Operlejn <36171422+MaksOpp@users.noreply.github.com>

2025-07-22 19:02:40 +00:00

Marek Połom

f333d7fe7f

feat: Json elements to HTML converter (#3936 )

## NOTE
`test_unstructured_ingest/expected-structured-output-html` contains all
test HTML fixtures. Original JSON files, from which these HTML fixtures
are generated, were taken from
`test_unstructured_ingest/expected-structured-output`

2025-03-04 13:57:35 +00:00

2 Commits