Closes
[SPI-44](https://linear.app/unstructured/issue/SPI-44/spike-replace-chardet-with-charset-normalizer-if-possible).
Removes `chardet` as a dependency, standardizing on
`charset-normalizer`.
This involved:
- Changing `chardet` to `charset-normalizer` in our base dependency file
- Updating the code (in only one place) where `chardet` was used
- pip-compiling to update our published dependency tree
- Updating one test... `charset-normalizer` misdiagnosed the encoding of
a file used as a test fixture. My guess is that the ~10 characters in
the file were not enough for `charset-normalizer` to do a proper
inference, so I re-encoded another slightly longer file that's also used
for encoding testing, and it got that one.
- Updating an ingest test fixture.
- Updating the ingest test fixture update workflow to also update the
expected markdown results (this was a task I missed when adding the
markdown ingest tests)
---------
Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: qued <qued@users.noreply.github.com>
Co-authored-by: Maksymilian Operlejn <36171422+MaksOpp@users.noreply.github.com>
### Summary
To fix error `Error in chunk: 512: {"detail":"'NoneType' object has no
attribute 'strip'"}` I found the logs under same org (could assume this
is the same job)
screenshot:

stack trace from the `utic-api` ES log doc:

### Notes
longer term we should make partitioner (vlm + utic-api) not return text
with Null
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
Bump `unstructured-inference` to `1.0.5`, which includes fix to ensure
model init is thread safe.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
update reqs to resolve CVEs and add the HF ENV to stop it from reaching
out
updated the Dockerfile with
ENV HF_HUB_OFFLINE=1
to stop it from pinging HF. This was an issue for a gov customer. and
updated requirements to resolve some open CVEs
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: luke-kucing <luke-kucing@users.noreply.github.com>
Currently we [filter img
tags](2addb19473/unstructured/partition/html/partition.py (L226-L229))
before tags are converted to Elements by the html partitioner. More
importantly we also don’t currently have a defined “block” / mapping to
support these. This adds these mappings and logic to process.
It also respects `extract_image_block_types` and
`extract_image_block_to_payload` (as we do with pdfs) to determine
whether base64 is included in the metadata.
The partitioned Image Elements sets the text to the img tag’s alt text
if available.
The partitioned Image Elements include the [url in the
metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209)
(rather than image_base64) if the img tag src is a url.
## Testing
unit tests have been added for explicit coverage.
existing integration tests and other unit test fixtures have been
updated to account for `Image` elements now present
---------
Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
## NOTE
`test_unstructured_ingest/expected-structured-output-html` contains all
test HTML fixtures. Original JSON files, from which these HTML fixtures
are generated, were taken from
`test_unstructured_ingest/expected-structured-output`