mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-24 13:44:05 +00:00
### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which highlights the risk of remote code execution when running `nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with the appropriate NLTK data files and checking the SHA256 hash to validate the download. An error now raises if `nltk.download` is invoked. The logic for determining the NLTK download directory is borrowed from `nltk`, so users can still set `NLTK_DATA` as they did previously. ### Testing 1. Create a directory called `~/tmp/nltk_test`. Set `NLTK_DATA=${HOME}/tmp/nltk_test`. 2. From a python interactive session, run: ```python from unstructured.nlp.tokenize import download_nltk_packages download_nltk_packages() ``` 3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded data. --------- Co-authored-by: Steve Canny <stcanny@gmail.com>
29 lines
1.1 KiB
Docker
29 lines
1.1 KiB
Docker
FROM quay.io/unstructured-io/base-images:wolfi-base-e48da6b@sha256:8ad3479e5dc87a86e4794350cca6385c01c6d110902c5b292d1a62e231be711b as base
|
|
|
|
USER root
|
|
|
|
WORKDIR /app
|
|
|
|
COPY ./requirements requirements/
|
|
COPY unstructured unstructured
|
|
COPY test_unstructured test_unstructured
|
|
COPY example-docs example-docs
|
|
|
|
RUN chown -R notebook-user:notebook-user /app && \
|
|
apk add font-ubuntu && \
|
|
fc-cache -fv && \
|
|
ln -s /usr/bin/python3.11 /usr/bin/python3
|
|
|
|
USER notebook-user
|
|
|
|
RUN find requirements/ -type f -name "*.txt" -exec pip3.11 install --no-cache-dir --user -r '{}' ';' && \
|
|
pip3.11 install unstructured.paddlepaddle && \
|
|
python3.11 -c "from unstructured.nlp.tokenize import download_nltk_packages; download_nltk_packages()" && \
|
|
python3.11 -c "from unstructured.partition.model_init import initialize; initialize()" && \
|
|
python3.11 -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"
|
|
|
|
ENV PATH="${PATH}:/home/notebook-user/.local/bin"
|
|
ENV TESSDATA_PREFIX=/usr/local/share/tessdata
|
|
|
|
CMD ["/bin/bash"]
|