unstructured/Dockerfile

FROM quay.io/unstructured-io/base-images:wolfi-base-latest AS base

ARG PYTHON=python3.11
ARG PIP="${PYTHON} -m pip"

USER root

WORKDIR /app

COPY ./requirements requirements/
COPY unstructured unstructured
COPY test_unstructured test_unstructured
COPY example-docs example-docs

RUN chown -R notebook-user:notebook-user /app && \
    apk add font-ubuntu git && \
    fc-cache -fv && \
    [ -e /usr/bin/python3 ] || ln -s /usr/bin/$PYTHON /usr/bin/python3

USER notebook-user

# append PATH before pip install to avoid warning logs; it also avoids issues with packages that needs compilation during installation
ENV PATH="${PATH}:/home/notebook-user/.local/bin"
ENV TESSDATA_PREFIX=/usr/local/share/tessdata
ENV NLTK_DATA=/home/notebook-user/nltk_data

# Install Python dependencies and download required NLTK packages
RUN find requirements/ -type f -name "*.txt" -exec $PIP install --no-cache-dir --user -r '{}' ';' && \
    mkdir -p ${NLTK_DATA} && \
    $PYTHON -m nltk.downloader -d ${NLTK_DATA} punkt_tab averaged_perceptron_tagger_eng && \
    $PYTHON -c "from unstructured.partition.model_init import initialize; initialize()" && \
    $PYTHON -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"

CMD ["/bin/bash"]
Feat/contain nltk assets in docker image (#3853) This pull request adds NLTK data to the Docker image by pre-packaging the data to ensure a more reliable and efficient deployment process, as the required NLTK resources are readily available within the container. Current updated solution: - Dockerfile Update: Integrated NLTK data directly into the Docker image, ensuring that the API can operate independently of external - data sources. The data is stored at /home/notebook-user/nltk_data. - Environment Variable Setup: Configured the NLTK_PATH environment variable, enabling Python scripts to automatically locate and use the embedded NLTK data. This eliminates the need for manual configuration in deployment environments. - Code Cleanup: Removed outdated code in tokenize.py and related scripts that previously downloaded NLTK data from S3. This streamlines the codebase and removes unnecessary dependencies. - Script Updates: Updated tokenize.py and test_tokenize.py to utilize the NLTK_PATH variable, ensuring consistent access to the embedded data across all environments. - Dependency Elimination: Fully eliminated reliance on the S3 bucket for NLTK data, mitigating risks from network failures or access changes. - Improved System Reliability: By embedding assets within the Docker image, the API now has a self-contained setup that ensures consistent behavior regardless of deployment location. - Updated the Dockerfile to copy the local NLTK data to the appropriate directory within the container. - Adjusted the application setup to verify the presence of NLTK assets during the container build process. 2025-01-08 14:00:13 -08:00			`FROM quay.io/unstructured-io/base-images:wolfi-base-latest AS base`

			`ARG PYTHON=python3.11`
Feat/refactor layoutelement textregion to vectorized data structure (#3881) This PR refactors the data structure for `list[LayoutElement]` and `list[TextRegion]` used in partition pdf/image files. - new data structure replaces a list of objects with one object with `numpy` array to store data - this only affects partition internal steps and it doesn't change input or output signature of `partition` function itself, i.e., `partition` still returns `list[Element]` - internally `list[LayoutElement]` -> `LayoutElements`; `list[TextRegion]` -> `TextRegions` - current refactor stops before clean up pdfminer elements inside inferred layout elements -> the algorithm of clean up needs to be refactored before the data structure refactor can move forward. So current refactor converts the array data structure into list data structure with `element_array.as_list()` call. This is the last step before turning `list[LayoutElement]` into `list[Element]` as return - a future PR will update this last step so that we build `list[Element]` from `LayoutElements` data structure instead. The goal of this PR is to replace the data structure as much as possible without changing underlying logic. There are a few places where the slicing or filtering logic was simple enough to be converted into vector data structure operations. Those are refactored to be vector based. As a result there is some small improvements observed in ingest test. This is likely because the vector operations cleaned up some previous inconsistency in data types and operations. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com> 2025-01-23 11:11:38 -06:00			`ARG PIP="${PYTHON} -m pip"`
adding Dockerfile (#359) 2023-03-14 13:40:01 -07:00
build: wolfi base image for Dockerfile (#3016) ### Summary Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. The job will fail if there are high or critical vulnerabilities. ### Testing Run `make docker-run-dev` and then `python3.11` once you're in. And that point, you can try: ```python from unstructured.partition.auto import partition elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"]) elements ``` Stop the container once you're done. 2024-05-15 18:53:15 -04:00			`USER root`
adding Dockerfile (#359) 2023-03-14 13:40:01 -07:00
build: pull from wolfi base image (#3213) ### Summary Updates the `wolfi` image to pull from the upstream `wolfi-base` base image to avoid maintaining the base layers in both locations. Closes #3105 by pulling in the fix from upstream. ### Testing `test_dockerfile` should continue to pass with the changes. 2024-06-14 16:41:27 -04:00			`WORKDIR /app`

fix: disable arm build for chainguard (#3039) ### Summary Temporarily disables the ARM build due to the error in [this CI job](https://github.com/Unstructured-IO/unstructured/actions/runs/9114507405/job/25058629166). Will add back support for ARM using the rockylinux container once we show this works. 2024-05-16 20:22:10 -04:00			`COPY ./requirements requirements/`
adding Dockerfile (#359) 2023-03-14 13:40:01 -07:00			`COPY unstructured unstructured`
build: wolfi base image for Dockerfile (#3016) ### Summary Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. The job will fail if there are high or critical vulnerabilities. ### Testing Run `make docker-run-dev` and then `python3.11` once you're in. And that point, you can try: ```python from unstructured.partition.auto import partition elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"]) elements ``` Stop the container once you're done. 2024-05-15 18:53:15 -04:00			`COPY test_unstructured test_unstructured`
			`COPY example-docs example-docs`
adding Dockerfile (#359) 2023-03-14 13:40:01 -07:00
feat: enhance analysis options with od model dump and better vis (#3234) This PR adds new capabilities for drawing bboxes for each layout (extracted, inferred, ocr and final) + OD model output dump as a json file for better analysis. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: Michal Martyniak <michal.martyniak@deepsense.ai> 2024-06-26 15:14:55 +02:00			`RUN chown -R notebook-user:notebook-user /app && \`
Feat/contain nltk assets in docker image (#3853) This pull request adds NLTK data to the Docker image by pre-packaging the data to ensure a more reliable and efficient deployment process, as the required NLTK resources are readily available within the container. Current updated solution: - Dockerfile Update: Integrated NLTK data directly into the Docker image, ensuring that the API can operate independently of external - data sources. The data is stored at /home/notebook-user/nltk_data. - Environment Variable Setup: Configured the NLTK_PATH environment variable, enabling Python scripts to automatically locate and use the embedded NLTK data. This eliminates the need for manual configuration in deployment environments. - Code Cleanup: Removed outdated code in tokenize.py and related scripts that previously downloaded NLTK data from S3. This streamlines the codebase and removes unnecessary dependencies. - Script Updates: Updated tokenize.py and test_tokenize.py to utilize the NLTK_PATH variable, ensuring consistent access to the embedded data across all environments. - Dependency Elimination: Fully eliminated reliance on the S3 bucket for NLTK data, mitigating risks from network failures or access changes. - Improved System Reliability: By embedding assets within the Docker image, the API now has a self-contained setup that ensures consistent behavior regardless of deployment location. - Updated the Dockerfile to copy the local NLTK data to the appropriate directory within the container. - Adjusted the application setup to verify the presence of NLTK assets during the container build process. 2025-01-08 14:00:13 -08:00			`apk add font-ubuntu git && \`
			`fc-cache -fv && \`
			`[ -e /usr/bin/python3 ] \|\| ln -s /usr/bin/$PYTHON /usr/bin/python3`
build: pull from wolfi base image (#3213) ### Summary Updates the `wolfi` image to pull from the upstream `wolfi-base` base image to avoid maintaining the base layers in both locations. Closes #3105 by pulling in the fix from upstream. ### Testing `test_dockerfile` should continue to pass with the changes. 2024-06-14 16:41:27 -04:00
			`USER notebook-user`
build: wolfi base image for Dockerfile (#3016) ### Summary Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. The job will fail if there are high or critical vulnerabilities. ### Testing Run `make docker-run-dev` and then `python3.11` once you're in. And that point, you can try: ```python from unstructured.partition.auto import partition elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"]) elements ``` Stop the container once you're done. 2024-05-15 18:53:15 -04:00
Feat/refactor layoutelement textregion to vectorized data structure (#3881) This PR refactors the data structure for `list[LayoutElement]` and `list[TextRegion]` used in partition pdf/image files. - new data structure replaces a list of objects with one object with `numpy` array to store data - this only affects partition internal steps and it doesn't change input or output signature of `partition` function itself, i.e., `partition` still returns `list[Element]` - internally `list[LayoutElement]` -> `LayoutElements`; `list[TextRegion]` -> `TextRegions` - current refactor stops before clean up pdfminer elements inside inferred layout elements -> the algorithm of clean up needs to be refactored before the data structure refactor can move forward. So current refactor converts the array data structure into list data structure with `element_array.as_list()` call. This is the last step before turning `list[LayoutElement]` into `list[Element]` as return - a future PR will update this last step so that we build `list[Element]` from `LayoutElements` data structure instead. The goal of this PR is to replace the data structure as much as possible without changing underlying logic. There are a few places where the slicing or filtering logic was simple enough to be converted into vector data structure operations. Those are refactored to be vector based. As a result there is some small improvements observed in ingest test. This is likely because the vector operations cleaned up some previous inconsistency in data types and operations. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com> 2025-01-23 11:11:38 -06:00			`# append PATH before pip install to avoid warning logs; it also avoids issues with packages that needs compilation during installation`
			`ENV PATH="${PATH}:/home/notebook-user/.local/bin"`
			`ENV TESSDATA_PREFIX=/usr/local/share/tessdata`
Feat/contain nltk assets in docker image (#3853) This pull request adds NLTK data to the Docker image by pre-packaging the data to ensure a more reliable and efficient deployment process, as the required NLTK resources are readily available within the container. Current updated solution: - Dockerfile Update: Integrated NLTK data directly into the Docker image, ensuring that the API can operate independently of external - data sources. The data is stored at /home/notebook-user/nltk_data. - Environment Variable Setup: Configured the NLTK_PATH environment variable, enabling Python scripts to automatically locate and use the embedded NLTK data. This eliminates the need for manual configuration in deployment environments. - Code Cleanup: Removed outdated code in tokenize.py and related scripts that previously downloaded NLTK data from S3. This streamlines the codebase and removes unnecessary dependencies. - Script Updates: Updated tokenize.py and test_tokenize.py to utilize the NLTK_PATH variable, ensuring consistent access to the embedded data across all environments. - Dependency Elimination: Fully eliminated reliance on the S3 bucket for NLTK data, mitigating risks from network failures or access changes. - Improved System Reliability: By embedding assets within the Docker image, the API now has a self-contained setup that ensures consistent behavior regardless of deployment location. - Updated the Dockerfile to copy the local NLTK data to the appropriate directory within the container. - Adjusted the application setup to verify the presence of NLTK assets during the container build process. 2025-01-08 14:00:13 -08:00			`ENV NLTK_DATA=/home/notebook-user/nltk_data`

			`# Install Python dependencies and download required NLTK packages`
			`RUN find requirements/ -type f -name "*.txt" -exec $PIP install --no-cache-dir --user -r '{}' ';' && \`
			`mkdir -p ${NLTK_DATA} && \`
			`$PYTHON -m nltk.downloader -d ${NLTK_DATA} punkt_tab averaged_perceptron_tagger_eng && \`
			`$PYTHON -c "from unstructured.partition.model_init import initialize; initialize()" && \`
			`$PYTHON -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"`
build: wolfi base image for Dockerfile (#3016) ### Summary Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. The job will fail if there are high or critical vulnerabilities. ### Testing Run `make docker-run-dev` and then `python3.11` once you're in. And that point, you can try: ```python from unstructured.partition.auto import partition elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"]) elements ``` Stop the container once you're done. 2024-05-15 18:53:15 -04:00
adding Dockerfile (#359) 2023-03-14 13:40:01 -07:00			`CMD ["/bin/bash"]`