unstructured/Dockerfile
Yao You 8f2a719873
Feat/refactor layoutelement textregion to vectorized data structure (#3881)
This PR refactors the data structure for `list[LayoutElement]` and
`list[TextRegion]` used in partition pdf/image files.

- new data structure replaces a list of objects with one object with
`numpy` array to store data
- this only affects partition internal steps and it doesn't change input
or output signature of `partition` function itself, i.e., `partition`
still returns `list[Element]`
- internally `list[LayoutElement]` -> `LayoutElements`;
`list[TextRegion]` -> `TextRegions`
- current refactor stops before clean up pdfminer elements inside
inferred layout elements -> the algorithm of clean up needs to be
refactored before the data structure refactor can move forward. So
current refactor converts the array data structure into list data
structure with `element_array.as_list()` call. This is the last step
before turning `list[LayoutElement]` into `list[Element]` as return
- a future PR will update this last step so that we build
`list[Element]` from `LayoutElements` data structure instead.

The goal of this PR is to replace the data structure as much as possible
without changing underlying logic. There are a few places where the
slicing or filtering logic was simple enough to be converted into vector
data structure operations. Those are refactored to be vector based. As a
result there is some small improvements observed in ingest test. This is
likely because the vector operations cleaned up some previous
inconsistency in data types and operations.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2025-01-23 17:11:38 +00:00

35 lines
1.3 KiB
Docker

FROM quay.io/unstructured-io/base-images:wolfi-base-latest AS base
ARG PYTHON=python3.11
ARG PIP="${PYTHON} -m pip"
USER root
WORKDIR /app
COPY ./requirements requirements/
COPY unstructured unstructured
COPY test_unstructured test_unstructured
COPY example-docs example-docs
RUN chown -R notebook-user:notebook-user /app && \
apk add font-ubuntu git && \
fc-cache -fv && \
[ -e /usr/bin/python3 ] || ln -s /usr/bin/$PYTHON /usr/bin/python3
USER notebook-user
# append PATH before pip install to avoid warning logs; it also avoids issues with packages that needs compilation during installation
ENV PATH="${PATH}:/home/notebook-user/.local/bin"
ENV TESSDATA_PREFIX=/usr/local/share/tessdata
ENV NLTK_DATA=/home/notebook-user/nltk_data
# Install Python dependencies and download required NLTK packages
RUN find requirements/ -type f -name "*.txt" -exec $PIP install --no-cache-dir --user -r '{}' ';' && \
mkdir -p ${NLTK_DATA} && \
$PYTHON -m nltk.downloader -d ${NLTK_DATA} punkt_tab averaged_perceptron_tagger_eng && \
$PYTHON -c "from unstructured.partition.model_init import initialize; initialize()" && \
$PYTHON -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"
CMD ["/bin/bash"]