luke-kucing 1c519efef5
Security Fixes - CVE Remediation (#4115)
Main Changes:

  1. Removed Clarifai Dependency
- Completely removed the clarifai dependency which is no longer used in
the codebase
- Removed clarifai from the unstructured-ingest extras list in
requirements/ingest/ingest.txt:1
- Removed clarifai test script reference from
test_unstructured_ingest/test-ingest-dest.sh:23

  2. Updated Dependencies to Resolve CVEs
  - pypdf: Updated from 6.1.1 → 6.1.3 (fixes GHSA-vr63-x8vc-m265)
- pip: Added explicit upgrade to >=25.3 in Dockerfile (fixes
GHSA-4xh5-x5gv-qwph)
  - uv: Addressed GHSA-8qf3-x8v5-2pj8 and GHSA-pqhf-p39g-3x64

  3. Dockerfile Security Enhancements (Dockerfile:17,28-29)
  - Added Alpine package upgrade for py3.12-pip
- Added explicit pip upgrade step before installing Python dependencies

  4. General Dependency Updates
  Ran pip-compile across all requirement files, resulting in updates to:
  - cryptography: 46.0.2 → 46.0.3
  - psutil: 7.1.0 → 7.1.3
  - rapidfuzz: 3.14.1 → 3.14.3
  - regex: 2025.9.18 → 2025.11.3
  - wrapt: 1.17.3 → 2.0.0
- Plus many other transitive dependencies across all extra requirement
files

  5. Version Bump
- Updated version from 0.18.16 → 0.18.17 in
unstructured/__version__.py:1
  - Updated CHANGELOG.md with security fixes documentation

  Impact:

This PR resolves 4 CVEs total without introducing breaking changes,
making it a pure security maintenance release.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-11-06 21:01:32 +00:00

92 lines
2.2 KiB
Bash
Executable File

#!/usr/bin/env bash
set -u -o pipefail
SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
SKIPPED_FILES_LOG=$SCRIPT_DIR/skipped-files.txt
# If the file already exists, reset it
if [ -f "$SKIPPED_FILES_LOG" ]; then
rm "$SKIPPED_FILES_LOG"
fi
touch "$SKIPPED_FILES_LOG"
cd "$SCRIPT_DIR"/.. || exit 1
# NOTE(crag): sets number of tesseract threads to 1 which may help with more reproducible outputs
export OMP_THREAD_LIMIT=1
all_tests=(
'astradb.sh'
'azure.sh'
'azure-cognitive-search.sh'
'box.sh'
'chroma.sh'
'delta-table.sh'
'dropbox.sh'
'elasticsearch.sh'
'gcs.sh'
'kafka-local.sh'
'mongodb.sh'
'opensearch.sh'
'pgvector.sh'
'pinecone.sh'
'qdrant.sh'
's3.sh'
'sharepoint-embed-cog-index.sh'
'sqlite.sh'
'vectara.sh'
'singlestore.sh'
'weaviate.sh'
'databricks-volumes.sh'
)
full_python_matrix_tests=(
'azure.sh'
'gcs.sh'
's3.sh'
)
CURRENT_TEST="none"
function print_last_run() {
if [ "$CURRENT_TEST" != "none" ]; then
echo "Last ran script: $CURRENT_TEST"
fi
echo "######## SKIPPED TESTS: ########"
cat "$SKIPPED_FILES_LOG"
}
trap print_last_run EXIT
python_version=$(python --version 2>&1)
tests_to_ignore=(
'notion.sh'
'dropbox.sh'
'sharepoint.sh'
'databricks-volumes.sh'
'vectara.sh'
)
for test in "${all_tests[@]}"; do
CURRENT_TEST="$test"
# IF: python_version is not 3.10 (wildcarded to match any subminor version) AND the current test is not in full_python_matrix_tests
# Note: to test we expand the full_python_matrix_tests array to a string and then regex match the current test
if [[ "$python_version" != "Python 3.10"* ]] && [[ ! "${full_python_matrix_tests[*]}" =~ $test ]]; then
echo "--------- SKIPPING SCRIPT $test ---------"
continue
fi
echo "--------- RUNNING SCRIPT $test ---------"
echo "Running ./test_unstructured_ingest/$test"
./test_unstructured_ingest/dest/"$test"
rc=$?
if [[ $rc -eq 8 ]]; then
echo "$test (skipped due to missing env var)" | tee -a "$SKIPPED_FILES_LOG"
elif [[ "${tests_to_ignore[*]}" =~ $test ]]; then
echo "$test (skipped checking error code: $rc)" | tee -a "$SKIPPED_FILES_LOG"
continue
elif [[ $rc -ne 0 ]]; then
exit $rc
fi
echo "--------- FINISHED SCRIPT $test ---------"
done