mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-06-27 02:30:08 +00:00

This pull request adds NLTK data to the Docker image by pre-packaging the data to ensure a more reliable and efficient deployment process, as the required NLTK resources are readily available within the container. **Current updated solution:** - Dockerfile Update: Integrated NLTK data directly into the Docker image, ensuring that the API can operate independently of external - data sources. The data is stored at /home/notebook-user/nltk_data. - Environment Variable Setup: Configured the NLTK_PATH environment variable, enabling Python scripts to automatically locate and use the embedded NLTK data. This eliminates the need for manual configuration in deployment environments. - Code Cleanup: Removed outdated code in tokenize.py and related scripts that previously downloaded NLTK data from S3. This streamlines the codebase and removes unnecessary dependencies. - Script Updates: Updated tokenize.py and test_tokenize.py to utilize the NLTK_PATH variable, ensuring consistent access to the embedded data across all environments. - Dependency Elimination: Fully eliminated reliance on the S3 bucket for NLTK data, mitigating risks from network failures or access changes. - Improved System Reliability: By embedding assets within the Docker image, the API now has a self-contained setup that ensures consistent behavior regardless of deployment location. - Updated the Dockerfile to copy the local NLTK data to the appropriate directory within the container. - Adjusted the application setup to verify the presence of NLTK assets during the container build process.