9 Commits

Author SHA1 Message Date
Christine Straub
8378c26035
Feat/contain nltk assets in docker image (#3853)
This pull request adds NLTK data to the Docker image by pre-packaging
the data to ensure a more reliable and efficient deployment process, as
the required NLTK resources are readily available within the container.

**Current updated solution:**
- Dockerfile Update: Integrated NLTK data directly into the Docker
image, ensuring that the API can operate independently of external -
data sources. The data is stored at /home/notebook-user/nltk_data.
- Environment Variable Setup: Configured the NLTK_PATH environment
variable, enabling Python scripts to automatically locate and use the
embedded NLTK data. This eliminates the need for manual configuration in
deployment environments.
- Code Cleanup: Removed outdated code in tokenize.py and related scripts
that previously downloaded NLTK data from S3. This streamlines the
codebase and removes unnecessary dependencies.
- Script Updates: Updated tokenize.py and test_tokenize.py to utilize
the NLTK_PATH variable, ensuring consistent access to the embedded data
across all environments.
- Dependency Elimination: Fully eliminated reliance on the S3 bucket for
NLTK data, mitigating risks from network failures or access changes.
- Improved System Reliability: By embedding assets within the Docker
image, the API now has a self-contained setup that ensures consistent
behavior regardless of deployment location.
- Updated the Dockerfile to copy the local NLTK data to the appropriate
directory within the container.
- Adjusted the application setup to verify the presence of NLTK assets
during the container build process.
2025-01-08 22:00:13 +00:00
Nathan Van Gheem
0fb814db61
Use native ntlk download (#3796)
This PR changes how we download NLTK data to use the native nltk
downloader.

We had moved to our own hosted NLTK dataset because of this CVE:
https://nvd.nist.gov/vuln/detail/CVE-2024-39705

Ref: https://github.com/Unstructured-IO/unstructured/pull/3361

Latest versions of NLTK have fixed this issue:
https://github.com/nltk/nltk/blob/develop/ChangeLog
2024-12-02 19:30:28 +00:00
Matt Robinson
7437f0a084
fix(CVE-2024-39705): update to latest nltk version (#3512)
### Summary

Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by
updating to `nltk==3.8.2` and closes #3511. This CVE had previously been
mitigated in #3361.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-08-13 09:39:29 -04:00
Matt Robinson
7b25dfc337
fix(CVE-2024-39705): remove nltk download (#3361)
### Summary

Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which
highlights the risk of remote code execution when running
`nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with
the appropriate NLTK data files and checking the SHA256 hash to validate
the download. An error now raises if `nltk.download` is invoked.

The logic for determining the NLTK download directory is borrowed from
`nltk`, so users can still set `NLTK_DATA` as they did previously.

### Testing

1. Create a directory called `~/tmp/nltk_test`. Set
`NLTK_DATA=${HOME}/tmp/nltk_test`.
2. From a python interactive session, run:
```python
from unstructured.nlp.tokenize import download_nltk_packages

download_nltk_packages()
```
3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded
data.

---------

Co-authored-by: Steve Canny <stcanny@gmail.com>
2024-07-08 22:55:36 +00:00
Matt Robinson
e017e99b5b
fix: correct nltk download arg order (#991)
* fix: correct download order to nltk args

* add smoke test for tokenizers
2023-07-28 11:29:59 -04:00
Tom Aarsen
5eb1466acc
Resolve various style issues to improve overall code quality (#282)
* Apply import sorting

ruff . --select I --fix

* Remove unnecessary open mode parameter

ruff . --select UP015 --fix

* Use f-string formatting rather than .format

* Remove extraneous parentheses

Also use "" instead of str()

* Resolve missing trailing commas

ruff . --select COM --fix

* Rewrite list() and dict() calls using literals

ruff . --select C4 --fix

* Add () to pytest.fixture, use tuples for parametrize, etc.

ruff . --select PT --fix

* Simplify code: merge conditionals, context managers

ruff . --select SIM --fix

* Import without unnecessary alias

ruff . --select PLR0402 --fix

* Apply formatting via black

* Rewrite ValueError somewhat

Slightly unrelated to the rest of the PR

* Apply formatting to tests via black

* Update expected exception message to match
0d81564

* Satisfy E501 line too long in test

* Update changelog & version

* Add ruff to make tidy and test deps

* Run 'make tidy'

* Update changelog & version

* Update changelog & version

* Add ruff to 'check' target

Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
2023-02-27 11:30:54 -05:00
Matt Robinson
354eff1e2b
build(deps): automatically download nltk models when required (#246)
* code for downloading nltk packages

* don't run nltk make command in ci

* test for model downloads

* remove nltk install from docs

* update changelog and bump version
2023-02-23 17:19:13 +00:00
Matt Robinson
08e091c5a9
chore: Reorganize partition bricks under partition directory (#76)
* move partition_pdf to partition folder

* move partition.py

* refactor partioning bricks into partition diretory

* import to nlp for backward compatibility

* update docs

* update version and bump changelog

* fix typo in changelog

* update readme reference
2022-11-21 22:27:23 +00:00
Matt Robinson
5f40c78f25 Initial Release 2022-09-26 14:55:20 -07:00