12 Commits

Author SHA1 Message Date
Christine Straub
8378c26035
Feat/contain nltk assets in docker image (#3853)
This pull request adds NLTK data to the Docker image by pre-packaging
the data to ensure a more reliable and efficient deployment process, as
the required NLTK resources are readily available within the container.

**Current updated solution:**
- Dockerfile Update: Integrated NLTK data directly into the Docker
image, ensuring that the API can operate independently of external -
data sources. The data is stored at /home/notebook-user/nltk_data.
- Environment Variable Setup: Configured the NLTK_PATH environment
variable, enabling Python scripts to automatically locate and use the
embedded NLTK data. This eliminates the need for manual configuration in
deployment environments.
- Code Cleanup: Removed outdated code in tokenize.py and related scripts
that previously downloaded NLTK data from S3. This streamlines the
codebase and removes unnecessary dependencies.
- Script Updates: Updated tokenize.py and test_tokenize.py to utilize
the NLTK_PATH variable, ensuring consistent access to the embedded data
across all environments.
- Dependency Elimination: Fully eliminated reliance on the S3 bucket for
NLTK data, mitigating risks from network failures or access changes.
- Improved System Reliability: By embedding assets within the Docker
image, the API now has a self-contained setup that ensures consistent
behavior regardless of deployment location.
- Updated the Dockerfile to copy the local NLTK data to the appropriate
directory within the container.
- Adjusted the application setup to verify the presence of NLTK assets
during the container build process.
2025-01-08 22:00:13 +00:00
Nathan Van Gheem
0fb814db61
Use native ntlk download (#3796)
This PR changes how we download NLTK data to use the native nltk
downloader.

We had moved to our own hosted NLTK dataset because of this CVE:
https://nvd.nist.gov/vuln/detail/CVE-2024-39705

Ref: https://github.com/Unstructured-IO/unstructured/pull/3361

Latest versions of NLTK have fixed this issue:
https://github.com/nltk/nltk/blob/develop/ChangeLog
2024-12-02 19:30:28 +00:00
Matt Robinson
7437f0a084
fix(CVE-2024-39705): update to latest nltk version (#3512)
### Summary

Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by
updating to `nltk==3.8.2` and closes #3511. This CVE had previously been
mitigated in #3361.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-08-13 09:39:29 -04:00
Matt Robinson
7b25dfc337
fix(CVE-2024-39705): remove nltk download (#3361)
### Summary

Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which
highlights the risk of remote code execution when running
`nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with
the appropriate NLTK data files and checking the SHA256 hash to validate
the download. An error now raises if `nltk.download` is invoked.

The logic for determining the NLTK download directory is borrowed from
`nltk`, so users can still set `NLTK_DATA` as they did previously.

### Testing

1. Create a directory called `~/tmp/nltk_test`. Set
`NLTK_DATA=${HOME}/tmp/nltk_test`.
2. From a python interactive session, run:
```python
from unstructured.nlp.tokenize import download_nltk_packages

download_nltk_packages()
```
3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded
data.

---------

Co-authored-by: Steve Canny <stcanny@gmail.com>
2024-07-08 22:55:36 +00:00
Roman Isecke
63861f537e
Add check for duplicate click options (#1775)
### Description
Given that many of the options associated with the `Click` based cli
ingest commands are added dynamically from a number of configs, a check
was incorporated to make sure there were no duplicate entries to prevent
new configs from overwriting already added options.

### Issues that were found and fixes:
* duplicate api-key option set on Notion command conflicts with api key
used for unstructured api. Added notion prefix.
* retry logic configs had duplicates in biomed. Removed since this is
not handled by the pipeline.
2023-10-20 14:00:19 +00:00
Roman Isecke
b265d8874b
refactoring linting (#1739)
### Description
Currently linting only takes place over the base unstructured directory
but we support python files throughout the repo. It makes sense for all
those files to also abide by the same linting rules so the entire repo
was set to be inspected when the linters are run. Along with that
autoflake was added as a linter which has a lot of added benefits such
as removing unused imports for you that would currently break flake and
require manual intervention.

The only real relevant changes in this PR are in the `Makefile`,
`setup.cfg`, and `requirements/test.in`. The rest is the result of
running the linters.
2023-10-17 12:45:12 +00:00
Matt Robinson
e017e99b5b
fix: correct nltk download arg order (#991)
* fix: correct download order to nltk args

* add smoke test for tokenizers
2023-07-28 11:29:59 -04:00
Tom Aarsen
5eb1466acc
Resolve various style issues to improve overall code quality (#282)
* Apply import sorting

ruff . --select I --fix

* Remove unnecessary open mode parameter

ruff . --select UP015 --fix

* Use f-string formatting rather than .format

* Remove extraneous parentheses

Also use "" instead of str()

* Resolve missing trailing commas

ruff . --select COM --fix

* Rewrite list() and dict() calls using literals

ruff . --select C4 --fix

* Add () to pytest.fixture, use tuples for parametrize, etc.

ruff . --select PT --fix

* Simplify code: merge conditionals, context managers

ruff . --select SIM --fix

* Import without unnecessary alias

ruff . --select PLR0402 --fix

* Apply formatting via black

* Rewrite ValueError somewhat

Slightly unrelated to the rest of the PR

* Apply formatting to tests via black

* Update expected exception message to match
0d81564

* Satisfy E501 line too long in test

* Update changelog & version

* Add ruff to make tidy and test deps

* Run 'make tidy'

* Update changelog & version

* Update changelog & version

* Add ruff to 'check' target

Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
2023-02-27 11:30:54 -05:00
Matt Robinson
354eff1e2b
build(deps): automatically download nltk models when required (#246)
* code for downloading nltk packages

* don't run nltk make command in ci

* test for model downloads

* remove nltk install from docs

* update changelog and bump version
2023-02-23 17:19:13 +00:00
Matt Robinson
08e091c5a9
chore: Reorganize partition bricks under partition directory (#76)
* move partition_pdf to partition folder

* move partition.py

* refactor partioning bricks into partition diretory

* import to nlp for backward compatibility

* update docs

* update version and bump changelog

* fix typo in changelog

* update readme reference
2022-11-21 22:27:23 +00:00
Sebastian Laverde Alfonso
baa15d0098
feat: new partitioning brick that calls the document image analysis API (#68)
* docs: add new feature to the CHANGELOG.md, bump the version, update __version__.py

* feat: new partition to call the document image analysis API

* fix: remove duplicated dependency on partition.py

* fix: linting error due to line-lenght > 100

* test: add test to call partition_pdf brick

* chore: new short example-doc pdf for speed up in test X8

* fix: add missing return statement to _read to pass check

* feat: new partitioning brick to call doc parse API

* docs: version update fix in CHANGELOG

* refactor: no nested ifs

* docs: documentation for new brick partition_pdf

* refactor: made tidy

* docs: minor doc refactor

Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
2022-11-16 17:48:30 +01:00
Matt Robinson
5f40c78f25 Initial Release 2022-09-26 14:55:20 -07:00