unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-02 02:00:29 +00:00

Author	SHA1	Message	Date
qued	9a239fa18b	build: remove test and dev deps from docker image (#3969 ) Removed the dependencies contained in `test.txt`, `dev.txt`, and `constraints.txt` from the things that get installed in the docker image. In order to keep testing the image (running the tests), I added a step to the `docker-test` make target to install `test.txt` and `dev.txt`. Thus we presumably get a smaller image (probably not much smaller), reduce the dependency chain or our images, and have less exposure to vulnerabilities while still testing as robustly as before. Incidentally, I removed the `Dockerfile` for our ubuntu image, since it made reference to non-existent make targets, which tells me it's stale and wasn't being used. ### Review: - Reviewer should ensure the dev and test dependencies are not being installed in the docker image. One way to check is to check the logs in CI, and note, e.g. that [this](https://github.com/Unstructured-IO/unstructured/actions/runs/14112971425/job/39536304012#step:3:1700) is the first reference to `pytest` in the docker build and test logs, after the image build is completed. - Reviewer should ensure docker image is still being tested in CI and is passing.	2025-03-27 18:41:11 +00:00
Yao You	8f2a719873	Feat/refactor layoutelement textregion to vectorized data structure (#3881 ) This PR refactors the data structure for `list[LayoutElement]` and `list[TextRegion]` used in partition pdf/image files. - new data structure replaces a list of objects with one object with `numpy` array to store data - this only affects partition internal steps and it doesn't change input or output signature of `partition` function itself, i.e., `partition` still returns `list[Element]` - internally `list[LayoutElement]` -> `LayoutElements`; `list[TextRegion]` -> `TextRegions` - current refactor stops before clean up pdfminer elements inside inferred layout elements -> the algorithm of clean up needs to be refactored before the data structure refactor can move forward. So current refactor converts the array data structure into list data structure with `element_array.as_list()` call. This is the last step before turning `list[LayoutElement]` into `list[Element]` as return - a future PR will update this last step so that we build `list[Element]` from `LayoutElements` data structure instead. The goal of this PR is to replace the data structure as much as possible without changing underlying logic. There are a few places where the slicing or filtering logic was simple enough to be converted into vector data structure operations. Those are refactored to be vector based. As a result there is some small improvements observed in ingest test. This is likely because the vector operations cleaned up some previous inconsistency in data types and operations. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2025-01-23 17:11:38 +00:00
Christine Straub	8378c26035	Feat/contain nltk assets in docker image (#3853 ) This pull request adds NLTK data to the Docker image by pre-packaging the data to ensure a more reliable and efficient deployment process, as the required NLTK resources are readily available within the container. Current updated solution: - Dockerfile Update: Integrated NLTK data directly into the Docker image, ensuring that the API can operate independently of external - data sources. The data is stored at /home/notebook-user/nltk_data. - Environment Variable Setup: Configured the NLTK_PATH environment variable, enabling Python scripts to automatically locate and use the embedded NLTK data. This eliminates the need for manual configuration in deployment environments. - Code Cleanup: Removed outdated code in tokenize.py and related scripts that previously downloaded NLTK data from S3. This streamlines the codebase and removes unnecessary dependencies. - Script Updates: Updated tokenize.py and test_tokenize.py to utilize the NLTK_PATH variable, ensuring consistent access to the embedded data across all environments. - Dependency Elimination: Fully eliminated reliance on the S3 bucket for NLTK data, mitigating risks from network failures or access changes. - Improved System Reliability: By embedding assets within the Docker image, the API now has a self-contained setup that ensures consistent behavior regardless of deployment location. - Updated the Dockerfile to copy the local NLTK data to the appropriate directory within the container. - Adjusted the application setup to verify the presence of NLTK assets during the container build process.	2025-01-08 22:00:13 +00:00
luke-kucing	0245661ded	minor comment to trigger new container workflow (#3848 )	2024-12-20 01:02:32 +00:00
Matt Robinson	7d66a236f1	fix: correctly install mesa-gl for arm (#3647 ) ### Summary Fixes the `arm64` image builds, which will be available again starting in version `0.15.13`. A fix was implemented upstream in https://github.com/Unstructured-IO/base-images/pull/47 and a workaround that installed `x86` packages in the `unstructured` repo was removed. ### Testing See [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/10948943594/job/30401108059?pr=3647) for a successful `arm64` build on the feature branch.	2024-09-20 13:32:47 +00:00
Matt Robinson	dc1128c21c	build(release): version 0.15.10 (#3609 ) ### Summary Release for version `0.15.10`.	2024-09-09 21:42:20 +00:00
Matt Robinson	e64e09507a	build: update to latest base image (#3524 ) ### Summary Updates to the latest `wolfi-base` base image to pull in more recent package version. A notable update is that upgrading to `libreoffice==24.2.5.2` resolves several CVEs. --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-08-15 22:27:41 -07:00
Christine Straub	d99b39923d	build(deps): Remove unstructured.paddlepaddle fork (#3506 ) This PR aims to remove "unstructured.paddlepaddle" fork. Previously, we used `unstructured.paddlepaddle` fork to support `unstructured.paddleocr` on arm64 architecture. But currently, `unstructured.paddleocr` with `unstructured.paddlepaddle` fails to work on `arm64` architecture. Also, `unstructured.paddleocr` with the latest version of the original `paddlepaddle` works on both `amd64` and `arm64` architectures. ### Testing ``` os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" elements = partition_pdf( filename=<file_path>, strategy="hi_res", infer_table_structure=True, ) ```	2024-08-09 22:04:22 +00:00
Jake Zerrer	051be5aead	Remove unstructured.pytesseract fork (#3454 ) A second attempt at https://github.com/Unstructured-IO/unstructured/pull/3360, this PR removes unstructured's dependency on its own fork of `pytesseract`. (The original reason for the fork, the addition of `run_and_get_multiple_output`, was removed [here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).) --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-08-09 04:28:48 +00:00
Matt Robinson	7b25dfc337	fix(CVE-2024-39705): remove nltk download (#3361 ) ### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which highlights the risk of remote code execution when running `nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with the appropriate NLTK data files and checking the SHA256 hash to validate the download. An error now raises if `nltk.download` is invoked. The logic for determining the NLTK download directory is borrowed from `nltk`, so users can still set `NLTK_DATA` as they did previously. ### Testing 1. Create a directory called `~/tmp/nltk_test`. Set `NLTK_DATA=${HOME}/tmp/nltk_test`. 2. From a python interactive session, run: ```python from unstructured.nlp.tokenize import download_nltk_packages download_nltk_packages() ``` 3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded data. --------- Co-authored-by: Steve Canny <stcanny@gmail.com>	2024-07-08 22:55:36 +00:00
Matt Robinson	db8617872b	build: image and dependency updates; fix tesseract files locations (#3310 ) ### Summary Updates to the latest version of the `wolfi-base` image. Changes include: - Version bumps to address CVEs - `libreoffice` is now included in the `arm64`. `.doc` files are now supported for `arm64`. `.ppt` do not work with the `libreoffice` package currently available on `wolfi-os`. We have follow on work to look into that. - Updates the location of the `tesseract` `tessdata` files on the `arm64` build. Closes #3290. - Closes #3319 and addes `psutil` to the base dependencies. ### Testing - `test_dockerfile` should continue to pass with the updates.	2024-07-01 19:39:32 +00:00
Pawel Kmiecik	575957b2d2	feat: enhance analysis options with od model dump and better vis (#3234 ) This PR adds new capabilities for drawing bboxes for each layout (extracted, inferred, ocr and final) + OD model output dump as a json file for better analysis. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: Michal Martyniak <michal.martyniak@deepsense.ai>	2024-06-26 13:14:55 +00:00
Matt Robinson	2d965fd65e	build: switch arm64 image to wolfi-base (#3268 ) ### Summary Updates the `arm64` build to use the same `Dockerfile` as `amd64`, since there are now upstream base images for `wolfi-base` for both architectures. The legacy `rockylinux-9.4` is now stashed in a subdirectory the `docker` subdirectory and is no longer built in CI, but is available is users would like to build it themselves. Additionally, this PR includes a fix to symlink `python3` to `python3.11`, which had caused a CI failure [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9619486931/job/26535697755). BREAKING CHANGE: the `arm64` image no longer supports `.doc`, `.pptx`, or `.xls` because we do not yet have a `libreoffice` `apk` built for `wolfi-base`. We intend to address that as a follow on. All other filetypes work. ### Testing Successfully docker builds, tests, and smoke tests for [amd64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610735?pr=3268) and [arm64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610341?pr=3268) on the feature branch (with publish disabled).	2024-06-22 05:10:29 +00:00
Matt Robinson	9cd0e706ab	fix: reenable arm64 builds for docker (#3045 ) ### Summary Closes #3034 and reenables ARM64 in the docker build and publish job. This was taken out in #3039 because we've only build `libreoffice` for AMD64 and not ARM64. If Chainguard publishes an `apk` for `libreoffice`, we can support a Chainguard image for both architectures. The smoke test now differs for both architectures, to reflect differences in the directory structure. ### Testing Build and publish ran successfully for ARM64 (job [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9129712470/job/25104907497)) and AMD64 (job [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9129712470/job/25104907826)).	2024-05-17 19:27:20 +00:00
Matt Robinson	934f1a464a	fix: disable arm build for chainguard (#3039 ) ### Summary Temporarily disables the ARM build due to the error in [this CI job](https://github.com/Unstructured-IO/unstructured/actions/runs/9114507405/job/25058629166). Will add back support for ARM using the rockylinux container once we show this works.	2024-05-17 00:22:10 +00:00
Matt Robinson	612905e311	build: wolfi base image for Dockerfile (#3016 ) ### Summary Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. The job will fail if there are high or critical vulnerabilities. ### Testing Run `make docker-run-dev` and then `python3.11` once you're in. And that point, you can try: ```python from unstructured.partition.auto import partition elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"]) elements ``` Stop the container once you're done.	2024-05-15 22:53:15 +00:00
Austin Walker	cfee86f5de	chore: Update base image (#2426 ) Propagating the openssl revert made in the base image: https://github.com/Unstructured-IO/base-images/pull/13 Note that I messed up and wrote over the existing 9.2-9 image. Any current prs will need to rebase in order to get a working dockerfile.	2024-01-18 22:34:43 +00:00
Austin Walker	aaf3fd982b	chore: bump base image (#2410 ) Propagating the openssl fix from Unstructured-IO/base-images#12	2024-01-17 01:32:58 +00:00
Yuming Long	97a25b0094	Chore: move hi res initialization `initialize.py` file out of ingest (#2096 ) Move Hi_res model initialization file out of ingest to `partition` dir --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-16 21:53:25 -08:00
Trevor Bossert	f8528a0e2c	Update base image to include CUDA 11.8 (#2053 ) This adds Nvidia GPU support with CUDA to container images.	2023-11-14 16:14:01 -08:00
Trevor Bossert	22aedc4d6f	Remove ssh-keyscan and files (#2057 ) This was legacy and is no longer needed. It also has the effect of incorrect owner for known_hosts of notebook-user Relates to: #2056	2023-11-13 18:50:06 +00:00
Trevor Bossert	24d5877bd6	Bump base image with latest security fixes (#2009 ) This includes latest version and security updates available from upstream	2023-11-05 19:29:29 +00:00
Roman Isecke	4802332de0	Roman/optimize ingest ci (#1799 ) ### Description Currently the CI caches the CI dependencies but uses the hash of all files in `requirements/`. This isn't completely accurate since the ingest dependencies are installed in a later step and don't affect the cached environment. As part of this PR: * ingest dependencies were isolated into their own folder in `requirements/ingest/` * A new cache setup was introduced in the CI to restore the base cache -> install ingest dependencies -> cache it with a new id * new make target created to install all ingest dependencies via `pip install -r ...` * updates to Dockerfile to use `find ...` to install all dependencies, avoiding the need to update this when new deps are added. * update to pip-compile script to run over all `*.in` files in `requirements/`	2023-10-24 14:54:00 +00:00
Roman Isecke	2e1404e02c	refactor: unstructured ingest as a pipeline (#1551 ) ### Description As we add more and more steps to the pipeline (i.e. chunking, embedding, table manipulation), it would help seperate the responsibility of each of these into their own processes, running each in parallel using json files to share data across. This will also help guarantee data is serializable if this code was used in an actual pipeline. Following is a flow diagram of the proposed changes. As part of this change: * A parent pipeline class will be responsible for running each `node`, which can optionally be run via multiprocessing if it supports it, or not. Possible nodes at this moment: * Doc factory: creates all the ingest docs via the source connector * Source: reads/downloads all of the content to process to the local filesystem to the location set by the `download_dir` parameter. * Partition: runs partition on all of the downloaded content in json format. * Any number of reformat nodes that modify the partitioned content. This can include chunking, embedding, etc. * Write: push the final json into the destination via the destination connector * This pipeline relies on the information of the ingest docs to be available via their serialization. An optimization was introduced with the `IngestDocJsonMixin` which adds in all the `@property` fields to the serialized json already being created via the `DataClassJsonMixin` * For all intermediate steps (partitioning, reformatting), the content is saved to a dedicated location on the local filesystem. Right now it's set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`. * Minor changes: made sense to move some of the config parameters between the read and partition configs when I explicitly divided the responsibility to download vs partition the content in the pipeline. * The pipeline class only makes the doc factory, source and partition nodes required, keeping with the logic that has been supported so far. All reformatting nodes and write node are optional. * Long term, there should also be some changes to the base configs supported by the CLI to support pipeline specific configs, but for now what exists was used to minimize changes in this PR. * Final step to copy the final output to the location designated by the `_output_filename` value of the ingest doc. * Hashing occurs at each step by hashing the parameters of that step (i.e. partition configs) along with the previous step via the filename used. This allows each step to be the same _if_ all the parameters for it have not changed and the content so far is the same. * The only data that is shared and has writes to across processes is the dictionary of ingest json data. This dict is created using the `multiprocessing.manager.DictProxy` to make sure any interaction with it is behind a lock. ### Minor refactors included: * Utility methods added to extract configs from the click options * Utility method to add common options to click commands. * All writers moved to using the class approach which extracts a lot of the common code so there's less copy-paste when new runners are added. * Use `@property` for source metadata on base ingest doc to add logic to call `update_source_metadata` if it's still `None` at the time it's fetched. ### Additional bug fixes included * Fsspec connectors were not serializable due to the `ingest_doc_cls`. This was removed from the fields captured by the `@dataclass` decorator and added in a `__post_init__` method. * Various reddit connector params were missing. This doesn't have an explicit ingest test at the moment so was never caught. * Fsspec connector had the parent `update_source_metadata` misnamed as `update_source_metadata_metadata` so it was never being called. ### Flow Diagram ![ingest_pipeline](https://github.com/Unstructured-IO/unstructured/assets/136338424/be485606-cfe0-4931-8b81-c2bf569cf1e2)	2023-10-06 18:49:29 +00:00
Trevor Bossert	fd79c5262c	Bump Dockerfile to use latest base image (#1553 ) New base image includes security fixes. This is an ongoing process to remediate security issues as they are identified.	2023-09-27 22:30:32 +00:00
Trevor Bossert	915e4adcbb	Updating deps from base image (#1360 ) Updated versions of: Tesseract Leptonica Pandoc Testing: `make docker-build` `make docker-test`	2023-09-09 10:47:16 -07:00
Trevor Bossert	30cdc19cba	set sha for base image (#1276 ) Provides more consistency and integrity to base image by including sha	2023-09-01 18:30:32 +00:00
cragwolfe	69c2c62978	build(image): patch-level base-image bump (#1265 )	2023-09-01 05:48:47 +00:00
cragwolfe	a4ec43a85f	build(image): bump to rockylinux 9 (#1254 )	2023-08-30 19:10:08 -07:00
Trevor Bossert	e4535d29ca	Set user for container to same as api image. (#1239 ) This is security best practice, a user can override this with their own Dockerfile if required.	2023-08-30 01:01:44 +00:00
cragwolfe	ba70828f4a	build(image): bump Dockerfile to python3.10 (#1214 )	2023-08-27 18:30:17 -07:00
Roman Isecke	db8af4f5de	Roman/notion tests (#1072 ) ### Description * Add ingest test for Notion docs * Update default cache dir for connectors to include connector name. Makes debugging the cached content easier. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-08-21 15:16:50 -04:00
John	f63a66dbef	Capture section and chapter in the metadata for epubs under `epub_section` (#1005 ) Capture section and chapter in the metadata for epubs under epub_section. Closes Github issue #459	2023-08-12 21:02:06 +00:00
Matt Robinson	331c7faf38	build(deps): split up dependencies by document type (#986 ) * split dependencies by document type * make pip-compile with new requirements * add extra requirements to setup.py * add in all docs; re pip-compile * extra for all docs * add pandas to xlsx * dependency requires for tsv and csv * handling for doc, docx and odt * dependency check for pypandoc * required dependencies for pandoc files * xml and html * markdown * msg * add in pdf * add in pptx * add in excel * add lxml as base req * extra all docs for local inference * local inference installs all * pin pillow version * fixes for plain text tests * fixes for doc * update make commands * changelog and version * add xlrd * update pip-compile * pin numpy for python 3.8 support * more constraints * contraint on scipy * update install docs * constrain ipython * add outlook to pip-compile * more ipython constraints * add extras to dockerfile * pin office365 client * few doc tweaks * types as strings * last pip-compile * re pip-comple * make tidy * make tidy	2023-08-01 11:31:13 -04:00
David Potter	1542607892	feat: adds Box connector (#996 )	2023-08-01 01:10:10 +00:00
Trevor Bossert	6249e1553e	New base image with security patches (#869 ) * New base image with security patches * Bump version * remove line from changelog not code related	2023-06-30 19:14:06 -07:00
Roman Isecke	61ea00a06f	Update Dockerfile to use multistage build and cache layers (#785 ) * Update Dockerfile to use multistage build and cache layers * Fix Dockerfile	2023-06-21 13:12:45 -04:00
cragwolfe	2989f53358	chore: bump to python 3.8.17 (#766 ) The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.	2023-06-16 11:17:03 -07:00
Yuming Long	b354e8eec6	Chore: Allow passing kwargs to request data field (#716 ) * bump again :( * update to kwarg * add test case * rename to request_kwargs * remove install detectron2 * pip compile * add changelog for remove detectron2 install * resolve weaviate import issue on python 3.9	2023-06-12 12:39:58 -04:00
Yuming Long	533689196b	Chore: bump base image to update tesseract version (#680 ) * dockerfile * changelog version * version bump	2023-06-06 17:01:16 +00:00
Trevor Bossert	cf70c86574	Build from rocky base image (#665 ) * build from Rocky linux unstructured base image * add qemu for arm * comment out push while testing * remove quotes * Add arch * bump login action * add ARCH env var to the push step * run only subset of tests on arm image Tests on emulated arm are extremely slow. Likelyhood of something breaking in arm image only, is minimal. I say that knowing I likely just jinxed us. * re-enable push from main * add a dnf cleanup * version bump * move from dev to minor version bump	2023-06-01 12:16:04 -07:00
qued	d3600dd5da	build(deps): update inference version (#662 ) Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay! --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-31 13:50:15 -05:00
Yuming Long	fc59a043b7	Chore: Support epub tests in docker image (#630 ) * docker works * more epub tests * changelog version * support epub + odt + rtf * update dockerfile * revert.. * install pandoc on ci env * pandoc docker grab bashed on arch * move arch into image * move back to base image	2023-05-26 15:38:48 -04:00
Trevor Bossert	a78719666a	Build using base image (#625 ) This should speed up the builds a lot	2023-05-22 11:13:24 -07:00
ryannikolaidis	2fc4d37454	chore: pin inference version, bump deps, and update openssl (#551 )	2023-05-08 17:02:55 -07:00
Trevor Bossert	1ac72c6ee8	Fixes issue where detectron2 would not install on OSX (#552 ) * Fixes issue where detectron2 would not install on OSX Tested on Apple silicon based MacBook Pro. This installs tensorboard which is required on OSX and arm based cpu’s for detectron2. * Improve Arch detection for tensorboard * remove makefile from commands in readme pin tensorboard version	2023-05-05 17:16:28 -07:00
Trevor Bossert	cff7f4fd5a	Slack connector (#462 ) This connector takes a slack channel id, token and other options to pull conversation history for a channel and store it as a text file that is then processed by unstructured into expected output.	2023-04-16 19:34:43 +00:00
cragwolfe	bd01af2bac	build: add mimetypes DB to docker image (#455 ) The mailcap centos7 package provides the file /etc/mime.types, which is used by the mimetypes python package. That said, the unstructured code base does not make much use of this but the upstream unstructured-api does. Bonus: docx mimetype added in lookup table.	2023-04-07 13:59:29 -07:00
qued	4211dda360	build: sync detectron version (#440 ) * Update detectron2 version in Dockerfile * Update detectron2 version in docs	2023-04-03 18:47:43 -05:00
ryannikolaidis	59785e4332	chore: install all extras in Dockerfile (#419 ) * Adds step to install all extras * Adds smoke test of wikipedia ingest to validate in CI	2023-03-30 13:23:30 -07:00

1 2

56 Commits