unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-30 20:39:54 +00:00

Author	SHA1	Message	Date
Roman Isecke	4802332de0	Roman/optimize ingest ci (#1799 ) ### Description Currently the CI caches the CI dependencies but uses the hash of all files in `requirements/`. This isn't completely accurate since the ingest dependencies are installed in a later step and don't affect the cached environment. As part of this PR: * ingest dependencies were isolated into their own folder in `requirements/ingest/` * A new cache setup was introduced in the CI to restore the base cache -> install ingest dependencies -> cache it with a new id * new make target created to install all ingest dependencies via `pip install -r ...` * updates to Dockerfile to use `find ...` to install all dependencies, avoiding the need to update this when new deps are added. * update to pip-compile script to run over all `*.in` files in `requirements/`	2023-10-24 14:54:00 +00:00
Roman Isecke	2e1404e02c	refactor: unstructured ingest as a pipeline (#1551 ) ### Description As we add more and more steps to the pipeline (i.e. chunking, embedding, table manipulation), it would help seperate the responsibility of each of these into their own processes, running each in parallel using json files to share data across. This will also help guarantee data is serializable if this code was used in an actual pipeline. Following is a flow diagram of the proposed changes. As part of this change: * A parent pipeline class will be responsible for running each `node`, which can optionally be run via multiprocessing if it supports it, or not. Possible nodes at this moment: * Doc factory: creates all the ingest docs via the source connector * Source: reads/downloads all of the content to process to the local filesystem to the location set by the `download_dir` parameter. * Partition: runs partition on all of the downloaded content in json format. * Any number of reformat nodes that modify the partitioned content. This can include chunking, embedding, etc. * Write: push the final json into the destination via the destination connector * This pipeline relies on the information of the ingest docs to be available via their serialization. An optimization was introduced with the `IngestDocJsonMixin` which adds in all the `@property` fields to the serialized json already being created via the `DataClassJsonMixin` * For all intermediate steps (partitioning, reformatting), the content is saved to a dedicated location on the local filesystem. Right now it's set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`. * Minor changes: made sense to move some of the config parameters between the read and partition configs when I explicitly divided the responsibility to download vs partition the content in the pipeline. * The pipeline class only makes the doc factory, source and partition nodes required, keeping with the logic that has been supported so far. All reformatting nodes and write node are optional. * Long term, there should also be some changes to the base configs supported by the CLI to support pipeline specific configs, but for now what exists was used to minimize changes in this PR. * Final step to copy the final output to the location designated by the `_output_filename` value of the ingest doc. * Hashing occurs at each step by hashing the parameters of that step (i.e. partition configs) along with the previous step via the filename used. This allows each step to be the same _if_ all the parameters for it have not changed and the content so far is the same. * The only data that is shared and has writes to across processes is the dictionary of ingest json data. This dict is created using the `multiprocessing.manager.DictProxy` to make sure any interaction with it is behind a lock. ### Minor refactors included: * Utility methods added to extract configs from the click options * Utility method to add common options to click commands. * All writers moved to using the class approach which extracts a lot of the common code so there's less copy-paste when new runners are added. * Use `@property` for source metadata on base ingest doc to add logic to call `update_source_metadata` if it's still `None` at the time it's fetched. ### Additional bug fixes included * Fsspec connectors were not serializable due to the `ingest_doc_cls`. This was removed from the fields captured by the `@dataclass` decorator and added in a `__post_init__` method. * Various reddit connector params were missing. This doesn't have an explicit ingest test at the moment so was never caught. * Fsspec connector had the parent `update_source_metadata` misnamed as `update_source_metadata_metadata` so it was never being called. ### Flow Diagram ![ingest_pipeline](https://github.com/Unstructured-IO/unstructured/assets/136338424/be485606-cfe0-4931-8b81-c2bf569cf1e2)	2023-10-06 18:49:29 +00:00
Trevor Bossert	fd79c5262c	Bump Dockerfile to use latest base image (#1553 ) New base image includes security fixes. This is an ongoing process to remediate security issues as they are identified.	2023-09-27 22:30:32 +00:00
Trevor Bossert	915e4adcbb	Updating deps from base image (#1360 ) Updated versions of: Tesseract Leptonica Pandoc Testing: `make docker-build` `make docker-test`	2023-09-09 10:47:16 -07:00
Trevor Bossert	30cdc19cba	set sha for base image (#1276 ) Provides more consistency and integrity to base image by including sha	2023-09-01 18:30:32 +00:00
cragwolfe	69c2c62978	build(image): patch-level base-image bump (#1265 )	2023-09-01 05:48:47 +00:00
cragwolfe	a4ec43a85f	build(image): bump to rockylinux 9 (#1254 )	2023-08-30 19:10:08 -07:00
Trevor Bossert	e4535d29ca	Set user for container to same as api image. (#1239 ) This is security best practice, a user can override this with their own Dockerfile if required.	2023-08-30 01:01:44 +00:00
cragwolfe	ba70828f4a	build(image): bump Dockerfile to python3.10 (#1214 )	2023-08-27 18:30:17 -07:00
Roman Isecke	db8af4f5de	Roman/notion tests (#1072 ) ### Description * Add ingest test for Notion docs * Update default cache dir for connectors to include connector name. Makes debugging the cached content easier. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-08-21 15:16:50 -04:00
John	f63a66dbef	Capture section and chapter in the metadata for epubs under `epub_section` (#1005 ) Capture section and chapter in the metadata for epubs under epub_section. Closes Github issue #459	2023-08-12 21:02:06 +00:00
Matt Robinson	331c7faf38	build(deps): split up dependencies by document type (#986 ) * split dependencies by document type * make pip-compile with new requirements * add extra requirements to setup.py * add in all docs; re pip-compile * extra for all docs * add pandas to xlsx * dependency requires for tsv and csv * handling for doc, docx and odt * dependency check for pypandoc * required dependencies for pandoc files * xml and html * markdown * msg * add in pdf * add in pptx * add in excel * add lxml as base req * extra all docs for local inference * local inference installs all * pin pillow version * fixes for plain text tests * fixes for doc * update make commands * changelog and version * add xlrd * update pip-compile * pin numpy for python 3.8 support * more constraints * contraint on scipy * update install docs * constrain ipython * add outlook to pip-compile * more ipython constraints * add extras to dockerfile * pin office365 client * few doc tweaks * types as strings * last pip-compile * re pip-comple * make tidy * make tidy	2023-08-01 11:31:13 -04:00
David Potter	1542607892	feat: adds Box connector (#996 )	2023-08-01 01:10:10 +00:00
Trevor Bossert	6249e1553e	New base image with security patches (#869 ) * New base image with security patches * Bump version * remove line from changelog not code related	2023-06-30 19:14:06 -07:00
Roman Isecke	61ea00a06f	Update Dockerfile to use multistage build and cache layers (#785 ) * Update Dockerfile to use multistage build and cache layers * Fix Dockerfile	2023-06-21 13:12:45 -04:00
cragwolfe	2989f53358	chore: bump to python 3.8.17 (#766 ) The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.	2023-06-16 11:17:03 -07:00
Yuming Long	b354e8eec6	Chore: Allow passing kwargs to request data field (#716 ) * bump again :( * update to kwarg * add test case * rename to request_kwargs * remove install detectron2 * pip compile * add changelog for remove detectron2 install * resolve weaviate import issue on python 3.9	2023-06-12 12:39:58 -04:00
Yuming Long	533689196b	Chore: bump base image to update tesseract version (#680 ) * dockerfile * changelog version * version bump	2023-06-06 17:01:16 +00:00
Trevor Bossert	cf70c86574	Build from rocky base image (#665 ) * build from Rocky linux unstructured base image * add qemu for arm * comment out push while testing * remove quotes * Add arch * bump login action * add ARCH env var to the push step * run only subset of tests on arm image Tests on emulated arm are extremely slow. Likelyhood of something breaking in arm image only, is minimal. I say that knowing I likely just jinxed us. * re-enable push from main * add a dnf cleanup * version bump * move from dev to minor version bump	2023-06-01 12:16:04 -07:00
qued	d3600dd5da	build(deps): update inference version (#662 ) Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay! --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-31 13:50:15 -05:00
Yuming Long	fc59a043b7	Chore: Support epub tests in docker image (#630 ) * docker works * more epub tests * changelog version * support epub + odt + rtf * update dockerfile * revert.. * install pandoc on ci env * pandoc docker grab bashed on arch * move arch into image * move back to base image	2023-05-26 15:38:48 -04:00
Trevor Bossert	a78719666a	Build using base image (#625 ) This should speed up the builds a lot	2023-05-22 11:13:24 -07:00
ryannikolaidis	2fc4d37454	chore: pin inference version, bump deps, and update openssl (#551 )	2023-05-08 17:02:55 -07:00
Trevor Bossert	1ac72c6ee8	Fixes issue where detectron2 would not install on OSX (#552 ) * Fixes issue where detectron2 would not install on OSX Tested on Apple silicon based MacBook Pro. This installs tensorboard which is required on OSX and arm based cpu’s for detectron2. * Improve Arch detection for tensorboard * remove makefile from commands in readme pin tensorboard version	2023-05-05 17:16:28 -07:00
Trevor Bossert	cff7f4fd5a	Slack connector (#462 ) This connector takes a slack channel id, token and other options to pull conversation history for a channel and store it as a text file that is then processed by unstructured into expected output.	2023-04-16 19:34:43 +00:00
cragwolfe	bd01af2bac	build: add mimetypes DB to docker image (#455 ) The mailcap centos7 package provides the file /etc/mime.types, which is used by the mimetypes python package. That said, the unstructured code base does not make much use of this but the upstream unstructured-api does. Bonus: docx mimetype added in lookup table.	2023-04-07 13:59:29 -07:00
qued	4211dda360	build: sync detectron version (#440 ) * Update detectron2 version in Dockerfile * Update detectron2 version in docs	2023-04-03 18:47:43 -05:00
ryannikolaidis	59785e4332	chore: install all extras in Dockerfile (#419 ) * Adds step to install all extras * Adds smoke test of wikipedia ingest to validate in CI	2023-03-30 13:23:30 -07:00
ryannikolaidis	77b6fb2792	ci: update dockerfile to also add models and nltk (#418 )	2023-03-29 20:48:06 -07:00
ryannikolaidis	65fec954ba	ci: publish amd and arm images (#404 )	2023-03-29 07:02:39 +00:00
ryannikolaidis	1e39e1ac2a	ci: Adds workflow to publish docker builds (#377 )	2023-03-19 21:53:05 +00:00
Amanda Cameron	edb847ce0b	adding Dockerfile (#359 )	2023-03-14 13:40:01 -07:00
Matt Robinson	5376bc510f	feat: generic `partition` brick with filetype detection (#132 ) * add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile	2023-01-09 16:15:14 -05:00
Matt Robinson	5f40c78f25	Initial Release	2022-09-26 14:55:20 -07:00

34 Commits