34 Commits

Author SHA1 Message Date
Roman Isecke
4802332de0
Roman/optimize ingest ci (#1799)
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`
2023-10-24 14:54:00 +00:00
Roman Isecke
2e1404e02c
refactor: unstructured ingest as a pipeline (#1551)
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
  * Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.

### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.


### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.

### Flow Diagram


![ingest_pipeline](https://github.com/Unstructured-IO/unstructured/assets/136338424/be485606-cfe0-4931-8b81-c2bf569cf1e2)
2023-10-06 18:49:29 +00:00
Trevor Bossert
fd79c5262c
Bump Dockerfile to use latest base image (#1553)
New base image includes security fixes. This is an ongoing process to
remediate security issues as they are identified.
2023-09-27 22:30:32 +00:00
Trevor Bossert
915e4adcbb
Updating deps from base image (#1360)
Updated versions of:

Tesseract
Leptonica
Pandoc


Testing:
`make docker-build`
`make docker-test`
2023-09-09 10:47:16 -07:00
Trevor Bossert
30cdc19cba
set sha for base image (#1276)
Provides more consistency and integrity to base image by including sha
2023-09-01 18:30:32 +00:00
cragwolfe
69c2c62978
build(image): patch-level base-image bump (#1265) 2023-09-01 05:48:47 +00:00
cragwolfe
a4ec43a85f
build(image): bump to rockylinux 9 (#1254) 2023-08-30 19:10:08 -07:00
Trevor Bossert
e4535d29ca
Set user for container to same as api image. (#1239)
This is security best practice, a user can override this with their own
Dockerfile if required.
2023-08-30 01:01:44 +00:00
cragwolfe
ba70828f4a
build(image): bump Dockerfile to python3.10 (#1214) 2023-08-27 18:30:17 -07:00
Roman Isecke
db8af4f5de
Roman/notion tests (#1072)
### Description
* Add ingest test for Notion docs
* Update default cache dir for connectors to include connector name.
Makes debugging the cached content easier.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-21 15:16:50 -04:00
John
f63a66dbef
Capture section and chapter in the metadata for epubs under epub_section (#1005)
Capture section and chapter in the metadata for epubs under epub_section.
Closes Github issue #459
2023-08-12 21:02:06 +00:00
Matt Robinson
331c7faf38
build(deps): split up dependencies by document type (#986)
* split dependencies by document type

* make pip-compile with new requirements

* add extra requirements to setup.py

* add in all docs; re pip-compile

* extra for all docs

* add pandas to xlsx

* dependency requires for tsv and csv

* handling for doc, docx and odt

* dependency check for pypandoc

* required dependencies for pandoc files

* xml and html

* markdown

* msg

* add in pdf

* add in pptx

* add in excel

* add lxml as base req

* extra all docs for local inference

* local inference installs all

* pin pillow version

* fixes for plain text tests

* fixes for doc

* update make commands

* changelog and version

* add xlrd

* update pip-compile

* pin numpy for python 3.8 support

* more constraints

* contraint on scipy

* update install docs

* constrain ipython

* add outlook to pip-compile

* more ipython constraints

* add extras to dockerfile

* pin office365 client

* few doc tweaks

* types as strings

* last pip-compile

* re pip-comple

* make tidy

* make tidy
2023-08-01 11:31:13 -04:00
David Potter
1542607892
feat: adds Box connector (#996) 2023-08-01 01:10:10 +00:00
Trevor Bossert
6249e1553e
New base image with security patches (#869)
* New base image with security patches

* Bump version

* remove line from changelog

not code related
2023-06-30 19:14:06 -07:00
Roman Isecke
61ea00a06f
Update Dockerfile to use multistage build and cache layers (#785)
* Update Dockerfile to use multistage build and cache layers

* Fix Dockerfile
2023-06-21 13:12:45 -04:00
cragwolfe
2989f53358
chore: bump to python 3.8.17 (#766)
The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.
2023-06-16 11:17:03 -07:00
Yuming Long
b354e8eec6
Chore: Allow passing kwargs to request data field (#716)
* bump again :(

* update to kwarg

* add test case

* rename to request_kwargs

* remove install detectron2

* pip compile

* add changelog for remove detectron2 install

* resolve weaviate import issue on python 3.9
2023-06-12 12:39:58 -04:00
Yuming Long
533689196b
Chore: bump base image to update tesseract version (#680)
* dockerfile

* changelog version

* version bump
2023-06-06 17:01:16 +00:00
Trevor Bossert
cf70c86574
Build from rocky base image (#665)
* build from Rocky linux unstructured base image

* add qemu for arm

* comment out push while testing

* remove quotes

* Add arch

* bump login action

* add ARCH env var to the push step

* run only subset of tests on arm image

Tests on emulated arm are extremely slow.  Likelyhood of something breaking in arm image only, is minimal.  I say that knowing I likely just jinxed us.

* re-enable push from main

* add a dnf cleanup

* version bump

* move from dev to minor version bump
2023-06-01 12:16:04 -07:00
qued
d3600dd5da
build(deps): update inference version (#662)
Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay!

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 13:50:15 -05:00
Yuming Long
fc59a043b7
Chore: Support epub tests in docker image (#630)
* docker works

* more epub tests

* changelog version

* support epub + odt + rtf

* update dockerfile

* revert..

* install pandoc on ci env

* pandoc docker grab bashed on arch

* move arch into image

* move back to base image
2023-05-26 15:38:48 -04:00
Trevor Bossert
a78719666a
Build using base image (#625)
This should speed up the builds a lot
2023-05-22 11:13:24 -07:00
ryannikolaidis
2fc4d37454
chore: pin inference version, bump deps, and update openssl (#551) 2023-05-08 17:02:55 -07:00
Trevor Bossert
1ac72c6ee8
Fixes issue where detectron2 would not install on OSX (#552)
* Fixes issue where detectron2 would not install on OSX

Tested on Apple silicon based MacBook Pro.  This installs tensorboard which is required on OSX and arm based cpu’s for detectron2.

* Improve Arch detection for tensorboard

* remove makefile from commands in readme

pin tensorboard version
2023-05-05 17:16:28 -07:00
Trevor Bossert
cff7f4fd5a
Slack connector (#462)
This connector takes a slack channel id, token and other options to
pull conversation history for a channel and store it as a text file that
is then processed by unstructured into expected output.
2023-04-16 19:34:43 +00:00
cragwolfe
bd01af2bac
build: add mimetypes DB to docker image (#455)
The mailcap centos7 package provides the file /etc/mime.types, which is used by the mimetypes python package. That said, the unstructured code base does not make much use of this but the upstream unstructured-api does.

Bonus: docx mimetype added in lookup table.
2023-04-07 13:59:29 -07:00
qued
4211dda360
build: sync detectron version (#440)
* Update detectron2 version in Dockerfile
* Update detectron2 version in docs
2023-04-03 18:47:43 -05:00
ryannikolaidis
59785e4332
chore: install all extras in Dockerfile (#419)
* Adds step to install all extras
* Adds smoke test of wikipedia ingest to validate in CI
2023-03-30 13:23:30 -07:00
ryannikolaidis
77b6fb2792
ci: update dockerfile to also add models and nltk (#418) 2023-03-29 20:48:06 -07:00
ryannikolaidis
65fec954ba
ci: publish amd and arm images (#404) 2023-03-29 07:02:39 +00:00
ryannikolaidis
1e39e1ac2a
ci: Adds workflow to publish docker builds (#377) 2023-03-19 21:53:05 +00:00
Amanda Cameron
edb847ce0b
adding Dockerfile (#359) 2023-03-14 13:40:01 -07:00
Matt Robinson
5376bc510f
feat: generic partition brick with filetype detection (#132)
* add python-magic

* first pass on filetype detection

* tests for filetype detection

* more tests for file detection

* added tests for error conditions

* install libmagic dev in github

* libmagic install instructions

* pattern for checking email files

* support reading .eml in rb mode

* add auto partition function

* auto tests for emal

* auto tests for docx

* added tests for html

* add pdf and html tests

* linting, linting, linting

* added docs for auto partitioning

* update readme with generic partition brick

* bumped version

* added test for bad type

* detect .docx files from application/octet-stream

* linting, linting, linting

* identify xlsx from octet stream

* install poppler in ci

* fix mocks; test for unknown type

* install poppler utils

* install in one line

* only poppler-utils

* file extension logic from application/octet-stream

* install local inference for ci

* install detectron2

* removing unused dockerfile
2023-01-09 16:15:14 -05:00
Matt Robinson
5f40c78f25 Initial Release 2022-09-26 14:55:20 -07:00