10 Commits

Author SHA1 Message Date
qued
9a239fa18b
build: remove test and dev deps from docker image (#3969)
Removed the dependencies contained in `test.txt`, `dev.txt`, and
`constraints.txt` from the things that get installed in the docker
image. In order to keep testing the image (running the tests), I added a
step to the `docker-test` make target to install `test.txt` and
`dev.txt`. Thus we presumably get a smaller image (probably not much
smaller), reduce the dependency chain or our images, and have less
exposure to vulnerabilities while still testing as robustly as before.

Incidentally, I removed the `Dockerfile` for our ubuntu image, since it
made reference to non-existent make targets, which tells me it's stale
and wasn't being used.

### Review:
- Reviewer should ensure the dev and test dependencies are not being
installed in the docker image. One way to check is to check the logs in
CI, and note, e.g. that
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/14112971425/job/39536304012#step:3:1700)
is the first reference to `pytest` in the docker build and test logs,
after the image build is completed.
- Reviewer should ensure docker image is still being tested in CI and is
passing.
2025-03-27 18:41:11 +00:00
Christine Straub
d0211cc41f
build: downgrade nltk version (#3527)
This PR aims to roll back `nltk` to `3.8.1` which bumped to `3.8.2` in
https://github.com/Unstructured-IO/unstructured/pull/3512 because
`3.8.2` is no longer available in PyPI due to some
issues(https://github.com/nltk/nltk/issues/3301)
2024-08-15 16:35:21 -07:00
Matt Robinson
2d965fd65e
build: switch arm64 image to wolfi-base (#3268)
### Summary

Updates the `arm64` build to use the same `Dockerfile` as `amd64`, since
there are now upstream base images for `wolfi-base` for both
architectures. The legacy `rockylinux-9.4` is now stashed in a
subdirectory the `docker` subdirectory and is no longer built in CI, but
is available is users would like to build it themselves.

Additionally, this PR includes a fix to symlink `python3` to
`python3.11`, which had caused a CI failure
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/9619486931/job/26535697755).

BREAKING CHANGE: the `arm64` image no longer supports `.doc`, `.pptx`,
or `.xls` because we do not yet have a `libreoffice` `apk` built for
`wolfi-base`. We intend to address that as a follow on. All other
filetypes work.

### Testing

Successfully docker builds, tests, and smoke tests for
[amd64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610735?pr=3268)
and
[arm64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610341?pr=3268)
on the feature branch (with publish disabled).
2024-06-22 05:10:29 +00:00
Roman Isecke
b37b4689bc
drop python3.8 (#2372)
### Description
Remove all uses of python3.8

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-01-09 23:37:30 +00:00
Ahmet Melek
b7674fb97e
feat: confluence connector (cloud) (#906)
* Add confluence connector and an example script

* add test script, add dependency installations

* add authentication secret variables for ci tests and actions

* add dependency installation commands for workflows

* add dependency installation commands for workflows

* Update ingest test fixtures (#907)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add add ingest test fixtures update workflow for python 3.10, update example script with dummy values

* change workflow name to avoid confusion

* change workflow name to avoid confusion

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions

* Update ingest test fixtures (#911)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* revert back the test python version matrix

* recompile dependencies

* modifications for shellcheck

* update changelog and version

* changelog and version

* remove comments

* Update ingest test fixtures (#915)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add the option to state the number of spaces to be fetched

* add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users

* add help message

* add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test

* change test names

* rename connector arg

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* change arg name for connector

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* add comment to example

* change arg names

* add new tests to ingest test

* shellcheck remove redundant statement

* Update ingest test fixtures (#932)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* Update ingest test fixtures (#936)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* linting

* change file extensions to parse as html

* Update ingest test fixtures (#943)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* remove old fixtures

* update version to 0.8.2-dev3

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
cragwolfe
2989f53358
chore: bump to python 3.8.17 (#766)
The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.
2023-06-16 11:17:03 -07:00
Yuming Long
2fbb1ccd30
Chore(ingest) : add tests on PDFs with fast strategy (#614)
Summary
* Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted
* Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh

Updated ingest tests procedure:

* Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name>
* Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name>
Test
* Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.
2023-06-12 19:02:48 +00:00
ryannikolaidis
dabda67c8f
fix: ingest-test-fixtures-update script to pass env vars (#697) 2023-06-08 04:48:49 +00:00
qued
d3600dd5da
build(deps): update inference version (#662)
Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay!

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 13:50:15 -05:00
cragwolfe
7b44bcd6e0
build: script to update all ingest fixtures, add azure ingest fixtures (#367)
- Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.).
- Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 .
- Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details.
- Updates expected outputs with above script.
- Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.
2023-04-11 00:11:50 -07:00