11 Commits

Author SHA1 Message Date
Matt Robinson
db8617872b
build: image and dependency updates; fix tesseract files locations (#3310)
### Summary

Updates to the latest version of the `wolfi-base` image. Changes
include:
- Version bumps to address CVEs
- `libreoffice` is now included in the `arm64`. `.doc` files are now
supported for `arm64`. `.ppt` do not work with the `libreoffice` package
currently available on `wolfi-os`. We have follow on work to look into
that.
- Updates the location of the `tesseract` `tessdata` files on the
`arm64` build. Closes #3290.
- Closes #3319 and addes `psutil` to the base dependencies.

### Testing

- `test_dockerfile` should continue to pass with the updates.
2024-07-01 19:39:32 +00:00
Matt Robinson
2d965fd65e
build: switch arm64 image to wolfi-base (#3268)
### Summary

Updates the `arm64` build to use the same `Dockerfile` as `amd64`, since
there are now upstream base images for `wolfi-base` for both
architectures. The legacy `rockylinux-9.4` is now stashed in a
subdirectory the `docker` subdirectory and is no longer built in CI, but
is available is users would like to build it themselves.

Additionally, this PR includes a fix to symlink `python3` to
`python3.11`, which had caused a CI failure
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/9619486931/job/26535697755).

BREAKING CHANGE: the `arm64` image no longer supports `.doc`, `.pptx`,
or `.xls` because we do not yet have a `libreoffice` `apk` built for
`wolfi-base`. We intend to address that as a follow on. All other
filetypes work.

### Testing

Successfully docker builds, tests, and smoke tests for
[amd64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610735?pr=3268)
and
[arm64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610341?pr=3268)
on the feature branch (with publish disabled).
2024-06-22 05:10:29 +00:00
Matt Robinson
9cd0e706ab
fix: reenable arm64 builds for docker (#3045)
### Summary

Closes #3034 and reenables ARM64 in the docker build and publish job.
This was taken out in #3039 because we've only build `libreoffice` for
AMD64 and not ARM64. If Chainguard publishes an `apk` for `libreoffice`,
we can support a Chainguard image for both architectures. The smoke test
now differs for both architectures, to reflect differences in the
directory structure.

### Testing

Build and publish ran successfully for ARM64 (job
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/9129712470/job/25104907497))
and AMD64 (job
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/9129712470/job/25104907826)).
2024-05-17 19:27:20 +00:00
Matt Robinson
612905e311
build: wolfi base image for Dockerfile (#3016)
### Summary

Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to
reduce CVEs. Also adds a step in the docker publish job that scans the
images and checks for CVEs before publishing. The job will fail if there
are high or critical vulnerabilities.

### Testing

Run `make docker-run-dev` and then `python3.11` once you're in. And that
point, you can try:

```python
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"])
elements
```

Stop the container once you're done.
2024-05-15 22:53:15 +00:00
cragwolfe
bd8a74d686
chore: shell scripts default indent of 2 instead of 4 (#2287)
Given the tendency for shell scripts to easily enter into a few levels
of indentation and long line lengths, update the default to 2 spaces.
2023-12-19 07:48:21 +00:00
Roman Isecke
76efcf4dd7
chore: add shfmt (#2246)
### Description
Given all the shell files that now exist in the repo, would be nice to
have linting/formatting around them (in addition to the existing
shellcheck which doesn't do anything to format the shell code). This PR
introduces `shfmt` to both check for changes and apply formatting when
the associated make targets are called.
2023-12-12 01:04:15 +00:00
ryannikolaidis
e08936b6fb
chore: update all bash scripts to use shebang: /usr/bin/env bash (#779) 2023-06-20 16:00:55 -07:00
cragwolfe
aaea6358f6
build(deps): bump pip (#558) 2023-05-08 23:08:10 -07:00
ryannikolaidis
ee52a749c3
fix: docker smoke test on build (#457) 2023-04-06 10:03:42 -07:00
ryannikolaidis
ef9fb79ed4
chore: build with registry as cache (#454) 2023-04-06 00:34:07 -07:00
Amanda Cameron
edb847ce0b
adding Dockerfile (#359) 2023-03-14 13:40:01 -07:00