2022-06-29 14:35:19 -04:00
|
|
|
PACKAGE_NAME := unstructured
|
2023-07-27 14:38:51 -07:00
|
|
|
PIP_VERSION := 23.2.1
|
2023-03-21 13:46:09 -07:00
|
|
|
CURRENT_DIR := $(shell pwd)
|
2023-05-05 17:16:28 -07:00
|
|
|
ARCH := $(shell uname -m)
|
2024-09-18 18:39:14 -05:00
|
|
|
PYTHON ?= python3
|
2022-06-29 14:35:19 -04:00
|
|
|
|
|
|
|
.PHONY: help
|
|
|
|
help: Makefile
|
|
|
|
@sed -n 's/^\(## \)\([a-zA-Z]\)/\2/p' $<
|
|
|
|
|
|
|
|
|
|
|
|
###########
|
|
|
|
# Install #
|
|
|
|
###########
|
|
|
|
|
|
|
|
## install-base: installs core requirements needed for text processing bricks
|
|
|
|
.PHONY: install-base
|
|
|
|
install-base: install-base-pip-packages install-nltk-models
|
|
|
|
|
|
|
|
## install: installs all test, dev, and experimental requirements
|
|
|
|
.PHONY: install
|
2023-08-01 11:31:13 -04:00
|
|
|
install: install-base-pip-packages install-dev install-nltk-models install-test install-huggingface install-all-docs
|
2022-06-29 14:35:19 -04:00
|
|
|
|
|
|
|
.PHONY: install-ci
|
feat: allow passing down of ocr agent and table agent (#3954)
This PR allows passing down both `ocr_agent` and `table_ocr_agent` as
parameters to specify the `OCRAgent` class for the page and tables, if
any, respectively. Both are default to using `tesseract`, consistent
with the present default behavior.
We used to rely on env variables to specify the agents but os env can be
changed during runtime outside of the caller's control. This method of
passing down the variables ensures that specification is independent of
env changes.
## testing
Using `example-docs/img/layout-parser-paper-with-table.jpg` and run
partition with two different settings. Note that this test requires
`paddleocr` extra.
```python
from unstructured.partition.auto import partition
from unstructured.partition.utils.constants import OCR_AGENT_TESSERACT, OCR_AGENT_PADDLE
elements = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_TESSERACT, table_ocr_agent=OCR_AGENT_PADDLE)
elements_alt = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_PADDLE, table_ocr_agent=OCR_AGENT_TESSERACT)
```
we should see both finish and slight differences in the table element's
text attribute.
2025-03-11 11:36:31 -05:00
|
|
|
install-ci: install-base-pip-packages install-nltk-models install-huggingface install-all-docs install-test install-pandoc install-paddleocr
|
2022-06-29 14:35:19 -04:00
|
|
|
|
2023-08-19 12:56:13 -04:00
|
|
|
.PHONY: install-base-ci
|
fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) (#2593)
## Problem Description
In some cases you might find yourselves in a situation when pandoc won't
be able to process an `rtf` as input file format, because older versions
simply do not support that.
```
RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki
```
Basically, some user may install the wrong version. The `README.md` is
not be precise enough when mentioning RTF files support:
https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/README.md?plain=1#L120-L122
## Example
Installing `pandoc` from a [stable repository, like
Debian](https://packages.debian.org/source/bullseye/pandoc) will give
you `2.9` and the official documentation shows clearly that support for
rtf was introduced in `2.14`
https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21

### Note that `rtf` is not there

### More detail

## Proposed Solution
- [x] I've simply added/copied `make install-pandoc` calls, mimicking
other recipes in order to ensure that `3.1.2` will be installed in all
cases. **Side note**: `make install-pandoc` calls
`./scripts/install-pandoc.sh` under the hood.
- [x] Update README file - mention that `make install-pandoc` is
recommended (`>=2.14.2`)
- [x] Verify tests that cover `rtf` cases:
https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/test_unstructured/file_utils/test_file_conversion.py#L14
- [x] Update `setup_ubuntu.sh` if needed?:
https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/scripts/setup_ubuntu.sh#L87
-
2024-03-04 12:02:32 +01:00
|
|
|
install-base-ci: install-base-pip-packages install-nltk-models install-test install-pandoc
|
2023-08-19 12:56:13 -04:00
|
|
|
|
2022-06-29 14:35:19 -04:00
|
|
|
.PHONY: install-base-pip-packages
|
|
|
|
install-base-pip-packages:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install pip==${PIP_VERSION}
|
|
|
|
${PYTHON} -m pip install -r requirements/base.txt
|
2022-06-29 14:35:19 -04:00
|
|
|
|
2022-10-13 11:18:27 -04:00
|
|
|
.PHONY: install-huggingface
|
|
|
|
install-huggingface:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install pip==${PIP_VERSION}
|
|
|
|
${PYTHON} -m pip install -r requirements/huggingface.txt
|
2022-10-13 11:18:27 -04:00
|
|
|
|
2024-01-30 12:12:35 -06:00
|
|
|
.PHONY: install-nltk-models
|
2022-06-29 14:35:19 -04:00
|
|
|
install-nltk-models:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -c "from unstructured.nlp.tokenize import download_nltk_packages; download_nltk_packages()"
|
2022-06-29 14:35:19 -04:00
|
|
|
|
|
|
|
.PHONY: install-test
|
|
|
|
install-test:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -r requirements/test.txt
|
2023-09-14 18:27:18 -05:00
|
|
|
# NOTE(yao) - CI seem to always install tesseract to test so it would make sense to also require
|
|
|
|
# pytesseract installation into the virtual env for testing
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install unstructured_pytesseract
|
|
|
|
# ${PYTHON} -m pip install argilla==1.28.0 -c requirements/deps/constraints.txt
|
2023-06-01 16:48:54 -04:00
|
|
|
# NOTE(robinson) - Installing weaviate-client separately here because the requests
|
|
|
|
# version conflicts with label_studio_sdk
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install weaviate-client -c requirements/deps/constraints.txt
|
2022-06-29 14:35:19 -04:00
|
|
|
|
|
|
|
.PHONY: install-dev
|
|
|
|
install-dev:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -r requirements/dev.txt
|
2022-06-29 14:35:19 -04:00
|
|
|
|
|
|
|
.PHONY: install-build
|
|
|
|
install-build:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -r requirements/build.txt
|
2022-06-29 14:35:19 -04:00
|
|
|
|
2023-08-01 11:31:13 -04:00
|
|
|
.PHONY: install-csv
|
|
|
|
install-csv:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -r requirements/extra-csv.txt
|
2023-08-01 11:31:13 -04:00
|
|
|
|
|
|
|
.PHONY: install-docx
|
|
|
|
install-docx:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -r requirements/extra-docx.txt
|
2023-08-01 11:31:13 -04:00
|
|
|
|
2023-08-12 16:02:06 -05:00
|
|
|
.PHONY: install-epub
|
|
|
|
install-epub:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -r requirements/extra-epub.txt
|
2023-08-12 16:02:06 -05:00
|
|
|
|
2023-08-01 11:31:13 -04:00
|
|
|
.PHONY: install-odt
|
|
|
|
install-odt:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -r requirements/extra-odt.txt
|
2023-08-01 11:31:13 -04:00
|
|
|
|
|
|
|
.PHONY: install-pypandoc
|
|
|
|
install-pypandoc:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -r requirements/extra-pandoc.txt
|
2023-08-01 11:31:13 -04:00
|
|
|
|
feat: allow passing down of ocr agent and table agent (#3954)
This PR allows passing down both `ocr_agent` and `table_ocr_agent` as
parameters to specify the `OCRAgent` class for the page and tables, if
any, respectively. Both are default to using `tesseract`, consistent
with the present default behavior.
We used to rely on env variables to specify the agents but os env can be
changed during runtime outside of the caller's control. This method of
passing down the variables ensures that specification is independent of
env changes.
## testing
Using `example-docs/img/layout-parser-paper-with-table.jpg` and run
partition with two different settings. Note that this test requires
`paddleocr` extra.
```python
from unstructured.partition.auto import partition
from unstructured.partition.utils.constants import OCR_AGENT_TESSERACT, OCR_AGENT_PADDLE
elements = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_TESSERACT, table_ocr_agent=OCR_AGENT_PADDLE)
elements_alt = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_PADDLE, table_ocr_agent=OCR_AGENT_TESSERACT)
```
we should see both finish and slight differences in the table element's
text attribute.
2025-03-11 11:36:31 -05:00
|
|
|
.PHONY: install-paddleocr
|
|
|
|
install-paddleocr:
|
|
|
|
${PYTHON} -m pip install -r requirements/extra-paddleocr.txt
|
|
|
|
|
2023-08-01 11:31:13 -04:00
|
|
|
.PHONY: install-markdown
|
|
|
|
install-markdown:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -r requirements/extra-markdown.txt
|
2023-08-01 11:31:13 -04:00
|
|
|
|
|
|
|
.PHONY: install-pdf-image
|
|
|
|
install-pdf-image:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -r requirements/extra-pdf-image.txt
|
2023-08-01 11:31:13 -04:00
|
|
|
|
|
|
|
.PHONY: install-pptx
|
|
|
|
install-pptx:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -r requirements/extra-pptx.txt
|
2023-08-01 11:31:13 -04:00
|
|
|
|
|
|
|
.PHONY: install-xlsx
|
|
|
|
install-xlsx:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -r requirements/extra-xlsx.txt
|
2023-08-01 11:31:13 -04:00
|
|
|
|
|
|
|
.PHONY: install-all-docs
|
2024-08-30 15:12:46 -04:00
|
|
|
install-all-docs: install-base install-csv install-docx install-epub install-odt install-pypandoc install-markdown install-pdf-image install-pptx install-xlsx
|
2023-08-01 11:31:13 -04:00
|
|
|
|
2024-10-15 11:01:34 -04:00
|
|
|
.PHONY: install-ingest
|
|
|
|
install-ingest:
|
|
|
|
python3 -m pip install -r requirements/ingest/ingest.txt
|
2023-01-04 16:19:05 -06:00
|
|
|
|
|
|
|
## install-local-inference: installs requirements for local inference
|
|
|
|
.PHONY: install-local-inference
|
2023-08-01 11:31:13 -04:00
|
|
|
install-local-inference: install install-all-docs
|
2023-01-04 16:19:05 -06:00
|
|
|
|
2023-05-26 15:38:48 -04:00
|
|
|
.PHONY: install-pandoc
|
|
|
|
install-pandoc:
|
|
|
|
ARCH=${ARCH} ./scripts/install-pandoc.sh
|
|
|
|
|
2022-06-29 14:35:19 -04:00
|
|
|
## pip-compile: compiles all base/dev/test requirements
|
|
|
|
.PHONY: pip-compile
|
|
|
|
pip-compile:
|
2023-09-25 10:27:42 -04:00
|
|
|
@scripts/pip-compile.sh
|
2023-08-31 18:19:53 -04:00
|
|
|
|
2022-06-29 14:35:19 -04:00
|
|
|
## install-project-local: install unstructured into your local python environment
|
|
|
|
.PHONY: install-project-local
|
|
|
|
install-project-local: install
|
|
|
|
# MAYBE TODO: fail if already exists?
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip install -e .
|
2022-06-29 14:35:19 -04:00
|
|
|
|
|
|
|
## uninstall-project-local: uninstall unstructured from your local python environment
|
|
|
|
.PHONY: uninstall-project-local
|
|
|
|
uninstall-project-local:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m pip uninstall ${PACKAGE_NAME}
|
2022-06-29 14:35:19 -04:00
|
|
|
|
|
|
|
#################
|
|
|
|
# Test and Lint #
|
|
|
|
#################
|
|
|
|
|
2023-06-29 10:31:01 -07:00
|
|
|
export CI ?= false
|
2023-10-05 15:26:47 -05:00
|
|
|
export UNSTRUCTURED_INCLUDE_DEBUG_METADATA ?= false
|
2023-06-29 10:31:01 -07:00
|
|
|
|
2022-06-29 14:35:19 -04:00
|
|
|
## test: runs all unittests
|
|
|
|
.PHONY: test
|
|
|
|
test:
|
2023-10-05 15:26:47 -05:00
|
|
|
PYTHONPATH=. CI=$(CI) \
|
2025-06-16 16:29:35 -07:00
|
|
|
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) ${PYTHON} -m pytest -n auto test_${PACKAGE_NAME} --cov=${PACKAGE_NAME} --cov-report term-missing --durations=40
|
2022-06-29 14:35:19 -04:00
|
|
|
|
2023-07-26 16:55:35 -04:00
|
|
|
.PHONY: test-unstructured-api-unit
|
|
|
|
test-unstructured-api-unit:
|
|
|
|
scripts/test-unstructured-api-unit.sh
|
|
|
|
|
2023-08-19 12:56:13 -04:00
|
|
|
.PHONY: test-no-extras
|
|
|
|
test-no-extras:
|
2023-10-05 15:26:47 -05:00
|
|
|
PYTHONPATH=. CI=$(CI) \
|
2025-06-16 16:29:35 -07:00
|
|
|
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) ${PYTHON} -m pytest -n auto \
|
2023-08-19 12:56:13 -04:00
|
|
|
test_${PACKAGE_NAME}/partition/test_text.py \
|
|
|
|
test_${PACKAGE_NAME}/partition/test_email.py \
|
2024-06-21 13:59:48 -07:00
|
|
|
test_${PACKAGE_NAME}/partition/html/test_partition.py \
|
2024-09-23 15:23:10 -07:00
|
|
|
test_${PACKAGE_NAME}/partition/test_xml.py
|
2023-08-19 12:56:13 -04:00
|
|
|
|
|
|
|
.PHONY: test-extra-csv
|
|
|
|
test-extra-csv:
|
2025-06-16 16:29:35 -07:00
|
|
|
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest -n auto \
|
2024-05-22 17:51:08 -07:00
|
|
|
test_unstructured/partition/test_csv.py \
|
|
|
|
test_unstructured/partition/test_tsv.py
|
2023-08-19 12:56:13 -04:00
|
|
|
|
|
|
|
.PHONY: test-extra-docx
|
|
|
|
test-extra-docx:
|
2025-06-16 16:29:35 -07:00
|
|
|
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest -n auto \
|
2024-05-22 17:51:08 -07:00
|
|
|
test_unstructured/partition/test_doc.py \
|
|
|
|
test_unstructured/partition/test_docx.py
|
|
|
|
|
|
|
|
.PHONY: test-extra-epub
|
|
|
|
test-extra-epub:
|
2025-06-16 16:29:35 -07:00
|
|
|
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest -n auto test_unstructured/partition/test_epub.py
|
2023-08-19 12:56:13 -04:00
|
|
|
|
|
|
|
.PHONY: test-extra-markdown
|
|
|
|
test-extra-markdown:
|
2025-06-16 16:29:35 -07:00
|
|
|
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest -n auto test_unstructured/partition/test_md.py
|
2023-08-19 12:56:13 -04:00
|
|
|
|
|
|
|
.PHONY: test-extra-odt
|
|
|
|
test-extra-odt:
|
2025-06-16 16:29:35 -07:00
|
|
|
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest -n auto test_unstructured/partition/test_odt.py
|
2023-08-19 12:56:13 -04:00
|
|
|
|
|
|
|
.PHONY: test-extra-pdf-image
|
|
|
|
test-extra-pdf-image:
|
2025-06-16 16:29:35 -07:00
|
|
|
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest -n auto test_unstructured/partition/pdf_image
|
2023-08-19 12:56:13 -04:00
|
|
|
|
|
|
|
.PHONY: test-extra-pptx
|
|
|
|
test-extra-pptx:
|
2025-06-16 16:29:35 -07:00
|
|
|
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest -n auto \
|
2024-05-22 17:51:08 -07:00
|
|
|
test_unstructured/partition/test_ppt.py \
|
|
|
|
test_unstructured/partition/test_pptx.py
|
2023-08-19 12:56:13 -04:00
|
|
|
|
|
|
|
.PHONY: test-extra-pypandoc
|
|
|
|
test-extra-pypandoc:
|
2025-06-16 16:29:35 -07:00
|
|
|
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest -n auto \
|
2024-05-22 17:51:08 -07:00
|
|
|
test_unstructured/partition/test_org.py \
|
|
|
|
test_unstructured/partition/test_rst.py \
|
|
|
|
test_unstructured/partition/test_rtf.py
|
2023-08-19 12:56:13 -04:00
|
|
|
|
|
|
|
.PHONY: test-extra-xlsx
|
|
|
|
test-extra-xlsx:
|
2025-06-16 16:29:35 -07:00
|
|
|
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest -n auto test_unstructured/partition/test_xlsx.py
|
2023-08-19 12:56:13 -04:00
|
|
|
|
2024-12-05 09:40:50 -08:00
|
|
|
.PHONY: test-text-extraction-evaluate
|
|
|
|
test-text-extraction-evaluate:
|
2025-06-16 16:29:35 -07:00
|
|
|
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest -n auto test_unstructured/metrics/test_text_extraction.py
|
2024-12-05 09:40:50 -08:00
|
|
|
|
2022-06-29 14:35:19 -04:00
|
|
|
## check: runs linters (includes tests)
|
|
|
|
.PHONY: check
|
2024-10-15 11:01:34 -04:00
|
|
|
check: check-ruff check-black check-flake8 check-version
|
2022-06-29 14:35:19 -04:00
|
|
|
|
2023-12-11 20:04:15 -05:00
|
|
|
.PHONY: check-shfmt
|
|
|
|
check-shfmt:
|
2023-12-18 23:48:21 -08:00
|
|
|
shfmt -i 2 -d .
|
2023-12-11 20:04:15 -05:00
|
|
|
|
2023-10-17 08:45:12 -04:00
|
|
|
.PHONY: check-black
|
|
|
|
check-black:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m black . --check --line-length=100
|
2023-10-17 08:45:12 -04:00
|
|
|
|
|
|
|
.PHONY: check-flake8
|
|
|
|
check-flake8:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m flake8 .
|
2023-10-17 08:45:12 -04:00
|
|
|
|
2024-07-11 18:36:01 -04:00
|
|
|
.PHONY: check-licenses
|
|
|
|
check-licenses:
|
|
|
|
@scripts/check-licenses.sh
|
|
|
|
|
2023-10-17 08:45:12 -04:00
|
|
|
.PHONY: check-ruff
|
|
|
|
check-ruff:
|
2024-03-14 14:31:58 -07:00
|
|
|
# -- ruff options are determined by pyproject.toml --
|
2024-04-08 12:01:03 -07:00
|
|
|
ruff check .
|
2022-06-29 14:35:19 -04:00
|
|
|
|
2023-10-17 08:45:12 -04:00
|
|
|
.PHONY: check-autoflake
|
|
|
|
check-autoflake:
|
|
|
|
autoflake --check-diff .
|
2022-06-29 14:35:19 -04:00
|
|
|
|
2022-09-29 15:24:28 -04:00
|
|
|
## check-scripts: run shellcheck
|
|
|
|
.PHONY: check-scripts
|
|
|
|
check-scripts:
|
|
|
|
# Fail if any of these files have warnings
|
|
|
|
scripts/shellcheck.sh
|
|
|
|
|
2022-10-10 13:11:48 -05:00
|
|
|
## check-version: run check to ensure version in CHANGELOG.md matches version in package
|
|
|
|
.PHONY: check-version
|
|
|
|
check-version:
|
|
|
|
# Fail if syncing version would produce changes
|
2023-04-21 05:48:38 +09:00
|
|
|
scripts/version-sync.sh -c \
|
|
|
|
-f "unstructured/__version__.py" semver
|
2022-10-10 13:11:48 -05:00
|
|
|
|
2022-06-29 14:35:19 -04:00
|
|
|
## tidy: run black
|
|
|
|
.PHONY: tidy
|
2023-12-11 20:04:15 -05:00
|
|
|
tidy: tidy-python
|
|
|
|
|
|
|
|
.PHONY: tidy_shell
|
|
|
|
tidy-shell:
|
2023-12-18 23:48:21 -08:00
|
|
|
shfmt -i 2 -l -w .
|
2023-12-11 20:04:15 -05:00
|
|
|
|
|
|
|
.PHONY: tidy-python
|
|
|
|
tidy-python:
|
2025-01-30 00:18:02 -05:00
|
|
|
ruff check . --fix-only || true
|
2023-10-17 08:45:12 -04:00
|
|
|
autoflake --in-place .
|
2024-03-21 14:21:04 -04:00
|
|
|
black --line-length=100 .
|
2022-06-29 14:35:19 -04:00
|
|
|
|
2022-10-10 13:11:48 -05:00
|
|
|
## version-sync: update __version__.py with most recent version from CHANGELOG.md
|
|
|
|
.PHONY: version-sync
|
|
|
|
version-sync:
|
2023-04-21 05:48:38 +09:00
|
|
|
scripts/version-sync.sh \
|
|
|
|
-f "unstructured/__version__.py" semver
|
2022-10-10 13:11:48 -05:00
|
|
|
|
2022-06-29 14:35:19 -04:00
|
|
|
.PHONY: check-coverage
|
|
|
|
check-coverage:
|
2024-09-18 18:39:14 -05:00
|
|
|
${PYTHON} -m coverage report --fail-under=90
|
2023-03-14 13:40:01 -07:00
|
|
|
|
2023-05-24 17:29:35 -05:00
|
|
|
## check-deps: check consistency of dependencies
|
|
|
|
.PHONY: check-deps
|
|
|
|
check-deps:
|
|
|
|
scripts/consistent-deps.sh
|
|
|
|
|
2024-03-06 13:59:08 -05:00
|
|
|
.PHONY: check-extras
|
|
|
|
check-extras:
|
|
|
|
scripts/check-extras.sh
|
|
|
|
|
2023-03-14 13:40:01 -07:00
|
|
|
##########
|
|
|
|
# Docker #
|
|
|
|
##########
|
|
|
|
|
|
|
|
# Docker targets are provided for convenience only and are not required in a standard development environment
|
|
|
|
|
2023-04-06 00:34:07 -07:00
|
|
|
DOCKER_IMAGE ?= unstructured:dev
|
2023-03-29 00:02:39 -07:00
|
|
|
|
2023-03-14 13:40:01 -07:00
|
|
|
.PHONY: docker-build
|
|
|
|
docker-build:
|
2023-04-06 00:34:07 -07:00
|
|
|
PIP_VERSION=${PIP_VERSION} DOCKER_IMAGE_NAME=${DOCKER_IMAGE} ./scripts/docker-build.sh
|
2023-03-14 13:40:01 -07:00
|
|
|
|
|
|
|
.PHONY: docker-start-bash
|
|
|
|
docker-start-bash:
|
2023-04-06 00:34:07 -07:00
|
|
|
docker run -ti --rm ${DOCKER_IMAGE}
|
2023-03-21 13:46:09 -07:00
|
|
|
|
2023-08-31 14:26:29 -05:00
|
|
|
.PHONY: docker-start-dev
|
|
|
|
docker-start-dev:
|
|
|
|
docker run --rm \
|
|
|
|
-v ${CURRENT_DIR}:/mnt/local_unstructued \
|
|
|
|
-ti ${DOCKER_IMAGE}
|
|
|
|
|
2023-03-21 13:46:09 -07:00
|
|
|
.PHONY: docker-test
|
|
|
|
docker-test:
|
2023-04-06 00:34:07 -07:00
|
|
|
docker run --rm \
|
2023-08-29 18:01:44 -07:00
|
|
|
-v ${CURRENT_DIR}/test_unstructured:/home/notebook-user/test_unstructured \
|
|
|
|
-v ${CURRENT_DIR}/test_unstructured_ingest:/home/notebook-user/test_unstructured_ingest \
|
2023-06-29 10:31:01 -07:00
|
|
|
$(if $(wildcard uns_test_env_file),--env-file uns_test_env_file,) \
|
2023-03-29 00:02:39 -07:00
|
|
|
$(DOCKER_IMAGE) \
|
build: remove test and dev deps from docker image (#3969)
Removed the dependencies contained in `test.txt`, `dev.txt`, and
`constraints.txt` from the things that get installed in the docker
image. In order to keep testing the image (running the tests), I added a
step to the `docker-test` make target to install `test.txt` and
`dev.txt`. Thus we presumably get a smaller image (probably not much
smaller), reduce the dependency chain or our images, and have less
exposure to vulnerabilities while still testing as robustly as before.
Incidentally, I removed the `Dockerfile` for our ubuntu image, since it
made reference to non-existent make targets, which tells me it's stale
and wasn't being used.
### Review:
- Reviewer should ensure the dev and test dependencies are not being
installed in the docker image. One way to check is to check the logs in
CI, and note, e.g. that
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/14112971425/job/39536304012#step:3:1700)
is the first reference to `pytest` in the docker build and test logs,
after the image build is completed.
- Reviewer should ensure docker image is still being tested in CI and is
passing.
2025-03-27 13:41:11 -05:00
|
|
|
bash -c "pip install -r requirements/test.txt -r requirements/dev.txt && \
|
|
|
|
CI=$(CI) \
|
2023-10-05 15:26:47 -05:00
|
|
|
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) \
|
2025-01-23 11:11:38 -06:00
|
|
|
python3 -m pytest $(if $(TEST_FILE),$(TEST_FILE),test_unstructured)"
|
2023-03-30 13:23:30 -07:00
|
|
|
|
|
|
|
.PHONY: docker-smoke-test
|
|
|
|
docker-smoke-test:
|
2023-04-06 10:03:42 -07:00
|
|
|
DOCKER_IMAGE=${DOCKER_IMAGE} ./scripts/docker-smoke-test.sh
|
2023-05-31 17:01:23 +03:00
|
|
|
|
|
|
|
|
|
|
|
###########
|
|
|
|
# Jupyter #
|
|
|
|
###########
|
|
|
|
|
|
|
|
.PHONY: docker-jupyter-notebook
|
|
|
|
docker-jupyter-notebook:
|
|
|
|
docker run -p 8888:8888 --mount type=bind,source=$(realpath .),target=/home --entrypoint jupyter-notebook -t --rm ${DOCKER_IMAGE} --allow-root --port 8888 --ip 0.0.0.0 --NotebookApp.token='' --NotebookApp.password=''
|
|
|
|
|
|
|
|
|
|
|
|
.PHONY: run-jupyter
|
|
|
|
run-jupyter:
|
2023-06-01 16:48:54 -04:00
|
|
|
PYTHONPATH=$(realpath .) JUPYTER_PATH=$(realpath .) jupyter-notebook --NotebookApp.token='' --NotebookApp.password=''
|
2025-03-04 14:57:35 +01:00
|
|
|
|
|
|
|
|
|
|
|
###########
|
|
|
|
# Other #
|
|
|
|
###########
|
|
|
|
|
|
|
|
.PHONY: html-fixtures-update
|
|
|
|
html-fixtures-update:
|
2025-04-29 15:29:44 +02:00
|
|
|
rm -r test_unstructured_ingest/expected-structured-output-html && \
|
2025-03-04 14:57:35 +01:00
|
|
|
test_unstructured_ingest/structured-json-to-html.sh test_unstructured_ingest/expected-structured-output-html
|