unstructured/Makefile

343 lines
9.8 KiB
Makefile
Raw Normal View History

2022-06-29 14:35:19 -04:00
PACKAGE_NAME := unstructured
PIP_VERSION := 23.2.1
2023-03-21 13:46:09 -07:00
CURRENT_DIR := $(shell pwd)
ARCH := $(shell uname -m)
PYTHON ?= python3
2022-06-29 14:35:19 -04:00
.PHONY: help
help: Makefile
@sed -n 's/^\(## \)\([a-zA-Z]\)/\2/p' $<
###########
# Install #
###########
## install-base: installs core requirements needed for text processing bricks
.PHONY: install-base
install-base: install-base-pip-packages install-nltk-models
## install: installs all test, dev, and experimental requirements
.PHONY: install
install: install-base-pip-packages install-dev install-nltk-models install-test install-huggingface install-all-docs
2022-06-29 14:35:19 -04:00
.PHONY: install-ci
install-ci: install-base-pip-packages install-nltk-models install-huggingface install-all-docs install-test install-pandoc install-paddleocr
2022-06-29 14:35:19 -04:00
.PHONY: install-base-ci
fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) (#2593) ## Problem Description In some cases you might find yourselves in a situation when pandoc won't be able to process an `rtf` as input file format, because older versions simply do not support that. ``` RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki ``` Basically, some user may install the wrong version. The `README.md` is not be precise enough when mentioning RTF files support: https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/README.md?plain=1#L120-L122 ## Example Installing `pandoc` from a [stable repository, like Debian](https://packages.debian.org/source/bullseye/pandoc) will give you `2.9` and the official documentation shows clearly that support for rtf was introduced in `2.14` https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21 ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8) ### Note that `rtf` is not there ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38) ### More detail ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8) ## Proposed Solution - [x] I've simply added/copied `make install-pandoc` calls, mimicking other recipes in order to ensure that `3.1.2` will be installed in all cases. **Side note**: `make install-pandoc` calls `./scripts/install-pandoc.sh` under the hood. - [x] Update README file - mention that `make install-pandoc` is recommended (`>=2.14.2`) - [x] Verify tests that cover `rtf` cases: https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/test_unstructured/file_utils/test_file_conversion.py#L14 - [x] Update `setup_ubuntu.sh` if needed?: https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/scripts/setup_ubuntu.sh#L87 -
2024-03-04 12:02:32 +01:00
install-base-ci: install-base-pip-packages install-nltk-models install-test install-pandoc
2022-06-29 14:35:19 -04:00
.PHONY: install-base-pip-packages
install-base-pip-packages:
${PYTHON} -m pip install pip==${PIP_VERSION}
${PYTHON} -m pip install -r requirements/base.txt
2022-06-29 14:35:19 -04:00
.PHONY: install-huggingface
install-huggingface:
${PYTHON} -m pip install pip==${PIP_VERSION}
${PYTHON} -m pip install -r requirements/huggingface.txt
.PHONY: install-nltk-models
2022-06-29 14:35:19 -04:00
install-nltk-models:
${PYTHON} -c "from unstructured.nlp.tokenize import download_nltk_packages; download_nltk_packages()"
2022-06-29 14:35:19 -04:00
.PHONY: install-test
install-test:
${PYTHON} -m pip install -r requirements/test.txt
[CORE-1741] use forked pytesseract to reduce calls to tesseract (#1298) This PR resolves [CORE-1741](https://unstructured-ai.atlassian.net/browse/CORE-1741) by using a new function `pytesseract.run_and_get_multiple_output`, see forked repo for more details: https://github.com/Unstructured-IO/unstructured.pytesseract/releases/tag/0.3.11-dev1 This reduces the call to `tesseract` by half per page of PDF/image during partition, roughly reducing the runtime by 48%. The new function is in forked `unstructured.pytesseract`. A PR has been made to the upstream repo and once that is merged we should switch to the up stream version. For now we add a new dependency: `unstructured.pytesseract`. ## testing Existing unit tests should serve as tests to the new function. To demonstrate the changes in performance: - checkout main - run `./scripts/performance/profile.sh` and select `ocr_only` strategy, using the 10th document (16 page layout paper in pdf format) - examine the speedscope profile or time profile in flamegraph -> should see two dominant time spenders are `pytesseract.image_to_text` and `pytesseract.image_to_boxes`, with both about the same total time (see attached first image) - checkout this branch - run the same `profile.sh` with the same options - examine the profile again and this time should notice 1) total runtime is reduced by more than 40%; 2) only `unstructured_pytesseract.run_and_get_multiple_output` is the top time spender and its total time is about the same as either the `pytesseract.image_to_text` or `pytesseract.image_to_boxes` time (see second image below) ![Screenshot 2023-09-06 at 9 45 10 AM](https://github.com/Unstructured-IO/unstructured/assets/647930/fed6118b-a0dc-493d-bef8-85d73027c968) ![Screenshot 2023-09-06 at 9 46 37 AM](https://github.com/Unstructured-IO/unstructured/assets/647930/dd1d6369-cfba-43d4-b1c6-87a8a98b2e16) [CORE-1741]: https://unstructured-ai.atlassian.net/browse/CORE-1741?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>
2023-09-14 18:27:18 -05:00
# NOTE(yao) - CI seem to always install tesseract to test so it would make sense to also require
# pytesseract installation into the virtual env for testing
${PYTHON} -m pip install unstructured_pytesseract
# ${PYTHON} -m pip install argilla==1.28.0 -c requirements/deps/constraints.txt
# NOTE(robinson) - Installing weaviate-client separately here because the requests
# version conflicts with label_studio_sdk
${PYTHON} -m pip install weaviate-client -c requirements/deps/constraints.txt
2022-06-29 14:35:19 -04:00
.PHONY: install-dev
install-dev:
${PYTHON} -m pip install -r requirements/dev.txt
2022-06-29 14:35:19 -04:00
.PHONY: install-build
install-build:
${PYTHON} -m pip install -r requirements/build.txt
2022-06-29 14:35:19 -04:00
.PHONY: install-csv
install-csv:
${PYTHON} -m pip install -r requirements/extra-csv.txt
.PHONY: install-docx
install-docx:
${PYTHON} -m pip install -r requirements/extra-docx.txt
.PHONY: install-epub
install-epub:
${PYTHON} -m pip install -r requirements/extra-epub.txt
.PHONY: install-odt
install-odt:
${PYTHON} -m pip install -r requirements/extra-odt.txt
.PHONY: install-pypandoc
install-pypandoc:
${PYTHON} -m pip install -r requirements/extra-pandoc.txt
.PHONY: install-paddleocr
install-paddleocr:
${PYTHON} -m pip install -r requirements/extra-paddleocr.txt
.PHONY: install-markdown
install-markdown:
${PYTHON} -m pip install -r requirements/extra-markdown.txt
.PHONY: install-pdf-image
install-pdf-image:
${PYTHON} -m pip install -r requirements/extra-pdf-image.txt
.PHONY: install-pptx
install-pptx:
${PYTHON} -m pip install -r requirements/extra-pptx.txt
.PHONY: install-xlsx
install-xlsx:
${PYTHON} -m pip install -r requirements/extra-xlsx.txt
.PHONY: install-all-docs
install-all-docs: install-base install-csv install-docx install-epub install-odt install-pypandoc install-markdown install-pdf-image install-pptx install-xlsx
.PHONY: install-ingest
install-ingest:
python3 -m pip install -r requirements/ingest/ingest.txt
## install-local-inference: installs requirements for local inference
.PHONY: install-local-inference
install-local-inference: install install-all-docs
.PHONY: install-pandoc
install-pandoc:
ARCH=${ARCH} ./scripts/install-pandoc.sh
2022-06-29 14:35:19 -04:00
## pip-compile: compiles all base/dev/test requirements
.PHONY: pip-compile
pip-compile:
@scripts/pip-compile.sh
2022-06-29 14:35:19 -04:00
## install-project-local: install unstructured into your local python environment
.PHONY: install-project-local
install-project-local: install
# MAYBE TODO: fail if already exists?
${PYTHON} -m pip install -e .
2022-06-29 14:35:19 -04:00
## uninstall-project-local: uninstall unstructured from your local python environment
.PHONY: uninstall-project-local
uninstall-project-local:
${PYTHON} -m pip uninstall ${PACKAGE_NAME}
2022-06-29 14:35:19 -04:00
#################
# Test and Lint #
#################
export CI ?= false
export UNSTRUCTURED_INCLUDE_DEBUG_METADATA ?= false
2022-06-29 14:35:19 -04:00
## test: runs all unittests
.PHONY: test
test:
PYTHONPATH=. CI=$(CI) \
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) ${PYTHON} -m pytest test_${PACKAGE_NAME} --cov=${PACKAGE_NAME} --cov-report term-missing --durations=40
2022-06-29 14:35:19 -04:00
.PHONY: test-unstructured-api-unit
test-unstructured-api-unit:
scripts/test-unstructured-api-unit.sh
.PHONY: test-no-extras
test-no-extras:
PYTHONPATH=. CI=$(CI) \
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) ${PYTHON} -m pytest \
test_${PACKAGE_NAME}/partition/test_text.py \
test_${PACKAGE_NAME}/partition/test_email.py \
test_${PACKAGE_NAME}/partition/html/test_partition.py \
rfctr(part): prepare for pluggable auto-partitioners 1 (#3655) **Summary** In preparation for pluggable auto-partitioners simplify metadata as discussed. **Additional Context** - Pluggable auto-partitioners requires partitioners to have a consistent call signature. An arbitrary partitioner provided at runtime needs to have a call signature that is known and consistent. Basically `partition_x(filename, *, file, **kwargs)`. - The current `auto.partition()` is highly coupled to each distinct file-type partitioner, deciding which arguments to forward to each. - This is driven by the existence of "delegating" partitioners, those that convert their file-type and then call a second partitioner to do the actual partitioning. Both the delegating and proxy partitioners are decorated with metadata-post-processing decorators and those decorators are not idempotent. We call the situation where those decorators would run twice "double-decorating". For example, EPUB converts to HTML and calls `partition_html()` and both `partition_epub()` and `partition_html()` are decorated. - The way double-decorating has been avoided in the past is to avoid sending the arguments the metadata decorators are sensitive to to the proxy partitioner. This is very obscure, complex to reason about, error-prone, and just overall not a viable strategy. The better solution is to not decorate delegating partitioners and let the proxy partitioner handle all the metadata. - This first step in preparation for that is part of simplifying the metadata processing by removing unused or unwanted legacy parameters. - `date_from_file_object` is a misnomer because a file-object never contains last-modified data. - It can never produce useful results in the API where last-modified information must be provided by `metadata_last_modified`. - It is an undocumented parameter so not in use. - Using it can produce incorrect metadata.
2024-09-23 15:23:10 -07:00
test_${PACKAGE_NAME}/partition/test_xml.py
.PHONY: test-extra-csv
test-extra-csv:
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest \
test_unstructured/partition/test_csv.py \
test_unstructured/partition/test_tsv.py
.PHONY: test-extra-docx
test-extra-docx:
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest \
test_unstructured/partition/test_doc.py \
test_unstructured/partition/test_docx.py
.PHONY: test-extra-epub
test-extra-epub:
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest test_unstructured/partition/test_epub.py
.PHONY: test-extra-markdown
test-extra-markdown:
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest test_unstructured/partition/test_md.py
.PHONY: test-extra-odt
test-extra-odt:
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest test_unstructured/partition/test_odt.py
.PHONY: test-extra-pdf-image
test-extra-pdf-image:
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest test_unstructured/partition/pdf_image
.PHONY: test-extra-pptx
test-extra-pptx:
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest \
test_unstructured/partition/test_ppt.py \
test_unstructured/partition/test_pptx.py
.PHONY: test-extra-pypandoc
test-extra-pypandoc:
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest \
test_unstructured/partition/test_org.py \
test_unstructured/partition/test_rst.py \
test_unstructured/partition/test_rtf.py
.PHONY: test-extra-xlsx
test-extra-xlsx:
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest test_unstructured/partition/test_xlsx.py
2024-12-05 09:40:50 -08:00
.PHONY: test-text-extraction-evaluate
test-text-extraction-evaluate:
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest test_unstructured/metrics/test_text_extraction.py
2022-06-29 14:35:19 -04:00
## check: runs linters (includes tests)
.PHONY: check
check: check-ruff check-black check-flake8 check-version
2022-06-29 14:35:19 -04:00
.PHONY: check-shfmt
check-shfmt:
shfmt -i 2 -d .
.PHONY: check-black
check-black:
${PYTHON} -m black . --check --line-length=100
.PHONY: check-flake8
check-flake8:
${PYTHON} -m flake8 .
.PHONY: check-licenses
check-licenses:
@scripts/check-licenses.sh
.PHONY: check-ruff
check-ruff:
# -- ruff options are determined by pyproject.toml --
ruff check .
2022-06-29 14:35:19 -04:00
.PHONY: check-autoflake
check-autoflake:
autoflake --check-diff .
2022-06-29 14:35:19 -04:00
## check-scripts: run shellcheck
.PHONY: check-scripts
check-scripts:
# Fail if any of these files have warnings
scripts/shellcheck.sh
## check-version: run check to ensure version in CHANGELOG.md matches version in package
.PHONY: check-version
check-version:
# Fail if syncing version would produce changes
scripts/version-sync.sh -c \
-f "unstructured/__version__.py" semver
2022-06-29 14:35:19 -04:00
## tidy: run black
.PHONY: tidy
tidy: tidy-python
.PHONY: tidy_shell
tidy-shell:
shfmt -i 2 -l -w .
.PHONY: tidy-python
tidy-python:
ruff check . --fix-only || true
autoflake --in-place .
black --line-length=100 .
2022-06-29 14:35:19 -04:00
## version-sync: update __version__.py with most recent version from CHANGELOG.md
.PHONY: version-sync
version-sync:
scripts/version-sync.sh \
-f "unstructured/__version__.py" semver
2022-06-29 14:35:19 -04:00
.PHONY: check-coverage
check-coverage:
${PYTHON} -m coverage report --fail-under=90
2023-03-14 13:40:01 -07:00
## check-deps: check consistency of dependencies
.PHONY: check-deps
check-deps:
scripts/consistent-deps.sh
.PHONY: check-extras
check-extras:
scripts/check-extras.sh
2023-03-14 13:40:01 -07:00
##########
# Docker #
##########
# Docker targets are provided for convenience only and are not required in a standard development environment
DOCKER_IMAGE ?= unstructured:dev
2023-03-29 00:02:39 -07:00
2023-03-14 13:40:01 -07:00
.PHONY: docker-build
docker-build:
PIP_VERSION=${PIP_VERSION} DOCKER_IMAGE_NAME=${DOCKER_IMAGE} ./scripts/docker-build.sh
2023-03-14 13:40:01 -07:00
.PHONY: docker-start-bash
docker-start-bash:
docker run -ti --rm ${DOCKER_IMAGE}
2023-03-21 13:46:09 -07:00
.PHONY: docker-start-dev
docker-start-dev:
docker run --rm \
-v ${CURRENT_DIR}:/mnt/local_unstructued \
-ti ${DOCKER_IMAGE}
2023-03-21 13:46:09 -07:00
.PHONY: docker-test
docker-test:
docker run --rm \
-v ${CURRENT_DIR}/test_unstructured:/home/notebook-user/test_unstructured \
-v ${CURRENT_DIR}/test_unstructured_ingest:/home/notebook-user/test_unstructured_ingest \
$(if $(wildcard uns_test_env_file),--env-file uns_test_env_file,) \
2023-03-29 00:02:39 -07:00
$(DOCKER_IMAGE) \
bash -c "CI=$(CI) \
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) \
Feat/refactor layoutelement textregion to vectorized data structure (#3881) This PR refactors the data structure for `list[LayoutElement]` and `list[TextRegion]` used in partition pdf/image files. - new data structure replaces a list of objects with one object with `numpy` array to store data - this only affects partition internal steps and it doesn't change input or output signature of `partition` function itself, i.e., `partition` still returns `list[Element]` - internally `list[LayoutElement]` -> `LayoutElements`; `list[TextRegion]` -> `TextRegions` - current refactor stops before clean up pdfminer elements inside inferred layout elements -> the algorithm of clean up needs to be refactored before the data structure refactor can move forward. So current refactor converts the array data structure into list data structure with `element_array.as_list()` call. This is the last step before turning `list[LayoutElement]` into `list[Element]` as return - a future PR will update this last step so that we build `list[Element]` from `LayoutElements` data structure instead. The goal of this PR is to replace the data structure as much as possible without changing underlying logic. There are a few places where the slicing or filtering logic was simple enough to be converted into vector data structure operations. Those are refactored to be vector based. As a result there is some small improvements observed in ingest test. This is likely because the vector operations cleaned up some previous inconsistency in data types and operations. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2025-01-23 11:11:38 -06:00
python3 -m pytest $(if $(TEST_FILE),$(TEST_FILE),test_unstructured)"
.PHONY: docker-smoke-test
docker-smoke-test:
2023-04-06 10:03:42 -07:00
DOCKER_IMAGE=${DOCKER_IMAGE} ./scripts/docker-smoke-test.sh
###########
# Jupyter #
###########
.PHONY: docker-jupyter-notebook
docker-jupyter-notebook:
docker run -p 8888:8888 --mount type=bind,source=$(realpath .),target=/home --entrypoint jupyter-notebook -t --rm ${DOCKER_IMAGE} --allow-root --port 8888 --ip 0.0.0.0 --NotebookApp.token='' --NotebookApp.password=''
.PHONY: run-jupyter
run-jupyter:
PYTHONPATH=$(realpath .) JUPYTER_PATH=$(realpath .) jupyter-notebook --NotebookApp.token='' --NotebookApp.password=''
###########
# Other #
###########
.PHONY: html-fixtures-update
html-fixtures-update:
test_unstructured_ingest/structured-json-to-html.sh test_unstructured_ingest/expected-structured-output-html