unstructured/requirements/constraints.in

44 lines
1.9 KiB
Plaintext
Raw Normal View History

####################################################################################################
# This file can house global constraints that aren't *direct* requirements of the package or any
# extras. Putting a dependency here will only affect dependency sets that contain them -- in other
# words, if something does not require a constraint, it will not be installed.
####################################################################################################
# NOTE(alan): Pinning to avoid conflicts with downstream ingest-s3
urllib3<1.27, >=1.25.4
# consistency with local-inference-pin
protobuf<4.24
# NOTE(robinson) - Required pins for security scans
jupyter-core>=4.11.2
wheel>=0.38.1
# NOTE(robinson) - The following pins are to address
# vulnerabilities in dependency scans
certifi>=2023.7.22
# From pycocotools in local-inference
pyparsing<3.1.0
# NOTE(robinson) - Numpy dropped Python 3.8 support in 1.25.0
numpy<1.25.0
scipy<1.11.0
IPython<8.13
# NOTE(alan) Pinned to avoid error that occurs with 2.4.3:
# AttributeError: 'ResourcePath' object has no attribute 'collection'
Office365-REST-Python-Client<2.4.3
# NOTE(trevor) `unstructured-inference` is set in extra-pdf-image.in to allow
# unstructured-inference to be upgraded when unstructured library is upgraded
# https://github.com/Unstructured-IO/unstructured/issues/1458
# unstructured-inference
# NOTE(klaijan) - Moved pin from test.in
# pinning to avoid error in argilla library
pydantic<2
# unable to build wheel for arm on 0.3.3+
safetensors<=0.3.2
[CORE-1741] use forked pytesseract to reduce calls to tesseract (#1298) This PR resolves [CORE-1741](https://unstructured-ai.atlassian.net/browse/CORE-1741) by using a new function `pytesseract.run_and_get_multiple_output`, see forked repo for more details: https://github.com/Unstructured-IO/unstructured.pytesseract/releases/tag/0.3.11-dev1 This reduces the call to `tesseract` by half per page of PDF/image during partition, roughly reducing the runtime by 48%. The new function is in forked `unstructured.pytesseract`. A PR has been made to the upstream repo and once that is merged we should switch to the up stream version. For now we add a new dependency: `unstructured.pytesseract`. ## testing Existing unit tests should serve as tests to the new function. To demonstrate the changes in performance: - checkout main - run `./scripts/performance/profile.sh` and select `ocr_only` strategy, using the 10th document (16 page layout paper in pdf format) - examine the speedscope profile or time profile in flamegraph -> should see two dominant time spenders are `pytesseract.image_to_text` and `pytesseract.image_to_boxes`, with both about the same total time (see attached first image) - checkout this branch - run the same `profile.sh` with the same options - examine the profile again and this time should notice 1) total runtime is reduced by more than 40%; 2) only `unstructured_pytesseract.run_and_get_multiple_output` is the top time spender and its total time is about the same as either the `pytesseract.image_to_text` or `pytesseract.image_to_boxes` time (see second image below) ![Screenshot 2023-09-06 at 9 45 10 AM](https://github.com/Unstructured-IO/unstructured/assets/647930/fed6118b-a0dc-493d-bef8-85d73027c968) ![Screenshot 2023-09-06 at 9 46 37 AM](https://github.com/Unstructured-IO/unstructured/assets/647930/dd1d6369-cfba-43d4-b1c6-87a8a98b2e16) [CORE-1741]: https://unstructured-ai.atlassian.net/browse/CORE-1741?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>
2023-09-14 18:27:18 -05:00
# use the known compatible version of weaviate and unstructured.pytesseract
unstructured.pytesseract>=0.3.12
weaviate-client==3.23.2
# Note(yuming) - pining to avoid conflict with paddle install
matplotlib==3.7.2
# NOTE(crag) - pin to available pandas for python 3.8 (at least in CI)
fsspec==2023.9.1
pandas<2.0.4
# langchain limits this to 3.1.7
anyio==3.1.7