unstructured/requirements/huggingface.txt

106 lines
2.1 KiB
Plaintext
Raw Normal View History

#
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/huggingface.in
#
certifi==2023.7.22
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
charset-normalizer==3.2.0
# via
# -c requirements/base.txt
# requests
click==8.1.7
# via
# -c requirements/base.txt
# sacremoses
[CORE-1741] use forked pytesseract to reduce calls to tesseract (#1298) This PR resolves [CORE-1741](https://unstructured-ai.atlassian.net/browse/CORE-1741) by using a new function `pytesseract.run_and_get_multiple_output`, see forked repo for more details: https://github.com/Unstructured-IO/unstructured.pytesseract/releases/tag/0.3.11-dev1 This reduces the call to `tesseract` by half per page of PDF/image during partition, roughly reducing the runtime by 48%. The new function is in forked `unstructured.pytesseract`. A PR has been made to the upstream repo and once that is merged we should switch to the up stream version. For now we add a new dependency: `unstructured.pytesseract`. ## testing Existing unit tests should serve as tests to the new function. To demonstrate the changes in performance: - checkout main - run `./scripts/performance/profile.sh` and select `ocr_only` strategy, using the 10th document (16 page layout paper in pdf format) - examine the speedscope profile or time profile in flamegraph -> should see two dominant time spenders are `pytesseract.image_to_text` and `pytesseract.image_to_boxes`, with both about the same total time (see attached first image) - checkout this branch - run the same `profile.sh` with the same options - examine the profile again and this time should notice 1) total runtime is reduced by more than 40%; 2) only `unstructured_pytesseract.run_and_get_multiple_output` is the top time spender and its total time is about the same as either the `pytesseract.image_to_text` or `pytesseract.image_to_boxes` time (see second image below) ![Screenshot 2023-09-06 at 9 45 10 AM](https://github.com/Unstructured-IO/unstructured/assets/647930/fed6118b-a0dc-493d-bef8-85d73027c968) ![Screenshot 2023-09-06 at 9 46 37 AM](https://github.com/Unstructured-IO/unstructured/assets/647930/dd1d6369-cfba-43d4-b1c6-87a8a98b2e16) [CORE-1741]: https://unstructured-ai.atlassian.net/browse/CORE-1741?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>
2023-09-14 18:27:18 -05:00
filelock==3.12.4
# via
# huggingface-hub
# torch
# transformers
fsspec==2023.9.1
# via huggingface-hub
huggingface-hub==0.17.2
# via transformers
idna==3.4
# via
# -c requirements/base.txt
# requests
jinja2==3.1.2
# via torch
joblib==1.3.2
# via
# -c requirements/base.txt
# sacremoses
langdetect==1.0.9
# via -r requirements/huggingface.in
markupsafe==2.1.3
# via jinja2
mpmath==1.3.0
# via sympy
networkx==3.1
# via torch
numpy==1.24.4
# via
# -c requirements/constraints.in
# transformers
packaging==23.1
# via
# -c requirements/base.txt
# huggingface-hub
# transformers
pyyaml==6.0.1
# via
# huggingface-hub
# transformers
regex==2023.8.8
# via
# -c requirements/base.txt
# sacremoses
# transformers
requests==2.31.0
# via
# -c requirements/base.txt
# huggingface-hub
# transformers
sacremoses==0.0.53
# via -r requirements/huggingface.in
safetensors==0.3.2
# via
# -c requirements/constraints.in
# transformers
sentencepiece==0.1.99
# via -r requirements/huggingface.in
six==1.16.0
# via
# langdetect
# sacremoses
sympy==1.12
# via torch
tokenizers==0.13.3
# via transformers
torch==2.0.1
# via -r requirements/huggingface.in
tqdm==4.66.1
# via
# -c requirements/base.txt
# huggingface-hub
# sacremoses
# transformers
transformers==4.33.2
# via -r requirements/huggingface.in
typing-extensions==4.8.0
# via
# -c requirements/base.txt
# huggingface-hub
# torch
urllib3==1.26.16
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# requests