unstructured/requirements/ingest/elasticsearch.txt

23 lines
570 B
Plaintext
Raw Normal View History

2023-07-01 18:45:28 +01:00
#
# This file is autogenerated by pip-compile with Python 3.8
2023-07-01 18:45:28 +01:00
# by the following command:
#
# pip-compile --output-file=ingest/elasticsearch.txt ingest/elasticsearch.in
2023-07-01 18:45:28 +01:00
#
certifi==2023.7.22
2023-07-01 18:45:28 +01:00
# via
# -c ingest/../base.txt
# -c ingest/../constraints.in
2023-07-01 18:45:28 +01:00
# elastic-transport
elastic-transport==8.10.0
2023-07-01 18:45:28 +01:00
# via elasticsearch
elasticsearch==8.11.0
# via -r ingest/elasticsearch.in
[CORE-1741] use forked pytesseract to reduce calls to tesseract (#1298) This PR resolves [CORE-1741](https://unstructured-ai.atlassian.net/browse/CORE-1741) by using a new function `pytesseract.run_and_get_multiple_output`, see forked repo for more details: https://github.com/Unstructured-IO/unstructured.pytesseract/releases/tag/0.3.11-dev1 This reduces the call to `tesseract` by half per page of PDF/image during partition, roughly reducing the runtime by 48%. The new function is in forked `unstructured.pytesseract`. A PR has been made to the upstream repo and once that is merged we should switch to the up stream version. For now we add a new dependency: `unstructured.pytesseract`. ## testing Existing unit tests should serve as tests to the new function. To demonstrate the changes in performance: - checkout main - run `./scripts/performance/profile.sh` and select `ocr_only` strategy, using the 10th document (16 page layout paper in pdf format) - examine the speedscope profile or time profile in flamegraph -> should see two dominant time spenders are `pytesseract.image_to_text` and `pytesseract.image_to_boxes`, with both about the same total time (see attached first image) - checkout this branch - run the same `profile.sh` with the same options - examine the profile again and this time should notice 1) total runtime is reduced by more than 40%; 2) only `unstructured_pytesseract.run_and_get_multiple_output` is the top time spender and its total time is about the same as either the `pytesseract.image_to_text` or `pytesseract.image_to_boxes` time (see second image below) ![Screenshot 2023-09-06 at 9 45 10 AM](https://github.com/Unstructured-IO/unstructured/assets/647930/fed6118b-a0dc-493d-bef8-85d73027c968) ![Screenshot 2023-09-06 at 9 46 37 AM](https://github.com/Unstructured-IO/unstructured/assets/647930/dd1d6369-cfba-43d4-b1c6-87a8a98b2e16) [CORE-1741]: https://unstructured-ai.atlassian.net/browse/CORE-1741?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>
2023-09-14 18:27:18 -05:00
jq==1.6.0
# via -r ingest/elasticsearch.in
urllib3==1.26.18
2023-07-01 18:45:28 +01:00
# via
# -c ingest/../base.txt
# -c ingest/../constraints.in
2023-07-01 18:45:28 +01:00
# elastic-transport