Steve Canny b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00
..
2023-06-05 09:14:43 -07:00
2023-06-05 09:14:43 -07:00
2023-08-31 19:26:29 +00:00
2023-08-31 19:26:29 +00:00

Performance

This is a collection of tools helpful for inspecting and tracking performance of the Unstructured library.

The benchmarking script allows a user to track performance time to partitioning results against a fixed set of test documents and store those results with indication of architecture, instance type, and git hash, in S3.

The profiling script allows a user to inspect how time time and memory are spent across called functions when performing partitioning on a given document.

Install

Benchmarking requires no additional dependencies and should work without any initial setup. Profiling has a few dependencies which can be installed with:

pip install -r scripts/performance/requirements.txt
npm install -g speedscope

The second dependency speedscope provides a tool to view profiling results from py-spy locally. Alternatively you can also drop the profile result *.speedscope into https://www.speedscope.app/ to view the results online.

Run

Benchmark

Export / assign desired environment variable settings:

  • DOCKER_TEST: Set to true to run benchmark inside a Docker container (default: false)
  • NUM_ITERATIONS: Number of iterations for benchmark (e.g., 100) (default: 3)
  • INSTANCE_TYPE: Type of benchmark instance (e.g., "c5.xlarge") (default: unspecified)
  • PUBLISH_RESULTS: Set to true to publish results to S3 bucket (default: false)

Usage: ./scripts/performance/benchmark.sh

Profile

Export / assign desired environment variable settings:

  • DOCKER_TEST: Set to true to run profiling inside a Docker container (default: false)

Usage:

on Linux: ./scripts/performance/profile.sh

on macOS: sudo -E ./scripts/performance/profile.sh; py-spy requires su to run on macOS

  • Run the script and choose the profiling mode: 'run' or 'view'.
  • In the 'run' mode, you can profile custom files or select existing test files.
  • In the 'view' mode, you can view previously generated profiling results.
  • The script supports time profiling with cProfile and memory profiling with memray.
  • Users can choose different visualization options such as flamegraphs, tables, trees, summaries, and statistics.
  • Test documents are synced from an S3 bucket to a local directory before running the profiles