mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-02 20:07:27 +00:00

History

Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.

2023-09-19 15:32:46 -07:00

docs

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

warmup_docs

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

.gitignore

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

benchmark-local.sh

chore: update all bash scripts to use shebang: /usr/bin/env bash (#779 )

2023-06-20 16:00:55 -07:00

benchmark.sh

chore: update all bash scripts to use shebang: /usr/bin/env bash (#779 )

2023-06-20 16:00:55 -07:00

get-stats-name.sh

chore: update all bash scripts to use shebang: /usr/bin/env bash (#779 )

2023-06-20 16:00:55 -07:00

profile.sh

dev: add py-spy profiling (#1251 )

2023-08-31 19:26:29 +00:00

README.md

dev: add py-spy profiling (#1251 )

2023-08-31 19:26:29 +00:00

requirements.txt

dev: add py-spy profiling (#1251 )

2023-08-31 19:26:29 +00:00

run_partition.py

rfctr: docx partitioning (#1422 )

2023-09-19 15:32:46 -07:00

time_partition.py

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

README.md

Performance

This is a collection of tools helpful for inspecting and tracking performance of the Unstructured library.

The benchmarking script allows a user to track performance time to partitioning results against a fixed set of test documents and store those results with indication of architecture, instance type, and git hash, in S3.

The profiling script allows a user to inspect how time time and memory are spent across called functions when performing partitioning on a given document.

Install

Benchmarking requires no additional dependencies and should work without any initial setup. Profiling has a few dependencies which can be installed with:

pip install -r scripts/performance/requirements.txt
npm install -g speedscope

The second dependency speedscope provides a tool to view profiling results from py-spy locally. Alternatively you can also drop the profile result *.speedscope into https://www.speedscope.app/ to view the results online.

Run

Benchmark

Export / assign desired environment variable settings:

DOCKER_TEST: Set to true to run benchmark inside a Docker container (default: false)
NUM_ITERATIONS: Number of iterations for benchmark (e.g., 100) (default: 3)
INSTANCE_TYPE: Type of benchmark instance (e.g., "c5.xlarge") (default: unspecified)
PUBLISH_RESULTS: Set to true to publish results to S3 bucket (default: false)

Usage: ./scripts/performance/benchmark.sh

Profile

Export / assign desired environment variable settings:

DOCKER_TEST: Set to true to run profiling inside a Docker container (default: false)

Usage:

on Linux: ./scripts/performance/profile.sh

on macOS: sudo -E ./scripts/performance/profile.sh; py-spy requires su to run on macOS

Run the script and choose the profiling mode: 'run' or 'view'.
In the 'run' mode, you can profile custom files or select existing test files.
In the 'view' mode, you can view previously generated profiling results.
The script supports time profiling with cProfile and memory profiling with memray.
Users can choose different visualization options such as flamegraphs, tables, trees, summaries, and statistics.
Test documents are synced from an S3 bucket to a local directory before running the profiles