unstructured/scripts/performance/README.md

# Performance
This is a collection of tools helpful for inspecting and tracking performance of the Unstructured library.

The benchmarking script allows a user to track performance time to partitioning results against a fixed set of test documents and store those results with indication of architecture, instance type, and git hash, in S3.

The profiling script allows a user to inspect how time time and memory are spent across called functions when performing partitioning on a given document.

## Install
Benchmarking requires no additional dependencies and should work without any initial setup.
Profiling has a few dependencies which can be installed with:

```bash
pip install -r scripts/performance/requirements.txt
npm install -g speedscope
```

The second dependency `speedscope` provides a tool to view profiling results from `py-spy` locally. Alternatively you can also drop the profile result `*.speedscope` into https://www.speedscope.app/ to view the results online.

## Run
### Benchmark
Export / assign desired environment variable settings:
- DOCKER_TEST: Set to true to run benchmark inside a Docker container (default: false)
- NUM_ITERATIONS: Number of iterations for benchmark (e.g., 100) (default: 3)
- INSTANCE_TYPE: Type of benchmark instance (e.g., "c5.xlarge") (default: unspecified)
- PUBLISH_RESULTS: Set to true to publish results to S3 bucket (default: false)
-
Usage: `./scripts/performance/benchmark.sh`

### Profile

Export / assign desired environment variable settings:
- DOCKER_TEST: Set to true to run profiling inside a Docker container (default: false)

Usage:

**on Linux**: `./scripts/performance/profile.sh`

**on macOS**: `sudo -E ./scripts/performance/profile.sh`; `py-spy` requires su to run on macOS

- Run the script and choose the profiling mode: 'run' or 'view'.
- In the 'run' mode, you can profile custom files or select existing test files.
- In the 'view' mode, you can view previously generated profiling results.
- The script supports time profiling with cProfile and memory profiling with memray.
- Users can choose different visualization options such as flamegraphs, tables, trees, summaries, and statistics.
- Test documents are synced from an S3 bucket to a local directory before running the profiles
test: adds profiling script (#661) 2023-06-01 14:26:05 -07:00			`# Performance`
dev: add py-spy profiling (#1251) This PR adds a new developer tool for profiling performance: `py-spy`. Additionally it adds a new make command to start a docker with your local `unstructured` repo mounted for quick testing code in a Rocky Linux environment (see usage below for intent). ### py-spy It is a sampling profiler https://github.com/benfred/py-spy and in practice usually provides more readily usable information than commonly used `cProfiler`. It also supports output to `speedscope` format, [which](https://github.com/jlfwong/speedscope#usage) provides a rich view of the profiling result. ### usage The new tool is added to the existing `profile.sh` script and is readily discoverable in the interactive interface. When select to view the new speedscope format profile it would show up in your local browser if you followed the readme to install speedscope locally via `npm install -g speedscope`. On macOS the profiling tool needs superuser privilege. If you are not comfortable with that feel free to run the profiling inside a Linux container if your local dev env is macOS. 2023-08-31 14:26:29 -05:00			`This is a collection of tools helpful for inspecting and tracking performance of the Unstructured library.`
test: adds profiling script (#661) 2023-06-01 14:26:05 -07:00
			`The benchmarking script allows a user to track performance time to partitioning results against a fixed set of test documents and store those results with indication of architecture, instance type, and git hash, in S3.`

			`The profiling script allows a user to inspect how time time and memory are spent across called functions when performing partitioning on a given document.`

			`## Install`
			`Benchmarking requires no additional dependencies and should work without any initial setup.`
dev: add py-spy profiling (#1251) This PR adds a new developer tool for profiling performance: `py-spy`. Additionally it adds a new make command to start a docker with your local `unstructured` repo mounted for quick testing code in a Rocky Linux environment (see usage below for intent). ### py-spy It is a sampling profiler https://github.com/benfred/py-spy and in practice usually provides more readily usable information than commonly used `cProfiler`. It also supports output to `speedscope` format, [which](https://github.com/jlfwong/speedscope#usage) provides a rich view of the profiling result. ### usage The new tool is added to the existing `profile.sh` script and is readily discoverable in the interactive interface. When select to view the new speedscope format profile it would show up in your local browser if you followed the readme to install speedscope locally via `npm install -g speedscope`. On macOS the profiling tool needs superuser privilege. If you are not comfortable with that feel free to run the profiling inside a Linux container if your local dev env is macOS. 2023-08-31 14:26:29 -05:00			`Profiling has a few dependencies which can be installed with:`

			```bash
			`pip install -r scripts/performance/requirements.txt`
			`npm install -g speedscope`
			```

			The second dependency `speedscope` provides a tool to view profiling results from `py-spy` locally. Alternatively you can also drop the profile result `*.speedscope` into https://www.speedscope.app/ to view the results online.
test: adds profiling script (#661) 2023-06-01 14:26:05 -07:00
			`## Run`
			`### Benchmark`
			`Export / assign desired environment variable settings:`
			`- DOCKER_TEST: Set to true to run benchmark inside a Docker container (default: false)`
			`- NUM_ITERATIONS: Number of iterations for benchmark (e.g., 100) (default: 3)`
			`- INSTANCE_TYPE: Type of benchmark instance (e.g., "c5.xlarge") (default: unspecified)`
			`- PUBLISH_RESULTS: Set to true to publish results to S3 bucket (default: false)`
dev: add py-spy profiling (#1251) This PR adds a new developer tool for profiling performance: `py-spy`. Additionally it adds a new make command to start a docker with your local `unstructured` repo mounted for quick testing code in a Rocky Linux environment (see usage below for intent). ### py-spy It is a sampling profiler https://github.com/benfred/py-spy and in practice usually provides more readily usable information than commonly used `cProfiler`. It also supports output to `speedscope` format, [which](https://github.com/jlfwong/speedscope#usage) provides a rich view of the profiling result. ### usage The new tool is added to the existing `profile.sh` script and is readily discoverable in the interactive interface. When select to view the new speedscope format profile it would show up in your local browser if you followed the readme to install speedscope locally via `npm install -g speedscope`. On macOS the profiling tool needs superuser privilege. If you are not comfortable with that feel free to run the profiling inside a Linux container if your local dev env is macOS. 2023-08-31 14:26:29 -05:00			`-`
test: adds profiling script (#661) 2023-06-01 14:26:05 -07:00			Usage: `./scripts/performance/benchmark.sh`

			`### Profile`

			`Export / assign desired environment variable settings:`
			`- DOCKER_TEST: Set to true to run profiling inside a Docker container (default: false)`

dev: add py-spy profiling (#1251) This PR adds a new developer tool for profiling performance: `py-spy`. Additionally it adds a new make command to start a docker with your local `unstructured` repo mounted for quick testing code in a Rocky Linux environment (see usage below for intent). ### py-spy It is a sampling profiler https://github.com/benfred/py-spy and in practice usually provides more readily usable information than commonly used `cProfiler`. It also supports output to `speedscope` format, [which](https://github.com/jlfwong/speedscope#usage) provides a rich view of the profiling result. ### usage The new tool is added to the existing `profile.sh` script and is readily discoverable in the interactive interface. When select to view the new speedscope format profile it would show up in your local browser if you followed the readme to install speedscope locally via `npm install -g speedscope`. On macOS the profiling tool needs superuser privilege. If you are not comfortable with that feel free to run the profiling inside a Linux container if your local dev env is macOS. 2023-08-31 14:26:29 -05:00			`Usage:`

			on Linux: `./scripts/performance/profile.sh`

			on macOS: `sudo -E ./scripts/performance/profile.sh`; `py-spy` requires su to run on macOS

test: adds profiling script (#661) 2023-06-01 14:26:05 -07:00			`- Run the script and choose the profiling mode: 'run' or 'view'.`
			`- In the 'run' mode, you can profile custom files or select existing test files.`
			`- In the 'view' mode, you can view previously generated profiling results.`
			`- The script supports time profiling with cProfile and memory profiling with memray.`
			`- Users can choose different visualization options such as flamegraphs, tables, trees, summaries, and statistics.`
			`- Test documents are synced from an S3 bucket to a local directory before running the profiles`