mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-09-01 12:53:58 +00:00

**Executive Summary** This PR adds the evaluation metrics to our current workflow. It verifies the flow that when the code is pushed, the code will gets evaluate against our gold standard and output into `.tsv` file. **Technical Details** - Adds evaluation metrics to the test-ingest workflow - Make use of `structured-output` from `test-ingest` and compare to the gold-standard uploaded in s3, and download into local when make comparison. The current folder in-use is `s3://utic-dev-tech-fixtures/small-cct`. This dir is editable in the shell script. - With this PR, only one file from one connector is use to compare. **Misc** - Not many overlapped files between test-ingest and gold-standard. More files will be added. **Outputs** 2 `.tsv` files are saved under `test_unstructured_ingest/metrics/`.   --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
32 lines
814 B
Bash
Executable File
32 lines
814 B
Bash
Executable File
#!/usr/bin/env bash
|
|
|
|
set -e
|
|
|
|
SCRIPT_DIR=$(dirname "$(realpath "$0")")
|
|
cd "$SCRIPT_DIR"/.. || exit 1
|
|
|
|
# List all structured outputs to use in this evaluation
|
|
OUTPUT_DIR=$SCRIPT_DIR/structured-output-eval
|
|
mkdir -p "$OUTPUT_DIR"
|
|
|
|
# Download cct test from s3
|
|
BUCKET_NAME=utic-dev-tech-fixtures
|
|
FOLDER_NAME=small-cct
|
|
CCT_DIR=$SCRIPT_DIR/gold-standard/$FOLDER_NAME
|
|
mkdir -p "$CCT_DIR"
|
|
aws s3 cp "s3://$BUCKET_NAME/$FOLDER_NAME" "$CCT_DIR" --recursive --no-sign-request --region us-east-2
|
|
|
|
# shellcheck disable=SC1091
|
|
source "$SCRIPT_DIR"/cleanup.sh
|
|
function cleanup() {
|
|
cleanup_dir "$OUTPUT_DIR"
|
|
cleanup_dir "$CCT_DIR"
|
|
}
|
|
trap cleanup EXIT
|
|
|
|
EXPORT_DIR="$SCRIPT_DIR"/metrics
|
|
PYTHONPATH=. ./unstructured/ingest/evaluate.py \
|
|
--output_dir "$OUTPUT_DIR" \
|
|
--source_dir "$CCT_DIR" \
|
|
--export_dir "$EXPORT_DIR"
|