unstructured/test_unstructured_ingest/evaluation-metrics.sh
Klaijan 6707cab250
build: text extraction evaluation metrics workflow added (#1757)
**Executive Summary**
This PR adds the evaluation metrics to our current workflow. It verifies
the flow that when the code is pushed, the code will gets evaluate
against our gold standard and output into `.tsv` file.

**Technical Details**
- Adds evaluation metrics to the test-ingest workflow
- Make use of `structured-output` from `test-ingest` and compare to the
gold-standard uploaded in s3, and download into local when make
comparison. The current folder in-use is
`s3://utic-dev-tech-fixtures/small-cct`. This dir is editable in the
shell script.
- With this PR, only one file from one connector is use to compare.

**Misc**
- Not many overlapped files between test-ingest and gold-standard. More
files will be added.

**Outputs**
2 `.tsv` files are saved under `test_unstructured_ingest/metrics/`.


![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/222e437c-1a94-4d7c-9320-81696633b1ae)


![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/5c840322-6739-4634-8868-eba04b4ebc96)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-10-23 21:39:22 +00:00

32 lines
814 B
Bash
Executable File

#!/usr/bin/env bash
set -e
SCRIPT_DIR=$(dirname "$(realpath "$0")")
cd "$SCRIPT_DIR"/.. || exit 1
# List all structured outputs to use in this evaluation
OUTPUT_DIR=$SCRIPT_DIR/structured-output-eval
mkdir -p "$OUTPUT_DIR"
# Download cct test from s3
BUCKET_NAME=utic-dev-tech-fixtures
FOLDER_NAME=small-cct
CCT_DIR=$SCRIPT_DIR/gold-standard/$FOLDER_NAME
mkdir -p "$CCT_DIR"
aws s3 cp "s3://$BUCKET_NAME/$FOLDER_NAME" "$CCT_DIR" --recursive --no-sign-request --region us-east-2
# shellcheck disable=SC1091
source "$SCRIPT_DIR"/cleanup.sh
function cleanup() {
cleanup_dir "$OUTPUT_DIR"
cleanup_dir "$CCT_DIR"
}
trap cleanup EXIT
EXPORT_DIR="$SCRIPT_DIR"/metrics
PYTHONPATH=. ./unstructured/ingest/evaluate.py \
--output_dir "$OUTPUT_DIR" \
--source_dir "$CCT_DIR" \
--export_dir "$EXPORT_DIR"