unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-08-31 12:23:49 +00:00

Author	SHA1	Message	Date
shreyanid	6db663e7bb	refactor: separate click wrappers from core evaluation functionality (#1981 ) ### Summary Click decorated functions cannot (properly) be called outside of the click interface. This makes it difficult to reuse the setup functionality in measure_text_edit_distance or measure_element_type_accuracy. This PR removes the click decoration and separates it into a wrapper function purely to execute the command. ### Technical Details - Changed as suggested in [this StackOverflow post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command) response - The locations of these now distinct functions are separate: the `_command` click-decorated functions stay in ingest/evaluate.py, and the core functions measure_text_edit_distance and measure_element_type_accuracy are moved into the unstructured/metrics/ folder (which is a more logical location for them). - Initial test added for measure_text_edit_distance ### Test `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` functionality is unchanged. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>	2023-11-07 19:54:22 +00:00
Klaijan	a06b151897	refactor: ci workflow refactor (#1907 ) Refactor the evaluation scripts including `unstructured/ingest/evaluation.py` `test_unstructured_ingest/evaluation-metrics.sh` for more structured code and usage. - The script is now only use one python script call with param - Adds function to build string for output_args (`--output_dir --output_list) and source_args (`--source_dir --source_args`) - Now accepts evaluation to call as a param, currently only accepts `text-extraction` and `element-type` Example to call the function: ```sh evaluation-metrics.sh text-extraction``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-11-01 15:58:23 +00:00
Klaijan	466255eec3	build: element type frequency evaluation metrics workflow in ci (#1862 ) Executive Summary Measured element type frequency accuracy from the current version of code with the expected output. The performance is reported as tsv file under `metrics`. Technical Details - The evaluation measures element type frequencies from `structured-output-eval` against `expected-structured-output` - `evaluation.py` has been edited to support function calling using `click.group()` and `command()` - `evaluation-ingest-cp.sh` is now added to all the `test-ingest-xx.sh` scripts Outputs 2 tsv files is saved ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/b4458094-a9fc-48f9-a0bd-2ccd6985440a) ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/6d785736-bcaf-4275-bf2d-ab511cdfb3f4) 9-0e05-41d4-b69f-841a2aa131ec) and aggregated score is displayed. ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/9d42bd0c-a0dd-41c2-a2e5-b675a40f35cc) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-10-27 04:36:36 +00:00
Klaijan	6707cab250	build: text extraction evaluation metrics workflow added (#1757 ) Executive Summary This PR adds the evaluation metrics to our current workflow. It verifies the flow that when the code is pushed, the code will gets evaluate against our gold standard and output into `.tsv` file. Technical Details - Adds evaluation metrics to the test-ingest workflow - Make use of `structured-output` from `test-ingest` and compare to the gold-standard uploaded in s3, and download into local when make comparison. The current folder in-use is `s3://utic-dev-tech-fixtures/small-cct`. This dir is editable in the shell script. - With this PR, only one file from one connector is use to compare. Misc - Not many overlapped files between test-ingest and gold-standard. More files will be added. Outputs 2 `.tsv` files are saved under `test_unstructured_ingest/metrics/`. ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/222e437c-1a94-4d7c-9320-81696633b1ae) ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/5c840322-6739-4634-8868-eba04b4ebc96) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-10-23 21:39:22 +00:00

4 Commits