unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-04 07:27:34 +00:00

Author	SHA1	Message	Date
Klaijan	fd8b682194	fix: mean group add param (#2684 )	2024-03-22 15:16:23 +00:00
Yao You	2eb0b25e0d	Feat: single table structure eval metric (#2655 ) Creates a compounding metric to represent table structure score. It is an average of existing row and col index and content score. This PR adds a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores. This new metric is meant to offer a single number to represent the performance of table structure extraction model/algorithms. This PR also refactors the eval computation logic so it uses a constant `table_eval_metrics` instead of hard coding the name of the metrics in multiple places in the code. --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2024-03-19 15:15:32 +00:00
Klaijan	ccda40f750	feat: grouping eval takes list of filenames (#2635 ) Add features to `get_mean_grouping` to allow input as a list of filenames in the format of List of strings or txt file. --------- Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-17 17:19:55 +00:00
John	fe300fe56d	fix: teardown fixture for tests and update pre-commit-config (#2565 ) Files were being created as a side effect from running tests in `test_unstructured/metrics/test_evaluate.py`. The updated decorator removes the created directory and its files after the tests run. Testing on the main branch, run `make test` or `pytest test_unstructured/metrics/test_evaluate.py` and files will be created. On this branch no files are created	2024-03-12 22:16:39 +00:00
Klaijan	3ff6de4f50	refactor: refactor var name for consistency (#2609 ) refactor variable name for consistency.	2024-03-05 09:08:25 +00:00
Klaijan	6a4b7a134b	feat: element type accuracy grouping (#2594 ) This PR allow grouping functionality on `evaluate.py` To test: Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` or call `get_mean_grouping(<doctype or connector>, <dataframe or path to tsv file>, <export directory>, "element_type")` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2024-03-01 15:18:37 +00:00
Klaijan	daaf1775b4	feat: separate evaluate grouping function (#2572 ) Separate the aggregating functionality of `text_extraction_accuracy` to a stand-alone function to avoid duplicated eval effort if the granular level eval is already available. To test: Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` locally	2024-02-23 05:45:20 +00:00
Pawel Kmiecik	ff9d46f9dc	feat(eval): table evaluation metrics (#2558 ) This PR adds new table evaluation metrics prepared by @leah1985 The metrics include: - `table count` (check) - `table_level_acc` - accuracy of table detection - `element_col_level_index_acc` - accuracy of cell detection in columns - `element_row_level_index_acc` - accuracy of cell detection in rows - `element_col_level_content_acc` - accuracy of content detected in columns - `element_row_level_content_acc` - accuracy of content detected in rows TODO in next steps: - create a minimal dataset and upload to s3 for ingest tests - generate and add metrics on the above dataset to `test_unstructured_ingest/metrics`	2024-02-22 16:35:46 +00:00
Klaijan	e65a44eabb	feat: update cct eval for text dir (#2299 ) The code makes edit to the `measure_text_extraction_accuracy` function to allows dir of txt as well as json. The function also takes input `output_type` to be either "json" or "txt" only, and checks if the files under given directory/list contains only specified file type or not. To test this feature, run the following code: ```PYTHONPATH=. python unstructured/ingest/evaluate.py measure-text-extraction-accuracy-command --output_dir <clean-text-path> --source_dir <cct-label-path> --output_type txt```	2024-01-05 23:34:53 +00:00
John	04f4c3ab16	create teardown fixture for tests (#2269 ) Closes #2263 Files were being created as a side effect from running tests in `test_unstructured/metrics/test_evaluate.py`. The added decorator removes the created directory and its files after the tests run. Testing on the main branch, run `make test` or `pytest test_unstructured/metrics/test_evaluate.py` and files will be created. On this branch no files are created	2023-12-20 17:50:12 +00:00
Klaijan	0aae1faa54	feat: add visualize param to command and add test (#2178 ) - Add `visualize` parameter to the click command -- now callable using `--visualize` flag to show the progress bar. - Refactor the name.	2023-11-29 01:05:55 +00:00
Klaijan	2c2d5b65ca	refactor: measure_text_edit_distance function for aggregation (#2108 ) - Refactor `metrics/evaluation.py` to accepts `grouping` as parameter. - Switch to `DataFrame` for easier analysis and aggregation.	2023-11-22 13:30:16 -08:00
shreyanid	6db663e7bb	refactor: separate click wrappers from core evaluation functionality (#1981 ) ### Summary Click decorated functions cannot (properly) be called outside of the click interface. This makes it difficult to reuse the setup functionality in measure_text_edit_distance or measure_element_type_accuracy. This PR removes the click decoration and separates it into a wrapper function purely to execute the command. ### Technical Details - Changed as suggested in [this StackOverflow post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command) response - The locations of these now distinct functions are separate: the `_command` click-decorated functions stay in ingest/evaluate.py, and the core functions measure_text_edit_distance and measure_element_type_accuracy are moved into the unstructured/metrics/ folder (which is a more logical location for them). - Initial test added for measure_text_edit_distance ### Test `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` functionality is unchanged. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>	2023-11-07 19:54:22 +00:00

13 Commits