unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-04 15:42:16 +00:00

Author	SHA1	Message	Date
Tracy Shen	8c58bc57db	fix doctype parsing error (#3811 ) - per [ticket](https://unstructured-ai.atlassian.net/browse/ML-551), there is a bug in the `unstructured` lib under metrics/evaluate.py that incorrectly retrieves the file extension before the conversion to cct file from paths like '.pdf.txt' . (see below screenshot) - the current status is in the top example - we should have the correct version in the bottom example of the screenshot. ![image](https://github.com/user-attachments/assets/6d82de85-3b54-4e77-a637-28a27fcb279d) - in addition, i also observe the doctype returned are not aligned, some returning '.' and some are returning without the dot. - therefore, i just aligned them to be output into the same version which is '.*".	2024-12-06 23:55:01 +00:00
Yao You	3b9b01c502	Feat: weighted average table metrics (#3348 ) This PR uses (number of actual table) weighted average instead of average without weights for table metrics. - pages where there are ground truth tables the weight is proportional to the number of ground truth tables in that page - pages where there are no ground truth tables but has predicted tables (false positive) are assigned as 1 table worth of weight for the whole page for calculating the mean value of `table_level_acc` - pages with false positive tables do not contribute to table structural or table content metrics ## test This PR updates the existing test for evaluating table metrics: - adds a second file with just 1 table vs. the existing file with 2 tables - test the weighted average is written to the report	2024-11-20 17:14:57 +00:00
Pluto	609a08a95f	remove unused _with_spans metric (#3342 ) The table metrics considering spans is not used and it messes with the output thus I have cleaned the code from it. Though, I have left table_as_cells in the source code - it still may be useful for the users	2024-07-08 16:59:53 +00:00
Pluto	caea73c8e3	Tables detection f1 (#3341 ) This pull request add table detection metrics. One case that was considered by me: Case: Two tables are predicted and matched with one table in ground truth Question: Is this matching correct in both cases or just for on table There are two subcases: - table was predicted by OD as two sub tables (so half in two, there are two non overlapping subtables) -> in my opinion both are correct - it is false positive from tables matching script in get_table_level_alignment -> 1 good, 1 wrong As we don't have bounding boxes I followed the notebook calculation script and assumed pessimistic, second subcase version	2024-07-08 13:29:52 +00:00
Michał Martyniak	2f25d8f79e	Support for concurrent processing of documents during evaluation (#2973 ) Currently, CCT eval takes a long time for any of the test_metrics CI runs. Documents in an eval set are evaluated sequentially, and It appears that a max of 1 cpu core is currently utilized. This implies there could be a large speedup by running eval across multiple docs concurrently (probably with multiprocessing). Things done in this PR: - [x] concurrent.futures.ProcessPoolExecutor instead of sequential for-loop - [x] refactor/reorganization of redundant pieces of code without changing the inner logic too much. Without that we'd have 3 places where documents are being processed. Take a look at `BaseMetricsCalculator` class and classes that inherit from it. - [x] string paths manipulation is now reworked and relies on `pathlib.Path()`	2024-05-09 21:25:47 +00:00
Pluto	4397dd6a10	Add calculation of table related metrics based on table_as_cells (#2898 ) This pull request add metrics that are calculated based on table_as_cells instead of text_as_html. This change is required for comprehensive metrics calculation, as previously every colspan or rowspan predicted was considered to be an incorrect predicted (even if it was correct prediction) This change has to be merged after https://github.com/Unstructured-IO/unstructured/pull/2892 which introduces table_as_cells field	2024-05-07 13:57:38 +00:00
Klaijan	8a239b346c	feat: add cleanup fixtures for test_evaluate (#2701 ) This PR adds `@pytest.mark.usefixtures("_cleanup_after_test")` to `test_evaluate` on tests that do not have.	2024-04-02 15:10:59 +00:00
Klaijan	fd8b682194	fix: mean group add param (#2684 )	2024-03-22 15:16:23 +00:00
Yao You	2eb0b25e0d	Feat: single table structure eval metric (#2655 ) Creates a compounding metric to represent table structure score. It is an average of existing row and col index and content score. This PR adds a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores. This new metric is meant to offer a single number to represent the performance of table structure extraction model/algorithms. This PR also refactors the eval computation logic so it uses a constant `table_eval_metrics` instead of hard coding the name of the metrics in multiple places in the code. --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2024-03-19 15:15:32 +00:00
Klaijan	ccda40f750	feat: grouping eval takes list of filenames (#2635 ) Add features to `get_mean_grouping` to allow input as a list of filenames in the format of List of strings or txt file. --------- Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-17 17:19:55 +00:00
John	fe300fe56d	fix: teardown fixture for tests and update pre-commit-config (#2565 ) Files were being created as a side effect from running tests in `test_unstructured/metrics/test_evaluate.py`. The updated decorator removes the created directory and its files after the tests run. Testing on the main branch, run `make test` or `pytest test_unstructured/metrics/test_evaluate.py` and files will be created. On this branch no files are created	2024-03-12 22:16:39 +00:00
Klaijan	3ff6de4f50	refactor: refactor var name for consistency (#2609 ) refactor variable name for consistency.	2024-03-05 09:08:25 +00:00
Klaijan	6a4b7a134b	feat: element type accuracy grouping (#2594 ) This PR allow grouping functionality on `evaluate.py` To test: Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` or call `get_mean_grouping(<doctype or connector>, <dataframe or path to tsv file>, <export directory>, "element_type")` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2024-03-01 15:18:37 +00:00
Klaijan	daaf1775b4	feat: separate evaluate grouping function (#2572 ) Separate the aggregating functionality of `text_extraction_accuracy` to a stand-alone function to avoid duplicated eval effort if the granular level eval is already available. To test: Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` locally	2024-02-23 05:45:20 +00:00
Pawel Kmiecik	ff9d46f9dc	feat(eval): table evaluation metrics (#2558 ) This PR adds new table evaluation metrics prepared by @leah1985 The metrics include: - `table count` (check) - `table_level_acc` - accuracy of table detection - `element_col_level_index_acc` - accuracy of cell detection in columns - `element_row_level_index_acc` - accuracy of cell detection in rows - `element_col_level_content_acc` - accuracy of content detected in columns - `element_row_level_content_acc` - accuracy of content detected in rows TODO in next steps: - create a minimal dataset and upload to s3 for ingest tests - generate and add metrics on the above dataset to `test_unstructured_ingest/metrics`	2024-02-22 16:35:46 +00:00
Klaijan	e65a44eabb	feat: update cct eval for text dir (#2299 ) The code makes edit to the `measure_text_extraction_accuracy` function to allows dir of txt as well as json. The function also takes input `output_type` to be either "json" or "txt" only, and checks if the files under given directory/list contains only specified file type or not. To test this feature, run the following code: ```PYTHONPATH=. python unstructured/ingest/evaluate.py measure-text-extraction-accuracy-command --output_dir <clean-text-path> --source_dir <cct-label-path> --output_type txt```	2024-01-05 23:34:53 +00:00
John	04f4c3ab16	create teardown fixture for tests (#2269 ) Closes #2263 Files were being created as a side effect from running tests in `test_unstructured/metrics/test_evaluate.py`. The added decorator removes the created directory and its files after the tests run. Testing on the main branch, run `make test` or `pytest test_unstructured/metrics/test_evaluate.py` and files will be created. On this branch no files are created	2023-12-20 17:50:12 +00:00
Klaijan	0aae1faa54	feat: add visualize param to command and add test (#2178 ) - Add `visualize` parameter to the click command -- now callable using `--visualize` flag to show the progress bar. - Refactor the name.	2023-11-29 01:05:55 +00:00
Klaijan	2c2d5b65ca	refactor: measure_text_edit_distance function for aggregation (#2108 ) - Refactor `metrics/evaluation.py` to accepts `grouping` as parameter. - Switch to `DataFrame` for easier analysis and aggregation.	2023-11-22 13:30:16 -08:00
shreyanid	6db663e7bb	refactor: separate click wrappers from core evaluation functionality (#1981 ) ### Summary Click decorated functions cannot (properly) be called outside of the click interface. This makes it difficult to reuse the setup functionality in measure_text_edit_distance or measure_element_type_accuracy. This PR removes the click decoration and separates it into a wrapper function purely to execute the command. ### Technical Details - Changed as suggested in [this StackOverflow post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command) response - The locations of these now distinct functions are separate: the `_command` click-decorated functions stay in ingest/evaluate.py, and the core functions measure_text_edit_distance and measure_element_type_accuracy are moved into the unstructured/metrics/ folder (which is a more logical location for them). - Initial test added for measure_text_edit_distance ### Test `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` functionality is unchanged. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>	2023-11-07 19:54:22 +00:00

20 Commits