- per [ticket](https://unstructured-ai.atlassian.net/browse/ML-551),
there is a bug in the `unstructured` lib under metrics/evaluate.py that
incorrectly retrieves the file extension before the conversion to cct
file from paths like '*.pdf.txt' . (see below screenshot)
- the current status is in the top example
- we should have the correct version in the bottom example of the
screenshot.

- in addition, i also observe the doctype returned are not aligned, some
returning '.*' and some are returning without the dot.
- therefore, i just aligned them to be output into the same version
which is '.*".
This PR uses (number of actual table) weighted average instead of
average without weights for table metrics.
- pages where there are ground truth tables the weight is proportional
to the number of ground truth tables in that page
- pages where there are no ground truth tables but has predicted tables
(false positive) are assigned as 1 table worth of weight for the whole
page for calculating the mean value of `table_level_acc`
- pages with false positive tables do not contribute to table structural
or table content metrics
## test
This PR updates the existing test for evaluating table metrics:
- adds a second file with just 1 table vs. the existing file with 2
tables
- test the weighted average is written to the report
The table metrics considering spans is not used and it messes with the
output thus I have cleaned the code from it. Though, I have left
table_as_cells in the source code - it still may be useful for the users
This pull request add table detection metrics.
One case that was considered by me:
Case: Two tables are predicted and matched with one table in ground
truth
Question: Is this matching correct in both cases or just for on table
There are two subcases:
- table was predicted by OD as two sub tables (so half in two, there are
two non overlapping subtables) -> in my opinion both are correct
- it is false positive from tables matching script in
get_table_level_alignment -> 1 good, 1 wrong
As we don't have bounding boxes I followed the notebook calculation
script and assumed pessimistic, second subcase version
Currently, CCT eval takes a long time for any of the test_metrics CI
runs. Documents in an eval set are evaluated sequentially, and It
appears that a max of 1 cpu core is currently utilized. This implies
there could be a large speedup by running eval across multiple docs
concurrently (probably with multiprocessing).
Things done in this PR:
- [x] concurrent.futures.ProcessPoolExecutor instead of sequential
for-loop
- [x] refactor/reorganization of redundant pieces of code without
changing the inner logic too much. Without that we'd have 3 places where
documents are being processed. Take a look at `BaseMetricsCalculator`
class and classes that inherit from it.
- [x] string paths manipulation is now reworked and relies on
`pathlib.Path()`
This pull request add metrics that are calculated based on
table_as_cells instead of text_as_html. This change is required for
comprehensive metrics calculation, as previously every colspan or
rowspan predicted was considered to be an incorrect predicted (even if
it was correct prediction)
This change has to be merged after
https://github.com/Unstructured-IO/unstructured/pull/2892 which
introduces table_as_cells field
Creates a compounding metric to represent table structure score. It is
an average of existing row and col index and content score.
This PR adds a new property to
`unstructured.metrics.table_eval.TableEvaluation`:
`composite_structure_acc`, which is computed from the element level row
and column index and content accuracy scores. This new metric is meant
to offer a single number to represent the performance of table structure
extraction model/algorithms.
This PR also refactors the eval computation logic so it uses a constant
`table_eval_metrics` instead of hard coding the name of the metrics in
multiple places in the code.
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Add features to `get_mean_grouping` to allow input as a list of
filenames in the format of List of strings or txt file.
---------
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Files were being created as a side effect from running tests in
`test_unstructured/metrics/test_evaluate.py`. The updated decorator
removes the created directory and its files after the tests run.
Testing
on the main branch, run `make test` or `pytest
test_unstructured/metrics/test_evaluate.py` and files will be created.
On this branch no files are created
This PR allow grouping functionality on `evaluate.py`
To test:
Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` or
call `get_mean_grouping(<doctype or connector>, <dataframe or path to
tsv file>, <export directory>, "element_type")`
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
Separate the aggregating functionality of `text_extraction_accuracy` to
a stand-alone function to avoid duplicated eval effort if the granular
level eval is already available.
To test:
Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py`
locally
This PR adds new table evaluation metrics prepared by @leah1985
The metrics include:
- `table count` (check)
- `table_level_acc` - accuracy of table detection
- `element_col_level_index_acc` - accuracy of cell detection in columns
- `element_row_level_index_acc` - accuracy of cell detection in rows
- `element_col_level_content_acc` - accuracy of content detected in
columns
- `element_row_level_content_acc` - accuracy of content detected in rows
TODO in next steps:
- create a minimal dataset and upload to s3 for ingest tests
- generate and add metrics on the above dataset to
`test_unstructured_ingest/metrics`
The code makes edit to the `measure_text_extraction_accuracy` function
to allows dir of txt as well as json. The function also takes input
`output_type` to be either "json" or "txt" only, and checks if the files
under given directory/list contains only specified file type or not.
To test this feature, run the following code:
```PYTHONPATH=. python unstructured/ingest/evaluate.py measure-text-extraction-accuracy-command --output_dir <clean-text-path> --source_dir <cct-label-path> --output_type txt```
Closes#2263
Files were being created as a side effect from running tests in
`test_unstructured/metrics/test_evaluate.py`. The added decorator
removes the created directory and its files after the tests run.
Testing
on the main branch, run `make test` or `pytest
test_unstructured/metrics/test_evaluate.py` and files will be created.
On this branch no files are created
### Summary
Click decorated functions cannot (properly) be called outside of the
click interface. This makes it difficult to reuse the setup
functionality in measure_text_edit_distance or
measure_element_type_accuracy. This PR removes the click decoration and
separates it into a wrapper function purely to execute the command.
### Technical Details
- Changed as suggested in [this StackOverflow
post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command)
response
- The locations of these now distinct functions are separate: the
`_command` click-decorated functions stay in ingest/evaluate.py, and the
core functions measure_text_edit_distance and
measure_element_type_accuracy are moved into the unstructured/metrics/
folder (which is a more logical location for them).
- Initial test added for measure_text_edit_distance
### Test
`sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction`
functionality is unchanged.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>