This PR uses (number of actual table) weighted average instead of
average without weights for table metrics.
- pages where there are ground truth tables the weight is proportional
to the number of ground truth tables in that page
- pages where there are no ground truth tables but has predicted tables
(false positive) are assigned as 1 table worth of weight for the whole
page for calculating the mean value of `table_level_acc`
- pages with false positive tables do not contribute to table structural
or table content metrics
## test
This PR updates the existing test for evaluating table metrics:
- adds a second file with just 1 table vs. the existing file with 2
tables
- test the weighted average is written to the report
This pull request add metrics that are calculated based on
table_as_cells instead of text_as_html. This change is required for
comprehensive metrics calculation, as previously every colspan or
rowspan predicted was considered to be an incorrect predicted (even if
it was correct prediction)
This change has to be merged after
https://github.com/Unstructured-IO/unstructured/pull/2892 which
introduces table_as_cells field
This PR adds new table evaluation metrics prepared by @leah1985
The metrics include:
- `table count` (check)
- `table_level_acc` - accuracy of table detection
- `element_col_level_index_acc` - accuracy of cell detection in columns
- `element_row_level_index_acc` - accuracy of cell detection in rows
- `element_col_level_content_acc` - accuracy of content detected in
columns
- `element_row_level_content_acc` - accuracy of content detected in rows
TODO in next steps:
- create a minimal dataset and upload to s3 for ingest tests
- generate and add metrics on the above dataset to
`test_unstructured_ingest/metrics`