This PR adds new table evaluation metrics prepared by @leah1985
The metrics include:
- `table count` (check)
- `table_level_acc` - accuracy of table detection
- `element_col_level_index_acc` - accuracy of cell detection in columns
- `element_row_level_index_acc` - accuracy of cell detection in rows
- `element_col_level_content_acc` - accuracy of content detected in
columns
- `element_row_level_content_acc` - accuracy of content detected in rows
TODO in next steps:
- create a minimal dataset and upload to s3 for ingest tests
- generate and add metrics on the above dataset to
`test_unstructured_ingest/metrics`
The sample docs directory contains the following files:
example-10k.html - A 10-K SEC filing in HTML format
layout-parser-paper.pdf - A PDF copy of the layout parser paper
factbook.xml/factbook.xsl - Example XML/XLS files that you can use to test stylesheets
These documents can be used to test out the parsers in the library. In addition, here are
instructions for pulling in some sample docs that are too big to store in the repo.
XBRL 10-K
You can get an example 10-K in inline XBRL format using the following curl. Note, you need
to have the user agent set in the header or the SEC site will reject your request.
curl -O \
-A '${organization} ${email}'
https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt
You can parse this document using the HTML parser.