From b3c3a13e03c27c0aa3a76aa84d7140b92f652c94 Mon Sep 17 00:00:00 2001 From: Jake Poznanski Date: Thu, 10 Apr 2025 16:06:17 -0700 Subject: [PATCH] Update README.md --- olmocr/bench/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/olmocr/bench/README.md b/olmocr/bench/README.md index 47ded44..5b47d44 100644 --- a/olmocr/bench/README.md +++ b/olmocr/bench/README.md @@ -67,4 +67,5 @@ Several categories of tests have been made so far: 2. headers_footers -> We sampled documents from our internal crawled PDF repository. (The same from which olmOCR-mix was derived, though the likelyhood of duplicates is low, as there are 200M+ pdfs in this set). Then we used [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO) to identify regions of the pages which were marked as headers/footers using the abandon category. We then got the text of those headers/footers regions by extracting them out and prompting Gemini, and we added them as test cases which should be absent. Manual review was then performed to remove mistakenly filtered text, and to set conditions such as limiting the search area to the first N or last N characters. Ex. if a page number "5" appears on the bottom a page, you want to test that your OCR system does not output a "5" in the last 20 characters of the page, but "5" could apepar earlier if in the actual body text. 3. table_tests -> We sampled documents from our internal crawled PDF repository, and found those which had tables using gemini-flash-2.0. https://github.com/allenai/olmocr/blob/main/olmocr/bench/miners/mine_tables_gemini.py On pages that had tables, we then further asked gemini-flash-2.0 to tell us the relationships between randomly chosen cells. Those tests were then manually checked. 4. multi_column -> We sampled documents from our internal crawled PDF repository manually, to find documents which had multi-column layouts and multiple articles on one page. Then, we used claude-sonnet-3.7 to render those pages to html, and from that html, we extracted text segments which were before/after one another. Then we manually reviewed each entry. -5. old_scans -> We sampled documents from the library of congress which contained handwriting or typewritten text. +5. old_scans -> We sampled documents from the library of congress which contained handwriting or typewritten text. (TODO) +6. book_math -> We found old math textbooks in the public domain from the Internet Archive. We then extracted random pages from them, OCRed them, filtered down to pages which contained equations, and picked several random equations from each page to use as test cases. We then manually checked each test case to see that it was accurate capturing what was on the page. (TODO)