olmOCR-Bench
Dataset Link: https://huggingface.co/datasets/allenai/olmOCR-bench
We develop olmOCR-Bench in order to automatically and effectively evaluate document-level OCR of various tools.
olmOCR-Bench works by testing various "facts" about document pages at the PDF-level. Our intention is that each "fact" is very simple, unambiguous, and machine-checkable, similar to a unit test. For example, once your document has been OCRed, we may check that a particular sentence appears exactly somewhere on the page.
We stay away from soft metrics like edit distance comparisons, because they may assign lower scores for parses of the document that differ from the reference, but may in fact still be correct. For example, on a document containing multiple distinct articles: you want the text of each article to be grouped together, but the relative order of the two articles may not be critical. Also, some documents may have critical details, like switching x and y in an equation that can make all the difference in understanding, but would appear as just a single character edit in an edit-distance metric.
olmOCR-bench operates on single page PDFs directly. We make this choice because PDFs do preserve some digital metadata and information which may be helpful to some OCR systems. Almost any other format can be converted to a PDF, but not the reverse, so we try to preserve these original documents where possible.
We have run the benchmark against some contemporary OCR pipelines, but it is really easy to run it against your own OCR tools. Your tool just needs to support Markdown or plain text output.

Results
Model | ArXiv | Old Scans Math | Tables | Old Scans | Headers and Footers | Multi column | Long tiny text | Base | Overall |
---|---|---|---|---|---|---|---|---|---|
GOT OCR | 52.7 | 52.0 | 0.20 | 22.1 | 93.6 | 42.0 | 29.9 | 94.0 | 48.3 ± 1.1 |
Marker v1.7.5 (base, force_ocr) | 76.0 | 57.9 | 57.6 | 27.8 | 84.9 | 72.9 | 84.6 | 99.1 | 70.1 ± 1.1 |
MinerU v1.3.10 | 75.4 | 47.4 | 60.9 | 17.3 | 96.6 | 59.0 | 39.1 | 96.6 | 61.5 ± 1.1 |
Mistral OCR API | 77.2 | 67.5 | 60.6 | 29.3 | 93.6 | 71.3 | 77.1 | 99.4 | 72.0 ± 1.1 |
Nanonets OCR | 67.0 | 68.6 | 77.7 | 39.5 | 40.7 | 69.9 | 53.4 | 99.3 | 64.5 ± 1.1 |
GPT-4o (No Anchor) | 51.5 | 75.5 | 69.1 | 40.9 | 94.2 | 68.9 | 54.1 | 96.7 | 68.9 ± 1.1 |
GPT-4o (Anchored) | 53.5 | 74.5 | 70.0 | 40.7 | 93.8 | 69.3 | 60.6 | 96.8 | 69.9 ± 1.1 |
Gemini Flash 2 (No Anchor) | 32.1 | 56.3 | 61.4 | 27.8 | 48.0 | 58.7 | 84.4 | 94.0 | 57.8 ± 1.1 |
Gemini Flash 2 (Anchored) | 54.5 | 56.1 | 72.1 | 34.2 | 64.7 | 61.5 | 71.5 | 95.6 | 63.8 ± 1.2 |
Qwen 2 VL (No Anchor) | 19.7 | 31.7 | 24.2 | 17.1 | 88.9 | 8.3 | 6.8 | 55.5 | 31.5 ± 0.9 |
Qwen 2.5 VL (No Anchor) | 63.1 | 65.7 | 67.3 | 38.6 | 73.6 | 68.3 | 49.1 | 98.3 | 65.5 ± 1.2 |
olmOCR v0.1.75 (No Anchor) | 71.5 | 71.4 | 71.4 | 42.8 | 94.1 | 77.7 | 71.0 | 97.8 | 74.7 ± 1.1 |
olmOCR v0.1.75 (Anchored) | 74.9 | 71.2 | 71.0 | 42.2 | 94.5 | 78.3 | 73.3 | 98.3 | 75.5 ± 1.0 |
olmOCR v0.2.0 | 78.8 | 77.5 | 71.9 | 45.4 | 94.2 | 78.6 | 81.4 | 99.8 | 78.5 ± 1.1 |
olmOCR v0.3.0 | 78.6 | 79.9 | 72.9 | 43.9 | 95.1 | 77.3 | 81.2 | 98.9 | 78.5 ± 1.1 |
There was a small drop in scores from olmOCR v0.1.68 (77.4), which is due to two factors. One, is that we have adjusted our benchmark code to not include any "fallback" mechanism when measuring benchmark scores (though it still exists when you run olmocr.pipeline). Second, there is a small drop in scores as we have updated from sglang 0.4.2 to vllm 0.9.1. In net, we think the upgrade to vllm is the right choice, given that sglang 0.4.6 had even lower scores by one point, and vllm comes with a small performance boost, and great support for quantization.
Sourcing Documents and Tests
We define 7 distinct document types that we found olmOCR (or its earlier iterations) often struggled to process and defined custom acquisition strategies for each (described below). We removed documents that both contained PII and were not meant for public dissemination. We also decontaminate against documents that appear in olmOCR-Mix via URL level deduplication. To scale creation of test cases over these documents, we combined manual design and review with prompting GPT-4o.
Document Types
-
arXiv Math (AR): We downloaded a recent set of papers from the math subset of arXiv, selecting manuscripts with a single TeX source file and corresponding rendered PDF. To select a candidate LATEX expression from a page to use in a test, we (1) ran olmOCR to identify candidate pages with TeX, (2) match pages back to original TeX source, and (3) validate matched TeX rendering compatibility with KaTeX. We manually verify the final set of test cases to exclude instances where custom macros produce renderings that deviate from standard LATEX and to split multi-part equations into smaller test cases.
-
Old Scans Math (OSM): We crawl old, public domain math textbooks from the Internet Archive, extracting random pages from these documents. We similarly use olmOCR to find candidate pages with formulas, but this time manually annotate each formula on the page to use as test cases.
-
Tables (TA): We sampled more documents from the same internal crawled PDF repository used to create olmOCR-Mix and filtered to those which had tables using a simple prompt with Gemini-Flash-2.0. On pages with tables, we prompted Gemini-Flash-2.0 for the relationships between randomly chosen cells. We manually reviewed those tests for accuracy.
-
Old Scans (OS): We sampled historical letters and typewritten documents with existing human transcriptions from the Library of Congress digital archives. We then wrote a small script to generate Natural Reading Order cases consisting of sentences that were naturally before or after one another in the original human transcriptions. We manually added test cases to cover some headers/footers which should have been excluded from any OCR version of these documents. All of the test cases then underwent a second pass of human review for accuracy.
-
Headers Footers (HF): We sampled documents from the same internally crawled PDF repository as olmOCR-Mix. We used DocLayout-YOLO to identify page regions labeled as headers or footers using the abandon category. To extract the text from these header/footer regions, we visually mask out the rest of the document and prompt Gemini-Flash-2.0 for the content. These extracted snippets are added as test cases that should be absent in linearized output. We manually reviewed to remove mistakenly filtered text and to set conditions such as limiting the search area to the first N or last N characters.
-
Multi Column (MC): We visually sample documents from our internal crawled PDF repository to find documents with multi-column layouts and multiple articles on one page. We use Claude-Sonnet-3.7 to render those pages to HTML, and from that HTML, we extract text segments before/after one another. We manually review each entry for accuracy. We purposely select simple text blocks from coherent regions of the document, and avoid including any math formulas, superscripts, or subscripts in these tests.
-
Long Tiny Text (LTT): We crawled documents from the Internet Archive containing a large amount of dense, small print on a single page. Such documents include pages from a dictionary or pages of references from academic papers. We then generate test cases using Gemini-Flash-2.0 and verify them manually.
Benchmark Principles
As we created olmOCR-bench, we also kept a few general rules in mind:
- We expect your OCR system to output a plain-text Unicode document in a reading order that would be considered natural.
- Documents from the benchmark should fit on a standard A4 piece of paper and still be readable to a human.
- Markdown syntax is allowed, but ignored. Ex. if we are looking for the word "enlightenment" to appear on a page, and your system outputs "**enlightenment**" in Markdown bold, that still counts.
- olmOCR-bench is not position sensitive, ex. we check that a sentence or math equation appears anywhere on a page. The exception to this is header/footer tests where we want to find simple page numbers appearing in the first or last few characters of a page.
- Tables can be in either Markdown syntax, or as an html
<table>
. - Math equations must render with Katex and be delimeted with $, $$, \(, or \[.
- Math equations are not position sensitive either, so if we are checking for
3x^2
to appear on a page, then outputting\int_a^b{ 3x ^ 2dx}
counts. - We normalize all Unicode to NFC before running the benchmark, so if your OCR model outputs é vs e + ◌́ then either way should not affect your benchmark score.
- We normalize all the different variants of hyphens to the ascii -, all the variants of double quoets to ascii " and all variants of single quotes/apostrophes to ascii '. You should score the same on the benchmark if you output - vs —
- All facts checked about documents are either pass/fail. We want it to be very clear if your OCR system fails a test, and if so, what output would make it pass.
olmOCR-Bench Test classes
- Text presence
- This task makes sure that a given small piece of text (ex. 1-3 sentence level) is present within a parsed document. Soft/fuzzy matching is allowed, as well as specifying if the text must be in the first N or last N characters of the document. Case sensitive by default.
- Text absense
- This task makes sure that a given piece of next does NOT appear in the OCR'ed version of a document. We generally want our OCR systems to filter out content like headers/footers/page numbers from documents. The same fuzzy matching as in Text Presence tests is allowed.
- Natural Reading Order
- This task ensures that blocks of text which are present have a defined order relative to one another. For example, on a document that contains multiple news articles on one page, you'd want to see that the first sentence of the first article appears after the heading of that article. But, you may be okay with swapping the order of those two articles.
- Table Accuracy
- Both Markdown and HTML based tables are supported. These tests check that a cell with a given text exists somewhere in the table, and that its neighbors have certain properties. Ex. A cell exists on this page with text "4.5%" and above that is a cell with the text "2.4%". However, it's important to note that some tests depend on rowspan and colspan information being present in the table, which is only available with HTML based tables. This means that a model outputting only markdown tables cannot achieve a max score on this section.
- Math Formula Accuracy
- We render a given Latex style equation using Katex in a headless browser. And then see if it exists anywhere in the final OCRed document. Matching is performed on a relative symbol level, ex. in "\f\relax{x} = \int_{-\infty}^\infty x^2dx" we check that a ∫ appears to the left of a x, x appears to the left of dx, etc...
Downloading and running the benchmark
Currently the full benchmark data is located here: https://huggingface.co/datasets/allenai/olmOCR-bench
To run a benchmark, first install the bench requirements
conda create -n olmocr python=3.11
conda activate olmocr
git clone https://github.com/allenai/olmocr.git
cd olmocr
# Install olmocr and the requirements needed to run the benchmark
pip install -e .[bench]
# Configure playwright headless browser to run the math rendering tests
playwright install chromium
# Now clone the benchmark data from hugging face, this includes the PDFs and JSON annotation data
huggingface-cli download --repo-type dataset --resume-download allenai/olmOCR-bench --local-dir ./olmOCR-bench
Convert your documents
# You will need to install the [gpu] subset of olmocr dependencies to run gpu inference
pip install olmocr[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
# convert using the same engine as olmOCR pipeline.py uses, see the olmocr/bench/runners directory for options
python -m olmocr.bench.convert olmocr_pipeline --dir ./olmOCR-bench/bench_data
# or use convert_all.sh to run OCR with many common frameworks all at once, API keys will be required
./olmocr/bench/scripts/convert_all.sh
Now run the benchmark
python -m olmocr.bench.benchmark --dir ./olmOCR-bench/bench_data
Previewing the benchmark questions
We have an internal data annotation tool that can be used to review the questions in the benchmark, and make edits.
python -m olmocr.bench.review_app --port 5000 --debug ./olmOCR-bench/bench_data/multi_column.jsonl --force