diff --git a/README.md b/README.md index 6ee0095..fc79abe 100644 --- a/README.md +++ b/README.md @@ -27,7 +27,7 @@ A toolkit for training language models to work with PDF documents in the wild. Try the online demo: [https://olmocr.allenai.org/](https://olmocr.allenai.org/) -What is included: +What is included here: - A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py) - An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py) - Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py) @@ -35,6 +35,11 @@ What is included: - Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py) - Viewing [Dolma docs](https://github.com/allenai/dolma) created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py) +See also: + +[**olmOCR-Bench**](https://github.com/allenai/olmocr/tree/main/olmocr/bench): +A comprehensive benchmark suite covering over 1,400 documents to help measure performance of OCR systems + ### Installation Requirements: