diff --git a/README.md b/README.md
index fc79abe..a26051a 100644
--- a/README.md
+++ b/README.md
@@ -23,22 +23,90 @@
-A toolkit for training language models to work with PDF documents in the wild.
+A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format.
Try the online demo: [https://olmocr.allenai.org/](https://olmocr.allenai.org/)
-What is included here:
- - A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
- - An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
- - Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
- - Finetuning code for Qwen2-VL and Molmo-O - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
- - Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
- - Viewing [Dolma docs](https://github.com/allenai/dolma) created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)
+Features:
+ - Convert PDF, PNG, and JPEG based documents into clean Markdown
+ - Support for equations, tables, handwriting, and complex formatting
+ - Automatically removes headers and footers
+ - Convert into text with a natural reading order, even in the presence of
+ figures, multi-column layouts, and insets
+ - Efficient, less than $200 USD per million pages converted
+ - (Based on a 7B parameter VLM, so it requires a GPU)
-See also:
+### Benchmark
[**olmOCR-Bench**](https://github.com/allenai/olmocr/tree/main/olmocr/bench):
-A comprehensive benchmark suite covering over 1,400 documents to help measure performance of OCR systems
+We also ship a comprehensive benchmark suite covering over 7,000 test cases across 1,400 documents to help measure performance of OCR systems.
+
+
+
+
+ Model |
+ AR |
+ OSM |
+ TA |
+ OS |
+ HF |
+ MC |
+ LTT |
+ Base |
+ Overall |
+
+
+
+
+ Marker v1.6.2 |
+ 24.3 |
+ 22.1 |
+ 69.8 |
+ 24.3 |
+ 87.1 |
+ 71.0 |
+ 76.9 |
+ 99.5 |
+ 59.4 ± 1.1 |
+
+
+ MinerU v1.3.10 |
+ 75.4 |
+ 47.4 |
+ 60.9 |
+ 17.3 |
+ 96.6 |
+ 59.0 |
+ 39.1 |
+ 96.6 |
+ 61.5 ± 1.1 |
+
+
+ Mistral OCR API |
+ 77.2 |
+ 67.5 |
+ 60.6 |
+ 29.3 |
+ 93.6 |
+ 71.3 |
+ 77.1 |
+ 99.4 |
+ 72.0 ± 1.1 |
+
+
+ olmOCR v0.1.68 (pipeline.py) |
+ 75.6 |
+ 75.1 |
+ 70.2 |
+ 44.5 |
+ 93.4 |
+ 79.4 |
+ 81.7 |
+ 99.0 |
+ 77.4 ± 1.0 |
+
+
+
### Installation
@@ -54,7 +122,8 @@ sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
```
-Set up a conda environment and install olmocr
+Set up a conda environment and install olmocr. The requirements for running olmOCR
+are difficult to install in an existing python environment, so please do make a clean python environment to install into.
```bash
conda create -n olmocr python=3.11
conda activate olmocr
@@ -72,6 +141,7 @@ pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flas
### Local Usage Example
For quick testing, try the [web demo](https://olmocr.allen.ai/). To run locally, a GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang) under the hood.
+
Convert a Single PDF:
```bash
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
@@ -186,6 +256,15 @@ options:
Beaker priority level for the job
```
+## Code overview
+
+There are some nice reusable pieces of the code that may be useful for your own projects:
+ - A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
+ - An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
+ - Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
+ - Finetuning code for Qwen2-VL and Molmo-O - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
+ - Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
+ - Viewing [Dolma docs](https://github.com/allenai/dolma) created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)
## Team
diff --git a/olmocr/data/buildsilver.py b/olmocr/data/buildsilver.py
index f5879d7..db4f908 100644
--- a/olmocr/data/buildsilver.py
+++ b/olmocr/data/buildsilver.py
@@ -29,29 +29,6 @@ def build_page_query(local_pdf_path: str, pretty_pdf_path: str, page: int) -> di
image_base64 = render_pdf_to_base64png(local_pdf_path, page, TARGET_IMAGE_DIM)
anchor_text = get_anchor_text(local_pdf_path, page, pdf_engine="pdfreport")
- # DEBUG crappy temporary code here that does the actual api call live so I can debug it a bit
- # from openai import OpenAI
- # client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
-
- # response = client.chat.completions.create(
- # model="gpt-4o-2024-08-06",
- # messages= [
- # {
- # "role": "user",
- # "content": [
- # {"type": "text", "text": build_openai_silver_data_prompt(anchor_text)},
- # {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}}
- # ],
- # }
- # ],
- # temperature=0.1,
- # max_tokens=3000,
- # logprobs=True,
- # top_logprobs=5,
- # response_format=openai_response_format_schema()
- # )
- # print(response)
-
# Construct OpenAI Batch API request format#
# There are a few tricks to know when doing data processing with OpenAI's apis
# First off, use the batch query system, it's 1/2 the price and exactly the same performance