diff --git a/README.md b/README.md index fc79abe..a26051a 100644 --- a/README.md +++ b/README.md @@ -23,22 +23,90 @@

-A toolkit for training language models to work with PDF documents in the wild. +A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format. Try the online demo: [https://olmocr.allenai.org/](https://olmocr.allenai.org/) -What is included here: - - A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py) - - An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py) - - Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py) - - Finetuning code for Qwen2-VL and Molmo-O - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py) - - Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py) - - Viewing [Dolma docs](https://github.com/allenai/dolma) created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py) +Features: + - Convert PDF, PNG, and JPEG based documents into clean Markdown + - Support for equations, tables, handwriting, and complex formatting + - Automatically removes headers and footers + - Convert into text with a natural reading order, even in the presence of + figures, multi-column layouts, and insets + - Efficient, less than $200 USD per million pages converted + - (Based on a 7B parameter VLM, so it requires a GPU) -See also: +### Benchmark [**olmOCR-Bench**](https://github.com/allenai/olmocr/tree/main/olmocr/bench): -A comprehensive benchmark suite covering over 1,400 documents to help measure performance of OCR systems +We also ship a comprehensive benchmark suite covering over 7,000 test cases across 1,400 documents to help measure performance of OCR systems. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelAROSMTAOSHFMCLTTBaseOverall
Marker v1.6.224.322.169.824.387.171.076.999.559.4 ± 1.1
MinerU v1.3.1075.447.460.917.396.659.039.196.661.5 ± 1.1
Mistral OCR API77.267.560.629.393.671.377.199.472.0 ± 1.1
olmOCR v0.1.68 (pipeline.py)75.675.170.244.593.479.481.799.077.4 ± 1.0
### Installation @@ -54,7 +122,8 @@ sudo apt-get update sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools ``` -Set up a conda environment and install olmocr +Set up a conda environment and install olmocr. The requirements for running olmOCR +are difficult to install in an existing python environment, so please do make a clean python environment to install into. ```bash conda create -n olmocr python=3.11 conda activate olmocr @@ -72,6 +141,7 @@ pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flas ### Local Usage Example For quick testing, try the [web demo](https://olmocr.allen.ai/). To run locally, a GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang) under the hood. + Convert a Single PDF: ```bash python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf @@ -186,6 +256,15 @@ options: Beaker priority level for the job ``` +## Code overview + +There are some nice reusable pieces of the code that may be useful for your own projects: + - A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py) + - An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py) + - Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py) + - Finetuning code for Qwen2-VL and Molmo-O - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py) + - Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py) + - Viewing [Dolma docs](https://github.com/allenai/dolma) created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py) ## Team diff --git a/olmocr/data/buildsilver.py b/olmocr/data/buildsilver.py index f5879d7..db4f908 100644 --- a/olmocr/data/buildsilver.py +++ b/olmocr/data/buildsilver.py @@ -29,29 +29,6 @@ def build_page_query(local_pdf_path: str, pretty_pdf_path: str, page: int) -> di image_base64 = render_pdf_to_base64png(local_pdf_path, page, TARGET_IMAGE_DIM) anchor_text = get_anchor_text(local_pdf_path, page, pdf_engine="pdfreport") - # DEBUG crappy temporary code here that does the actual api call live so I can debug it a bit - # from openai import OpenAI - # client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) - - # response = client.chat.completions.create( - # model="gpt-4o-2024-08-06", - # messages= [ - # { - # "role": "user", - # "content": [ - # {"type": "text", "text": build_openai_silver_data_prompt(anchor_text)}, - # {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}} - # ], - # } - # ], - # temperature=0.1, - # max_tokens=3000, - # logprobs=True, - # top_logprobs=5, - # response_format=openai_response_format_schema() - # ) - # print(response) - # Construct OpenAI Batch API request format# # There are a few tricks to know when doing data processing with OpenAI's apis # First off, use the batch query system, it's 1/2 the price and exactly the same performance