mirror of
https://github.com/allenai/olmocr.git
synced 2025-06-27 04:00:02 +00:00
Update README.md
This commit is contained in:
parent
f0768bba3e
commit
4b4ba454ba
@ -27,7 +27,7 @@ A toolkit for training language models to work with PDF documents in the wild.
|
||||
|
||||
Try the online demo: [https://olmocr.allenai.org/](https://olmocr.allenai.org/)
|
||||
|
||||
What is included:
|
||||
What is included here:
|
||||
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
|
||||
- An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
|
||||
- Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
|
||||
@ -35,6 +35,11 @@ What is included:
|
||||
- Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
|
||||
- Viewing [Dolma docs](https://github.com/allenai/dolma) created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)
|
||||
|
||||
See also:
|
||||
|
||||
[**olmOCR-Bench**](https://github.com/allenai/olmocr/tree/main/olmocr/bench):
|
||||
A comprehensive benchmark suite covering over 1,400 documents to help measure performance of OCR systems
|
||||
|
||||
### Installation
|
||||
|
||||
Requirements:
|
||||
|
Loading…
x
Reference in New Issue
Block a user