More readme imporvements

This commit is contained in:
Jake Poznanski 2025-01-29 11:23:04 -08:00
parent f16acec296
commit 4c35105bd4
2 changed files with 22 additions and 2 deletions

View File

@ -4,8 +4,9 @@ Toolkit for training language models to work with PDF documents in the wild.
<img src="https://github.com/user-attachments/assets/d70c8644-3e64-4230-98c3-c52fddaeccb6" alt="olmOCR Logo" width="300"/> <img src="https://github.com/user-attachments/assets/d70c8644-3e64-4230-98c3-c52fddaeccb6" alt="olmOCR Logo" width="300"/>
<br/>
View the online demo here: [https://olmocr.allen.ai/](https://olmocr.allen.ai/) Online demo: [https://olmocr.allen.ai/](https://olmocr.allen.ai/)
What is included: What is included:
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py) - A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
@ -37,6 +38,12 @@ pip install sgl-kernel --force-reinstall --no-deps
pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/ pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
``` ```
## BETA TESTER NOTE:
If you are a beta tester, you will need to login using the hugging-face CLI
to make sure you have access to https://huggingface.co/allenai/olmocr-preview
`huggingface-cli login`
### Local Usage Example ### Local Usage Example
The easiest way to try out olmOCR on one or two PDFs is to check out the [web demo](https://olmocr.allen.ai/). The easiest way to try out olmOCR on one or two PDFs is to check out the [web demo](https://olmocr.allen.ai/).
@ -44,7 +51,7 @@ The easiest way to try out olmOCR on one or two PDFs is to check out the [web de
Once you are ready to run locally, a local GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang) Once you are ready to run locally, a local GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang)
under the hood. under the hood.
This command will convert one PDF into a local workspace: This command will convert one PDF into a directoey called `localworkspace`:
```bash ```bash
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
``` ```
@ -54,6 +61,19 @@ You can also bulk convert many PDFS with a glob pattern:
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
``` ```
Once that finishes, output is stored as [Dolma](https://github.com/allenai/dolma)-style JSONL inside of the `./localworkspace/results` directory.
```bash
cat localworkspace/results/output_*.jsonl
```
You can view your documents side-by-side with the original PDF renders using the `dolmaviewer` command.
```python
```
### Multi-node / Cluster Usage ### Multi-node / Cluster Usage
If you want to convert millions of PDFs, using multiple nodes running in parallel, then olmOCR supports If you want to convert millions of PDFs, using multiple nodes running in parallel, then olmOCR supports

View File