mirror of
https://github.com/allenai/olmocr.git
synced 2025-11-08 14:40:24 +00:00
More readme imporvements
This commit is contained in:
parent
f16acec296
commit
4c35105bd4
24
README.md
24
README.md
@ -4,8 +4,9 @@ Toolkit for training language models to work with PDF documents in the wild.
|
|||||||
|
|
||||||
|
|
||||||
<img src="https://github.com/user-attachments/assets/d70c8644-3e64-4230-98c3-c52fddaeccb6" alt="olmOCR Logo" width="300"/>
|
<img src="https://github.com/user-attachments/assets/d70c8644-3e64-4230-98c3-c52fddaeccb6" alt="olmOCR Logo" width="300"/>
|
||||||
|
<br/>
|
||||||
|
|
||||||
View the online demo here: [https://olmocr.allen.ai/](https://olmocr.allen.ai/)
|
Online demo: [https://olmocr.allen.ai/](https://olmocr.allen.ai/)
|
||||||
|
|
||||||
What is included:
|
What is included:
|
||||||
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
|
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
|
||||||
@ -37,6 +38,12 @@ pip install sgl-kernel --force-reinstall --no-deps
|
|||||||
pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
|
pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## BETA TESTER NOTE:
|
||||||
|
If you are a beta tester, you will need to login using the hugging-face CLI
|
||||||
|
to make sure you have access to https://huggingface.co/allenai/olmocr-preview
|
||||||
|
|
||||||
|
`huggingface-cli login`
|
||||||
|
|
||||||
### Local Usage Example
|
### Local Usage Example
|
||||||
|
|
||||||
The easiest way to try out olmOCR on one or two PDFs is to check out the [web demo](https://olmocr.allen.ai/).
|
The easiest way to try out olmOCR on one or two PDFs is to check out the [web demo](https://olmocr.allen.ai/).
|
||||||
@ -44,7 +51,7 @@ The easiest way to try out olmOCR on one or two PDFs is to check out the [web de
|
|||||||
Once you are ready to run locally, a local GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang)
|
Once you are ready to run locally, a local GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang)
|
||||||
under the hood.
|
under the hood.
|
||||||
|
|
||||||
This command will convert one PDF into a local workspace:
|
This command will convert one PDF into a directoey called `localworkspace`:
|
||||||
```bash
|
```bash
|
||||||
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
|
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
|
||||||
```
|
```
|
||||||
@ -54,6 +61,19 @@ You can also bulk convert many PDFS with a glob pattern:
|
|||||||
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
|
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Once that finishes, output is stored as [Dolma](https://github.com/allenai/dolma)-style JSONL inside of the `./localworkspace/results` directory.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat localworkspace/results/output_*.jsonl
|
||||||
|
```
|
||||||
|
|
||||||
|
You can view your documents side-by-side with the original PDF renders using the `dolmaviewer` command.
|
||||||
|
|
||||||
|
```python
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
### Multi-node / Cluster Usage
|
### Multi-node / Cluster Usage
|
||||||
|
|
||||||
If you want to convert millions of PDFs, using multiple nodes running in parallel, then olmOCR supports
|
If you want to convert millions of PDFs, using multiple nodes running in parallel, then olmOCR supports
|
||||||
|
|||||||
0
olmocr/viewer/__init__.py
Normal file
0
olmocr/viewer/__init__.py
Normal file
Loading…
x
Reference in New Issue
Block a user