diff --git a/README.md b/README.md index 3ac9b5d..002642c 100644 --- a/README.md +++ b/README.md @@ -4,8 +4,9 @@ Toolkit for training language models to work with PDF documents in the wild. olmOCR Logo +
-View the online demo here: [https://olmocr.allen.ai/](https://olmocr.allen.ai/) +Online demo: [https://olmocr.allen.ai/](https://olmocr.allen.ai/) What is included: - A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py) @@ -37,6 +38,12 @@ pip install sgl-kernel --force-reinstall --no-deps pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/ ``` +## BETA TESTER NOTE: +If you are a beta tester, you will need to login using the hugging-face CLI +to make sure you have access to https://huggingface.co/allenai/olmocr-preview + +`huggingface-cli login` + ### Local Usage Example The easiest way to try out olmOCR on one or two PDFs is to check out the [web demo](https://olmocr.allen.ai/). @@ -44,7 +51,7 @@ The easiest way to try out olmOCR on one or two PDFs is to check out the [web de Once you are ready to run locally, a local GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang) under the hood. -This command will convert one PDF into a local workspace: +This command will convert one PDF into a directoey called `localworkspace`: ```bash python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf ``` @@ -54,6 +61,19 @@ You can also bulk convert many PDFS with a glob pattern: python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf ``` +Once that finishes, output is stored as [Dolma](https://github.com/allenai/dolma)-style JSONL inside of the `./localworkspace/results` directory. + +```bash +cat localworkspace/results/output_*.jsonl +``` + +You can view your documents side-by-side with the original PDF renders using the `dolmaviewer` command. + +```python + +``` + + ### Multi-node / Cluster Usage If you want to convert millions of PDFs, using multiple nodes running in parallel, then olmOCR supports diff --git a/olmocr/viewer/__init__.py b/olmocr/viewer/__init__.py new file mode 100644 index 0000000..e69de29