diff --git a/README.md b/README.md
index 3ac9b5d..002642c 100644
--- a/README.md
+++ b/README.md
@@ -4,8 +4,9 @@ Toolkit for training language models to work with PDF documents in the wild.
+
-View the online demo here: [https://olmocr.allen.ai/](https://olmocr.allen.ai/)
+Online demo: [https://olmocr.allen.ai/](https://olmocr.allen.ai/)
What is included:
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
@@ -37,6 +38,12 @@ pip install sgl-kernel --force-reinstall --no-deps
pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
```
+## BETA TESTER NOTE:
+If you are a beta tester, you will need to login using the hugging-face CLI
+to make sure you have access to https://huggingface.co/allenai/olmocr-preview
+
+`huggingface-cli login`
+
### Local Usage Example
The easiest way to try out olmOCR on one or two PDFs is to check out the [web demo](https://olmocr.allen.ai/).
@@ -44,7 +51,7 @@ The easiest way to try out olmOCR on one or two PDFs is to check out the [web de
Once you are ready to run locally, a local GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang)
under the hood.
-This command will convert one PDF into a local workspace:
+This command will convert one PDF into a directoey called `localworkspace`:
```bash
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
```
@@ -54,6 +61,19 @@ You can also bulk convert many PDFS with a glob pattern:
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
```
+Once that finishes, output is stored as [Dolma](https://github.com/allenai/dolma)-style JSONL inside of the `./localworkspace/results` directory.
+
+```bash
+cat localworkspace/results/output_*.jsonl
+```
+
+You can view your documents side-by-side with the original PDF renders using the `dolmaviewer` command.
+
+```python
+
+```
+
+
### Multi-node / Cluster Usage
If you want to convert millions of PDFs, using multiple nodes running in parallel, then olmOCR supports
diff --git a/olmocr/viewer/__init__.py b/olmocr/viewer/__init__.py
new file mode 100644
index 0000000..e69de29