mirror of
https://github.com/allenai/olmocr.git
synced 2025-06-27 04:00:02 +00:00
Readme improvements
This commit is contained in:
parent
dee494ad7b
commit
f16acec296
82
README.md
82
README.md
@ -5,19 +5,21 @@ Toolkit for training language models to work with PDF documents in the wild.
|
|||||||
|
|
||||||
<img src="https://github.com/user-attachments/assets/d70c8644-3e64-4230-98c3-c52fddaeccb6" alt="olmOCR Logo" width="300"/>
|
<img src="https://github.com/user-attachments/assets/d70c8644-3e64-4230-98c3-c52fddaeccb6" alt="olmOCR Logo" width="300"/>
|
||||||
|
|
||||||
|
View the online demo here: [https://olmocr.allen.ai/](https://olmocr.allen.ai/)
|
||||||
|
|
||||||
What is included:
|
What is included:
|
||||||
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
|
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
|
||||||
- An eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
|
- An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
|
||||||
- Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
|
- Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
|
||||||
- Finetuning code for Qwen2-VL (and soon other VLMs) - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
|
- Finetuning code for Qwen2-VL and Molmo-O - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
|
||||||
- Processing millions of PDFs through a finetuned model using Sglang - [beakerpipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/beakerpipeline.py)
|
- Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
|
||||||
- Viewing Dolma Docs created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)
|
- Viewing Dolma Docs created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)
|
||||||
|
|
||||||
### Installation
|
### Installation
|
||||||
|
|
||||||
You will need to install poppler-utils and then also some fonts on your computer so that any pdfs you render come out looking nice.
|
You will need to install poppler-utils and then also some fonts on your computer so that any pdfs you render come out looking nice.
|
||||||
|
|
||||||
|
Linux Ubuntu/Debian
|
||||||
```bash
|
```bash
|
||||||
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
|
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
|
||||||
```
|
```
|
||||||
@ -29,44 +31,73 @@ cd olmocr
|
|||||||
pip install -e .
|
pip install -e .
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Finally, make sure you have sglang with flashinfer installed if you want to do efficient inference
|
||||||
|
```bash
|
||||||
|
pip install sgl-kernel --force-reinstall --no-deps
|
||||||
|
pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
|
||||||
|
```
|
||||||
|
|
||||||
### Beaker Usage
|
### Local Usage Example
|
||||||
|
|
||||||
If you want to linearize millions of PDFs efficiently using [beaker](https://www.beaker.org), follow these instructions.
|
The easiest way to try out olmOCR on one or two PDFs is to check out the [web demo](https://olmocr.allen.ai/).
|
||||||
This is the preferred method for best performance, and lets you get results quickly for iterating and debugging.
|
|
||||||
|
|
||||||
It also runs at 2,800+ tokens per second per H100 GPU.
|
Once you are ready to run locally, a local GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang)
|
||||||
|
under the hood.
|
||||||
|
|
||||||
|
This command will convert one PDF into a local workspace:
|
||||||
|
```bash
|
||||||
|
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
|
||||||
|
```
|
||||||
|
|
||||||
|
You can also bulk convert many PDFS with a glob pattern:
|
||||||
|
```bash
|
||||||
|
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multi-node / Cluster Usage
|
||||||
|
|
||||||
|
If you want to convert millions of PDFs, using multiple nodes running in parallel, then olmOCR supports
|
||||||
|
reading your PDFs from AWS S3, and coordinating work using an AWS S3 output bucket.
|
||||||
|
|
||||||
|
For example, you can start this command on your first worker node, and it will set up
|
||||||
|
a simple work queue in your AWS bucket and start converting PDFs.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf
|
||||||
|
```
|
||||||
|
|
||||||
|
Now on any subsequent nodes, just run this and they will start grabbing items from the same workspace queue.
|
||||||
|
```bash
|
||||||
|
python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace
|
||||||
|
```
|
||||||
|
|
||||||
|
If you are at AI2 and want to linearize millions of PDFs efficiently using [beaker](https://www.beaker.org), just add the `--beaker`
|
||||||
|
flag. This will prepare the workspace on your local machine, and then launch N GPU workers in the cluster to start
|
||||||
|
converting PDFs.
|
||||||
|
|
||||||
For example:
|
For example:
|
||||||
```bash
|
```bash
|
||||||
python -m olmocr.beakerpipeline s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename] --pdfs s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf --beaker
|
python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf --beaker --beaker_gpus 4
|
||||||
```
|
```
|
||||||
|
|
||||||
This will convert all the pdfs at `s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf` and output dolma formatted documents at `s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename]/results`
|
|
||||||
|
|
||||||
You can specify more GPUs with `--beaker_gpus [int]` to get through the work faster. You can also specify your workspace, and allowed beaker clusters to use.
|
|
||||||
With default settings, it should work fine on any available GPUs.
|
|
||||||
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m olmocr.beakerpipeline --help
|
python -m olmocr.pipeline --help
|
||||||
usage: beakerpipeline.py [-h] [--pdfs PDFS] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP]
|
usage: pipeline.py [-h] [--pdfs PDFS] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP]
|
||||||
[--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--stats]
|
[--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--apply_filter] [--stats] [--model MODEL]
|
||||||
[--model MODEL] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE]
|
[--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM]
|
||||||
[--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM] [--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--beaker]
|
[--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--beaker] [--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER]
|
||||||
[--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER] [--beaker_gpus BEAKER_GPUS]
|
[--beaker_gpus BEAKER_GPUS] [--beaker_priority BEAKER_PRIORITY]
|
||||||
[--beaker_priority BEAKER_PRIORITY]
|
|
||||||
workspace
|
workspace
|
||||||
|
|
||||||
Manager for running millions of PDFs through a batch inference pipeline
|
Manager for running millions of PDFs through a batch inference pipeline
|
||||||
|
|
||||||
positional arguments:
|
positional arguments:
|
||||||
workspace The S3 path where work will be done e.g., s3://bucket/prefix/
|
workspace The filesystem path where work will be stored, can be a local folder, or an s3 path if coordinating work with many workers, s3://bucket/prefix/
|
||||||
|
|
||||||
options:
|
options:
|
||||||
-h, --help show this help message and exit
|
-h, --help show this help message and exit
|
||||||
--pdfs PDFS Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list
|
--pdfs PDFS Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
|
||||||
of pdf paths
|
|
||||||
--workspace_profile WORKSPACE_PROFILE
|
--workspace_profile WORKSPACE_PROFILE
|
||||||
S3 configuration profile for accessing the workspace
|
S3 configuration profile for accessing the workspace
|
||||||
--pdf_profile PDF_PROFILE
|
--pdf_profile PDF_PROFILE
|
||||||
@ -78,9 +109,10 @@ options:
|
|||||||
--max_page_error_rate MAX_PAGE_ERROR_RATE
|
--max_page_error_rate MAX_PAGE_ERROR_RATE
|
||||||
Rate of allowable failed pages in a document, 1/250 by default
|
Rate of allowable failed pages in a document, 1/250 by default
|
||||||
--workers WORKERS Number of workers to run at a time
|
--workers WORKERS Number of workers to run at a time
|
||||||
|
--apply_filter Apply basic filtering to English pdfs which are not forms, and not likely seo spam
|
||||||
--stats Instead of running any job, reports some statistics about the current workspace
|
--stats Instead of running any job, reports some statistics about the current workspace
|
||||||
--model MODEL List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script
|
--model MODEL List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the
|
||||||
will try to use the one which is fastest to access
|
one which is fastest to access
|
||||||
--model_max_context MODEL_MAX_CONTEXT
|
--model_max_context MODEL_MAX_CONTEXT
|
||||||
Maximum context length that the model was fine tuned under
|
Maximum context length that the model was fine tuned under
|
||||||
--model_chat_template MODEL_CHAT_TEMPLATE
|
--model_chat_template MODEL_CHAT_TEMPLATE
|
||||||
|
Loading…
x
Reference in New Issue
Block a user