- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/data/buildsilver.py)
- An eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/eval/runeval.py)
- Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/filter/filter.py)
- Finetuning code for Qwen2-VL (and soon other VLMs) - [train.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/train/train.py)
- Processing millions of PDFs through a finetuned model using VLLM (requires [birr](https://github.com/allenai/mise/tree/main/birr)) - [birrpipeline.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/birrpipeline.py)
- Viewing Dolma Docs created from PDFs - [dolmaviewer.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/viewer/dolmaviewer.py)
You will also need to install the latest pypdf, which contains some fixes regarding processing PDF documents. Hopefully soon it will be included in the next full release.
If you want run a fine tuned model in order to linearize millions of PDFs, you need to use the [birrpipeline.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/birrpipeline.py) script.
birrpipeline.py will take as input all of your PDFs (stored in S3), and generate the inputs needed to run those through your fine-tuned model.
After that, you will use [birr](https://github.com/allenai/mise/tree/main/birr) (part of mise) in order to run those batch inference files efficiently via VLLM.
You should expect somewhere between 1,400 to 1,800 tokens per second per H100 GPU.
Manager for running millions of PDFs through a batch inference pipeline
positional arguments:
workspace The S3 path where work will be done e.g., s3://bucket/prefix/)
options:
-h, --help show this help message and exit
--add_pdfs ADD_PDFS Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
You can use the [qwen2-vl-7b-pdf-weka.yaml](https://github.com/allenai/pdelfin/blob/main/scripts/birr/config/qwen2-vl-7b-pdf-weka.yaml) file here as a template for your birr config.
Once the batch inference job completes, you will want to run the birrpipeline again (witthout the --add_pdfs argument). This will index all of the
batch inference files, and assemble dolma docs, which you can preview with [dolmaviewer.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/viewer/dolmaviewer.py)
Because of the nature of vlms, you will need to run multiple rounds of inference in order to convert the majority of your files. This is because
sometimes generation will fail due to repetition errors, (or if the pdf page was rotated incorrectly, the system will attempt to classify that and rotate it properly on