mirror of
https://github.com/allenai/olmocr.git
synced 2025-08-15 04:11:59 +00:00
pdelfin
Toolkit for truly understanding PDF documents in the wild.
Things supported:
- A prompting strategy to get really good natural text parsing using ChatGPT 4o (silver_data)
- An eval toolkit for comparing different pipeline versions
- Basic filtering by language and SEO spam removal
- Finetuning code for Qwen2-VL (and soon other VLMs)
Note: Font installation
You will probably need to install some fonts on your computer so that any pdfs you render come out looking nice.
sudo apt-get install ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
TODOs for future versions
- Equations could be specified to be in a more specific format (they are "LaTeX" now)
- Ask model to predict footnotes in a structured format separately
- Add training data for complex tables
- More training augmentations to improve performance
- Fix pages which are all-references sometimes rendering as empty-text
Description
Languages
Python
90.2%
HTML
5.7%
Shell
3.9%
Dockerfile
0.2%