pdelfin

Toolkit for truly understanding PDF documents in the wild.

image

Things supported:

  • A prompting strategy to get really good natural text parsing using ChatGPT 4o (silver_data)
  • An eval toolkit for comparing different pipeline versions
  • Basic filtering by language and SEO spam removal
  • Finetuning code for Qwen2-VL (and soon other VLMs)

Note: Font installation

You will probably need to install some fonts on your computer so that any pdfs you render come out looking nice.

sudo apt-get install ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

TODOs for future versions

  • Equations could be specified to be in a more specific format (they are "LaTeX" now)
  • Ask model to predict footnotes in a structured format separately
  • Add training data for complex tables
  • More training augmentations to improve performance
  • Fix pages which are all-references sometimes rendering as empty-text
Description
Toolkit for linearizing PDFs for LLM datasets/training
Readme Apache-2.0 359 MiB
Languages
Python 90.2%
HTML 5.7%
Shell 3.9%
Dockerfile 0.2%