olmocr/README.md

# pdelfin

Toolkit for truly understanding PDF documents in the wild.

<img src="https://github.com/user-attachments/assets/984a645c-096d-4b9a-9c5b-44063004cd8c" alt="image" width="300"/>

Things supported:
 - A prompting strategy to get really good natural text parsing using ChatGPT 4o (silver_data)
 - An eval toolkit for comparing different pipeline versions
 - Basic filtering by language and SEO spam removal
 - Finetuning code for Qwen2-VL (and soon other VLMs)

### Note: Font installation

You will probably need to install some fonts on your computer so that any pdfs you render come out looking nice.

```
sudo apt-get install ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

```


### TODOs for future versions
 - Equations could be specified to be in a more specific format (they are "LaTeX" now)
 - Ask model to predict footnotes in a structured format separately
 - Add training data for complex tables
 - More training augmentations to improve performance
 - Fix pages which are all-references sometimes rendering as empty-text
Update README.md 2024-09-17 07:58:39 -07:00			`# pdelfin`
Readme 2024-10-02 20:48:39 +00:00
			`Toolkit for truly understanding PDF documents in the wild.`

			`<img src="https://github.com/user-attachments/assets/984a645c-096d-4b9a-9c5b-44063004cd8c" alt="image" width="300"/>`

			`Things supported:`
			`- A prompting strategy to get really good natural text parsing using ChatGPT 4o (silver_data)`
			`- An eval toolkit for comparing different pipeline versions`
			`- Basic filtering by language and SEO spam removal`
			`- Finetuning code for Qwen2-VL (and soon other VLMs)`

			`### Note: Font installation`

			`You will probably need to install some fonts on your computer so that any pdfs you render come out looking nice.`

			```
A few notes, starting to test dataloader with new structured response format 2024-10-02 22:17:15 +00:00			`sudo apt-get install ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools`
Readme 2024-10-02 20:48:39 +00:00
Merge branch 'main' of https://github.com/allenai/pdelfin 2024-10-02 20:48:58 +00:00			```
Small fixes 2024-10-21 16:45:06 +00:00

			`### TODOs for future versions`
			`- Equations could be specified to be in a more specific format (they are "LaTeX" now)`
			`- Ask model to predict footnotes in a structured format separately`
			`- Add training data for complex tables`
			`- More training augmentations to improve performance`
			`- Fix pages which are all-references sometimes rendering as empty-text`