yujunjun/olmocr

mirror of https://github.com/allenai/olmocr.git synced 2025-11-19 03:48:03 +00:00

Go to file

Jake Poznanski a1a4798ce7 Some crazy idea I had to simplify futures and memory limits

2024-10-23 21:51:37 +00:00

Running personalize script on template

2024-09-17 15:06:59 +00:00

Running personalize script on template

2024-09-17 15:06:59 +00:00

Some crazy idea I had to simplify futures and memory limits

2024-10-23 21:51:37 +00:00

Some crazy idea I had to simplify futures and memory limits

2024-10-23 21:51:37 +00:00

Some crazy idea I had to simplify futures and memory limits

2024-10-23 21:51:37 +00:00

.dockerignore

Initial commit

2024-09-17 07:53:43 -07:00

.gitignore

Adding vllm profile script for reference

2024-10-22 20:00:34 +00:00

.readthedocs.yaml

Initial commit

2024-09-17 07:53:43 -07:00

CHANGELOG.md

Initial commit

2024-09-17 07:53:43 -07:00

gantry-requirements.txt

Fix for unicode errors in big datasets for the future

2024-10-07 17:01:59 +00:00

LICENSE

Initial commit

2024-09-17 07:53:43 -07:00

Makefile

Running personalize script on template

2024-09-17 15:06:59 +00:00

pyproject.toml

Prepping to train

2024-10-16 13:18:24 -07:00

README.md

Small fixes

2024-10-21 16:45:06 +00:00

RELEASE_PROCESS.md

Running personalize script on template

2024-09-17 15:06:59 +00:00

README.md

pdelfin

Toolkit for truly understanding PDF documents in the wild.

Things supported:

A prompting strategy to get really good natural text parsing using ChatGPT 4o (silver_data)
An eval toolkit for comparing different pipeline versions
Basic filtering by language and SEO spam removal
Finetuning code for Qwen2-VL (and soon other VLMs)

Note: Font installation

You will probably need to install some fonts on your computer so that any pdfs you render come out looking nice.

sudo apt-get install ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

TODOs for future versions

Equations could be specified to be in a more specific format (they are "LaTeX" now)
Ask model to predict footnotes in a structured format separately
Add training data for complex tables
More training augmentations to improve performance
Fix pages which are all-references sometimes rendering as empty-text