mirror of
				https://github.com/allenai/olmocr.git
				synced 2025-10-31 01:55:06 +00:00 
			
		
		
		
	pdelfin
Toolkit for truly understanding PDF documents in the wild.
Things supported:
- A prompting strategy to get really good natural text parsing using ChatGPT 4o (silver_data)
- An eval toolkit for comparing different pipeline versions
- Basic filtering by language and SEO spam removal
- Finetuning code for Qwen2-VL (and soon other VLMs)
Note: Font installation
You will probably need to install some fonts on your computer so that any pdfs you render come out looking nice.
sudo apt-get install ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
Description
				
					Languages
				
				
								
								
									Python
								
								87.7%
							
						
							
								
								
									Shell
								
								6.5%
							
						
							
								
								
									HTML
								
								5.7%
							
						
							
								
								
									Dockerfile
								
								0.1%
							
						
					