mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-01-03 18:52:38 +00:00
OCRmyPDF
Script aimed at performing optical character recognition (OCR) on PDF files from PDF files containing only images
To get the script usage, call: ./OCmyPDF.sh -h
Features
- Generate a searchable PDF/A file from a PDF file containing only images
- Keep the exact resolution of the original embedded images
- If requested deskew and / or clean the image before performing OCR
- Validate the generated file against the PDF/A specification using jhove
- Provides debug mode to enable easy verification of the OCR results
Motivation
I searched the web for a free tool to OCR PDF files on linux/unix: I found many, but none of them was satisfying.
- Either they produced PDF files with misplaced text below the image (making copy/paste impossible)
- Or they changed the resolution of the embedded images
- Or they generated PDF file having a rediculous big size
- Or they crashed when trying to OCR some of my PDF files
- Or they did not produce valid PDF files (even though they were readable with my current PDF reader) On top of that none of them produced PDF/A files (format dedicated for long time storage / archiving)
... so I decided to develop my own tool (using various existing scripts as an inspiration)
Install
TODO
Install jhove: download jhove from here: http://sourceforge.net/projects/jhove/files/jhove/ After extracting the JHOVE files to some directory "jhove", you have to edit the file "jhove/conf/jhove.conf" and change something in "something" to the actual directory (ending in "/jhove").
Languages
Python
97%
Shell
2.6%
Dockerfile
0.4%