OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-10-30 01:10:32 +00:00

Go to file

James R. Barlow 8c17c9918e Add documentation and test cases for —tesseract-config

This parameter has existed for along time but never really got any
attention.

2017-01-28 22:06:51 -08:00

.github

Create issue template

2016-08-31 11:26:29 -07:00

docs

Add documentation and test cases for —tesseract-config

2017-01-28 22:06:51 -08:00

ocrmypdf

Add documentation and test cases for —tesseract-config

2017-01-28 22:06:51 -08:00

Docker: fix blank JPEG2000 PDF issue

2016-02-21 04:24:21 -08:00

tests

Add documentation and test cases for —tesseract-config

2017-01-28 22:06:51 -08:00

.dockerignore

Update dockerfile: include all languages

2016-01-04 14:27:16 -08:00

.git_archival.txt

setup_scm_git_archive: add additional files

2016-02-29 12:46:27 -08:00

.gitattributes

setup_scm_git_archive: add additional files

2016-02-29 12:46:27 -08:00

.gitignore

Moved venvs

2016-11-21 20:40:22 -08:00

.travis.yml

travis: fix ‘pip install’ by moving working code out of the way

2017-01-27 14:33:23 -08:00

dev_requirements.txt

Merge branch 'master' (4.3.5, Python 3.6 support) into develop

2017-01-20 14:25:28 -08:00

docker-wrapper.sh

Works

2015-08-18 05:38:05 -07:00

Dockerfile

Docker: fix blank JPEG2000 PDF issue

2016-02-21 04:24:21 -08:00

Dockerfile.polyglot

Merge branch 'develop'

2016-02-06 18:18:49 -08:00

LICENSE.rst

Implement DPI checking for stencil masks

2016-08-23 15:59:34 -07:00

MANIFEST.in

Fix MANIFEST for .png

2016-12-08 16:25:04 -08:00

OCRmyPDF.sh

Reinstate OCRmyPDF.sh with a deprecation warning

2016-08-31 11:57:02 -07:00

pipeline.svg

Implement “tesstop” (tesseract v4 text-only pages - working name)

2017-01-20 17:16:01 -08:00

README.rst

Update README to point to ReadTheDocs

2016-10-28 00:33:17 -07:00

RELEASE_NOTES.rst

Note about pytest-helpers-namespace

2017-01-26 23:15:32 -08:00

requirements.txt

Experiment: update *requirements.txt, use more current travis build steps

2017-01-27 13:13:14 -08:00

setup.cfg

Move duplicate test code into common namespace

2017-01-26 13:36:52 -08:00

setup.py

cffi: verbose=True

2017-01-27 14:17:13 -08:00

test_requirements.txt

Experiment: update *requirements.txt, use more current travis build steps

2017-01-27 13:13:14 -08:00

README.rst

OCRmyPDF
========

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to
be searched or copy-pasted.

.. code-block:: bash

   ocrmypdf                      # it's a scriptable command line program
      -l eng+fra                 # it supports multiple languages
      --rotate-pages             # it can fix pages that are misrotated
      --deskew                   # it can deskew crooked PDFs!
      --title "My PDF"           # it can change output metadata
      --jobs 4                   # it uses multiple cores by default
      --output-type pdfa         # it produces PDF/A by default
      input_scanned.pdf          # takes PDF input (or images)
      output_searchable.pdf      # produces validated PDF output


Main features
-------------

-  Generates a searchable
   `PDF/A <https://en.wikipedia.org/?title=PDF/A>`_ file from a regular PDF
-  Places OCR text accurately below the image to ease copy / paste
-  Keeps the exact resolution of the original embedded images
-  When possible, inserts OCR information as a "lossless" operation without rendering vector information
-  Keeps file size about the same
-  If requested deskews and/or cleans the image before performing OCR
-  Validates input and output files
-  Provides debug mode to enable easy verification of the OCR results
-  Processes pages in parallel when more than one CPU core is
   available
-  Uses `Tesseract OCR <https://github.com/tesseract-ocr/tesseract>`_ engine
-  Supports more than `100 languages <https://github.com/tesseract-ocr/tessdata>`_ recognized by Tesseract
-  Battle-tested on thousands of PDFs, a test suite and continuous integration

For details: please consult the `release notes <RELEASE_NOTES.rst>`_.

Motivation
----------

I searched the web for a free command line tool to OCR PDF files on
Linux/UNIX: I found many, but none of them were really satisfying.

-  Either they produced PDF files with misplaced text under the image (making copy/paste impossible) 
-  Or they did not handle accents and multilingual characters
-  Or they changed the resolution of the embedded images
-  Or they generated ridiculously large PDF files
-  Or they crashed when trying to OCR some of my PDF files
-  Or they did not produce valid PDF files (even though they were readable with my current PDF reader)
-  On top of that none of them produced PDF/A files (format dedicated for long time storage)

...so I decided to develop my own tool (using various existing scripts
as an inspiration). 

Installation
------------

Linux, UNIX, and macOS are supported. Windows is not directly supported but there is a Docker image available that runs on Windows.

Users of Debian 9 or later or Ubuntu 16.10 or later may simply
``apt-get install ocrmypdf``.

For everyone else, `see our documentation <https://ocrmypdf.readthedocs.io/en/latest/installation.html>`_ for installation steps.

Languages
---------

OCRmyPDF uses Tesseract for OCR, and relies on its language packs. For Linux users,
you can often find packages that provide language packs:

.. code-block:: bash

   # Display a list of all Tesseract language packs
   apt-cache search tesseract-ocr

   # Debian/Ubuntu users
   apt-get install tesseract-ocr-chi-sim  # Example: Install Chinese Simplified language back
   
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple
languages can be requested.

Documentation and support
-------------------------

Once ocrmypdf is installed, the built-in help which explains the command syntax and options can be accessed via:

.. code-block:: bash

   ocrmypdf --help

Our `documentation is served on Read the Docs <https://ocrmypdf.readthedocs.io/en/latest/index.html>`_.

If you detect an issue, please:

-  Check whether your issue is already known
-  If no problem report exists on github, please create one here:
   https://github.com/jbarlow83/OCRmyPDF/issues
-  Describe your problem thoroughly
-  Append the console output of the script when running the debug mode
   (``-v 1`` option)
-  If possible provide your input PDF file as well as the content of the
   temporary folder (using a file sharing service like Dropbox)

Press & Media
-------------

-  `c't 1-2014, page 59 <http://heise.de/-2279695>`_:
   Detailed presentation of OCRmyPDF v1.0 in the leading German IT
   magazine c't
-  `heise Open Source, 09/2014: Texterkennung mit
   OCRmyPDF <http://heise.de/-2356670>`_

Disclaimer
----------

The software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied.