2022-07-28 01:06:46 -07:00
|
|
|
|
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
|
|
|
|
..
|
|
|
|
|
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
============
|
2016-09-06 13:52:40 -07:00
|
|
|
|
Introduction
|
|
|
|
|
============
|
2019-06-22 17:29:26 -07:00
|
|
|
|
|
2021-09-14 00:24:18 -07:00
|
|
|
|
OCRmyPDF is an application and library that adds text "layers" to images
|
|
|
|
|
in PDFs, making scanned image PDFs searchable. It uses OCR to guess what text
|
|
|
|
|
is contained in images. It is written in Python. OCRmyPDF supports plugins
|
|
|
|
|
that allow customization of its processing steps, and is very tolerant of
|
|
|
|
|
PDFs that contain scanned images and "born digital" content that needs no
|
|
|
|
|
text recognition.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
|
|
|
|
About OCR
|
2019-06-22 17:29:26 -07:00
|
|
|
|
=========
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
`Optical character
|
|
|
|
|
recognition <https://en.wikipedia.org/wiki/Optical_character_recognition>`__
|
|
|
|
|
is technology that converts images of typed or handwritten text, such as
|
2020-02-25 22:23:58 -08:00
|
|
|
|
in a scanned document, to computer text that can be selected, searched and copied.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
OCRmyPDF uses
|
|
|
|
|
`Tesseract <https://github.com/tesseract-ocr/tesseract>`__, the best
|
|
|
|
|
available open source OCR engine, to perform OCR.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2016-10-28 01:22:40 -07:00
|
|
|
|
.. _raster-vector:
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
|
|
|
|
About PDFs
|
2019-06-22 17:29:26 -07:00
|
|
|
|
==========
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
PDFs are page description files that attempts to preserve a layout
|
|
|
|
|
exactly. They contain `vector
|
|
|
|
|
graphics <http://vector-conversions.com/vectorizing/raster_vs_vector.html>`__
|
|
|
|
|
that can contain raster objects such as scanned images. Because PDFs can
|
|
|
|
|
contain multiple pages (unlike many image formats) and can contain fonts
|
2021-09-14 00:24:18 -07:00
|
|
|
|
and text, it is a good format for exchanging scanned documents.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
|image|
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
A PDF page might contain multiple images, even if it only appears to
|
|
|
|
|
have one image. Some scanners or scanning software will segment pages
|
|
|
|
|
into monochromatic text and color regions for example, to improve the
|
|
|
|
|
compression ratio and appearance of the page.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2021-09-14 00:24:18 -07:00
|
|
|
|
Rasterizing a PDF is the process of generating corresponding raster images.
|
|
|
|
|
OCR engines like Tesseract work with images, not scalable vector graphics
|
|
|
|
|
or mixed raster-vector-text graphics such as PDF.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
|
|
|
|
About PDF/A
|
2019-06-22 17:29:26 -07:00
|
|
|
|
===========
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`__ is an ISO-standardized
|
|
|
|
|
subset of the full PDF specification that is designed for archiving (the
|
|
|
|
|
'A' stands for Archive). PDF/A differs from PDF primarily by omitting
|
|
|
|
|
features that would make it difficult to read the file in the future,
|
|
|
|
|
such as embedded Javascript, video, audio and references to external
|
|
|
|
|
fonts. All fonts and resources needed to interpret the PDF must be
|
|
|
|
|
contained within it. Because PDF/A disables Javascript and other types
|
|
|
|
|
of embedded content, it is probably more secure.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2016-11-07 14:12:37 -08:00
|
|
|
|
There are various conformance levels and versions, such as "PDF/A-2b".
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Generally speaking, the best format for scanned documents is PDF/A. Some
|
|
|
|
|
governments and jurisdictions, US Courts in particular, `mandate the use
|
|
|
|
|
of PDF/A <https://pdfblog.com/2012/02/13/what-is-pdfa/>`__ for scanned
|
|
|
|
|
documents.
|
2016-11-07 14:12:37 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Since most people who scan documents are interested in reading them
|
|
|
|
|
indefinitely into the future, OCRmyPDF generates PDF/A-2b by default.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
PDF/A has a few drawbacks. Some PDF viewers include an alert that the
|
|
|
|
|
file is a PDF/A, which may confuse some users. It also tends to produce
|
|
|
|
|
larger files than PDF, because it embeds certain resources even if they
|
|
|
|
|
are commonly available. PDF/A files can be digitally signed, but may not
|
|
|
|
|
be encrypted, to ensure they can be read in the future. Fortunately,
|
|
|
|
|
converting from PDF/A to a regular PDF is trivial, and any PDF viewer
|
|
|
|
|
can view PDF/A.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
|
|
|
|
What OCRmyPDF does
|
2019-06-22 17:29:26 -07:00
|
|
|
|
==================
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
OCRmyPDF analyzes each page of a PDF to determine the colorspace and
|
|
|
|
|
resolution (DPI) needed to capture all of the information on that page
|
|
|
|
|
without losing content. It uses
|
|
|
|
|
`Ghostscript <http://ghostscript.com/>`__ to rasterize the page, and
|
2022-12-14 20:24:55 -05:00
|
|
|
|
then performs OCR on the rasterized image to create an OCR "layer".
|
2019-06-22 17:29:26 -07:00
|
|
|
|
The layer is then grafted back onto the original PDF.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
While one can use a program like Ghostscript or ImageMagick to get an
|
|
|
|
|
image and put the image through Tesseract, that actually creates a new
|
|
|
|
|
PDF and many details may be lost. OCRmyPDF can produce a minimally
|
|
|
|
|
changed PDF as output.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2021-09-14 00:24:18 -07:00
|
|
|
|
OCRmyPDF also provides some image processing options, like deskew, which
|
|
|
|
|
improves the appearance of files and quality of OCR. When these are used,
|
|
|
|
|
the OCR layer is grafted onto the processed image instead.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
By default, OCRmyPDF produces archival PDFs – PDF/A, which are a
|
|
|
|
|
stricter subset of PDF features designed for long term archives. If
|
|
|
|
|
regular PDFs are desired, this can be disabled with
|
|
|
|
|
``--output-type pdf``.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
2017-04-18 18:07:19 -07:00
|
|
|
|
Why you shouldn't do this manually
|
2019-06-22 17:29:26 -07:00
|
|
|
|
==================================
|
2017-04-18 18:07:19 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
A PDF is similar to an HTML file, in that it contains document structure
|
|
|
|
|
along with images. Sometimes a PDF does nothing more than present a full
|
|
|
|
|
page image, but often there is additional content that would be lost.
|
2017-04-18 18:07:19 -07:00
|
|
|
|
|
2017-05-29 14:36:50 -07:00
|
|
|
|
A manual process could work like either of these:
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
1. Rasterize each page as an image, OCR the images, and combine the
|
|
|
|
|
output into a PDF. This preserves the layout of each page, but
|
|
|
|
|
resamples all images (possibly losing quality, increasing file size,
|
|
|
|
|
introducing compression artifacts, etc.).
|
|
|
|
|
2. Extract each image, OCR, and combine the output into a PDF. This
|
|
|
|
|
loses the context in which images are used in the PDF, meaning that
|
|
|
|
|
cropping, rotation and scaling of pages may be lost. Some scanned
|
|
|
|
|
PDFs use multiple images segmented into black and white, grayscale
|
|
|
|
|
and color regions, with stencil masks to prevent overlap, as this can
|
|
|
|
|
enhance the appearance of a file while reducing file size. Clearly,
|
|
|
|
|
reassembling these images will be easy. This also loses and text or
|
|
|
|
|
vector art on any pages in a PDF with both scanned and pure digital
|
|
|
|
|
content.
|
|
|
|
|
|
|
|
|
|
In the case of a PDF that is nothing other than a container of images
|
|
|
|
|
(no rotation, scaling, cropping, one image per page), the second
|
|
|
|
|
approach can be lossless.
|
|
|
|
|
|
|
|
|
|
OCRmyPDF uses several strategies depending on input options and the
|
|
|
|
|
input PDF itself, but generally speaking it rasterizes a page for OCR
|
|
|
|
|
and then grafts the OCR back onto the original. As such it can handle
|
|
|
|
|
complex PDFs and still preserve their contents as much as possible.
|
|
|
|
|
|
|
|
|
|
OCRmyPDF also supports a many, many edge cases that have cropped over
|
|
|
|
|
several years of development. We support PDF features like images inside
|
|
|
|
|
of Form XObjects, and pages with UserUnit scaling. We support rare image
|
|
|
|
|
formats like non-monochrome 1-bit images. We warn about files you may
|
|
|
|
|
not to OCR. Thanks to pikepdf and QPDF, we auto-repair PDFs that are
|
|
|
|
|
damaged. (Not that you need to know what any of these are! You should be
|
|
|
|
|
able to throw any PDF at it.)
|
2017-04-18 18:07:19 -07:00
|
|
|
|
|
2016-09-06 13:52:40 -07:00
|
|
|
|
Limitations
|
2019-06-22 17:29:26 -07:00
|
|
|
|
===========
|
|
|
|
|
|
|
|
|
|
OCRmyPDF is limited by the Tesseract OCR engine. As such it experiences
|
|
|
|
|
these limitations, as do any other programs that rely on Tesseract:
|
|
|
|
|
|
2021-05-27 13:45:41 -07:00
|
|
|
|
- The OCR is not as accurate as commercial OCR solutions.
|
2019-06-22 17:29:26 -07:00
|
|
|
|
- It is not capable of recognizing handwriting.
|
|
|
|
|
- It may find gibberish and report this as OCR output.
|
|
|
|
|
- If a document contains languages outside of those given in the
|
|
|
|
|
``-l LANG`` arguments, results may be poor.
|
|
|
|
|
- It is not always good at analyzing the natural reading order of
|
|
|
|
|
documents. For example, it may fail to recognize that a document
|
|
|
|
|
contains two columns, and may try to join text across columns.
|
|
|
|
|
- Poor quality scans may produce poor quality OCR. Garbage in, garbage
|
|
|
|
|
out.
|
|
|
|
|
- It does not expose information about what font family text belongs
|
|
|
|
|
to.
|
2018-04-09 13:06:09 -07:00
|
|
|
|
|
2016-09-06 13:52:40 -07:00
|
|
|
|
OCRmyPDF is also limited by the PDF specification:
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
- PDF encodes the position of text glyphs but does not encode document
|
|
|
|
|
structure. There is no markup that divides a document in sections,
|
|
|
|
|
paragraphs, sentences, or even words (since blank spaces are not
|
|
|
|
|
represented). As such all elements of document structure including
|
|
|
|
|
the spaces between words must be derived heuristically. Some PDF
|
|
|
|
|
viewers do a better job of this than others.
|
|
|
|
|
- Because some popular open source PDF viewers have a particularly hard
|
2019-07-27 04:04:33 -07:00
|
|
|
|
time with spaces between words, OCRmyPDF appends a space to each text
|
2019-06-22 17:29:26 -07:00
|
|
|
|
element as a workaround (when using ``--pdf-renderer hocr``). While
|
|
|
|
|
this mixes document structure with graphical information that ideally
|
|
|
|
|
should be left to the PDF viewer to interpret, it improves
|
|
|
|
|
compatibility with some viewers and does not cause problems for
|
|
|
|
|
better ones.
|
2016-09-06 13:52:40 -07:00
|
|
|
|
|
|
|
|
|
Ghostscript also imposes some limitations:
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
- PDFs containing JBIG2-encoded content will be converted to CCITT
|
|
|
|
|
Group4 encoding, which has lower compression ratios, if Ghostscript
|
|
|
|
|
PDF/A is enabled.
|
|
|
|
|
- PDFs containing JPEG 2000-encoded content will be converted to JPEG
|
|
|
|
|
encoding, which may introduce compression artifacts, if Ghostscript
|
|
|
|
|
PDF/A is enabled.
|
|
|
|
|
- Ghostscript may transcode grayscale and color images, either lossy to
|
|
|
|
|
lossless or lossless to lossy, based on an internal algorithm. This
|
|
|
|
|
behavior can be suppressed by setting ``--pdfa-image-compression`` to
|
|
|
|
|
``jpeg`` or ``lossless`` to set all images to one type or the other.
|
|
|
|
|
Ghostscript has no option to maintain the input image's format.
|
2022-08-02 15:01:10 -07:00
|
|
|
|
(Modern Ghostscript can copy JPEG images without transcoding them.)
|
2019-06-22 17:29:26 -07:00
|
|
|
|
- Ghostscript's PDF/A conversion removes any XMP metadata that is not
|
|
|
|
|
one of the standard XMP metadata namespaces for PDFs. In particular,
|
2023-03-30 12:57:32 +08:00
|
|
|
|
PRISM Metadata is removed.
|
2020-08-12 12:12:00 -07:00
|
|
|
|
- Ghostscript's PDF/A conversion seems to remove or deactivate
|
|
|
|
|
hyperlinks and other active content.
|
|
|
|
|
|
|
|
|
|
You can use ``--output-type pdf`` to disable PDF/A conversion and produce
|
|
|
|
|
a standard, non-archival PDF.
|
2017-04-18 18:07:19 -07:00
|
|
|
|
|
2018-07-12 01:52:49 -07:00
|
|
|
|
Regarding OCRmyPDF itself:
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
- PDFs that use transparency are not currently represented in the test
|
|
|
|
|
suite
|
2017-04-18 18:07:19 -07:00
|
|
|
|
|
|
|
|
|
Similar programs
|
2019-06-22 17:29:26 -07:00
|
|
|
|
================
|
2017-04-18 18:07:19 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
To the author's knowledge, OCRmyPDF is the most feature-rich and
|
|
|
|
|
thoroughly tested command line OCR PDF conversion tool. If it does not
|
|
|
|
|
meet your needs, contributions and suggestions are welcome. If not,
|
|
|
|
|
consider one of these similar open source programs:
|
2017-04-18 18:07:19 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
- pdf2pdfocr
|
|
|
|
|
- pdfsandwich
|
2021-06-14 01:08:07 -07:00
|
|
|
|
|
|
|
|
|
Ghostscript recently added three "pdfocr" output devices. They work by
|
|
|
|
|
rasterizing all content and converting all pages to a single colour space.
|
2017-05-14 23:16:30 -07:00
|
|
|
|
|
2018-03-25 21:36:39 -07:00
|
|
|
|
Web front-ends
|
2019-06-22 17:29:26 -07:00
|
|
|
|
==============
|
2018-03-25 21:36:39 -07:00
|
|
|
|
|
2019-11-03 22:35:15 -08:00
|
|
|
|
The Docker image ``ocrmypdf`` provides a web service front-end
|
2019-06-22 17:29:26 -07:00
|
|
|
|
that allows files to submitted over HTTP and the results "downloaded".
|
|
|
|
|
This is an HTTP server intended to simplify web services deployments; it
|
|
|
|
|
is not intended to be deployed on the public internet and no real
|
|
|
|
|
security measures to speak of.
|
2018-12-17 23:20:49 -08:00
|
|
|
|
|
2019-02-17 16:27:44 -08:00
|
|
|
|
In addition, the following third-party integrations are available:
|
2018-12-17 23:20:49 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
- `Nextcloud OCR <https://github.com/janis91/ocr>`__ is a free software
|
|
|
|
|
plugin for the Nextcloud private cloud software
|
|
|
|
|
|
|
|
|
|
OCRmyPDF is not designed to be secure against malware-bearing PDFs (see
|
|
|
|
|
`Using OCRmyPDF online <ocr-service>`__). Users should ensure they
|
|
|
|
|
comply with OCRmyPDF's licenses and the licenses of all dependencies. In
|
|
|
|
|
particular, OCRmyPDF requires Ghostscript, which is licensed under
|
|
|
|
|
AGPLv3.
|
2018-03-25 21:36:39 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
.. |image| image:: images/bitmap_vs_svg.svg
|