OCRmyPDF/docs/introduction.md

214 lines
9.9 KiB
Markdown
Raw Normal View History

2025-04-17 15:03:21 -07:00
% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0
# Introduction
2023-10-14 00:45:04 -07:00
OCRmyPDF is a Python application and library that adds text "layers" to images in
PDFs, making scanned image PDFs searchable. It uses OCR to guess the text
contained in images. OCRmyPDF also supports plugins
that enable customization of its processing steps, and it is highly tolerant
of PDFs containing scanned images and "born digital" content that doesn't
require text recognition.
2016-09-06 13:52:40 -07:00
2025-04-17 15:03:21 -07:00
## About OCR
2016-09-06 13:52:40 -07:00
2025-04-17 15:03:21 -07:00
[Optical character
recognition](https://en.wikipedia.org/wiki/Optical_character_recognition)
2023-10-14 00:45:04 -07:00
is a technology that converts images of typed or handwritten text, such as
in a scanned document, into computer text that can be selected, searched and copied.
2016-09-06 13:52:40 -07:00
OCRmyPDF uses
2025-04-17 15:03:21 -07:00
[Tesseract](https://github.com/tesseract-ocr/tesseract), a widely
available open source OCR engine, to perform OCR.
2016-09-06 13:52:40 -07:00
2025-04-17 15:03:21 -07:00
(raster-vector)=
2016-09-06 13:52:40 -07:00
2025-04-17 15:03:21 -07:00
## About PDFs
2016-09-06 13:52:40 -07:00
2023-10-14 00:45:04 -07:00
PDFs are page description files that attempt to preserve a layout
2025-04-17 15:03:21 -07:00
exactly. They contain [vector
graphics](http://vector-conversions.com/vectorizing/raster_vs_vector.html)
2023-10-14 00:45:04 -07:00
that can contain raster objects, such as scanned images. Because PDFs can
contain multiple pages (unlike many image formats) and can contain fonts
2023-10-14 00:45:04 -07:00
and text, they are a suitable format for exchanging scanned documents.
2016-09-06 13:52:40 -07:00
2025-04-17 15:03:21 -07:00
:::{image} images/bitmap_vs_svg.svg
:::
2016-09-06 13:52:40 -07:00
2023-10-14 00:45:04 -07:00
A PDF page may contain multiple images, even if it appears to have only
one image. Some scanners or scanning software may segment pages into
monochromatic text and color regions, for example, to enhance the compression
ratio and appearance of the page.
2016-09-06 13:52:40 -07:00
Rasterizing a PDF is the process of generating corresponding raster images.
OCR engines like Tesseract work with images, not scalable vector graphics
or mixed raster-vector-text graphics such as PDF.
2016-09-06 13:52:40 -07:00
2025-04-17 15:03:21 -07:00
## About PDF/A
2016-09-06 13:52:40 -07:00
2025-04-17 15:03:21 -07:00
[PDF/A](https://en.wikipedia.org/wiki/PDF/A) is an ISO-standardized
subset of the full PDF specification that is designed for archiving (the
'A' stands for Archive). PDF/A differs from PDF primarily by omitting
2023-10-14 00:45:04 -07:00
features that could complicate future file readability,
such as embedded Javascript, video, audio and references to external
fonts. All fonts and resources needed to interpret the PDF must be
contained within it. Because PDF/A disables Javascript and other types
2023-10-14 00:45:04 -07:00
of embedded content, it is likely more secure.
2016-09-06 13:52:40 -07:00
There are various conformance levels and versions, such as "PDF/A-2b".
2023-10-14 00:45:04 -07:00
In general, the preferred format for scanned documents is PDF/A. Some
2025-04-17 15:03:21 -07:00
governments and jurisdictions, US Courts in particular, [mandate the use
of PDF/A](https://pdfblog.com/2012/02/13/what-is-pdfa/) for scanned
documents.
2023-10-14 00:45:04 -07:00
Since most individuals scanning documents aim for long-term readability,
OCRmyPDF defaults to generating PDF/A-2b.
2016-09-06 13:52:40 -07:00
2023-10-14 00:45:04 -07:00
PDF/A does have a few drawbacks. Some PDF viewers display an alert
indicating that the file is in PDF/A format, which may confuse some users.
Additionally, it tends to result in larger files than standard PDFs because
it embeds certain resources, even if they are widely available. PDF/A
files can be digitally signed but may not be encrypted to ensure future
readability. Fortunately, converting from PDF/A to a regular PDF is
straightforward, and any PDF viewer can handle PDF/A files.
2016-09-06 13:52:40 -07:00
2025-04-17 15:03:21 -07:00
## What OCRmyPDF does
2016-09-06 13:52:40 -07:00
2023-10-14 00:45:04 -07:00
OCRmyPDF analyzes each page of a PDF to determine the required colorspace
and resolution (DPI) for capturing all the information on that page without
losing content. It uses
2025-04-17 15:03:21 -07:00
[Ghostscript](http://ghostscript.com/) to rasterize each page and subsequently
2023-10-14 00:45:04 -07:00
performs OCR on the rasterized image to generate an OCR "layer." This layer
is then integrated back into the original PDF.
2016-09-06 13:52:40 -07:00
2023-10-14 00:45:04 -07:00
While it is possible to use a program like Ghostscript or ImageMagick to
obtain an image and then run that image through Tesseract OCR, this process
actually generates a new PDF, potentially resulting in the loss of various
details (such as the document's metadata). In contrast, OCRmyPDF can produce
a minimally altered PDF as the output.
2016-09-06 13:52:40 -07:00
2023-10-14 00:45:04 -07:00
OCRmyPDF also offers several image processing options, such as deskew, which
enhances the visual quality of files and the accuracy of OCR. When these
options are utilized, the OCR layer is integrated into the processed image.
2016-09-06 13:52:40 -07:00
2023-10-14 00:45:04 -07:00
By default, OCRmyPDF generates archival PDFs in the PDF/A format, which is
a more rigid subset of PDF features designed for long-term archives. If you
prefer regular PDFs, you can disable this feature using the
2025-04-17 15:03:21 -07:00
`--output-type pdf` option.
2016-09-06 13:52:40 -07:00
2025-04-17 15:03:21 -07:00
## Why you shouldn't do this manually
2017-04-18 18:07:19 -07:00
A PDF is similar to an HTML file, in that it contains document structure
2023-10-14 00:45:04 -07:00
along with images. While some PDFs may solely display a full-page image,
they often contain additional content that would be forfeited if not preserved.
A manual process could take one of these approaches:
1. Rasterize each page as an image, perform OCR on the images, and then merge the
output into a PDF. This method preserves the layout of each page, but
resamples all images potentially leading to quality loss, increased file size,
and the introduction of compression artifacts, among other issues.
2. Extract each image, OCR, and combine the output into a PDF. This approach
loses the context in which images are used in the PDF, potentially resulting
in loss of information related to scaling and position of images. Some scanned
PDFs contain multiple images segmented into black and white, grayscale
and color regions, with stencil masks to prevent overlap, as this can
2023-10-14 00:45:04 -07:00
enhance the appearance of a file while reducing file size.
Reassembling these images can be challenging, and risks losing vector art
or text that is not part of an image.
In cases where a PDF solely serves as a container for images without any
rotation, scaling, or cropping, the second approach can be lossless.
OCRmyPDF uses various strategies depending on input options and the input PDF
itself. Generally, it rasterizes a page for OCR and then integrates the OCR
data back into the original PDF. This approach allows it to handle complex
PDFs and preserve their content as much as possible.
Furthermore, OCRmyPDF supports a wide range of edge cases that have emerged
during several years of development. It accommodates PDF features like
images within Form XObjects and pages with UserUnit scaling. It also
supports less common image formats like non-monochrome 1-bit images and
provides warnings about files you may not want to OCR. Thanks to tools
like pikepdf and QPDF, it can auto-repair damaged PDFs. You don't need to
understand the intricacies of these issues; you should be able to use
OCRmyPDF with any PDF file, and expect reasonable results.
2017-04-18 18:07:19 -07:00
2025-04-17 15:03:21 -07:00
## Limitations
2023-10-14 00:45:04 -07:00
OCRmyPDF is subject to limitations imposed by the Tesseract OCR engine.
These limitations are inherent to any software relying on Tesseract:
2025-04-17 15:03:21 -07:00
- The OCR accuracy may not match that of commercial OCR solutions.
- It is incapable of recognizing handwriting.
- It may detect gibberish and report it as OCR output.
- Results may be subpar when a document contains languages not specified
in the `-l LANG` argument.
- Tesseract may struggle to analyze the natural reading order of documents.
For instance, it might fail to recognize two columns in a document and
attempt to join text across columns.
- Poor quality scans can result in subpar OCR quality. In other words, the
quality of the OCR output depends on the quality of the input.
- Tesseract does not provide information about the font family to which text
belongs.
- Tesseract does not divide text into paragraphs or headings. It only provides
the text and its bounding box. As such, the generated PDF does not
contain any information about the document's structure.
2016-09-06 13:52:40 -07:00
Ghostscript also imposes some limitations:
2025-04-17 15:03:21 -07:00
- PDFs containing JPEG 2000-encoded content may be converted to JPEG
encoding, which may introduce compression artifacts, if Ghostscript
PDF/A is enabled.
- Ghostscript may transcode grayscale and color images, potentially
lossily, based on an internal algorithm. This
behavior can be suppressed by setting `--pdfa-image-compression` to
`jpeg` or `lossless` to set all images to one type or the other.
Ghostscript lacks an option to maintain the input image's format.
(Modern Ghostscript can copy JPEG images without transcoding them.)
- Ghostscript's PDF/A conversion removes any XMP metadata that is not
one of the standard XMP metadata namespaces for PDFs. In particular,
PRISM Metadata is removed.
- Ghostscript's PDF/A conversion may remove or deactivate
hyperlinks and other active content.
You can use `--output-type pdf` to disable PDF/A conversion and produce
a standard, non-archival PDF.
2017-04-18 18:07:19 -07:00
2018-07-12 01:52:49 -07:00
Regarding OCRmyPDF itself:
2025-04-17 15:03:21 -07:00
- PDFs using transparency are not currently represented in the test
suite
2017-04-18 18:07:19 -07:00
2025-04-17 15:03:21 -07:00
## Similar programs
2017-04-18 18:07:19 -07:00
To the author's knowledge, OCRmyPDF is the most feature-rich and
thoroughly tested command line OCR PDF conversion tool. If it does not
2023-10-14 00:45:04 -07:00
meet your needs, contributions and suggestions are welcome.
Ghostscript recently added three "pdfocr" output devices. They work by
rasterizing all content and converting all pages to a single colour space.
2017-05-14 23:16:30 -07:00
2025-04-17 15:03:21 -07:00
## Web front-ends
2018-03-25 21:36:39 -07:00
2023-10-14 00:45:04 -07:00
The Docker image of OCRmyPDF provides a web service front-end
that allows files to submitted over HTTP, and the results can be downloaded.
This is an HTTP server intended to demonstrate how OCRmyPDF can be
integrated into a web service. It is not intended to be deployed on the
public internet and does not provide any security measures.
2019-02-17 16:27:44 -08:00
In addition, the following third-party integrations are available:
2025-04-17 15:03:21 -07:00
- [Paperless-ngx](https://docs.paperless-ngx.com/) is a free software
document management system that uses OCRmyPDF to perform OCR on
uploaded documents.
- [Nextcloud OCR](https://github.com/janis91/ocr) is a free software
plugin for the Nextcloud private cloud software.
OCRmyPDF is not designed to be secure against malware-bearing PDFs (see
2025-04-17 15:03:21 -07:00
[Using OCRmyPDF online](ocr-service)). Users should ensure they
comply with OCRmyPDF's licenses and the licenses of all dependencies. In
particular, OCRmyPDF requires Ghostscript, which is licensed under
AGPLv3.