OCRmyPDF/docs/introduction.rst

.. SPDX-FileCopyrightText: 2022 James R. Barlow
..
.. SPDX-License-Identifier: CC-BY-SA-4.0

============
Introduction
============

OCRmyPDF is an application and library that adds text "layers" to images
in PDFs, making scanned image PDFs searchable. It uses OCR to guess what text
is contained in images. It is written in Python. OCRmyPDF supports plugins
that allow customization of its processing steps, and is very tolerant of
PDFs that contain scanned images and "born digital" content that needs no
text recognition.

About OCR
=========

`Optical character
recognition <https://en.wikipedia.org/wiki/Optical_character_recognition>`__
is technology that converts images of typed or handwritten text, such as
in a scanned document, to computer text that can be selected, searched and copied.

OCRmyPDF uses
`Tesseract <https://github.com/tesseract-ocr/tesseract>`__, the best
available open source OCR engine, to perform OCR.

.. _raster-vector:

About PDFs
==========

PDFs are page description files that attempts to preserve a layout
exactly. They contain `vector
graphics <http://vector-conversions.com/vectorizing/raster_vs_vector.html>`__
that can contain raster objects such as scanned images. Because PDFs can
contain multiple pages (unlike many image formats) and can contain fonts
and text, it is a good format for exchanging scanned documents.

|image|

A PDF page might contain multiple images, even if it only appears to
have one image. Some scanners or scanning software will segment pages
into monochromatic text and color regions for example, to improve the
compression ratio and appearance of the page.

Rasterizing a PDF is the process of generating corresponding raster images.
OCR engines like Tesseract work with images, not scalable vector graphics
or mixed raster-vector-text graphics such as PDF.

About PDF/A
===========

`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`__ is an ISO-standardized
subset of the full PDF specification that is designed for archiving (the
'A' stands for Archive). PDF/A differs from PDF primarily by omitting
features that would make it difficult to read the file in the future,
such as embedded Javascript, video, audio and references to external
fonts. All fonts and resources needed to interpret the PDF must be
contained within it. Because PDF/A disables Javascript and other types
of embedded content, it is probably more secure.

There are various conformance levels and versions, such as "PDF/A-2b".

Generally speaking, the best format for scanned documents is PDF/A. Some
governments and jurisdictions, US Courts in particular, `mandate the use
of PDF/A <https://pdfblog.com/2012/02/13/what-is-pdfa/>`__ for scanned
documents.

Since most people who scan documents are interested in reading them
indefinitely into the future, OCRmyPDF generates PDF/A-2b by default.

PDF/A has a few drawbacks. Some PDF viewers include an alert that the
file is a PDF/A, which may confuse some users. It also tends to produce
larger files than PDF, because it embeds certain resources even if they
are commonly available. PDF/A files can be digitally signed, but may not
be encrypted, to ensure they can be read in the future. Fortunately,
converting from PDF/A to a regular PDF is trivial, and any PDF viewer
can view PDF/A.

What OCRmyPDF does
==================

OCRmyPDF analyzes each page of a PDF to determine the colorspace and
resolution (DPI) needed to capture all of the information on that page
without losing content. It uses
`Ghostscript <http://ghostscript.com/>`__ to rasterize the page, and
then performs OCR on the rasterized image to create an OCR "layer".
The layer is then grafted back onto the original PDF.

While one can use a program like Ghostscript or ImageMagick to get an
image and put the image through Tesseract, that actually creates a new
PDF and many details may be lost. OCRmyPDF can produce a minimally
changed PDF as output.

OCRmyPDF also provides some image processing options, like deskew, which
improves the appearance of files and quality of OCR. When these are used,
the OCR layer is grafted onto the processed image instead.

By default, OCRmyPDF produces archival PDFs – PDF/A, which are a
stricter subset of PDF features designed for long term archives. If
regular PDFs are desired, this can be disabled with
``--output-type pdf``.

Why you shouldn't do this manually
==================================

A PDF is similar to an HTML file, in that it contains document structure
along with images. Sometimes a PDF does nothing more than present a full
page image, but often there is additional content that would be lost.

A manual process could work like either of these:

1. Rasterize each page as an image, OCR the images, and combine the
   output into a PDF. This preserves the layout of each page, but
   resamples all images (possibly losing quality, increasing file size,
   introducing compression artifacts, etc.).
2. Extract each image, OCR, and combine the output into a PDF. This
   loses the context in which images are used in the PDF, meaning that
   cropping, rotation and scaling of pages may be lost. Some scanned
   PDFs use multiple images segmented into black and white, grayscale
   and color regions, with stencil masks to prevent overlap, as this can
   enhance the appearance of a file while reducing file size. Clearly,
   reassembling these images will be easy. This also loses and text or
   vector art on any pages in a PDF with both scanned and pure digital
   content.

In the case of a PDF that is nothing other than a container of images
(no rotation, scaling, cropping, one image per page), the second
approach can be lossless.

OCRmyPDF uses several strategies depending on input options and the
input PDF itself, but generally speaking it rasterizes a page for OCR
and then grafts the OCR back onto the original. As such it can handle
complex PDFs and still preserve their contents as much as possible.

OCRmyPDF also supports a many, many edge cases that have cropped over
several years of development. We support PDF features like images inside
of Form XObjects, and pages with UserUnit scaling. We support rare image
formats like non-monochrome 1-bit images. We warn about files you may
not to OCR. Thanks to pikepdf and QPDF, we auto-repair PDFs that are
damaged. (Not that you need to know what any of these are! You should be
able to throw any PDF at it.)

Limitations
===========

OCRmyPDF is limited by the Tesseract OCR engine. As such it experiences
these limitations, as do any other programs that rely on Tesseract:

-  The OCR is not as accurate as commercial OCR solutions.
-  It is not capable of recognizing handwriting.
-  It may find gibberish and report this as OCR output.
-  If a document contains languages outside of those given in the
   ``-l LANG`` arguments, results may be poor.
-  It is not always good at analyzing the natural reading order of
   documents. For example, it may fail to recognize that a document
   contains two columns, and may try to join text across columns.
-  Poor quality scans may produce poor quality OCR. Garbage in, garbage
   out.
-  It does not expose information about what font family text belongs
   to.

OCRmyPDF is also limited by the PDF specification:

-  PDF encodes the position of text glyphs but does not encode document
   structure. There is no markup that divides a document in sections,
   paragraphs, sentences, or even words (since blank spaces are not
   represented). As such all elements of document structure including
   the spaces between words must be derived heuristically. Some PDF
   viewers do a better job of this than others.
-  Because some popular open source PDF viewers have a particularly hard
   time with spaces between words, OCRmyPDF appends a space to each text
   element as a workaround (when using ``--pdf-renderer hocr``). While
   this mixes document structure with graphical information that ideally
   should be left to the PDF viewer to interpret, it improves
   compatibility with some viewers and does not cause problems for
   better ones.

Ghostscript also imposes some limitations:

-  PDFs containing JBIG2-encoded content will be converted to CCITT
   Group4 encoding, which has lower compression ratios, if Ghostscript
   PDF/A is enabled.
-  PDFs containing JPEG 2000-encoded content will be converted to JPEG
   encoding, which may introduce compression artifacts, if Ghostscript
   PDF/A is enabled.
-  Ghostscript may transcode grayscale and color images, either lossy to
   lossless or lossless to lossy, based on an internal algorithm. This
   behavior can be suppressed by setting ``--pdfa-image-compression`` to
   ``jpeg`` or ``lossless`` to set all images to one type or the other.
   Ghostscript has no option to maintain the input image's format.
   (Modern Ghostscript can copy JPEG images without transcoding them.)
-  Ghostscript's PDF/A conversion removes any XMP metadata that is not
   one of the standard XMP metadata namespaces for PDFs. In particular,
   PRISM Metadata is removed.
-  Ghostscript's PDF/A conversion seems to remove or deactivate
   hyperlinks and other active content.

You can use ``--output-type pdf`` to disable PDF/A conversion and produce
a standard, non-archival PDF.

Regarding OCRmyPDF itself:

-  PDFs that use transparency are not currently represented in the test
   suite

Similar programs
================

To the author's knowledge, OCRmyPDF is the most feature-rich and
thoroughly tested command line OCR PDF conversion tool. If it does not
meet your needs, contributions and suggestions are welcome. If not,
consider one of these similar open source programs:

-  pdf2pdfocr
-  pdfsandwich

Ghostscript recently added three "pdfocr" output devices. They work by
rasterizing all content and converting all pages to a single colour space.

Web front-ends
==============

The Docker image ``ocrmypdf`` provides a web service front-end
that allows files to submitted over HTTP and the results "downloaded".
This is an HTTP server intended to simplify web services deployments; it
is not intended to be deployed on the public internet and no real
security measures to speak of.

In addition, the following third-party integrations are available:

-  `Nextcloud OCR <https://github.com/janis91/ocr>`__ is a free software
   plugin for the Nextcloud private cloud software

OCRmyPDF is not designed to be secure against malware-bearing PDFs (see
`Using OCRmyPDF online <ocr-service>`__). Users should ensure they
comply with OCRmyPDF's licenses and the licenses of all dependencies. In
particular, OCRmyPDF requires Ghostscript, which is licensed under
AGPLv3.

.. |image| image:: images/bitmap_vs_svg.svg
-												Change to SPDX license tracking

											
										
										
											2022-07-28 01:06:46 -07:00
+								.. SPDX-FileCopyrightText: 2022 James R. Barlow
 								..
 								.. SPDX-License-Identifier: CC-BY-SA-4.0
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								============
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
+								Introduction
 								============
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
-												docs: various fixes

As suggested by @Chealer

Closes #829, #830, #831, #832

											
										
										
											2021-09-14 00:24:18 -07:00
+								OCRmyPDF is an application and library that adds text "layers" to images
 								in PDFs, making scanned image PDFs searchable. It uses OCR to guess what text
 								is contained in images. It is written in Python. OCRmyPDF supports plugins
 								that allow customization of its processing steps, and is very tolerant of
 								PDFs that contain scanned images and "born digital" content that needs no
 								text recognition.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
 								About OCR
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								=========
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								`Optical character
 								recognition <https://en.wikipedia.org/wiki/Optical_character_recognition>`__
 								is technology that converts images of typed or handwritten text, such as
-												docs: some mild improvements

											
										
										
											2020-02-25 22:23:58 -08:00
+								in a scanned document, to computer text that can be selected, searched and copied.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								OCRmyPDF uses
 								`Tesseract <https://github.com/tesseract-ocr/tesseract>`__, the best
 								available open source OCR engine, to perform OCR.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												More work on documentation

											
										
										
											2016-10-28 01:22:40 -07:00
+								.. _raster-vector:
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
 								About PDFs
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								==========
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								PDFs are page description files that attempts to preserve a layout
 								exactly. They contain `vector
 								graphics <http://vector-conversions.com/vectorizing/raster_vs_vector.html>`__
 								that can contain raster objects such as scanned images. Because PDFs can
 								contain multiple pages (unlike many image formats) and can contain fonts
-												docs: various fixes

As suggested by @Chealer

Closes #829, #830, #831, #832

											
										
										
											2021-09-14 00:24:18 -07:00
+								and text, it is a good format for exchanging scanned documents.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								|image|
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								A PDF page might contain multiple images, even if it only appears to
 								have one image. Some scanners or scanning software will segment pages
 								into monochromatic text and color regions for example, to improve the
 								compression ratio and appearance of the page.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												docs: various fixes

As suggested by @Chealer

Closes #829, #830, #831, #832

											
										
										
											2021-09-14 00:24:18 -07:00
+								Rasterizing a PDF is the process of generating corresponding raster images.
 								OCR engines like Tesseract work with images, not scalable vector graphics
 								or mixed raster-vector-text graphics such as PDF.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
 								About PDF/A
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								===========
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`__ is an ISO-standardized
 								subset of the full PDF specification that is designed for archiving (the
 								'A' stands for Archive). PDF/A differs from PDF primarily by omitting
 								features that would make it difficult to read the file in the future,
 								such as embedded Javascript, video, audio and references to external
 								fonts. All fonts and resources needed to interpret the PDF must be
 								contained within it. Because PDF/A disables Javascript and other types
 								of embedded content, it is probably more secure.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												Update documentation on other languages, multilingual documents

											
										
										
											2016-11-07 14:12:37 -08:00
+								There are various conformance levels and versions, such as "PDF/A-2b".
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								Generally speaking, the best format for scanned documents is PDF/A. Some
 								governments and jurisdictions, US Courts in particular, `mandate the use
 								of PDF/A <https://pdfblog.com/2012/02/13/what-is-pdfa/>`__ for scanned
 								documents.
-												Update documentation on other languages, multilingual documents

											
										
										
											2016-11-07 14:12:37 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								Since most people who scan documents are interested in reading them
 								indefinitely into the future, OCRmyPDF generates PDF/A-2b by default.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								PDF/A has a few drawbacks. Some PDF viewers include an alert that the
 								file is a PDF/A, which may confuse some users. It also tends to produce
 								larger files than PDF, because it embeds certain resources even if they
 								are commonly available. PDF/A files can be digitally signed, but may not
 								be encrypted, to ensure they can be read in the future. Fortunately,
 								converting from PDF/A to a regular PDF is trivial, and any PDF viewer
 								can view PDF/A.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
 								What OCRmyPDF does
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								==================
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								OCRmyPDF analyzes each page of a PDF to determine the colorspace and
 								resolution (DPI) needed to capture all of the information on that page
 								without losing content. It uses
 								`Ghostscript <http://ghostscript.com/>`__ to rasterize the page, and
-												fixed interchanged words (#1039)


											
										
										
											2022-12-14 20:24:55 -05:00
+								then performs OCR on the rasterized image to create an OCR "layer".
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								The layer is then grafted back onto the original PDF.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								While one can use a program like Ghostscript or ImageMagick to get an
 								image and put the image through Tesseract, that actually creates a new
 								PDF and many details may be lost. OCRmyPDF can produce a minimally
 								changed PDF as output.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												docs: various fixes

As suggested by @Chealer

Closes #829, #830, #831, #832

											
										
										
											2021-09-14 00:24:18 -07:00
+								OCRmyPDF also provides some image processing options, like deskew, which
 								improves the appearance of files and quality of OCR. When these are used,
 								the OCR layer is grafted onto the processed image instead.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								By default, OCRmyPDF produces archival PDFs – PDF/A, which are a
 								stricter subset of PDF features designed for long term archives. If
 								regular PDFs are desired, this can be disabled with
 								``--output-type pdf``.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
-												Update documentation

											
										
										
											2017-04-18 18:07:19 -07:00
+								Why you shouldn't do this manually
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								==================================
-												Update documentation

											
										
										
											2017-04-18 18:07:19 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								A PDF is similar to an HTML file, in that it contains document structure
 								along with images. Sometimes a PDF does nothing more than present a full
 								page image, but often there is additional content that would be lost.
-												Update documentation

											
										
										
											2017-04-18 18:07:19 -07:00
-												v5.1 release notes

											
										
										
											2017-05-29 14:36:50 -07:00
+								A manual process could work like either of these:
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+. Rasterize each page as an image, OCR the images, and combine the
 								   output into a PDF. This preserves the layout of each page, but
 								   resamples all images (possibly losing quality, increasing file size,
 								   introducing compression artifacts, etc.).
 . Extract each image, OCR, and combine the output into a PDF. This
 								   loses the context in which images are used in the PDF, meaning that
 								   cropping, rotation and scaling of pages may be lost. Some scanned
 								   PDFs use multiple images segmented into black and white, grayscale
 								   and color regions, with stencil masks to prevent overlap, as this can
 								   enhance the appearance of a file while reducing file size. Clearly,
 								   reassembling these images will be easy. This also loses and text or
 								   vector art on any pages in a PDF with both scanned and pure digital
 								   content.
 								In the case of a PDF that is nothing other than a container of images
 								(no rotation, scaling, cropping, one image per page), the second
 								approach can be lossless.
 								OCRmyPDF uses several strategies depending on input options and the
 								input PDF itself, but generally speaking it rasterizes a page for OCR
 								and then grafts the OCR back onto the original. As such it can handle
 								complex PDFs and still preserve their contents as much as possible.
 								OCRmyPDF also supports a many, many edge cases that have cropped over
 								several years of development. We support PDF features like images inside
 								of Form XObjects, and pages with UserUnit scaling. We support rare image
 								formats like non-monochrome 1-bit images. We warn about files you may
 								not to OCR. Thanks to pikepdf and QPDF, we auto-repair PDFs that are
 								damaged. (Not that you need to know what any of these are! You should be
 								able to throw any PDF at it.)
-												Update documentation

											
										
										
											2017-04-18 18:07:19 -07:00
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
+								Limitations
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								===========
 								OCRmyPDF is limited by the Tesseract OCR engine. As such it experiences
 								these limitations, as do any other programs that rely on Tesseract:
-												docs: take a more vender neutral position on commercial OCR

											
										
										
											2021-05-27 13:45:41 -07:00
+								-  The OCR is not as accurate as commercial OCR solutions.
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								-  It is not capable of recognizing handwriting.
 								-  It may find gibberish and report this as OCR output.
 								-  If a document contains languages outside of those given in the
 								   ``-l LANG`` arguments, results may be poor.
 								-  It is not always good at analyzing the natural reading order of
 								   documents. For example, it may fail to recognize that a document
 								   contains two columns, and may try to join text across columns.
 								-  Poor quality scans may produce poor quality OCR. Garbage in, garbage
 								   out.
 								-  It does not expose information about what font family text belongs
 								   to.
-												docs: expand ocr of image usage

											
										
										
											2018-04-09 13:06:09 -07:00
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
+								OCRmyPDF is also limited by the PDF specification:
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								-  PDF encodes the position of text glyphs but does not encode document
 								   structure. There is no markup that divides a document in sections,
 								   paragraphs, sentences, or even words (since blank spaces are not
 								   represented). As such all elements of document structure including
 								   the spaces between words must be derived heuristically. Some PDF
 								   viewers do a better job of this than others.
 								-  Because some popular open source PDF viewers have a particularly hard
-												docs: some cleanup

											
										
										
											2019-07-27 04:04:33 -07:00
+								   time with spaces between words, OCRmyPDF appends a space to each text
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								   element as a workaround (when using ``--pdf-renderer hocr``). While
 								   this mixes document structure with graphical information that ideally
 								   should be left to the PDF viewer to interpret, it improves
 								   compatibility with some viewers and does not cause problems for
 								   better ones.
-												Start the documentation

											
										
										
											2016-09-06 13:52:40 -07:00
 								Ghostscript also imposes some limitations:
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								-  PDFs containing JBIG2-encoded content will be converted to CCITT
 								   Group4 encoding, which has lower compression ratios, if Ghostscript
 								   PDF/A is enabled.
 								-  PDFs containing JPEG 2000-encoded content will be converted to JPEG
 								   encoding, which may introduce compression artifacts, if Ghostscript
 								   PDF/A is enabled.
 								-  Ghostscript may transcode grayscale and color images, either lossy to
 								   lossless or lossless to lossy, based on an internal algorithm. This
 								   behavior can be suppressed by setting ``--pdfa-image-compression`` to
 								   ``jpeg`` or ``lossless`` to set all images to one type or the other.
 								   Ghostscript has no option to maintain the input image's format.
-												Drop support for Ghostscript <9.50

											
										
										
											2022-08-02 15:01:10 -07:00
+								   (Modern Ghostscript can copy JPEG images without transcoding them.)
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								-  Ghostscript's PDF/A conversion removes any XMP metadata that is not
 								   one of the standard XMP metadata namespaces for PDFs. In particular,
-												Fix typos (#1087)

Found via `codespell -S tests,LICENSES -L flate`
											
										
										
											2023-03-30 12:57:32 +08:00
+								   PRISM Metadata is removed.
-												docs: mention that Ghostscript PDF/A can swallow hyperlinks

Addresses #605

											
										
										
											2020-08-12 12:12:00 -07:00
+								-  Ghostscript's PDF/A conversion seems to remove or deactivate
 								   hyperlinks and other active content.
 								You can use ``--output-type pdf`` to disable PDF/A conversion and produce
 								a standard, non-archival PDF.
-												Update documentation

											
										
										
											2017-04-18 18:07:19 -07:00
-												More doc updates for 7.0.0

											
										
										
											2018-07-12 01:52:49 -07:00
+								Regarding OCRmyPDF itself:
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								-  PDFs that use transparency are not currently represented in the test
 								   suite
-												Update documentation

											
										
										
											2017-04-18 18:07:19 -07:00
 								Similar programs
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								================
-												Update documentation

											
										
										
											2017-04-18 18:07:19 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								To the author's knowledge, OCRmyPDF is the most feature-rich and
 								thoroughly tested command line OCR PDF conversion tool. If it does not
 								meet your needs, contributions and suggestions are welcome. If not,
 								consider one of these similar open source programs:
-												Update documentation

											
										
										
											2017-04-18 18:07:19 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								-  pdf2pdfocr
 								-  pdfsandwich
-												docs: don't suggest unmaintained alternatives, update on GS

											
										
										
											2021-06-14 01:08:07 -07:00
 								Ghostscript recently added three "pdfocr" output devices. They work by
 								rasterizing all content and converting all pages to a single colour space.
-												docs: link to OCRmyPDF-web

											
										
										
											2017-05-14 23:16:30 -07:00
-												Note other web frontends

											
										
										
											2018-03-25 21:36:39 -07:00
+								Web front-ends
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								==============
-												Note other web frontends

											
										
										
											2018-03-25 21:36:39 -07:00
-												Remove Alpine Docker image

											
										
										
											2019-11-03 22:35:15 -08:00
+								The Docker image ``ocrmypdf`` provides a web service front-end
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								that allows files to submitted over HTTP and the results "downloaded".
 								This is an HTTP server intended to simplify web services deployments; it
 								is not intended to be deployed on the public internet and no real
 								security measures to speak of.
-												docs: Ghostscript PDF/A XMP metadata loss; ocrmypdf-webservice

[ci skip]

											
										
										
											2018-12-17 23:20:49 -08:00
-												docs: minor

											
										
										
											2019-02-17 16:27:44 -08:00
+								In addition, the following third-party integrations are available:
-												docs: Ghostscript PDF/A XMP metadata loss; ocrmypdf-webservice

[ci skip]

											
										
										
											2018-12-17 23:20:49 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								-  `Nextcloud OCR <https://github.com/janis91/ocr>`__ is a free software
 								   plugin for the Nextcloud private cloud software
 								OCRmyPDF is not designed to be secure against malware-bearing PDFs (see
 								`Using OCRmyPDF online <ocr-service>`__). Users should ensure they
 								comply with OCRmyPDF's licenses and the licenses of all dependencies. In
 								particular, OCRmyPDF requires Ghostscript, which is licensed under
 								AGPLv3.
-												Note other web frontends

											
										
										
											2018-03-25 21:36:39 -07:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								.. |image| image:: images/bitmap_vs_svg.svg