OCRmyPDF/docs/introduction.md

% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0

# Introduction

OCRmyPDF is a Python application and library that adds text "layers" to images in
PDFs, making scanned image PDFs searchable. It uses OCR to guess the text
contained in images. OCRmyPDF also supports plugins
that enable customization of its processing steps, and it is highly tolerant
of PDFs containing scanned images and "born digital" content that doesn't
require text recognition.

## About OCR

[Optical character
recognition](https://en.wikipedia.org/wiki/Optical_character_recognition)
is a technology that converts images of typed or handwritten text, such as
in a scanned document, into computer text that can be selected, searched and copied.

OCRmyPDF uses
[Tesseract](https://github.com/tesseract-ocr/tesseract), a widely
available open source OCR engine, to perform OCR.

(raster-vector)=

## About PDFs

PDFs are page description files that attempt to preserve a layout
exactly. They contain [vector
graphics](http://vector-conversions.com/vectorizing/raster_vs_vector.html)
that can contain raster objects, such as scanned images. Because PDFs can
contain multiple pages (unlike many image formats) and can contain fonts
and text, they are a suitable format for exchanging scanned documents.

:::{image} images/bitmap_vs_svg.svg
:::

A PDF page may contain multiple images, even if it appears to have only
one image. Some scanners or scanning software may segment pages into
monochromatic text and color regions, for example, to enhance the compression
ratio and appearance of the page.

Rasterizing a PDF is the process of generating corresponding raster images.
OCR engines like Tesseract work with images, not scalable vector graphics
or mixed raster-vector-text graphics such as PDF.

## About PDF/A

[PDF/A](https://en.wikipedia.org/wiki/PDF/A) is an ISO-standardized
subset of the full PDF specification that is designed for archiving (the
'A' stands for Archive). PDF/A differs from PDF primarily by omitting
features that could complicate future file readability,
such as embedded Javascript, video, audio and references to external
fonts. All fonts and resources needed to interpret the PDF must be
contained within it. Because PDF/A disables Javascript and other types
of embedded content, it is likely more secure.

There are various conformance levels and versions, such as "PDF/A-2b".

In general, the preferred format for scanned documents is PDF/A. Some
governments and jurisdictions, US Courts in particular, [mandate the use
of PDF/A](https://pdfblog.com/2012/02/13/what-is-pdfa/) for scanned
documents.

Since most individuals scanning documents aim for long-term readability,
OCRmyPDF defaults to generating PDF/A-2b.

PDF/A does have a few drawbacks. Some PDF viewers display an alert
indicating that the file is in PDF/A format, which may confuse some users.
Additionally, it tends to result in larger files than standard PDFs because
it embeds certain resources, even if they are widely available. PDF/A
files can be digitally signed but may not be encrypted to ensure future
readability. Fortunately, converting from PDF/A to a regular PDF is
straightforward, and any PDF viewer can handle PDF/A files.

## What OCRmyPDF does

OCRmyPDF analyzes each page of a PDF to determine the required colorspace
and resolution (DPI) for capturing all the information on that page without
losing content. It uses
[Ghostscript](http://ghostscript.com/) to rasterize each page and subsequently
performs OCR on the rasterized image to generate an OCR "layer." This layer
is then integrated back into the original PDF.

While it is possible to use a program like Ghostscript or ImageMagick to
obtain an image and then run that image through Tesseract OCR, this process
actually generates a new PDF, potentially resulting in the loss of various
details (such as the document's metadata). In contrast, OCRmyPDF can produce
a minimally altered PDF as the output.

OCRmyPDF also offers several image processing options, such as deskew, which
enhances the visual quality of files and the accuracy of OCR. When these
options are utilized, the OCR layer is integrated into the processed image.

By default, OCRmyPDF generates archival PDFs in the PDF/A format, which is
a more rigid subset of PDF features designed for long-term archives. If you
prefer regular PDFs, you can disable this feature using the
`--output-type pdf` option.

## Why you shouldn't do this manually

A PDF is similar to an HTML file, in that it contains document structure
along with images. While some PDFs may solely display a full-page image,
they often contain additional content that would be forfeited if not preserved.

A manual process could take one of these approaches:

1. Rasterize each page as an image, perform OCR on the images, and then merge the
   output into a PDF. This method preserves the layout of each page, but
   resamples all images potentially leading to quality loss, increased file size,
   and the introduction of compression artifacts, among other issues.
2. Extract each image, OCR, and combine the output into a PDF. This approach
   loses the context in which images are used in the PDF, potentially resulting
   in loss of information related to scaling and position of images. Some scanned
   PDFs contain multiple images segmented into black and white, grayscale
   and color regions, with stencil masks to prevent overlap, as this can
   enhance the appearance of a file while reducing file size.
   Reassembling these images can be challenging, and risks losing vector art
   or text that is not part of an image.

In cases where a PDF solely serves as a container for images without any
rotation, scaling, or cropping, the second approach can be lossless.

OCRmyPDF uses various strategies depending on input options and the input PDF
itself. Generally, it rasterizes a page for OCR and then integrates the OCR
data back into the original PDF. This approach allows it to handle complex
PDFs and preserve their content as much as possible.

Furthermore, OCRmyPDF supports a wide range of edge cases that have emerged
during several years of development. It accommodates PDF features like
images within Form XObjects and pages with UserUnit scaling. It also
supports less common image formats like non-monochrome 1-bit images and
provides warnings about files you may not want to OCR. Thanks to tools
like pikepdf and QPDF, it can auto-repair damaged PDFs. You don't need to
understand the intricacies of these issues; you should be able to use
OCRmyPDF with any PDF file, and expect reasonable results.

## Limitations

OCRmyPDF is subject to limitations imposed by the Tesseract OCR engine.
These limitations are inherent to any software relying on Tesseract:

- The OCR accuracy may not match that of commercial OCR solutions.
- It is incapable of recognizing handwriting.
- It may detect gibberish and report it as OCR output.
- Results may be subpar when a document contains languages not specified
  in the `-l LANG` argument.
- Tesseract may struggle to analyze the natural reading order of documents.
  For instance, it might fail to recognize two columns in a document and
  attempt to join text across columns.
- Poor quality scans can result in subpar OCR quality. In other words, the
  quality of the OCR output depends on the quality of the input.
- Tesseract does not provide information about the font family to which text
  belongs.
- Tesseract does not divide text into paragraphs or headings. It only provides
  the text and its bounding box. As such, the generated PDF does not
  contain any information about the document's structure.

Ghostscript also imposes some limitations:

- PDFs containing JPEG 2000-encoded content may be converted to JPEG
  encoding, which may introduce compression artifacts, if Ghostscript
  PDF/A is enabled.
- Ghostscript may transcode grayscale and color images, potentially
  lossily, based on an internal algorithm. This
  behavior can be suppressed by setting `--pdfa-image-compression` to
  `jpeg` or `lossless` to set all images to one type or the other.
  Ghostscript lacks an option to maintain the input image's format.
  (Modern Ghostscript can copy JPEG images without transcoding them.)
- Ghostscript's PDF/A conversion removes any XMP metadata that is not
  one of the standard XMP metadata namespaces for PDFs. In particular,
  PRISM Metadata is removed.
- Ghostscript's PDF/A conversion may remove or deactivate
  hyperlinks and other active content.

You can use `--output-type pdf` to disable PDF/A conversion and produce
a standard, non-archival PDF.

Regarding OCRmyPDF itself:

- PDFs using transparency are not currently represented in the test
  suite

## Similar programs

To the author's knowledge, OCRmyPDF is the most feature-rich and
thoroughly tested command line OCR PDF conversion tool. If it does not
meet your needs, contributions and suggestions are welcome.

Ghostscript recently added three "pdfocr" output devices. They work by
rasterizing all content and converting all pages to a single colour space.

## Web front-ends

The Docker image of OCRmyPDF provides a web service front-end
that allows files to submitted over HTTP, and the results can be downloaded.
This is an HTTP server intended to demonstrate how OCRmyPDF can be
integrated into a web service. It is not intended to be deployed on the
public internet and does not provide any security measures.

In addition, the following third-party integrations are available:

- [Paperless-ngx](https://docs.paperless-ngx.com/) is a free software
  document management system that uses OCRmyPDF to perform OCR on
  uploaded documents.
- [Nextcloud OCR](https://github.com/janis91/ocr) is a free software
  plugin for the Nextcloud private cloud software.

OCRmyPDF is not designed to be secure against malware-bearing PDFs (see
[Using OCRmyPDF online](ocr-service)). Users should ensure they
comply with OCRmyPDF's licenses and the licenses of all dependencies. In
particular, OCRmyPDF requires Ghostscript, which is licensed under
AGPLv3.
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`% SPDX-FileCopyrightText: 2022 James R. Barlow`
			`% SPDX-License-Identifier: CC-BY-SA-4.0`

			`# Introduction`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00
docs: some copyediting 2023-10-14 00:45:04 -07:00			`OCRmyPDF is a Python application and library that adds text "layers" to images in`
			`PDFs, making scanned image PDFs searchable. It uses OCR to guess the text`
			`contained in images. OCRmyPDF also supports plugins`
			`that enable customization of its processing steps, and it is highly tolerant`
			`of PDFs containing scanned images and "born digital" content that doesn't`
			`require text recognition.`
Start the documentation 2016-09-06 13:52:40 -07:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`## About OCR`
Start the documentation 2016-09-06 13:52:40 -07:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`[Optical character`
			`recognition](https://en.wikipedia.org/wiki/Optical_character_recognition)`
docs: some copyediting 2023-10-14 00:45:04 -07:00			`is a technology that converts images of typed or handwritten text, such as`
			`in a scanned document, into computer text that can be selected, searched and copied.`
Start the documentation 2016-09-06 13:52:40 -07:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`OCRmyPDF uses`
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`[Tesseract](https://github.com/tesseract-ocr/tesseract), a widely`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`available open source OCR engine, to perform OCR.`
Start the documentation 2016-09-06 13:52:40 -07:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`(raster-vector)=`
Start the documentation 2016-09-06 13:52:40 -07:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`## About PDFs`
Start the documentation 2016-09-06 13:52:40 -07:00
docs: some copyediting 2023-10-14 00:45:04 -07:00			`PDFs are page description files that attempt to preserve a layout`
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`exactly. They contain [vector`
			`graphics](http://vector-conversions.com/vectorizing/raster_vs_vector.html)`
docs: some copyediting 2023-10-14 00:45:04 -07:00			`that can contain raster objects, such as scanned images. Because PDFs can`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`contain multiple pages (unlike many image formats) and can contain fonts`
docs: some copyediting 2023-10-14 00:45:04 -07:00			`and text, they are a suitable format for exchanging scanned documents.`
Start the documentation 2016-09-06 13:52:40 -07:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`:::{image} images/bitmap_vs_svg.svg`
			`:::`
Start the documentation 2016-09-06 13:52:40 -07:00
docs: some copyediting 2023-10-14 00:45:04 -07:00			`A PDF page may contain multiple images, even if it appears to have only`
			`one image. Some scanners or scanning software may segment pages into`
			`monochromatic text and color regions, for example, to enhance the compression`
			`ratio and appearance of the page.`
Start the documentation 2016-09-06 13:52:40 -07:00
docs: various fixes As suggested by @Chealer Closes #829, #830, #831, #832 2021-09-14 00:24:18 -07:00			`Rasterizing a PDF is the process of generating corresponding raster images.`
			`OCR engines like Tesseract work with images, not scalable vector graphics`
			`or mixed raster-vector-text graphics such as PDF.`
Start the documentation 2016-09-06 13:52:40 -07:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`## About PDF/A`
Start the documentation 2016-09-06 13:52:40 -07:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`[PDF/A](https://en.wikipedia.org/wiki/PDF/A) is an ISO-standardized`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`subset of the full PDF specification that is designed for archiving (the`
			`'A' stands for Archive). PDF/A differs from PDF primarily by omitting`
docs: some copyediting 2023-10-14 00:45:04 -07:00			`features that could complicate future file readability,`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`such as embedded Javascript, video, audio and references to external`
			`fonts. All fonts and resources needed to interpret the PDF must be`
			`contained within it. Because PDF/A disables Javascript and other types`
docs: some copyediting 2023-10-14 00:45:04 -07:00			`of embedded content, it is likely more secure.`
Start the documentation 2016-09-06 13:52:40 -07:00
Update documentation on other languages, multilingual documents 2016-11-07 14:12:37 -08:00			`There are various conformance levels and versions, such as "PDF/A-2b".`

docs: some copyediting 2023-10-14 00:45:04 -07:00			`In general, the preferred format for scanned documents is PDF/A. Some`
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`governments and jurisdictions, US Courts in particular, [mandate the use`
			`of PDF/A](https://pdfblog.com/2012/02/13/what-is-pdfa/) for scanned`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`documents.`
Update documentation on other languages, multilingual documents 2016-11-07 14:12:37 -08:00
docs: some copyediting 2023-10-14 00:45:04 -07:00			`Since most individuals scanning documents aim for long-term readability,`
			`OCRmyPDF defaults to generating PDF/A-2b.`
Start the documentation 2016-09-06 13:52:40 -07:00
docs: some copyediting 2023-10-14 00:45:04 -07:00			`PDF/A does have a few drawbacks. Some PDF viewers display an alert`
			`indicating that the file is in PDF/A format, which may confuse some users.`
			`Additionally, it tends to result in larger files than standard PDFs because`
			`it embeds certain resources, even if they are widely available. PDF/A`
			`files can be digitally signed but may not be encrypted to ensure future`
			`readability. Fortunately, converting from PDF/A to a regular PDF is`
			`straightforward, and any PDF viewer can handle PDF/A files.`
Start the documentation 2016-09-06 13:52:40 -07:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`## What OCRmyPDF does`
Start the documentation 2016-09-06 13:52:40 -07:00
docs: some copyediting 2023-10-14 00:45:04 -07:00			`OCRmyPDF analyzes each page of a PDF to determine the required colorspace`
			`and resolution (DPI) for capturing all the information on that page without`
			`losing content. It uses`
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`[Ghostscript](http://ghostscript.com/) to rasterize each page and subsequently`
docs: some copyediting 2023-10-14 00:45:04 -07:00			`performs OCR on the rasterized image to generate an OCR "layer." This layer`
			`is then integrated back into the original PDF.`
Start the documentation 2016-09-06 13:52:40 -07:00
docs: some copyediting 2023-10-14 00:45:04 -07:00			`While it is possible to use a program like Ghostscript or ImageMagick to`
			`obtain an image and then run that image through Tesseract OCR, this process`
			`actually generates a new PDF, potentially resulting in the loss of various`
			`details (such as the document's metadata). In contrast, OCRmyPDF can produce`
			`a minimally altered PDF as the output.`
Start the documentation 2016-09-06 13:52:40 -07:00
docs: some copyediting 2023-10-14 00:45:04 -07:00			`OCRmyPDF also offers several image processing options, such as deskew, which`
			`enhances the visual quality of files and the accuracy of OCR. When these`
			`options are utilized, the OCR layer is integrated into the processed image.`
Start the documentation 2016-09-06 13:52:40 -07:00
docs: some copyediting 2023-10-14 00:45:04 -07:00			`By default, OCRmyPDF generates archival PDFs in the PDF/A format, which is`
			`a more rigid subset of PDF features designed for long-term archives. If you`
			`prefer regular PDFs, you can disable this feature using the`
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`--output-type pdf` option.
Start the documentation 2016-09-06 13:52:40 -07:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`## Why you shouldn't do this manually`
Update documentation 2017-04-18 18:07:19 -07:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`A PDF is similar to an HTML file, in that it contains document structure`
docs: some copyediting 2023-10-14 00:45:04 -07:00			`along with images. While some PDFs may solely display a full-page image,`
			`they often contain additional content that would be forfeited if not preserved.`

			`A manual process could take one of these approaches:`

			`1. Rasterize each page as an image, perform OCR on the images, and then merge the`
			`output into a PDF. This method preserves the layout of each page, but`
			`resamples all images potentially leading to quality loss, increased file size,`
			`and the introduction of compression artifacts, among other issues.`
			`2. Extract each image, OCR, and combine the output into a PDF. This approach`
			`loses the context in which images are used in the PDF, potentially resulting`
			`in loss of information related to scaling and position of images. Some scanned`
			`PDFs contain multiple images segmented into black and white, grayscale`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`and color regions, with stencil masks to prevent overlap, as this can`
docs: some copyediting 2023-10-14 00:45:04 -07:00			`enhance the appearance of a file while reducing file size.`
			`Reassembling these images can be challenging, and risks losing vector art`
			`or text that is not part of an image.`

			`In cases where a PDF solely serves as a container for images without any`
			`rotation, scaling, or cropping, the second approach can be lossless.`

			`OCRmyPDF uses various strategies depending on input options and the input PDF`
			`itself. Generally, it rasterizes a page for OCR and then integrates the OCR`
			`data back into the original PDF. This approach allows it to handle complex`
			`PDFs and preserve their content as much as possible.`

			`Furthermore, OCRmyPDF supports a wide range of edge cases that have emerged`
			`during several years of development. It accommodates PDF features like`
			`images within Form XObjects and pages with UserUnit scaling. It also`
			`supports less common image formats like non-monochrome 1-bit images and`
			`provides warnings about files you may not want to OCR. Thanks to tools`
			`like pikepdf and QPDF, it can auto-repair damaged PDFs. You don't need to`
			`understand the intricacies of these issues; you should be able to use`
			`OCRmyPDF with any PDF file, and expect reasonable results.`
Update documentation 2017-04-18 18:07:19 -07:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`## Limitations`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00
docs: some copyediting 2023-10-14 00:45:04 -07:00			`OCRmyPDF is subject to limitations imposed by the Tesseract OCR engine.`
			`These limitations are inherent to any software relying on Tesseract:`

Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`- The OCR accuracy may not match that of commercial OCR solutions.`
			`- It is incapable of recognizing handwriting.`
			`- It may detect gibberish and report it as OCR output.`
			`- Results may be subpar when a document contains languages not specified`
			in the `-l LANG` argument.
			`- Tesseract may struggle to analyze the natural reading order of documents.`
			`For instance, it might fail to recognize two columns in a document and`
			`attempt to join text across columns.`
			`- Poor quality scans can result in subpar OCR quality. In other words, the`
			`quality of the OCR output depends on the quality of the input.`
			`- Tesseract does not provide information about the font family to which text`
			`belongs.`
			`- Tesseract does not divide text into paragraphs or headings. It only provides`
			`the text and its bounding box. As such, the generated PDF does not`
			`contain any information about the document's structure.`
Start the documentation 2016-09-06 13:52:40 -07:00
			`Ghostscript also imposes some limitations:`

Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`- PDFs containing JPEG 2000-encoded content may be converted to JPEG`
			`encoding, which may introduce compression artifacts, if Ghostscript`
			`PDF/A is enabled.`
			`- Ghostscript may transcode grayscale and color images, potentially`
			`lossily, based on an internal algorithm. This`
			behavior can be suppressed by setting `--pdfa-image-compression` to
			`jpeg` or `lossless` to set all images to one type or the other.
			`Ghostscript lacks an option to maintain the input image's format.`
			`(Modern Ghostscript can copy JPEG images without transcoding them.)`
			`- Ghostscript's PDF/A conversion removes any XMP metadata that is not`
			`one of the standard XMP metadata namespaces for PDFs. In particular,`
			`PRISM Metadata is removed.`
			`- Ghostscript's PDF/A conversion may remove or deactivate`
			`hyperlinks and other active content.`

			You can use `--output-type pdf` to disable PDF/A conversion and produce
docs: mention that Ghostscript PDF/A can swallow hyperlinks Addresses #605 2020-08-12 12:12:00 -07:00			`a standard, non-archival PDF.`
Update documentation 2017-04-18 18:07:19 -07:00
More doc updates for 7.0.0 2018-07-12 01:52:49 -07:00			`Regarding OCRmyPDF itself:`

Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`- PDFs using transparency are not currently represented in the test`
			`suite`
Update documentation 2017-04-18 18:07:19 -07:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`## Similar programs`
Update documentation 2017-04-18 18:07:19 -07:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`To the author's knowledge, OCRmyPDF is the most feature-rich and`
			`thoroughly tested command line OCR PDF conversion tool. If it does not`
docs: some copyediting 2023-10-14 00:45:04 -07:00			`meet your needs, contributions and suggestions are welcome.`
docs: don't suggest unmaintained alternatives, update on GS 2021-06-14 01:08:07 -07:00
			`Ghostscript recently added three "pdfocr" output devices. They work by`
			`rasterizing all content and converting all pages to a single colour space.`
docs: link to OCRmyPDF-web 2017-05-14 23:16:30 -07:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`## Web front-ends`
Note other web frontends 2018-03-25 21:36:39 -07:00
docs: some copyediting 2023-10-14 00:45:04 -07:00			`The Docker image of OCRmyPDF provides a web service front-end`
			`that allows files to submitted over HTTP, and the results can be downloaded.`
			`This is an HTTP server intended to demonstrate how OCRmyPDF can be`
			`integrated into a web service. It is not intended to be deployed on the`
			`public internet and does not provide any security measures.`
docs: Ghostscript PDF/A XMP metadata loss; ocrmypdf-webservice [ci skip] 2018-12-17 23:20:49 -08:00
docs: minor 2019-02-17 16:27:44 -08:00			`In addition, the following third-party integrations are available:`
docs: Ghostscript PDF/A XMP metadata loss; ocrmypdf-webservice [ci skip] 2018-12-17 23:20:49 -08:00
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`- [Paperless-ngx](https://docs.paperless-ngx.com/) is a free software`
			`document management system that uses OCRmyPDF to perform OCR on`
			`uploaded documents.`
			`- [Nextcloud OCR](https://github.com/janis91/ocr) is a free software`
			`plugin for the Nextcloud private cloud software.`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00
			`OCRmyPDF is not designed to be secure against malware-bearing PDFs (see`
Convert remaining rst -> md 2025-04-17 15:03:21 -07:00			`[Using OCRmyPDF online](ocr-service)). Users should ensure they`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`comply with OCRmyPDF's licenses and the licenses of all dependencies. In`
			`particular, OCRmyPDF requires Ghostscript, which is licensed under`
			`AGPLv3.`