2025-04-17 15:43:36 -07:00
|
|
|
(security)=
|
2022-07-28 01:06:46 -07:00
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
# PDF security issues
|
2025-01-03 12:23:42 -08:00
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
> OCRmyPDF should only be used on PDFs you trust. It is not designed to
|
|
|
|
> protect you against malware.
|
2016-11-21 20:58:31 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
Recognizing that many users have an interest in handling PDFs and
|
|
|
|
applying OCR to PDFs they did not generate themselves, this article
|
|
|
|
discusses the security implications of PDFs and how users can protect
|
|
|
|
themselves.
|
2016-11-21 20:58:31 -08:00
|
|
|
|
|
|
|
The disclaimer applies: this software has no warranties of any kind.
|
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
## PDFs may contain malware
|
2019-06-22 17:29:26 -07:00
|
|
|
|
|
|
|
PDF is a rich, complex file format. The official PDF 1.7 specification,
|
|
|
|
ISO 32000:2008, is hundreds of pages long and references several annexes
|
|
|
|
each of which are similar in length. PDFs can contain video, audio, XML,
|
|
|
|
JavaScript and other programming, and forms. In some cases, they can
|
2025-04-17 02:27:59 -07:00
|
|
|
open internet connections to pre-selected URLs. All of these are
|
|
|
|
possible attack vectors.
|
2019-06-22 17:29:26 -07:00
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
In short, PDFs [may contain
|
|
|
|
viruses](https://security.stackexchange.com/questions/64052/can-a-pdf-file-contain-a-virus).
|
2019-06-22 17:29:26 -07:00
|
|
|
|
2023-06-01 23:49:34 -07:00
|
|
|
If you do not trust a PDF or its source, do not open it or use OCRmyPDF
|
|
|
|
on it. Consider using a Docker container or virtual machine to isolate
|
|
|
|
an untrusted PDF from your system.
|
2016-11-21 20:58:31 -08:00
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
## How OCRmyPDF processes PDFs
|
2019-06-22 17:29:26 -07:00
|
|
|
|
|
|
|
OCRmyPDF must open and interpret your PDF in order to insert an OCR
|
|
|
|
layer. First, it runs all PDFs through
|
2025-04-17 02:27:59 -07:00
|
|
|
[pikepdf](https://github.com/pikepdf/pikepdf), a library based on
|
|
|
|
[QPDF](https://github.com/qpdf/qpdf), a program that repairs PDFs with
|
|
|
|
syntax errors. This is done because, in the author\'s experience, a
|
2021-08-04 11:47:34 +02:00
|
|
|
significant number of PDFs in the wild, especially those created by
|
2023-06-01 23:49:34 -07:00
|
|
|
scanners, are not well-formed files. QPDF makes it more likely that
|
|
|
|
OCRmyPDF will succeed, but offers no security guarantees. QPDF is also
|
2019-06-22 17:29:26 -07:00
|
|
|
used to split the PDF into single page PDFs.
|
|
|
|
|
|
|
|
Finally, OCRmyPDF rasterizes each page of the PDF using
|
2025-04-17 02:27:59 -07:00
|
|
|
[Ghostscript](http://ghostscript.com/) in `-dSAFER` mode.
|
2019-06-22 17:29:26 -07:00
|
|
|
|
|
|
|
Depending on the options specified, OCRmyPDF may graft the OCR layer
|
2025-04-17 02:27:59 -07:00
|
|
|
into the existing PDF or it may essentially reconstruct (\"re-fry\") a
|
2019-06-22 17:29:26 -07:00
|
|
|
visually identical PDF that may be quite different at the binary level.
|
|
|
|
That said, OCRmyPDF is not a tool designed for sanitizing PDFs.
|
2016-11-21 20:58:31 -08:00
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
## Password protected PDFs
|
2019-06-22 17:29:26 -07:00
|
|
|
|
|
|
|
Password protected PDFs usually have two passwords, and owner and user
|
|
|
|
password. When the user password is set to empty, PDF readers will open
|
2025-04-17 02:27:59 -07:00
|
|
|
the file automatically and mark it as \"(SECURED)\". Password security
|
|
|
|
can also request certain restrictions on the PDF, but anyone can remove
|
|
|
|
these restrictions if they have either the owner *or* user password.
|
|
|
|
Passwords mainly present a barrier for casual users.
|
2019-06-22 17:29:26 -07:00
|
|
|
|
2023-08-14 00:13:17 -07:00
|
|
|
OCRmyPDF cannot remove passwords from PDFs. If you want to remove a
|
2025-04-17 02:27:59 -07:00
|
|
|
password from a PDF, you must use other software, such as `qpdf`.
|
2019-06-22 17:29:26 -07:00
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
If the owner and user password are set, a password is required for
|
|
|
|
`qpdf`. If only the owner password is set, then the password can be
|
|
|
|
stripped, even if one does not have the owner password. To remove the
|
|
|
|
password from a using QPDF, use:
|
2023-08-14 00:13:17 -07:00
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
:::{code} bash
|
|
|
|
qpdf --decrypt --password='abc123' input.pdf no_password.pdf
|
|
|
|
:::
|
2023-08-14 00:13:17 -07:00
|
|
|
|
|
|
|
Then you can run OCRmyPDF on the file.
|
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
In its default mode, OCRmyPDF generates PDF/A. Passwords may not be set
|
|
|
|
on PDF/A documents. If you want to set a password on the output PDF, you
|
|
|
|
must specify `--output-type pdf`.
|
2023-08-14 00:13:17 -07:00
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
## Signature images
|
2019-06-22 17:29:26 -07:00
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
Many programs exist which are capable of inserting an image of
|
|
|
|
someone\'s signature. On its own, this offers no security guarantees. It
|
|
|
|
is trivial to remove the signature image and apply it to other files.
|
|
|
|
This practice offers no real security.
|
2019-06-22 17:29:26 -07:00
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
## Digital signatures
|
2023-08-14 00:13:17 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
Important documents can be digitally signed and certified to attest to
|
2023-08-14 00:13:17 -07:00
|
|
|
their authorship, approval or execution of a legal agreement. OCRmyPDF
|
|
|
|
will detect signed PDFs and will not modify them, unless the
|
2025-04-17 02:27:59 -07:00
|
|
|
`--invalidate-digital-signatures` option is used, which will invalidate
|
|
|
|
any signatures. (The signature may still be present in the PDF if
|
|
|
|
opened, but PDF readers will not validate it.)
|
2023-08-14 00:13:17 -07:00
|
|
|
|
|
|
|
A digital signature adds a cryptographic hash of the document to the
|
|
|
|
document, so tamper protection is provided. That also precludes OCRmyPDF
|
|
|
|
from modifying the document and preserving the signature.
|
|
|
|
|
|
|
|
Digital signatures are not the same as a signature image. A digital
|
|
|
|
signature is a cryptographic hash of the document that is encrypted with
|
2025-04-17 02:27:59 -07:00
|
|
|
the author\'s private key. The signature is decrypted with the author\'s
|
2023-08-14 00:13:17 -07:00
|
|
|
public key. The public key is usually distributed by a certificate
|
|
|
|
authority. The signature is then verified by the PDF reader. If the
|
|
|
|
document is modified, the signature will be invalidated.
|
|
|
|
|
2025-04-17 02:27:59 -07:00
|
|
|
## Certificate-encrypted PDFs
|
2023-08-14 00:13:17 -07:00
|
|
|
|
|
|
|
PDFs can be encrypted with a certificate. This is a more secure form of
|
|
|
|
encryption than a password. The certificate is usually issued by a
|
2025-04-17 02:27:59 -07:00
|
|
|
certificate authority. A certificate is used to encrypt the document
|
|
|
|
using the public key for the benefit of a specific recipient who
|
|
|
|
possesses the private key.
|
2023-08-14 00:13:17 -07:00
|
|
|
|
|
|
|
OCRmyPDF cannot open certificate-encrypted PDFs. If you have the
|
2025-04-17 02:27:59 -07:00
|
|
|
certificate, you can use other PDF software, such as Acrobat, to decrypt
|
|
|
|
the PDF.
|