2022-07-28 01:06:46 -07:00
|
|
|
|
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
|
|
|
|
..
|
|
|
|
|
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
=====================
|
2016-10-28 01:22:40 -07:00
|
|
|
|
Common error messages
|
|
|
|
|
=====================
|
|
|
|
|
|
|
|
|
|
Page already has text
|
2019-06-22 17:29:26 -07:00
|
|
|
|
=====================
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
.. code-block::
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
ERROR - 1: page already has text! – aborting (use --force-ocr to force OCR)
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
You ran ocrmypdf on a file that already contains printable text or a
|
|
|
|
|
hidden OCR text layer (it can't quite tell the difference). You probably
|
|
|
|
|
don't want to do this, because the file is already searchable.
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
|
|
|
|
As the error message suggests, your options are:
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
- ``ocrmypdf --force-ocr`` to :ref:`rasterize <raster-vector>` all
|
|
|
|
|
vector content and run OCR on the images. This is useful if a
|
|
|
|
|
previous OCR program failed, or if the document contains a text
|
|
|
|
|
watermark.
|
|
|
|
|
- ``ocrmypdf --skip-text`` to skip OCR and other processing on any
|
|
|
|
|
pages that contain text. Text pages will be copied into the output
|
|
|
|
|
PDF without modification.
|
2020-04-15 02:17:55 -07:00
|
|
|
|
- ``ocrmypdf --redo-ocr`` to scan the file for any existing OCR
|
|
|
|
|
(non-printing text), remove it, and do OCR again. This is one way
|
|
|
|
|
to take advantage of improvements in OCR accuracy. Printable vector
|
|
|
|
|
text is excluded from OCR, so this can be used on files that contain
|
|
|
|
|
a mix of digital and scanned files.
|
|
|
|
|
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Input file 'filename' is not a valid PDF
|
|
|
|
|
========================================
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
2020-04-26 05:33:26 -07:00
|
|
|
|
OCRmyPDF checks files with pikepdf, a library that in turn uses libqpdf to fixes
|
|
|
|
|
errors in PDFs, before it tries to work on them. In most cases this happens
|
|
|
|
|
because the PDF is corrupt and truncated (incomplete file copying) and not much
|
|
|
|
|
can be done.
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
You can try rewriting the file with Ghostscript:
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
.. code-block:: bash
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
gs -o output.pdf -dSAFER -sDEVICE=pdfwrite input.pdf
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
``pdftk`` can also rewrite PDFs:
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
.. code-block:: bash
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
pdftk input.pdf cat output output.pdf
|
2016-10-28 01:22:40 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Sometimes Acrobat can repair PDFs with its `Preflight
|
|
|
|
|
tool <https://helpx.adobe.com/acrobat/using/correcting-problem-areas-preflight-tool.html>`__.
|