2019-06-22 17:29:26 -07:00
|
|
|
|
=================
|
2017-01-28 15:35:59 -08:00
|
|
|
|
Advanced features
|
|
|
|
|
=================
|
|
|
|
|
|
2019-02-07 17:06:51 -08:00
|
|
|
|
Control of unpaper
|
2019-06-22 17:29:26 -07:00
|
|
|
|
==================
|
|
|
|
|
|
|
|
|
|
OCRmyPDF uses ``unpaper`` to provide the implementation of the
|
|
|
|
|
``--clean`` and ``--clean-final`` arguments.
|
|
|
|
|
`unpaper <https://github.com/Flameeyes/unpaper/blob/master/doc/basic-concepts.md>`__
|
|
|
|
|
provides a variety of image processing filters to improve images.
|
|
|
|
|
|
|
|
|
|
By default, OCRmyPDF uses only ``unpaper`` arguments that were found to
|
|
|
|
|
be safe to use on almost all files without having to inspect every page
|
|
|
|
|
of the file afterwards. This is particularly true when only ``--clean``
|
|
|
|
|
is used, since that instructs OCRmyPDF to only clean the image before
|
|
|
|
|
OCR and not the final image.
|
|
|
|
|
|
|
|
|
|
However, if you wish to use the more aggressive options in ``unpaper``,
|
|
|
|
|
you may use ``--unpaper-args '...'`` to override the OCRmyPDF's defaults
|
|
|
|
|
and forward other arguments to unpaper. This option will forward
|
|
|
|
|
arguments to ``unpaper`` without any knowledge of what that program
|
|
|
|
|
considers to be valid arguments. The string of arguments must be quoted
|
|
|
|
|
as shown in the examples below. No filename arguments may be included.
|
|
|
|
|
OCRmyPDF will assume it can append input and output filename of
|
|
|
|
|
intermediate images to the ``--unpaper-args`` string.
|
|
|
|
|
|
|
|
|
|
In this example, we tell ``unpaper`` to expect two pages of text on a
|
|
|
|
|
sheet (image), such as occurs when two facing pages of a book are
|
|
|
|
|
scanned. ``unpaper`` uses this information to deskew each independently
|
|
|
|
|
and clean up the margins of both.
|
2019-02-07 17:06:51 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
|
|
|
|
ocrmypdf --clean --clean-final --unpaper-args '--layout double' input.pdf output.pdf
|
2019-02-08 13:05:09 -08:00
|
|
|
|
ocrmypdf --clean --clean-final --unpaper-args '--layout double --no-noisefilter' input.pdf output.pdf
|
2019-02-07 17:06:51 -08:00
|
|
|
|
|
|
|
|
|
.. warning::
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Some ``unpaper`` features will reposition text within the image.
|
|
|
|
|
``--clean-final`` is recommended to avoid this issue.
|
2019-02-07 17:06:51 -08:00
|
|
|
|
|
|
|
|
|
.. warning::
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Some ``unpaper`` features cause multiple input or output files to be
|
|
|
|
|
consumed or produced. OCRmyPDF requires ``unpaper`` to consume one
|
|
|
|
|
file and produce one file. An deviation from that condition will
|
|
|
|
|
result in errors.
|
2019-02-07 17:06:51 -08:00
|
|
|
|
|
2019-02-08 13:05:09 -08:00
|
|
|
|
.. note::
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
``unpaper`` uses uncompressed PBM/PGM/PPM files for its intermediate
|
|
|
|
|
files. For large images or documents, it can take a lot of temporary
|
|
|
|
|
disk space.
|
2019-02-07 17:06:51 -08:00
|
|
|
|
|
2017-01-28 15:35:59 -08:00
|
|
|
|
Control of OCR options
|
2019-06-22 17:29:26 -07:00
|
|
|
|
======================
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
OCRmyPDF provides many features to control the behavior of the OCR
|
|
|
|
|
engine, Tesseract.
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
|
|
|
|
When OCR is skipped
|
2019-06-22 17:29:26 -07:00
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
If a page in a PDF seems to have text, by default OCRmyPDF will exit
|
|
|
|
|
without modifying the PDF. This is to ensure that PDFs that were
|
|
|
|
|
previously OCRed or were "born digital" rather than scanned are not
|
|
|
|
|
processed.
|
|
|
|
|
|
|
|
|
|
If ``--skip-text`` is issued, then no OCR will be performed on pages
|
|
|
|
|
that already have text. The page will be copied to the output. This may
|
|
|
|
|
be useful for documents that contain both "born digital" and scanned
|
|
|
|
|
content, or to use OCRmyPDF to normalize and convert to PDF/A regardless
|
|
|
|
|
of their contents.
|
|
|
|
|
|
|
|
|
|
If ``--redo-ocr`` is issued, then a detailed text analysis is performed.
|
|
|
|
|
Text is categorized as either visible or invisible. Invisible text (OCR)
|
|
|
|
|
is stripped out. Then an image of each page is created with visible text
|
|
|
|
|
masked out. The page image is sent for OCR, and any additional text is
|
|
|
|
|
inserted as OCR. If a file contains a mix of text and bitmap images that
|
|
|
|
|
contain text, OCRmyPDF will locate the additional text in images without
|
|
|
|
|
disrupting the existing text.
|
|
|
|
|
|
|
|
|
|
If ``--force-ocr`` is issued, then all pages will be rasterized to
|
|
|
|
|
images, discarding any hidden OCR text, and rasterizing any printable
|
|
|
|
|
text. This is useful for redoing OCR, for fixing OCR text with a damaged
|
|
|
|
|
character map (text is selectable but not searchable), and destroying
|
|
|
|
|
redacted information. Any forms and vector graphics will be rasterized
|
|
|
|
|
as well.
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
|
|
|
|
Time and image size limits
|
2019-06-22 17:29:26 -07:00
|
|
|
|
--------------------------
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
By default, OCRmyPDF permits tesseract to run for three minutes (180
|
|
|
|
|
seconds) per page. This is usually more than enough time to find all
|
|
|
|
|
text on a reasonably sized page with modern hardware.
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
If a page is skipped, it will be inserted without OCR. If preprocessing
|
|
|
|
|
was requested, the preprocessed image layer will be inserted.
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
If you want to adjust the amount of time spent on OCR, change
|
|
|
|
|
``--tesseract-timeout``. You can also automatically skip images that
|
|
|
|
|
exceed a certain number of megapixels with ``--skip-big``. (A 300 DPI,
|
|
|
|
|
8.5×11" page is 8.4 megapixels.)
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2018-04-14 00:18:58 -07:00
|
|
|
|
# Allow 300 seconds for OCR; skip any page larger than 50 megapixels
|
|
|
|
|
ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
|
|
|
|
Overriding default tesseract
|
2019-06-22 17:29:26 -07:00
|
|
|
|
----------------------------
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2018-07-03 16:59:03 -07:00
|
|
|
|
OCRmyPDF checks the system ``PATH`` for the ``tesseract`` binary.
|
2018-04-05 02:15:01 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Some relevant environment variables that influence Tesseract's behavior
|
|
|
|
|
include:
|
2017-07-20 16:19:57 -07:00
|
|
|
|
|
|
|
|
|
.. envvar:: TESSDATA_PREFIX
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Overrides the path to Tesseract's data files. This can allow
|
|
|
|
|
simultaneous installation of the "best" and "fast" training data
|
|
|
|
|
sets. OCRmyPDF does not manage this environment variable.
|
2018-04-05 02:15:01 -07:00
|
|
|
|
|
|
|
|
|
.. envvar:: OMP_THREAD_LIMIT
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Controls the number of threads Tesseract will use. OCRmyPDF will
|
2020-11-23 12:36:04 -08:00
|
|
|
|
manage this environment variable if it is not already set.
|
2017-07-20 16:19:57 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
For example, if you have a development build of Tesseract don't wish to
|
|
|
|
|
use the system installation, you can launch OCRmyPDF as follows:
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2018-04-14 00:18:58 -07:00
|
|
|
|
env \
|
2018-12-30 00:47:12 -08:00
|
|
|
|
PATH=/home/user/src/tesseract/api:$PATH \
|
|
|
|
|
TESSDATA_PREFIX=/home/user/src/tesseract \
|
|
|
|
|
ocrmypdf input.pdf output.pdf
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
In this example ``TESSDATA_PREFIX`` is required to redirect Tesseract to
|
|
|
|
|
an alternate folder for its "tessdata" files.
|
2017-07-20 16:19:57 -07:00
|
|
|
|
|
2017-01-28 15:35:59 -08:00
|
|
|
|
Overriding other support programs
|
2019-06-22 17:29:26 -07:00
|
|
|
|
---------------------------------
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
|
|
|
|
In addition to tesseract, OCRmyPDF uses the following external binaries:
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
- ``gs`` (Ghostscript)
|
|
|
|
|
- ``unpaper``
|
2020-04-26 05:33:26 -07:00
|
|
|
|
- ``pngquant``
|
|
|
|
|
- ``jbig2``
|
2017-07-20 16:19:57 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
In each case OCRmyPDF will search the ``PATH`` environment variable to
|
|
|
|
|
locate the binaries.
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2017-01-28 22:06:51 -08:00
|
|
|
|
Changing tesseract configuration variables
|
2019-06-22 17:29:26 -07:00
|
|
|
|
------------------------------------------
|
2017-01-28 22:06:51 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
You can override tesseract's default `control
|
|
|
|
|
parameters <https://github.com/tesseract-ocr/tesseract/wiki/ControlParams>`__
|
|
|
|
|
with a configuration file.
|
2017-01-28 22:06:51 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
As an example, this configuration will disable Tesseract's dictionary
|
|
|
|
|
for current language. Normally the dictionary is helpful for
|
|
|
|
|
interpolating words that are unclear, but it may interfere with OCR if
|
|
|
|
|
the document does not contain many words (for example, a list of part
|
|
|
|
|
numbers).
|
2017-01-28 22:06:51 -08:00
|
|
|
|
|
|
|
|
|
Create a file named "no-dict.cfg" with these contents:
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
2018-04-14 00:18:58 -07:00
|
|
|
|
load_system_dawg 0
|
|
|
|
|
language_model_penalty_non_dict_word 0
|
|
|
|
|
language_model_penalty_non_freq_dict_word 0
|
2017-01-28 22:06:51 -08:00
|
|
|
|
|
|
|
|
|
then run ocrmypdf as follows (along with any other desired arguments):
|
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2018-04-14 00:18:58 -07:00
|
|
|
|
ocrmypdf --tesseract-config no-dict.cfg input.pdf output.pdf
|
2017-01-28 22:06:51 -08:00
|
|
|
|
|
|
|
|
|
.. warning::
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Some combinations of control parameters will break Tesseract or break
|
|
|
|
|
assumptions that OCRmyPDF makes about Tesseract's output.
|
2017-01-28 22:06:51 -08:00
|
|
|
|
|
2017-01-28 15:35:59 -08:00
|
|
|
|
Changing the PDF renderer
|
2019-06-22 17:29:26 -07:00
|
|
|
|
=========================
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
|
|
|
|
rasterizing
|
|
|
|
|
Converting a PDF to an image for display.
|
|
|
|
|
|
|
|
|
|
rendering
|
|
|
|
|
Creating a new PDF from other data (such as an existing PDF).
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
OCRmyPDF has these PDF renderers: ``sandwich`` and ``hocr``. The
|
|
|
|
|
renderer may be selected using ``--pdf-renderer``. The default is
|
|
|
|
|
``auto`` which lets OCRmyPDF select the renderer to use. Currently,
|
|
|
|
|
``auto`` always selects ``sandwich``.
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2017-06-13 13:09:12 -07:00
|
|
|
|
The ``sandwich`` renderer
|
2019-06-22 17:29:26 -07:00
|
|
|
|
-------------------------
|
2017-06-13 13:09:12 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
The ``sandwich`` renderer uses Tesseract's new text-only PDF feature,
|
|
|
|
|
which produces a PDF page that lays out the OCR in invisible text. This
|
|
|
|
|
page is then "sandwiched" onto the original PDF page, allowing lossless
|
|
|
|
|
application of OCR even to PDF pages that contain other vector objects.
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Currently this is the best renderer for most uses, however it is
|
|
|
|
|
implemented in Tesseract so OCRmyPDF cannot influence it. Currently some
|
|
|
|
|
problematic PDF viewers like Mozilla PDF.js and macOS Preview have
|
|
|
|
|
problems with segmenting its text output, and
|
|
|
|
|
mightrunseveralwordstogether.
|
2018-04-05 02:15:01 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
When image preprocessing features like ``--deskew`` are used, the
|
|
|
|
|
original PDF will be rendered as a full page and the OCR layer will be
|
|
|
|
|
placed on top.
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2017-06-13 13:09:12 -07:00
|
|
|
|
The ``hocr`` renderer
|
2019-06-22 17:29:26 -07:00
|
|
|
|
---------------------
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
The ``hocr`` renderer works with older versions of Tesseract. The image
|
|
|
|
|
layer is copied from the original PDF page if possible, avoiding
|
|
|
|
|
potentially lossy transcoding or loss of other PDF information. If
|
|
|
|
|
preprocessing is specified, then the image layer is a new PDF.
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Unlike ``sandwich`` this renderer is implemented within OCRmyPDF; anyone
|
|
|
|
|
looking to customize how OCR is presented should look here. A major
|
|
|
|
|
disadvantage of this renderer is it not capable of correctly handling
|
2021-05-27 13:42:17 -07:00
|
|
|
|
text outside the Latin alphabet (specifically, it supports the ISO 8859-1
|
|
|
|
|
character). Pull requests to improve the situation are welcome.
|
2018-04-05 02:15:01 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Currently, this renderer has the best compatibility with Mozilla's
|
|
|
|
|
PDF.js viewer.
|
2018-04-05 02:15:01 -07:00
|
|
|
|
|
2017-06-13 13:09:12 -07:00
|
|
|
|
This works in all versions of Tesseract.
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
|
|
|
|
The ``tesseract`` renderer
|
2019-06-22 17:29:26 -07:00
|
|
|
|
--------------------------
|
2017-01-28 15:35:59 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
The ``tesseract`` renderer was removed. OCRmyPDF's new approach to text
|
|
|
|
|
layer grafting makes it functionally equivalent to ``sandwich``.
|
2018-04-14 00:18:58 -07:00
|
|
|
|
|
|
|
|
|
Return code policy
|
2019-06-22 17:29:26 -07:00
|
|
|
|
==================
|
2018-04-14 00:18:58 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
OCRmyPDF writes all messages to ``stderr``. ``stdout`` is reserved for
|
|
|
|
|
piping output files. ``stdin`` is reserved for piping input files.
|
2018-04-14 00:18:58 -07:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
The return codes generated by the OCRmyPDF are considered part of the
|
|
|
|
|
stable user interface. They may be imported from
|
|
|
|
|
``ocrmypdf.exceptions``.
|
2018-04-14 00:18:58 -07:00
|
|
|
|
|
|
|
|
|
.. list-table:: Return codes
|
|
|
|
|
:widths: 5 35 60
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
|
|
|
|
|
* - Code
|
|
|
|
|
- Name
|
|
|
|
|
- Interpretation
|
|
|
|
|
* - 0
|
2018-07-03 16:59:03 -07:00
|
|
|
|
- ``ExitCode.ok``
|
2018-04-14 00:18:58 -07:00
|
|
|
|
- Everything worked as expected.
|
|
|
|
|
* - 1
|
2018-07-03 16:59:03 -07:00
|
|
|
|
- ``ExitCode.bad_args``
|
2018-04-14 00:18:58 -07:00
|
|
|
|
- Invalid arguments, exited with an error.
|
|
|
|
|
* - 2
|
2018-07-03 16:59:03 -07:00
|
|
|
|
- ``ExitCode.input_file``
|
2018-04-14 00:18:58 -07:00
|
|
|
|
- The input file does not seem to be a valid PDF.
|
|
|
|
|
* - 3
|
2018-07-03 16:59:03 -07:00
|
|
|
|
- ``ExitCode.missing_dependency``
|
2018-04-14 00:18:58 -07:00
|
|
|
|
- An external program required by OCRmyPDF is missing.
|
|
|
|
|
* - 4
|
2018-07-03 16:59:03 -07:00
|
|
|
|
- ``ExitCode.invalid_output_pdf``
|
|
|
|
|
- An output file was created, but it does not seem to be a valid PDF. The file will be available.
|
2018-04-14 00:18:58 -07:00
|
|
|
|
* - 5
|
2018-07-03 16:59:03 -07:00
|
|
|
|
- ``ExitCode.file_access_error``
|
2018-04-14 00:18:58 -07:00
|
|
|
|
- The user running OCRmyPDF does not have sufficient permissions to read the input file and write the output file.
|
|
|
|
|
* - 6
|
2018-07-03 16:59:03 -07:00
|
|
|
|
- ``ExitCode.already_done_ocr``
|
2018-04-14 00:18:58 -07:00
|
|
|
|
- The file already appears to contain text so it may not need OCR. See output message.
|
|
|
|
|
* - 7
|
2018-07-03 16:59:03 -07:00
|
|
|
|
- ``ExitCode.child_process_error``
|
2018-04-14 00:18:58 -07:00
|
|
|
|
- An error occurred in an external program (child process) and OCRmyPDF cannot continue.
|
|
|
|
|
* - 8
|
2018-07-03 16:59:03 -07:00
|
|
|
|
- ``ExitCode.encrypted_pdf``
|
2018-04-14 00:18:58 -07:00
|
|
|
|
- The input PDF is encrypted. OCRmyPDF does not read encrypted PDFs. Use another program such as ``qpdf`` to remove encryption.
|
|
|
|
|
* - 9
|
2018-07-03 16:59:03 -07:00
|
|
|
|
- ``ExitCode.invalid_config``
|
2018-04-14 00:18:58 -07:00
|
|
|
|
- A custom configuration file was forwarded to Tesseract using ``--tesseract-config``, and Tesseract rejected this file.
|
2018-07-03 16:59:03 -07:00
|
|
|
|
* - 10
|
|
|
|
|
- ``ExitCode.pdfa_conversion_failed``
|
|
|
|
|
- A valid PDF was created, PDF/A conversion failed. The file will be available.
|
2018-04-14 00:18:58 -07:00
|
|
|
|
* - 15
|
2018-07-03 16:59:03 -07:00
|
|
|
|
- ``ExitCode.other_error``
|
2018-04-14 00:18:58 -07:00
|
|
|
|
- Some other error occurred.
|
|
|
|
|
* - 130
|
2018-07-03 16:59:03 -07:00
|
|
|
|
- ``ExitCode.ctrl_c``
|
2018-04-14 00:18:58 -07:00
|
|
|
|
- The program was interrupted by pressing Ctrl+C.
|
2019-01-11 14:52:05 -08:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Debugging the intermediate files
|
2019-06-22 17:29:26 -07:00
|
|
|
|
================================
|
2019-01-11 14:52:05 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
OCRmyPDF normally saves its intermediate results to a temporary folder
|
|
|
|
|
and deletes this folder when it exits, whether it succeeded or failed.
|
2019-01-11 14:52:05 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
If the ``-k`` argument is issued on the command line, OCRmyPDF will keep
|
|
|
|
|
the temporary folder and print the location, whether it succeeded or
|
|
|
|
|
failed (provided the Python interpreter did not crash). An example
|
|
|
|
|
message is:
|
2019-01-11 14:52:05 -08:00
|
|
|
|
|
2019-03-01 23:15:32 -08:00
|
|
|
|
.. code-block:: none
|
2019-01-11 14:52:05 -08:00
|
|
|
|
|
2019-06-12 13:56:02 -07:00
|
|
|
|
Temporary working files retained at:
|
2020-12-22 00:46:20 -08:00
|
|
|
|
/tmp/ocrmypdf.io.u20wpz07
|
2019-01-11 14:52:05 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
The organization of this folder is an implementation detail and subject
|
|
|
|
|
to change between releases. However the general organization is that
|
|
|
|
|
working files on a per page basis have the page number as a prefix
|
|
|
|
|
(starting with page 1), an infix indicates the processing stage, and a
|
|
|
|
|
suffix indicates the file type. Some important files include:
|
|
|
|
|
|
2019-09-20 17:02:35 -07:00
|
|
|
|
- ``_rasterize.png`` - what the input page looks like
|
|
|
|
|
- ``_ocr.png`` - the file that is sent to Tesseract for OCR; depending
|
2019-06-22 17:29:26 -07:00
|
|
|
|
on arguments this may differ from the presentation image
|
2019-09-20 17:02:35 -07:00
|
|
|
|
- ``_pp_deskew.png`` - the image, after deskewing
|
|
|
|
|
- ``_pp_clean.png`` - the image, after cleaning with unpaper
|
|
|
|
|
- ``_ocr_tess.pdf`` - the OCR file; appears as a blank page with invisible
|
|
|
|
|
text embedded
|
|
|
|
|
- ``_ocr_tess.txt`` - the OCR text (not necessarily all text on the page,
|
|
|
|
|
if the page is mixed format)
|
|
|
|
|
- ``fix_docinfo.pdf`` - a temporary file created to fix the PDF DocumentInfo
|
|
|
|
|
data structure
|
|
|
|
|
- ``graft_layers.pdf`` - the rendered PDF with OCR layers grafted on
|
|
|
|
|
- ``pdfa.pdf`` - ``graft_layers.pdf`` after conversion to PDF/A
|
|
|
|
|
- ``pdfa.ps`` - a PostScript file used by Ghostscript for PDF/A conversion
|
|
|
|
|
- ``optimize.pdf`` - the PDF generated before optimization
|
|
|
|
|
- ``optimize.out.pdf`` - the PDF generated by optimization
|
|
|
|
|
- ``origin`` - the input file
|
|
|
|
|
- ``origin.pdf`` - the input file or the input image converted to PDF
|
2019-06-22 17:29:26 -07:00
|
|
|
|
- ``images/*`` - images extracted during the optimization process; here
|
|
|
|
|
the prefix indicates a PDF object ID not a page number
|