2022-07-28 01:06:46 -07:00
|
|
|
|
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
|
|
|
|
..
|
|
|
|
|
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
|
|
|
|
|
2020-11-10 04:08:01 -08:00
|
|
|
|
.. _docker:
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
=====================
|
2019-03-01 23:15:32 -08:00
|
|
|
|
OCRmyPDF Docker image
|
|
|
|
|
=====================
|
|
|
|
|
|
2023-10-10 01:25:26 -07:00
|
|
|
|
OCRmyPDF is also available in Docker images that packages recent
|
2019-06-22 17:29:26 -07:00
|
|
|
|
versions of all dependencies.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
For users who already have Docker installed this may be an easy and
|
2023-10-24 13:34:26 -07:00
|
|
|
|
convenient option.
|
|
|
|
|
|
|
|
|
|
On platforms other than Linux, Docker runs in a virtual machine, and so may
|
|
|
|
|
be less performant. You may also want to adjust the Docker virtual machine's
|
|
|
|
|
memory and CPU allocation. On Linux, the Docker image runs natively and
|
|
|
|
|
performance is comparable to a system installation.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. _docker-install:
|
|
|
|
|
|
|
|
|
|
Installing the Docker image
|
2019-06-22 17:29:26 -07:00
|
|
|
|
===========================
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
If you have `Docker <https://docs.docker.com/>`__ installed on your
|
|
|
|
|
system, you can install a Docker image of the latest release.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-11-04 02:08:42 -08:00
|
|
|
|
If you can run this command successfully, your system is ready to download and
|
|
|
|
|
execute the image:
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2019-11-04 02:08:42 -08:00
|
|
|
|
docker run hello-world
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2023-10-10 01:25:26 -07:00
|
|
|
|
.. list-table:: Docker images
|
2025-01-03 12:23:42 -08:00
|
|
|
|
:widths: 30 20 50
|
2023-10-10 01:25:26 -07:00
|
|
|
|
:header-rows: 1
|
|
|
|
|
|
|
|
|
|
* - Image
|
|
|
|
|
- Architecture
|
|
|
|
|
- Description
|
|
|
|
|
* - ``jbarlow83/ocrmypdf-alpine``
|
2024-06-08 01:20:25 -07:00
|
|
|
|
- x86_64 and arm64
|
2023-10-10 01:25:26 -07:00
|
|
|
|
- Recommended image, based on Alpine Linux.
|
|
|
|
|
* - ``jbarlow83/ocrmypdf-ubuntu``
|
|
|
|
|
- x86_64 and arm64
|
|
|
|
|
- Alternate image, based on Ubuntu. When the Alpine image is considered
|
|
|
|
|
stable and available for arm64, this image will be deprecated.
|
|
|
|
|
* - ``jbarlow83/ocrmypdf``
|
|
|
|
|
- x86_64 and arm64
|
|
|
|
|
- Currently an alias for ocrmypdf-ubuntu. When the Alpine image is
|
|
|
|
|
considered stable and available for arm64, this name point to the
|
|
|
|
|
Alpine image. If you don't about the difference between Alpine and
|
|
|
|
|
Ubuntu, use this image.
|
|
|
|
|
|
|
|
|
|
To install:
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2023-10-10 01:25:26 -07:00
|
|
|
|
docker pull jbarlow83/ocrmypdf-alpine
|
|
|
|
|
|
|
|
|
|
The ``ocrmypdf`` image is also available, but is deprecated and will be removed
|
|
|
|
|
in the future.
|
2019-11-04 02:08:42 -08:00
|
|
|
|
|
2023-10-09 14:57:55 -07:00
|
|
|
|
OCRmyPDF will use all available CPU cores. See the Docker documentation for
|
2024-10-27 16:49:54 -07:00
|
|
|
|
`adjusting memory and CPU on other platforms <https://docs.docker.com/config/containers/resource_constraints/>`__
|
|
|
|
|
if you are using Docker on macOS or Windows, where you may need to manually assign
|
|
|
|
|
more resources. On Linux, all resources will be available automatically.
|
|
|
|
|
|
2024-12-02 11:45:01 -08:00
|
|
|
|
The underlying operating system and other details in Docker images are considered
|
|
|
|
|
implementation details and **subject to change at minor releases**. If you are
|
|
|
|
|
modifying the image, you should pin the version you intend to use.
|
2019-11-04 02:08:42 -08:00
|
|
|
|
|
2019-03-01 23:15:32 -08:00
|
|
|
|
Using the Docker image on the command line
|
2019-06-22 17:29:26 -07:00
|
|
|
|
==========================================
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2020-03-03 00:59:48 -08:00
|
|
|
|
**Unlike typical Docker containers**, in this section the OCRmyPDF Docker
|
2022-02-03 19:33:51 +01:00
|
|
|
|
container is ephemeral – it runs for one OCR job and terminates, just like a
|
2020-03-03 00:59:48 -08:00
|
|
|
|
command line program. We are using Docker to deliver an application (as opposed
|
|
|
|
|
to the more conventional case, where a Docker container runs as a server).
|
2023-10-09 14:57:55 -07:00
|
|
|
|
For that reason we usually use the ``--rm`` argument to delete the container
|
|
|
|
|
when it exits.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
To start a Docker container (instance of the image):
|
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2024-06-08 01:20:51 -07:00
|
|
|
|
docker tag jbarlow83/ocrmypdf-alpine ocrmypdf
|
2020-03-03 02:14:50 -08:00
|
|
|
|
docker run --rm -i ocrmypdf (... all other arguments here...) - -
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
For convenience, create a shell alias to hide the Docker command. It is
|
2020-03-03 00:59:48 -08:00
|
|
|
|
easier to send the input file as stdin and read the output from
|
|
|
|
|
stdout – **this avoids the messy permission issues with Docker entirely**.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2020-03-03 00:59:48 -08:00
|
|
|
|
alias docker_ocrmypdf='docker run --rm -i ocrmypdf'
|
|
|
|
|
docker_ocrmypdf --version # runs docker version
|
2020-03-03 02:14:50 -08:00
|
|
|
|
docker_ocrmypdf - - <input.pdf >output.pdf
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Or in the wonderful `fish shell <https://fishshell.com/>`__:
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: fish
|
|
|
|
|
|
2020-03-03 00:59:48 -08:00
|
|
|
|
alias docker_ocrmypdf 'docker run --rm ocrmypdf'
|
|
|
|
|
funcsave docker_ocrmypdf
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Alternately, you could mount the local current working directory as a
|
|
|
|
|
Docker volume:
|
2019-06-05 03:14:36 -07:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2020-04-15 08:32:01 +02:00
|
|
|
|
alias docker_ocrmypdf='docker run --rm -i --user "$(id -u):$(id -g)" --workdir /data -v "$PWD:/data" ocrmypdf'
|
|
|
|
|
docker_ocrmypdf /data/input.pdf /data/output.pdf
|
2020-03-03 00:59:48 -08:00
|
|
|
|
|
2025-02-26 14:57:18 -08:00
|
|
|
|
Podman
|
|
|
|
|
======
|
|
|
|
|
|
|
|
|
|
Especially if you use `Podman <https://podman.io/>`__ (or have SELinux enabled on your
|
|
|
|
|
system), you may need to add ``--userns keep-id`` there, otherwise you may get access
|
|
|
|
|
errors, because the user is otherwise not mapped to the same UID as on the host:
|
2025-02-26 02:30:25 +01:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
|
|
|
|
alias podman_ocrmypdf='podman run --rm -i --user "$(id -u):$(id -g)" --userns keep-id --workdir /data -v "$PWD:/data" ocrmypdf'
|
|
|
|
|
podman_ocrmypdf /data/input.pdf /data/output.pdf
|
|
|
|
|
|
2025-02-26 14:57:18 -08:00
|
|
|
|
If you use SELinux you may additionally need to add the ``:Z`` `suffix to the volume
|
|
|
|
|
<https://docs.podman.io/en/stable/markdown/podman-run.1.html#volume-v-source-volume-host-dir-container-dir-options>`__
|
|
|
|
|
or disable SELinux for the container using ``--security-opt label=disable``, which is
|
|
|
|
|
suggested for system files as they should not be re-labelled. Please refer to the „Note”
|
|
|
|
|
section at the end of the linked podman documentation for details.
|
2025-02-26 02:30:25 +01:00
|
|
|
|
|
2019-03-01 23:15:32 -08:00
|
|
|
|
.. _docker-lang-packs:
|
|
|
|
|
|
|
|
|
|
Adding languages to the Docker image
|
2019-06-22 17:29:26 -07:00
|
|
|
|
====================================
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2020-02-18 11:10:01 +01:00
|
|
|
|
By default the Docker image includes English, German, Simplified Chinese,
|
|
|
|
|
French, Portuguese and Spanish, the most popular languages for OCRmyPDF
|
|
|
|
|
users based on feedback. You may add other languages by creating a new
|
2021-05-13 23:24:54 -07:00
|
|
|
|
Dockerfile based on the public one.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: dockerfile
|
|
|
|
|
|
2019-11-03 22:35:15 -08:00
|
|
|
|
FROM jbarlow83/ocrmypdf
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2021-05-13 23:24:54 -07:00
|
|
|
|
# Example: add Italian
|
|
|
|
|
RUN apt install tesseract-ocr-ita
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2023-10-09 14:57:55 -07:00
|
|
|
|
To install language packs (training data) such as the
|
2021-05-13 23:24:54 -07:00
|
|
|
|
`tessdata_best <https://github.com/tesseract-ocr/tessdata_best>`_ suite or
|
|
|
|
|
custom data, you first need to determine the version of Tesseract data files, which
|
|
|
|
|
may differ from the Tesseract program version. Use this command to determine the data
|
|
|
|
|
file version:
|
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
|
|
|
|
docker run -i --rm --entrypoint /bin/ls jbarlow83/ocrmypdf /usr/share/tesseract-ocr
|
|
|
|
|
|
|
|
|
|
As of 2021, the data file version is probably ``4.00``.
|
|
|
|
|
|
|
|
|
|
You can then add new data with either a Dockerfile:
|
|
|
|
|
|
|
|
|
|
.. code-block:: dockerfile
|
|
|
|
|
|
2023-10-09 14:57:55 -07:00
|
|
|
|
FROM jbarlow83/ocrmypdf:{TAG}
|
2021-05-13 23:24:54 -07:00
|
|
|
|
|
|
|
|
|
# Example: add a tessdata_best file
|
|
|
|
|
COPY chi_tra_vert.traineddata /usr/share/tesseract-ocr/<data version>/tessdata/
|
|
|
|
|
|
2023-10-09 14:57:55 -07:00
|
|
|
|
When creating your own image, you should always pin a specific version of the
|
|
|
|
|
OCRmyPDF Docker image. This ensures that your image will not break when a new
|
|
|
|
|
version of OCRmyPDF is released.
|
|
|
|
|
|
2021-05-13 23:24:54 -07:00
|
|
|
|
Alternately, you can copy training data into a Docker container as follows:
|
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
|
|
|
|
docker cp mycustomtraining.traineddata name_of_container:/usr/share/tesseract-ocr/<tesseract version>/tessdata/
|
2019-05-26 00:15:14 -07:00
|
|
|
|
|
2023-10-09 14:57:55 -07:00
|
|
|
|
Extending the Docker image
|
|
|
|
|
==========================
|
|
|
|
|
|
|
|
|
|
You can extend the Docker image with your own customizations, similar to the way
|
|
|
|
|
it is extended to add language packs.
|
|
|
|
|
|
|
|
|
|
Note that the Docker image is subject to change at any time. For example, the base
|
|
|
|
|
image may be updated to a newer version of Ubuntu or Debian. Such changes will be
|
|
|
|
|
noted in the release notes but might occur at minor versions releases, unless the
|
|
|
|
|
way a "casual" user of the Docker image is affected.
|
|
|
|
|
|
|
|
|
|
If you extend the Docker image, you should pin a specific version of the OCRmyPDF
|
|
|
|
|
Docker image.
|
|
|
|
|
|
2019-03-01 23:15:32 -08:00
|
|
|
|
Executing the test suite
|
2019-06-22 17:29:26 -07:00
|
|
|
|
========================
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
The OCRmyPDF test suite is installed with image. To run it:
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2023-10-10 01:25:26 -07:00
|
|
|
|
docker run --rm --entrypoint python jbarlow83/ocrmypdf -m pytest
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-05-26 00:15:14 -07:00
|
|
|
|
Accessing the shell
|
2019-06-22 17:29:26 -07:00
|
|
|
|
===================
|
2019-05-26 00:15:14 -07:00
|
|
|
|
|
2023-10-10 01:25:26 -07:00
|
|
|
|
To use the shell in the Docker image:
|
2019-05-26 00:15:14 -07:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2023-10-10 01:25:26 -07:00
|
|
|
|
docker run -it --entrypoint sh jbarlow83/ocrmypdf
|
2019-05-26 00:15:14 -07:00
|
|
|
|
|
2019-03-01 23:15:32 -08:00
|
|
|
|
Using the OCRmyPDF web service wrapper
|
2019-06-22 17:29:26 -07:00
|
|
|
|
======================================
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
The OCRmyPDF Docker image includes an example, barebones HTTP web
|
|
|
|
|
service. The webservice may be launched as follows:
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2023-10-10 01:25:26 -07:00
|
|
|
|
docker run --entrypoint python -p 5000:5000 jbarlow83/ocrmypdf webservice.py
|
2019-11-03 22:35:15 -08:00
|
|
|
|
|
2023-10-09 14:57:55 -07:00
|
|
|
|
We omit the ``--rm`` parameter so that the container will not be
|
|
|
|
|
automatically deleted when it exits.
|
|
|
|
|
|
2019-11-03 22:35:15 -08:00
|
|
|
|
This will configure the machine to listen on port 5000. On Linux machines
|
|
|
|
|
this is port 5000 of localhost. On macOS or Windows machines running
|
|
|
|
|
Docker, this is port 5000 of the virtual machine that runs your Docker
|
|
|
|
|
images. You can find its IP address using the command ``docker-machine ip``.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Unlike command line usage this program will open a socket and wait for
|
|
|
|
|
connections.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. warning::
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
The OCRmyPDF web service wrapper is intended for demonstration or
|
|
|
|
|
development. It provides no security, no authentication, no
|
|
|
|
|
protection against denial of service attacks, and no load balancing.
|
|
|
|
|
The default Flask WSGI server is used, which is intended for
|
|
|
|
|
development only. The server is single-threaded and so can respond to
|
|
|
|
|
only one client at a time. While running OCR, it cannot respond to
|
|
|
|
|
any other clients.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Clients must keep their open connection while waiting for OCR to
|
|
|
|
|
complete. This may entail setting a long timeout; this interface is more
|
|
|
|
|
useful for internal HTTP API calls.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Unlike the rest of OCRmyPDF, this web service is licensed under the
|
2020-08-05 00:44:42 -07:00
|
|
|
|
Affero GPLv3 (AGPLv3) since Ghostscript is also licensed in this way.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
In addition to the above, please read our
|
|
|
|
|
:ref:`general remarks on using OCRmyPDF as a service <ocr-service>`.
|