2019-06-22 17:29:26 -07:00
|
|
|
|
=====================
|
2019-03-01 23:15:32 -08:00
|
|
|
|
OCRmyPDF Docker image
|
|
|
|
|
=====================
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
OCRmyPDF is also available in a Docker image that packages recent
|
|
|
|
|
versions of all dependencies.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
For users who already have Docker installed this may be an easy and
|
|
|
|
|
convenient option. However, it is less performant than a system
|
|
|
|
|
installation and may require Docker engine configuration.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
OCRmyPDF needs a generous amount of RAM, CPU cores, temporary storage
|
|
|
|
|
space, whether running in a Docker container or on its own. It may be
|
|
|
|
|
necessary to ensure the container is provisioned with additional
|
|
|
|
|
resources.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. _docker-install:
|
|
|
|
|
|
|
|
|
|
Installing the Docker image
|
2019-06-22 17:29:26 -07:00
|
|
|
|
===========================
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
If you have `Docker <https://docs.docker.com/>`__ installed on your
|
|
|
|
|
system, you can install a Docker image of the latest release.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-11-04 02:08:42 -08:00
|
|
|
|
If you can run this command successfully, your system is ready to download and
|
|
|
|
|
execute the image:
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2019-11-04 02:08:42 -08:00
|
|
|
|
docker run hello-world
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-11-04 02:08:42 -08:00
|
|
|
|
The recommended OCRmyPDF Docker image is currently named ``ocrmypdf``:
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2019-11-04 02:08:42 -08:00
|
|
|
|
docker pull jbarlow83/ocrmypdf
|
|
|
|
|
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
OCRmyPDF will use all available CPU cores. By default, the VirtualBox
|
|
|
|
|
machine instance on Windows and macOS has only a single CPU core
|
|
|
|
|
enabled. Use the VirtualBox Manager to determine the name of your Docker
|
|
|
|
|
engine host, and then follow these optional steps to enable multiple
|
|
|
|
|
CPUs:
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
# Optional step for Mac OS X users
|
|
|
|
|
docker-machine stop "yourVM"
|
|
|
|
|
VBoxManage modifyvm "yourVM" --cpus 2 # or whatever number of core is desired
|
|
|
|
|
docker-machine start "yourVM"
|
|
|
|
|
eval $(docker-machine env "yourVM")
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-11-04 02:08:42 -08:00
|
|
|
|
See the Docker documentation for
|
|
|
|
|
`adjusting memory and CPU on other platforms <https://docs.docker.com/config/containers/resource_constraints/>`__.
|
|
|
|
|
|
2019-03-01 23:15:32 -08:00
|
|
|
|
Using the Docker image on the command line
|
2019-06-22 17:29:26 -07:00
|
|
|
|
==========================================
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
**Unlike typical Docker containers**, in this mode we are using the
|
|
|
|
|
OCRmyPDF Docker container is intended to be emphemeral – it runs for one
|
|
|
|
|
OCR job and then terminates, just like a command line program. We are
|
|
|
|
|
using Docker as a way of delivering an application, not a server.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
To start a Docker container (instance of the image):
|
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2019-11-03 22:35:15 -08:00
|
|
|
|
docker tag jbarlow83/ocrmypdf ocrmypdf
|
2019-06-22 17:29:26 -07:00
|
|
|
|
docker run --rm -i ocrmypdf (... all other arguments here...)
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
For convenience, create a shell alias to hide the Docker command. It is
|
|
|
|
|
easier to send the input file to file stdin and read the output from
|
|
|
|
|
stdout – this avoids the occasionally messy permission issues with
|
|
|
|
|
Docker entirely.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
alias ocrmypdf='docker run --rm -i ocrmypdf'
|
|
|
|
|
ocrmypdf --version # runs docker version
|
|
|
|
|
ocrmypdf <input.pdf >output.pdf
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Or in the wonderful `fish shell <https://fishshell.com/>`__:
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: fish
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
alias ocrmypdf 'docker run --rm ocrmypdf'
|
|
|
|
|
funcsave ocrmypdf
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Alternately, you could mount the local current working directory as a
|
|
|
|
|
Docker volume:
|
2019-06-05 03:14:36 -07:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
docker run --rm -v $(pwd):/data ocrmypdf /data/input.pdf /data/output.pdf
|
2019-06-05 03:14:36 -07:00
|
|
|
|
|
2019-03-01 23:15:32 -08:00
|
|
|
|
.. _docker-lang-packs:
|
|
|
|
|
|
|
|
|
|
Adding languages to the Docker image
|
2019-06-22 17:29:26 -07:00
|
|
|
|
====================================
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
By default the Docker image includes English, German and Simplified
|
|
|
|
|
Chinese, the most popular languages for OCRmyPDF users based on
|
|
|
|
|
feedback. You may add other languages by creating a new Dockerfile based
|
|
|
|
|
on the public one:
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: dockerfile
|
|
|
|
|
|
2019-11-03 22:35:15 -08:00
|
|
|
|
FROM jbarlow83/ocrmypdf
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
# Add French
|
2019-11-18 15:13:42 -08:00
|
|
|
|
RUN apt install tesseract-ocr-fra
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-11-18 15:13:42 -08:00
|
|
|
|
You can also copy training data to ``/usr/share/tesseract-ocr/<tesseract version>/tessdata``.
|
2019-05-26 00:15:14 -07:00
|
|
|
|
|
2019-03-01 23:15:32 -08:00
|
|
|
|
Executing the test suite
|
2019-06-22 17:29:26 -07:00
|
|
|
|
========================
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
The OCRmyPDF test suite is installed with image. To run it:
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2019-11-03 23:39:40 -08:00
|
|
|
|
docker run --entrypoint python3 jbarlow83/ocrmypdf -m pytest
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-05-26 00:15:14 -07:00
|
|
|
|
Accessing the shell
|
2019-06-22 17:29:26 -07:00
|
|
|
|
===================
|
2019-05-26 00:15:14 -07:00
|
|
|
|
|
2019-11-03 22:35:15 -08:00
|
|
|
|
To use the bash shell in the Docker image:
|
2019-05-26 00:15:14 -07:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2019-11-03 22:35:15 -08:00
|
|
|
|
docker run -it --entrypoint bash jbarlow83/ocrmypdf
|
2019-05-26 00:15:14 -07:00
|
|
|
|
|
2019-03-01 23:15:32 -08:00
|
|
|
|
Using the OCRmyPDF web service wrapper
|
2019-06-22 17:29:26 -07:00
|
|
|
|
======================================
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
The OCRmyPDF Docker image includes an example, barebones HTTP web
|
|
|
|
|
service. The webservice may be launched as follows:
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
2019-11-03 22:35:15 -08:00
|
|
|
|
docker run --entrypoint python3 -p 5000:5000 jbarlow83/ocrmypdf webservice.py
|
|
|
|
|
|
|
|
|
|
This will configure the machine to listen on port 5000. On Linux machines
|
|
|
|
|
this is port 5000 of localhost. On macOS or Windows machines running
|
|
|
|
|
Docker, this is port 5000 of the virtual machine that runs your Docker
|
|
|
|
|
images. You can find its IP address using the command ``docker-machine ip``.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Unlike command line usage this program will open a socket and wait for
|
|
|
|
|
connections.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
|
|
|
|
.. warning::
|
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
The OCRmyPDF web service wrapper is intended for demonstration or
|
|
|
|
|
development. It provides no security, no authentication, no
|
|
|
|
|
protection against denial of service attacks, and no load balancing.
|
|
|
|
|
The default Flask WSGI server is used, which is intended for
|
|
|
|
|
development only. The server is single-threaded and so can respond to
|
|
|
|
|
only one client at a time. While running OCR, it cannot respond to
|
|
|
|
|
any other clients.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Clients must keep their open connection while waiting for OCR to
|
|
|
|
|
complete. This may entail setting a long timeout; this interface is more
|
|
|
|
|
useful for internal HTTP API calls.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
Unlike the rest of OCRmyPDF, this web service is licensed under the
|
|
|
|
|
Affero GPLv3 (AGPLv3) since Ghostscript, a dependency of OCRmyPDF, is
|
|
|
|
|
also licensed in this way.
|
2019-03-01 23:15:32 -08:00
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
|
In addition to the above, please read our
|
|
|
|
|
:ref:`general remarks on using OCRmyPDF as a service <ocr-service>`.
|