OCRmyPDF/docs/docker.rst

=====================
OCRmyPDF Docker image
=====================

OCRmyPDF is also available in a Docker image that packages recent
versions of all dependencies.

For users who already have Docker installed this may be an easy and
convenient option. However, it is less performant than a system
installation and may require Docker engine configuration.

OCRmyPDF needs a generous amount of RAM, CPU cores, temporary storage
space, whether running in a Docker container or on its own. It may be
necessary to ensure the container is provisioned with additional
resources.

.. _docker-install:

Installing the Docker image
===========================

If you have `Docker <https://docs.docker.com/>`__ installed on your
system, you can install a Docker image of the latest release.

If you can run this command successfully, your system is ready to download and
execute the image:

.. code-block:: bash

   docker run hello-world

The recommended OCRmyPDF Docker image is currently named ``ocrmypdf``:

.. code-block:: bash

   docker pull jbarlow83/ocrmypdf


OCRmyPDF will use all available CPU cores. By default, the VirtualBox
machine instance on Windows and macOS has only a single CPU core
enabled. Use the VirtualBox Manager to determine the name of your Docker
engine host, and then follow these optional steps to enable multiple
CPUs:

.. code-block:: bash

   # Optional step for Mac OS X users
   docker-machine stop "yourVM"
   VBoxManage modifyvm "yourVM" --cpus 2  # or whatever number of core is desired
   docker-machine start "yourVM"
   eval $(docker-machine env "yourVM")

See the Docker documentation for
`adjusting memory and CPU on other platforms <https://docs.docker.com/config/containers/resource_constraints/>`__.

Using the Docker image on the command line
==========================================

**Unlike typical Docker containers**, in this mode we are using the
OCRmyPDF Docker container is intended to be emphemeral – it runs for one
OCR job and then terminates, just like a command line program. We are
using Docker as a way of delivering an application, not a server.

To start a Docker container (instance of the image):

.. code-block:: bash

   docker tag jbarlow83/ocrmypdf ocrmypdf
   docker run --rm -i ocrmypdf (... all other arguments here...)

For convenience, create a shell alias to hide the Docker command. It is
easier to send the input file to file stdin and read the output from
stdout – this avoids the occasionally messy permission issues with
Docker entirely.

.. code-block:: bash

   alias ocrmypdf='docker run --rm -i ocrmypdf'
   ocrmypdf --version  # runs docker version
   ocrmypdf <input.pdf >output.pdf

Or in the wonderful `fish shell <https://fishshell.com/>`__:

.. code-block:: fish

   alias ocrmypdf 'docker run --rm ocrmypdf'
   funcsave ocrmypdf

Alternately, you could mount the local current working directory as a
Docker volume:

.. code-block:: bash

   docker run --rm -v $(pwd):/data ocrmypdf /data/input.pdf /data/output.pdf

.. _docker-lang-packs:

Adding languages to the Docker image
====================================

By default the Docker image includes English, German and Simplified
Chinese, the most popular languages for OCRmyPDF users based on
feedback. You may add other languages by creating a new Dockerfile based
on the public one:

.. code-block:: dockerfile

   FROM jbarlow83/ocrmypdf

   # Add French
   RUN apt install tesseract-ocr-fra

You can also copy training data to ``/usr/share/tesseract-ocr/<tesseract version>/tessdata``.

Executing the test suite
========================

The OCRmyPDF test suite is installed with image. To run it:

.. code-block:: bash

   docker run --entrypoint python3  jbarlow83/ocrmypdf -m pytest

Accessing the shell
===================

To use the bash shell in the Docker image:

.. code-block:: bash

   docker run -it --entrypoint bash  jbarlow83/ocrmypdf

Using the OCRmyPDF web service wrapper
======================================

The OCRmyPDF Docker image includes an example, barebones HTTP web
service. The webservice may be launched as follows:

.. code-block:: bash

   docker run --entrypoint python3 -p 5000:5000  jbarlow83/ocrmypdf webservice.py

This will configure the machine to listen on port 5000. On Linux machines
this is port 5000 of localhost. On macOS or Windows machines running
Docker, this is port 5000 of the virtual machine that runs your Docker
images. You can find its IP address using the command ``docker-machine ip``.

Unlike command line usage this program will open a socket and wait for
connections.

.. warning::

   The OCRmyPDF web service wrapper is intended for demonstration or
   development. It provides no security, no authentication, no
   protection against denial of service attacks, and no load balancing.
   The default Flask WSGI server is used, which is intended for
   development only. The server is single-threaded and so can respond to
   only one client at a time. While running OCR, it cannot respond to
   any other clients.

Clients must keep their open connection while waiting for OCR to
complete. This may entail setting a long timeout; this interface is more
useful for internal HTTP API calls.

Unlike the rest of OCRmyPDF, this web service is licensed under the
Affero GPLv3 (AGPLv3) since Ghostscript, a dependency of OCRmyPDF, is
also licensed in this way.

In addition to the above, please read our
:ref:`general remarks on using OCRmyPDF as a service <ocr-service>`.
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								=====================
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
+								OCRmyPDF Docker image
 								=====================
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								OCRmyPDF is also available in a Docker image that packages recent
 								versions of all dependencies.
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								For users who already have Docker installed this may be an easy and
 								convenient option. However, it is less performant than a system
 								installation and may require Docker engine configuration.
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								OCRmyPDF needs a generous amount of RAM, CPU cores, temporary storage
 								space, whether running in a Docker container or on its own. It may be
 								necessary to ensure the container is provisioned with additional
 								resources.
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
 								.. _docker-install:
 								Installing the Docker image
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								===========================
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								If you have `Docker <https://docs.docker.com/>`__ installed on your
 								system, you can install a Docker image of the latest release.
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												docs: remove comment about Ubuntu image

[ci skip]

											
										
										
											2019-11-04 02:08:42 -08:00
+								If you can run this command successfully, your system is ready to download and
 								execute the image:
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
 								.. code-block:: bash
-												docs: remove comment about Ubuntu image

[ci skip]

											
										
										
											2019-11-04 02:08:42 -08:00
+								   docker run hello-world
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												docs: remove comment about Ubuntu image

[ci skip]

											
										
										
											2019-11-04 02:08:42 -08:00
+								The recommended OCRmyPDF Docker image is currently named ``ocrmypdf``:
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
 								.. code-block:: bash
-												docs: remove comment about Ubuntu image

[ci skip]

											
										
										
											2019-11-04 02:08:42 -08:00
+								   docker pull jbarlow83/ocrmypdf
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								OCRmyPDF will use all available CPU cores. By default, the VirtualBox
 								machine instance on Windows and macOS has only a single CPU core
 								enabled. Use the VirtualBox Manager to determine the name of your Docker
 								engine host, and then follow these optional steps to enable multiple
 								CPUs:
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
 								.. code-block:: bash
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								   # Optional step for Mac OS X users
 								   docker-machine stop "yourVM"
 								   VBoxManage modifyvm "yourVM" --cpus 2  # or whatever number of core is desired
 								   docker-machine start "yourVM"
 								   eval $(docker-machine env "yourVM")
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												docs: remove comment about Ubuntu image

[ci skip]

											
										
										
											2019-11-04 02:08:42 -08:00
+								See the Docker documentation for
 								`adjusting memory and CPU on other platforms <https://docs.docker.com/config/containers/resource_constraints/>`__.
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
+								Using the Docker image on the command line
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								==========================================
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								**Unlike typical Docker containers**, in this mode we are using the
 								OCRmyPDF Docker container is intended to be emphemeral – it runs for one
 								OCR job and then terminates, just like a command line program. We are
 								using Docker as a way of delivering an application, not a server.
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
 								To start a Docker container (instance of the image):
 								.. code-block:: bash
-												Remove Alpine Docker image

											
										
										
											2019-11-03 22:35:15 -08:00
+								   docker tag jbarlow83/ocrmypdf ocrmypdf
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								   docker run --rm -i ocrmypdf (... all other arguments here...)
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								For convenience, create a shell alias to hide the Docker command. It is
 								easier to send the input file to file stdin and read the output from
 								stdout – this avoids the occasionally messy permission issues with
 								Docker entirely.
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
 								.. code-block:: bash
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								   alias ocrmypdf='docker run --rm -i ocrmypdf'
 								   ocrmypdf --version  # runs docker version
 								   ocrmypdf <input.pdf >output.pdf
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								Or in the wonderful `fish shell <https://fishshell.com/>`__:
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
 								.. code-block:: fish
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								   alias ocrmypdf 'docker run --rm ocrmypdf'
 								   funcsave ocrmypdf
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								Alternately, you could mount the local current working directory as a
 								Docker volume:
-												Docker: prefer streaming

											
										
										
											2019-06-05 03:14:36 -07:00
 								.. code-block:: bash
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								   docker run --rm -v $(pwd):/data ocrmypdf /data/input.pdf /data/output.pdf
-												Docker: prefer streaming

											
										
										
											2019-06-05 03:14:36 -07:00
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
+								.. _docker-lang-packs:
 								Adding languages to the Docker image
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								====================================
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								By default the Docker image includes English, German and Simplified
 								Chinese, the most popular languages for OCRmyPDF users based on
 								feedback. You may add other languages by creating a new Dockerfile based
 								on the public one:
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
 								.. code-block:: dockerfile
-												Remove Alpine Docker image

											
										
										
											2019-11-03 22:35:15 -08:00
+								   FROM jbarlow83/ocrmypdf
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								   # Add French
-												Fix reference to Alpine apk add

											
										
										
											2019-11-18 15:13:42 -08:00
+								   RUN apt install tesseract-ocr-fra
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Fix reference to Alpine apk add

											
										
										
											2019-11-18 15:13:42 -08:00
+								You can also copy training data to ``/usr/share/tesseract-ocr/<tesseract version>/tessdata``.
-												docs: mention how to use Docker image shell

											
										
										
											2019-05-26 00:15:14 -07:00
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
+								Executing the test suite
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								========================
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								The OCRmyPDF test suite is installed with image. To run it:
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
 								.. code-block:: bash
-												Dockerfile: remove venv from Ubuntu image; tweak reqs

											
										
										
											2019-11-03 23:39:40 -08:00
+								   docker run --entrypoint python3  jbarlow83/ocrmypdf -m pytest
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												docs: mention how to use Docker image shell

											
										
										
											2019-05-26 00:15:14 -07:00
+								Accessing the shell
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								===================
-												docs: mention how to use Docker image shell

											
										
										
											2019-05-26 00:15:14 -07:00
-												Remove Alpine Docker image

											
										
										
											2019-11-03 22:35:15 -08:00
+								To use the bash shell in the Docker image:
-												docs: mention how to use Docker image shell

											
										
										
											2019-05-26 00:15:14 -07:00
 								.. code-block:: bash
-												Remove Alpine Docker image

											
										
										
											2019-11-03 22:35:15 -08:00
+								   docker run -it --entrypoint bash  jbarlow83/ocrmypdf
-												docs: mention how to use Docker image shell

											
										
										
											2019-05-26 00:15:14 -07:00
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
+								Using the OCRmyPDF web service wrapper
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								======================================
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								The OCRmyPDF Docker image includes an example, barebones HTTP web
 								service. The webservice may be launched as follows:
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
 								.. code-block:: bash
-												Remove Alpine Docker image

											
										
										
											2019-11-03 22:35:15 -08:00
+								   docker run --entrypoint python3 -p 5000:5000  jbarlow83/ocrmypdf webservice.py
 								This will configure the machine to listen on port 5000. On Linux machines
 								this is port 5000 of localhost. On macOS or Windows machines running
 								Docker, this is port 5000 of the virtual machine that runs your Docker
 								images. You can find its IP address using the command ``docker-machine ip``.
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								Unlike command line usage this program will open a socket and wait for
 								connections.
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
 								.. warning::
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								   The OCRmyPDF web service wrapper is intended for demonstration or
 								   development. It provides no security, no authentication, no
 								   protection against denial of service attacks, and no load balancing.
 								   The default Flask WSGI server is used, which is intended for
 								   development only. The server is single-threaded and so can respond to
 								   only one client at a time. While running OCR, it cannot respond to
 								   any other clients.
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								Clients must keep their open connection while waiting for OCR to
 								complete. This may entail setting a long timeout; this interface is more
 								useful for internal HTTP API calls.
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								Unlike the rest of OCRmyPDF, this web service is licensed under the
 								Affero GPLv3 (AGPLv3) since Ghostscript, a dependency of OCRmyPDF, is
 								also licensed in this way.
-												Docs: reorganize for new docker-alpine image

											
										
										
											2019-03-01 23:15:32 -08:00
-												Use pandoc to rewrite .rst files

Fixes all of the long lines, mainly.

											
										
										
											2019-06-22 17:29:26 -07:00
+								In addition to the above, please read our
 								:ref:`general remarks on using OCRmyPDF as a service <ocr-service>`.