| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | =====================
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | OCRmyPDF Docker image
 | 
					
						
							|  |  |  |  | =====================
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | OCRmyPDF is also available in a Docker image that packages recent
 | 
					
						
							|  |  |  |  | versions of all dependencies.
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | For users who already have Docker installed this may be an easy and
 | 
					
						
							|  |  |  |  | convenient option. However, it is less performant than a system
 | 
					
						
							|  |  |  |  | installation and may require Docker engine configuration.
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | OCRmyPDF needs a generous amount of RAM, CPU cores, temporary storage
 | 
					
						
							|  |  |  |  | space, whether running in a Docker container or on its own. It may be
 | 
					
						
							|  |  |  |  | necessary to ensure the container is provisioned with additional
 | 
					
						
							|  |  |  |  | resources.
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | .. _docker-install:
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							|  |  |  |  | Installing the Docker image
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | ===========================
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | If you have `Docker <https://docs.docker.com/>`__ installed on your
 | 
					
						
							|  |  |  |  | system, you can install a Docker image of the latest release.
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-11-04 02:08:42 -08:00
										 |  |  |  | If you can run this command successfully, your system is ready to download and
 | 
					
						
							|  |  |  |  | execute the image:
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | .. code-block:: bash
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-11-04 02:08:42 -08:00
										 |  |  |  |    docker run hello-world
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-11-04 02:08:42 -08:00
										 |  |  |  | The recommended OCRmyPDF Docker image is currently named ``ocrmypdf``:
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | .. code-block:: bash
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-11-04 02:08:42 -08:00
										 |  |  |  |    docker pull jbarlow83/ocrmypdf
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | OCRmyPDF will use all available CPU cores. By default, the VirtualBox
 | 
					
						
							|  |  |  |  | machine instance on Windows and macOS has only a single CPU core
 | 
					
						
							|  |  |  |  | enabled. Use the VirtualBox Manager to determine the name of your Docker
 | 
					
						
							|  |  |  |  | engine host, and then follow these optional steps to enable multiple
 | 
					
						
							|  |  |  |  | CPUs:
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | .. code-block:: bash
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  |    # Optional step for Mac OS X users
 | 
					
						
							|  |  |  |  |    docker-machine stop "yourVM"
 | 
					
						
							|  |  |  |  |    VBoxManage modifyvm "yourVM" --cpus 2  # or whatever number of core is desired
 | 
					
						
							|  |  |  |  |    docker-machine start "yourVM"
 | 
					
						
							|  |  |  |  |    eval $(docker-machine env "yourVM")
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-11-04 02:08:42 -08:00
										 |  |  |  | See the Docker documentation for
 | 
					
						
							|  |  |  |  | `adjusting memory and CPU on other platforms <https://docs.docker.com/config/containers/resource_constraints/>`__.
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | Using the Docker image on the command line
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | ==========================================
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | **Unlike typical Docker containers**, in this mode we are using the
 | 
					
						
							|  |  |  |  | OCRmyPDF Docker container is intended to be emphemeral – it runs for one
 | 
					
						
							|  |  |  |  | OCR job and then terminates, just like a command line program. We are
 | 
					
						
							|  |  |  |  | using Docker as a way of delivering an application, not a server.
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | To start a Docker container (instance of the image):
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							|  |  |  |  | .. code-block:: bash
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-11-03 22:35:15 -08:00
										 |  |  |  |    docker tag jbarlow83/ocrmypdf ocrmypdf
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  |    docker run --rm -i ocrmypdf (... all other arguments here...)
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | For convenience, create a shell alias to hide the Docker command. It is
 | 
					
						
							|  |  |  |  | easier to send the input file to file stdin and read the output from
 | 
					
						
							|  |  |  |  | stdout – this avoids the occasionally messy permission issues with
 | 
					
						
							|  |  |  |  | Docker entirely.
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | .. code-block:: bash
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  |    alias ocrmypdf='docker run --rm -i ocrmypdf'
 | 
					
						
							|  |  |  |  |    ocrmypdf --version  # runs docker version
 | 
					
						
							|  |  |  |  |    ocrmypdf <input.pdf >output.pdf
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | Or in the wonderful `fish shell <https://fishshell.com/>`__:
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | .. code-block:: fish
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  |    alias ocrmypdf 'docker run --rm ocrmypdf'
 | 
					
						
							|  |  |  |  |    funcsave ocrmypdf
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | Alternately, you could mount the local current working directory as a
 | 
					
						
							|  |  |  |  | Docker volume:
 | 
					
						
							| 
									
										
										
										
											2019-06-05 03:14:36 -07:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | .. code-block:: bash
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  |    docker run --rm -v $(pwd):/data ocrmypdf /data/input.pdf /data/output.pdf
 | 
					
						
							| 
									
										
										
										
											2019-06-05 03:14:36 -07:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | .. _docker-lang-packs:
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							|  |  |  |  | Adding languages to the Docker image
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | ====================================
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | By default the Docker image includes English, German and Simplified
 | 
					
						
							|  |  |  |  | Chinese, the most popular languages for OCRmyPDF users based on
 | 
					
						
							|  |  |  |  | feedback. You may add other languages by creating a new Dockerfile based
 | 
					
						
							|  |  |  |  | on the public one:
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | .. code-block:: dockerfile
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-11-03 22:35:15 -08:00
										 |  |  |  |    FROM jbarlow83/ocrmypdf
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  |    # Add French
 | 
					
						
							|  |  |  |  |    RUN apk add tesseract-ocr-data-fra
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-05-26 00:15:14 -07:00
										 |  |  |  | You can also copy training data to ``/usr/share/tessdata``.
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | Executing the test suite
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | ========================
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | The OCRmyPDF test suite is installed with image. To run it:
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | .. code-block:: bash
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-11-03 23:39:40 -08:00
										 |  |  |  |    docker run --entrypoint python3  jbarlow83/ocrmypdf -m pytest
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-05-26 00:15:14 -07:00
										 |  |  |  | Accessing the shell
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | ===================
 | 
					
						
							| 
									
										
										
										
											2019-05-26 00:15:14 -07:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-11-03 22:35:15 -08:00
										 |  |  |  | To use the bash shell in the Docker image:
 | 
					
						
							| 
									
										
										
										
											2019-05-26 00:15:14 -07:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | .. code-block:: bash
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-11-03 22:35:15 -08:00
										 |  |  |  |    docker run -it --entrypoint bash  jbarlow83/ocrmypdf
 | 
					
						
							| 
									
										
										
										
											2019-05-26 00:15:14 -07:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | Using the OCRmyPDF web service wrapper
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | ======================================
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | The OCRmyPDF Docker image includes an example, barebones HTTP web
 | 
					
						
							|  |  |  |  | service. The webservice may be launched as follows:
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | .. code-block:: bash
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-11-03 22:35:15 -08:00
										 |  |  |  |    docker run --entrypoint python3 -p 5000:5000  jbarlow83/ocrmypdf webservice.py
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							|  |  |  |  | This will configure the machine to listen on port 5000. On Linux machines
 | 
					
						
							|  |  |  |  | this is port 5000 of localhost. On macOS or Windows machines running
 | 
					
						
							|  |  |  |  | Docker, this is port 5000 of the virtual machine that runs your Docker
 | 
					
						
							|  |  |  |  | images. You can find its IP address using the command ``docker-machine ip``.
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | Unlike command line usage this program will open a socket and wait for
 | 
					
						
							|  |  |  |  | connections.
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							|  |  |  |  | .. warning::
 | 
					
						
							|  |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  |    The OCRmyPDF web service wrapper is intended for demonstration or
 | 
					
						
							|  |  |  |  |    development. It provides no security, no authentication, no
 | 
					
						
							|  |  |  |  |    protection against denial of service attacks, and no load balancing.
 | 
					
						
							|  |  |  |  |    The default Flask WSGI server is used, which is intended for
 | 
					
						
							|  |  |  |  |    development only. The server is single-threaded and so can respond to
 | 
					
						
							|  |  |  |  |    only one client at a time. While running OCR, it cannot respond to
 | 
					
						
							|  |  |  |  |    any other clients.
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | Clients must keep their open connection while waiting for OCR to
 | 
					
						
							|  |  |  |  | complete. This may entail setting a long timeout; this interface is more
 | 
					
						
							|  |  |  |  | useful for internal HTTP API calls.
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | Unlike the rest of OCRmyPDF, this web service is licensed under the
 | 
					
						
							|  |  |  |  | Affero GPLv3 (AGPLv3) since Ghostscript, a dependency of OCRmyPDF, is
 | 
					
						
							|  |  |  |  | also licensed in this way.
 | 
					
						
							| 
									
										
										
										
											2019-03-01 23:15:32 -08:00
										 |  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2019-06-22 17:29:26 -07:00
										 |  |  |  | In addition to the above, please read our
 | 
					
						
							|  |  |  |  | :ref:`general remarks on using OCRmyPDF as a service <ocr-service>`.
 |