OCRmyPDF/docs/batch.rst

================
Batch processing
================

This article provides information about running OCRmyPDF on multiple
files or configuring it as a service triggered by file system events.

Batch jobs
==========

Consider using the excellent `GNU
Parallel <https://www.gnu.org/software/parallel/>`__ to apply OCRmyPDF
to multiple files at once.

Both ``parallel`` and ``ocrmypdf`` will try to use all available
processors. To maximize parallelism without overloading your system with
processes, consider using ``parallel -j 2`` to limit parallel to running
two jobs at once.

This command will run all ocrmypdf all files named ``*.pdf`` in the
current directory and write them to the previous created ``output/``
folder. It will not search subdirectories.

The ``--tag`` argument tells parallel to print the filename as a prefix
whenever a message is printed, so that one can trace any errors to the
file that produced them.

.. code-block:: bash

   parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf

OCRmyPDF automatically repairs PDFs before parsing and gathering
information from them.

Directory trees
===============

This will walk through a directory tree and run OCR on all files in
place, printing the output in a way that makes

.. code-block:: bash

   find . -printf '%p' -name '*.pdf' -exec ocrmypdf '{}' '{}' \;

Alternatively, with a docker container (mounts a volume to the container
where the PDFs are stored):

.. code-block:: bash

   find . -printf '%p' -name '*.pdf' -exec docker run --rm -v <host dir>:<container dir> jbarlow83/ocrmypdf '<container dir>/{}' '<container dir>/{}' \;

This only runs one ``ocrmypdf`` process at a time. This variation uses
``find`` to create a directory list and ``parallel`` to parallelize runs
of ``ocrmypdf``, again updating files in place.

.. code-block:: bash

   find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf '{}' '{}'

In a Windows batch file, use

.. code-block:: bat

   for /r %%f in (*.pdf) do ocrmypdf %%f %%f

Sample script
-------------

This user contributed script also provides an example of batch
processing.

.. literalinclude:: ../misc/batch.py
    :caption: misc/batch.py

Synology DiskStations
---------------------

Synology DiskStations (Network Attached Storage devices) can run the
Docker image of OCRmyPDF if the Synology `Docker
package <https://www.synology.com/en-global/dsm/packages/Docker>`__ is
installed. Attached is a script to address particular quirks of using
OCRmyPDF on one of these devices.

This is only possible for x86-based Synology products. Some Synology
products use ARM or Power processors and do not support Docker. Further
adjustments might be needed to deal with the Synology's relatively
limited CPU and RAM.

.. literalinclude:: ../misc/synology.py
    :caption: misc/synology.py - Sample script for Synology DiskStations

Huge batch jobs
---------------

If you have thousands of files to work with, contact the author.
Consulting work related to OCRmyPDF helps fund this open source project
and all inquiries are appreciated.

Hot (watched) folders
=====================

Watched folders with Docker
---------------------------

The OCRmyPDF Docker image includes a watcher service. This service can
be launched as follows:

.. code-block:: bash

    docker run \
        -v <path to files to convert>:/input \
        -v <path to store results>:/output \
        -e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \
        -e OCR_ON_SUCCESS_DELETE=1 \
        -e OCR_DESKEW=1 \
        -e PYTHONUNBUFFERED=1 \
        -it --entrypoint python3 \
        jbarlow83/ocrmypdf \
        watcher.py

This service will watch for a file that matches ``/input/\*.pdf`` and will
convert it to a OCRed PDF in ``/output/``. The parameters to this image are:

.. csv-table:: watcher.py parameters for Docker
    :header: "Parameter", "Description"
    :widths: 50, 50

    "``-v <path to files to convert>:/input``", "Files placed in this location will be OCRed"
    "``-v <path to store results>:/output``", "This is where OCRed files will be stored"
    "``-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1``", "This will place files in the output in {output}/{year}/{month}/{filename}"
    "``-e OCR_ON_SUCCESS_DELETE=1``", "This will delete the input file if the exit code is 0 (OK)"
    "``-e OCR_DESKEW=1``", "This will enable deskew for crooked PDFs"
    "``-e PYTHONBUFFERED=1``", "This will force STDOUT to be unbuffered and allow you to see messages in docker logs"

This service relies on polling to check for changes to the filesystem. It
may not be suitable for some environments, such as filesystems shared on a
slow network.

A configuration manager such as Docker Compose could be used to ensure that the
service is always available.

.. literalinclude:: ../misc/docker-compose.example.yaml
    :language: yaml
    :caption: misc/docker-compose.example.yaml

Watched folders with watcher.py
-------------------------------

The watcher service may also be run natively, without Docker:

.. code-block:: bash

    pip3 install -r requirements/watcher.txt

    env OCR_INPUT_DIRECTORY=/mnt/input-pdfs \
        OCR_OUTPUT_DIRECTORY=/mnt/output-pdfs \
        OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \
        python3 watcher.py

Watched folders with CLI
------------------------

To set up a "hot folder" that will trigger OCR for every file inserted,
use a program like Python
`watchdog <https://pypi.python.org/pypi/watchdog>`__ (supports all major
OS).

One could then configure a scanner to automatically place scanned files
in a hot folder, so that they will be queued for OCR and copied to the
destination.

.. code-block:: bash

   pip install watchdog

watchdog installs the command line program ``watchmedo``, which can be
told to run ``ocrmypdf`` on any .pdf added to the current directory
(``.``) and place the result in the previously created ``out/`` folder.

.. code-block:: bash

   cd hot-folder
   mkdir out
   watchmedo shell-command \
       --patterns="*.pdf" \
       --ignore-directories \
       --command='ocrmypdf "${watch_src_path}" "out/${watch_src_path}" ' \
       .  # don't forget the final dot

On file servers, you could configure watchmedo as a system service so it
will run all the time.

For more complex behavior you can write a Python script around to use
the watchdog API. You can refer to the watcher.py script as an example.

Caveats
-------

-  ``watchmedo`` may not work properly on a networked file system,
   depending on the capabilities of the file system client and server.
-  This simple recipe does not filter for the type of file system event,
   so file copies, deletes and moves, and directory operations, will all
   be sent to ocrmypdf, producing errors in several cases. Disable your
   watched folder if you are doing anything other than copying files to
   it.
-  If the source and destination directory are the same, watchmedo may
   create an infinite loop.
-  On BSD, FreeBSD and older versions of macOS, you may need to increase
   the number of file descriptors to monitor more files, using
   ``ulimit -n 1024`` to watch a folder of up to 1024 files.

Alternatives
------------

-  On Linux, `systemd user services <https://wiki.archlinux.org/index.php/Systemd/User>`__
   can be configured to automatically perform OCR on a collection of files.

-  `Watchman <https://facebook.github.io/watchman/>`__ is a more
   powerful alternative to ``watchmedo``.

macOS Automator
===============

You can use the Automator app with macOS, to create a Workflow or Quick
Action. Use a *Run Shell Script* action in your workflow. In the context
of Automator, the ``PATH`` may be set differently your Terminal's
``PATH``; you may need to explicitly set the PATH to include
``ocrmypdf``. The following example may serve as a starting point:

.. figure:: images/macos-workflow.png
    :alt: Example macOS Automator workflow

You may customize the command sent to ocrmypdf.
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`================`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00			`Batch processing`
			`================`

Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`This article provides information about running OCRmyPDF on multiple`
			`files or configuring it as a service triggered by file system events.`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
			`Batch jobs`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`==========`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			Consider using the excellent `GNU
			Parallel <https://www.gnu.org/software/parallel/>`__ to apply OCRmyPDF
			`to multiple files at once.`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			Both ``parallel`` and ``ocrmypdf`` will try to use all available
			`processors. To maximize parallelism without overloading your system with`
			processes, consider using ``parallel -j 2`` to limit parallel to running
			`two jobs at once.`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			This command will run all ocrmypdf all files named ``*.pdf`` in the
			current directory and write them to the previous created ``output/``
			`folder. It will not search subdirectories.`
Improve batch processing examples 2017-02-13 02:14:32 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			The ``--tag`` argument tells parallel to print the filename as a prefix
			`whenever a message is printed, so that one can trace any errors to the`
			`file that produced them.`
Improve batch processing examples 2017-02-13 02:14:32 -08:00
			`.. code-block:: bash`

Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf`
Improve batch processing examples 2017-02-13 02:14:32 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`OCRmyPDF automatically repairs PDFs before parsing and gathering`
			`information from them.`
Add new argument --skip-repair to skip the repair step 2018-03-28 00:54:58 -07:00
Improve batch processing examples 2017-02-13 02:14:32 -08:00			`Directory trees`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`===============`
Improve batch processing examples 2017-02-13 02:14:32 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`This will walk through a directory tree and run OCR on all files in`
			`place, printing the output in a way that makes`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
			`.. code-block:: bash`

Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`find . -printf '%p' -name '*.pdf' -exec ocrmypdf '{}' '{}' \;`
docs: Remove discussion of ruffus 2019-05-17 22:28:28 -07:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`Alternatively, with a docker container (mounts a volume to the container`
			`where the PDFs are stored):`
Update batch.rst (#362) Added docker instructions for passing "find" filenames into container. Obviates prior incorrect flag fix. 2019-03-08 15:46:50 -05:00
			`.. code-block:: bash`

Remove Alpine Docker image 2019-11-03 22:35:15 -08:00			`find . -printf '%p' -name '*.pdf' -exec docker run --rm -v <host dir>:<container dir> jbarlow83/ocrmypdf '<container dir>/{}' '<container dir>/{}' \;`
Improve batch processing examples 2017-02-13 02:14:32 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			This only runs one ``ocrmypdf`` process at a time. This variation uses
			``find`` to create a directory list and ``parallel`` to parallelize runs
			of ``ocrmypdf``, again updating files in place.
Improve batch processing examples 2017-02-13 02:14:32 -08:00
			`.. code-block:: bash`

Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`find . -name '*.pdf' \| parallel --tag -j 2 ocrmypdf '{}' '{}'`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
docs: improvements for Windows 2019-12-09 21:39:01 -08:00			`In a Windows batch file, use`

			`.. code-block:: bat`

			`for /r %%f in (*.pdf) do ocrmypdf %%f %%f`

Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00			`Sample script`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`-------------`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`This user contributed script also provides an example of batch`
			`processing.`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
docs: extract example files from batch.rst 2020-03-03 02:15:35 -08:00			`.. literalinclude:: ../misc/batch.py`
			`:caption: misc/batch.py`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
Update batch processing docs to include Synology script 2017-10-08 12:34:36 -07:00			`Synology DiskStations`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`---------------------`
Update batch processing docs to include Synology script 2017-10-08 12:34:36 -07:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`Synology DiskStations (Network Attached Storage devices) can run the`
			Docker image of OCRmyPDF if the Synology `Docker
			package <https://www.synology.com/en-global/dsm/packages/Docker>`__ is
			`installed. Attached is a script to address particular quirks of using`
			`OCRmyPDF on one of these devices.`
Update batch processing docs to include Synology script 2017-10-08 12:34:36 -07:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`This is only possible for x86-based Synology products. Some Synology`
			`products use ARM or Power processors and do not support Docker. Further`
			`adjustments might be needed to deal with the Synology's relatively`
			`limited CPU and RAM.`
Update batch processing docs to include Synology script 2017-10-08 12:34:36 -07:00
docs: extract example files from batch.rst 2020-03-03 02:15:35 -08:00			`.. literalinclude:: ../misc/synology.py`
			`:caption: misc/synology.py - Sample script for Synology DiskStations`
Update batch processing docs to include Synology script 2017-10-08 12:34:36 -07:00
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00			`Huge batch jobs`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`---------------`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`If you have thousands of files to work with, contact the author.`
			`Consulting work related to OCRmyPDF helps fund this open source project`
			`and all inquiries are appreciated.`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
			`Hot (watched) folders`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`=====================`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
Add improved example demonstrating watched folder functionality Closes #466 2019-12-28 15:37:08 -08:00			`Watched folders with Docker`
			`---------------------------`

			`The OCRmyPDF Docker image includes a watcher service. This service can`
			`be launched as follows:`

			`.. code-block:: bash`

			`docker run \`
			`-v <path to files to convert>:/input \`
			`-v <path to store results>:/output \`
			`-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \`
Watched folder bug fixes, new flags, and docs updates. 2020-01-19 19:11:54 -08:00			`-e OCR_ON_SUCCESS_DELETE=1 \`
			`-e OCR_DESKEW=1 \`
			`-e PYTHONUNBUFFERED=1 \`
Add improved example demonstrating watched folder functionality Closes #466 2019-12-28 15:37:08 -08:00			`-it --entrypoint python3 \`
			`jbarlow83/ocrmypdf \`
			`watcher.py`

			This service will watch for a file that matches ``/input/\*.pdf`` and will
			convert it to a OCRed PDF in ``/output/``. The parameters to this image are:

			`.. csv-table:: watcher.py parameters for Docker`
			`:header: "Parameter", "Description"`
			`:widths: 50, 50`

			"``-v <path to files to convert>:/input``", "Files placed in this location will be OCRed"
			"``-v <path to store results>:/output``", "This is where OCRed files will be stored"
			"``-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1``", "This will place files in the output in {output}/{year}/{month}/{filename}"
Watched folder bug fixes, new flags, and docs updates. 2020-01-19 19:11:54 -08:00			"``-e OCR_ON_SUCCESS_DELETE=1``", "This will delete the input file if the exit code is 0 (OK)"
			"``-e OCR_DESKEW=1``", "This will enable deskew for crooked PDFs"
			"``-e PYTHONBUFFERED=1``", "This will force STDOUT to be unbuffered and allow you to see messages in docker logs"
Add improved example demonstrating watched folder functionality Closes #466 2019-12-28 15:37:08 -08:00
			`This service relies on polling to check for changes to the filesystem. It`
			`may not be suitable for some environments, such as filesystems shared on a`
			`slow network.`

docs: add Docker compose configuration for watchdog 2020-02-18 02:50:57 -08:00			`A configuration manager such as Docker Compose could be used to ensure that the`
			`service is always available.`

docs: extract example files from batch.rst 2020-03-03 02:15:35 -08:00			`.. literalinclude:: ../misc/docker-compose.example.yaml`
			`:language: yaml`
			`:caption: misc/docker-compose.example.yaml`
docs: add Docker compose configuration for watchdog 2020-02-18 02:50:57 -08:00
Add improved example demonstrating watched folder functionality Closes #466 2019-12-28 15:37:08 -08:00			`Watched folders with watcher.py`
			`-------------------------------`

docs: extract example files from batch.rst 2020-03-03 02:15:35 -08:00			`The watcher service may also be run natively, without Docker:`
Add improved example demonstrating watched folder functionality Closes #466 2019-12-28 15:37:08 -08:00
			`.. code-block:: bash`

docs: extract example files from batch.rst 2020-03-03 02:15:35 -08:00			`pip3 install -r requirements/watcher.txt`
Add improved example demonstrating watched folder functionality Closes #466 2019-12-28 15:37:08 -08:00
			`env OCR_INPUT_DIRECTORY=/mnt/input-pdfs \`
			`OCR_OUTPUT_DIRECTORY=/mnt/output-pdfs \`
			`OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \`
			`python3 watcher.py`

			`Watched folders with CLI`
			`------------------------`

Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`To set up a "hot folder" that will trigger OCR for every file inserted,`
			`use a program like Python`
			`watchdog <https://pypi.python.org/pypi/watchdog>`__ (supports all major
			`OS).`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`One could then configure a scanner to automatically place scanned files`
			`in a hot folder, so that they will be queued for OCR and copied to the`
			`destination.`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
			`.. code-block:: bash`

Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`pip install watchdog`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			watchdog installs the command line program ``watchmedo``, which can be
			told to run ``ocrmypdf`` on any .pdf added to the current directory
			(``.``) and place the result in the previously created ``out/`` folder.
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
			`.. code-block:: bash`

Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`cd hot-folder`
			`mkdir out`
			`watchmedo shell-command \`
			`--patterns="*.pdf" \`
			`--ignore-directories \`
			`--command='ocrmypdf "${watch_src_path}" "out/${watch_src_path}" ' \`
			`. # don't forget the final dot`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`On file servers, you could configure watchmedo as a system service so it`
			`will run all the time.`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
Add improved example demonstrating watched folder functionality Closes #466 2019-12-28 15:37:08 -08:00			`For more complex behavior you can write a Python script around to use`
			`the watchdog API. You can refer to the watcher.py script as an example.`

Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00			`Caveats`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`-------`

			- ``watchmedo`` may not work properly on a networked file system,
			`depending on the capabilities of the file system client and server.`
			`- This simple recipe does not filter for the type of file system event,`
			`so file copies, deletes and moves, and directory operations, will all`
			`be sent to ocrmypdf, producing errors in several cases. Disable your`
			`watched folder if you are doing anything other than copying files to`
			`it.`
			`- If the source and destination directory are the same, watchmedo may`
			`create an infinite loop.`
			`- On BSD, FreeBSD and older versions of macOS, you may need to increase`
			`the number of file descriptors to monitor more files, using`
			``ulimit -n 1024`` to watch a folder of up to 1024 files.
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
			`Alternatives`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`------------`
Additional docs updates for v4.4 2017-01-26 23:02:44 -08:00
Add improved example demonstrating watched folder functionality Closes #466 2019-12-28 15:37:08 -08:00			- On Linux, `systemd user services <https://wiki.archlinux.org/index.php/Systemd/User>`__
docs: mention systemd for batches 2019-11-08 03:24:54 -08:00			`can be configured to automatically perform OCR on a collection of files.`

Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			- `Watchman <https://facebook.github.io/watchman/>`__ is a more
			powerful alternative to ``watchmedo``.
docs: explain Automator workflow 2019-03-08 15:37:42 -08:00
			`macOS Automator`
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`===============`
docs: explain Automator workflow 2019-03-08 15:37:42 -08:00
Use pandoc to rewrite .rst files Fixes all of the long lines, mainly. 2019-06-22 17:29:26 -07:00			`You can use the Automator app with macOS, to create a Workflow or Quick`
			`Action. Use a Run Shell Script action in your workflow. In the context`
			of Automator, the ``PATH`` may be set differently your Terminal's
			``PATH``; you may need to explicitly set the PATH to include
			``ocrmypdf``. The following example may serve as a starting point:
docs: explain Automator workflow 2019-03-08 15:37:42 -08:00
docs: extract example files from batch.rst 2020-03-03 02:15:35 -08:00			`.. figure:: images/macos-workflow.png`
			`:alt: Example macOS Automator workflow`
docs: explain Automator workflow 2019-03-08 15:37:42 -08:00
			`You may customize the command sent to ocrmypdf.`