2022-07-28 01:06:46 -07:00
|
|
|
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
|
|
|
..
|
|
|
|
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
======================
|
2019-05-24 01:05:32 -07:00
|
|
|
Using the OCRmyPDF API
|
|
|
|
======================
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF originated as a command line program and continues to have this
|
|
|
|
legacy, but parts of it can be imported and used in other Python
|
|
|
|
applications.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
Some applications may want to consider running ocrmypdf from a
|
|
|
|
subprocess call anyway, as this provides isolation of its activities.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Example
|
2019-06-22 17:29:26 -07:00
|
|
|
=======
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2021-10-04 09:34:39 +02:00
|
|
|
OCRmyPDF provides one high-level function to run its main engine from an
|
2019-06-22 17:29:26 -07:00
|
|
|
application. The parameters are symmetric to the command line arguments
|
|
|
|
and largely have the same functions.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
2019-06-12 17:52:25 -07:00
|
|
|
import ocrmypdf
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2020-11-27 13:54:36 -08:00
|
|
|
if __name__ == '__main__': # To ensure correct behavior on Windows and macOS
|
2020-11-03 15:28:33 -08:00
|
|
|
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2021-10-04 09:34:39 +02:00
|
|
|
With some exceptions, all of the command line arguments are available
|
2019-06-22 17:29:26 -07:00
|
|
|
and may be passed as equivalent keywords.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
A few differences are that ``verbose`` and ``quiet`` are not available.
|
|
|
|
Instead, output should be managed by configuring logging.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Parent process requirements
|
2019-06-22 17:29:26 -07:00
|
|
|
---------------------------
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-07-27 04:04:33 -07:00
|
|
|
The :func:`ocrmypdf.ocr` function runs OCRmyPDF similar to command line
|
|
|
|
execution. To do this, it will:
|
|
|
|
|
2023-09-30 17:05:58 -07:00
|
|
|
- create worker processes or threads
|
2020-11-03 17:09:58 -08:00
|
|
|
- manage the signal flags of its worker processes
|
2019-07-27 04:04:33 -07:00
|
|
|
- execute other subprocesses (forking and executing other programs)
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2021-10-04 09:34:39 +02:00
|
|
|
The Python process that calls :func:`ocrmypdf.ocr()` must be sufficiently
|
2020-11-27 13:54:36 -08:00
|
|
|
privileged to perform these actions.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2021-07-05 23:12:51 +02:00
|
|
|
There currently is no option to manage how jobs are scheduled other
|
2019-06-22 17:29:26 -07:00
|
|
|
than the argument ``jobs=`` which will limit the number of worker
|
|
|
|
processes.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2021-10-04 09:34:39 +02:00
|
|
|
Creating a child process to call :func:`ocrmypdf.ocr()` is suggested. That
|
2019-06-22 17:29:26 -07:00
|
|
|
way your application will survive and remain interactive even if
|
2023-09-30 17:05:58 -07:00
|
|
|
OCRmyPDF fails for any reason. For example:
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
from multiprocessing import Process
|
|
|
|
|
|
|
|
def ocrmypdf_process():
|
|
|
|
ocrmypdf.ocr('input.pdf', 'output.pdf')
|
|
|
|
|
|
|
|
def call_ocrmypdf_from_my_app():
|
|
|
|
p = Process(target=ocrmypdf_process)
|
|
|
|
p.start()
|
|
|
|
p.join()
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2021-10-04 09:34:39 +02:00
|
|
|
Programs that call :func:`ocrmypdf.ocr()` should also install a SIGBUS signal
|
2020-08-03 16:03:54 -07:00
|
|
|
handler (except on Windows), to raise an exception if access to a memory
|
|
|
|
mapped file fails. OCRmyPDF may use memory mapping.
|
|
|
|
|
2021-10-04 09:34:39 +02:00
|
|
|
:func:`ocrmypdf.ocr()` will take a threading lock to prevent multiple runs of itself
|
2021-01-01 01:37:09 -08:00
|
|
|
in the same Python interpreter process. This is not thread-safe, because of how
|
|
|
|
OCRmyPDF's plugins and Python's library import system work. If you need to parallelize
|
|
|
|
OCRmyPDF, use processes.
|
|
|
|
|
2020-04-01 16:29:18 -07:00
|
|
|
.. warning::
|
|
|
|
|
2021-10-04 09:34:39 +02:00
|
|
|
On Windows and macOS, the script that calls :func:`ocrmypdf.ocr()` must be
|
2020-11-03 17:09:58 -08:00
|
|
|
protected by an "ifmain" guard (``if __name__ == '__main__'``). If you do
|
|
|
|
not take at least one of these steps, process semantics will prevent
|
|
|
|
OCRmyPDF from working correctly.
|
2020-04-01 16:29:18 -07:00
|
|
|
|
2019-05-24 01:05:32 -07:00
|
|
|
Logging
|
2019-06-22 17:29:26 -07:00
|
|
|
-------
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF will log under loggers named ``ocrmypdf``. In addition, it
|
|
|
|
imports ``pdfminer`` and ``PIL``, both of which post log messages under
|
|
|
|
those logging namespaces.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
You can configure the logging as desired for your application or call
|
|
|
|
:func:`ocrmypdf.configure_logging` to configure logging the same way
|
|
|
|
OCRmyPDF itself does. The command line parameters such as ``--quiet``
|
|
|
|
and ``--verbose`` have no equivalents in the API; you must use the
|
|
|
|
provided configuration function or do configuration in a way that suits
|
|
|
|
your use case.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Progress monitoring
|
2019-06-22 17:29:26 -07:00
|
|
|
-------------------
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2023-09-30 17:05:58 -07:00
|
|
|
OCRmyPDF uses the ``rich`` package to implement its progress bars.
|
2019-06-22 17:29:26 -07:00
|
|
|
:func:`ocrmypdf.configure_logging` will set up logging output to
|
|
|
|
``sys.stderr`` in a way that is compatible with the display of the
|
2019-11-08 02:59:02 -08:00
|
|
|
progress bar. Use ``ocrmypdf.ocr(...progress_bar=False)`` to disable
|
|
|
|
the progress bar.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2023-09-30 17:05:58 -07:00
|
|
|
Standard output
|
|
|
|
---------------
|
|
|
|
|
|
|
|
OCRmyPDF is strict about not writing to standard output so that
|
|
|
|
users can safely use it in a pipeline and produce a valid output
|
|
|
|
file. A caller application will have to ensure it does not write to
|
|
|
|
standard output either, if it wants to be compatible with this
|
|
|
|
behavior and support piping to a file.
|
|
|
|
|
2019-05-24 01:05:32 -07:00
|
|
|
Exceptions
|
2019-06-22 17:29:26 -07:00
|
|
|
----------
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF may throw standard Python exceptions, ``ocrmypdf.exceptions.*``
|
|
|
|
exceptions, some exceptions related to multiprocessing, and
|
2021-10-04 09:34:39 +02:00
|
|
|
:exc:`KeyboardInterrupt`. The parent process should provide an exception
|
2019-06-22 17:29:26 -07:00
|
|
|
handler. OCRmyPDF will clean up its temporary files and worker processes
|
|
|
|
automatically when an exception occurs.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-20 02:45:14 -07:00
|
|
|
When OCRmyPDF succeeds conditionally, it returns an integer exit code.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Reference
|
|
|
|
---------
|
|
|
|
|
2019-07-27 04:04:33 -07:00
|
|
|
.. autofunction:: ocrmypdf.ocr
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
.. autoclass:: ocrmypdf.Verbosity
|
|
|
|
:members:
|
|
|
|
:undoc-members:
|
|
|
|
|
2020-05-07 03:53:37 -07:00
|
|
|
.. autofunction:: ocrmypdf.configure_logging
|