OCRmyPDF/docs/api.rst

126 lines
4.4 KiB
ReStructuredText
Raw Normal View History

======================
2019-05-24 01:05:32 -07:00
Using the OCRmyPDF API
======================
OCRmyPDF originated as a command line program and continues to have this
legacy, but parts of it can be imported and used in other Python
applications.
2019-05-24 01:05:32 -07:00
Some applications may want to consider running ocrmypdf from a
subprocess call anyway, as this provides isolation of its activities.
2019-05-24 01:05:32 -07:00
Example
=======
2019-05-24 01:05:32 -07:00
2021-10-04 09:34:39 +02:00
OCRmyPDF provides one high-level function to run its main engine from an
application. The parameters are symmetric to the command line arguments
and largely have the same functions.
2019-05-24 01:05:32 -07:00
.. code-block:: python
import ocrmypdf
2019-05-24 01:05:32 -07:00
2020-11-27 13:54:36 -08:00
if __name__ == '__main__': # To ensure correct behavior on Windows and macOS
2020-11-03 15:28:33 -08:00
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
2019-05-24 01:05:32 -07:00
2021-10-04 09:34:39 +02:00
With some exceptions, all of the command line arguments are available
and may be passed as equivalent keywords.
2019-05-24 01:05:32 -07:00
A few differences are that ``verbose`` and ``quiet`` are not available.
Instead, output should be managed by configuring logging.
2019-05-24 01:05:32 -07:00
Parent process requirements
---------------------------
2019-05-24 01:05:32 -07:00
2019-07-27 04:04:33 -07:00
The :func:`ocrmypdf.ocr` function runs OCRmyPDF similar to command line
execution. To do this, it will:
- create a monitoring thread
- create worker processes (on Linux, forking itself; on Windows and macOS, by
spawning)
- manage the signal flags of its worker processes
2019-07-27 04:04:33 -07:00
- execute other subprocesses (forking and executing other programs)
2019-05-24 01:05:32 -07:00
2021-10-04 09:34:39 +02:00
The Python process that calls :func:`ocrmypdf.ocr()` must be sufficiently
2020-11-27 13:54:36 -08:00
privileged to perform these actions.
2019-05-24 01:05:32 -07:00
2021-07-05 23:12:51 +02:00
There currently is no option to manage how jobs are scheduled other
than the argument ``jobs=`` which will limit the number of worker
processes.
2019-05-24 01:05:32 -07:00
2021-10-04 09:34:39 +02:00
Creating a child process to call :func:`ocrmypdf.ocr()` is suggested. That
way your application will survive and remain interactive even if
OCRmyPDF fails for any reason.
2019-05-24 01:05:32 -07:00
2021-10-04 09:34:39 +02:00
Programs that call :func:`ocrmypdf.ocr()` should also install a SIGBUS signal
2020-08-03 16:03:54 -07:00
handler (except on Windows), to raise an exception if access to a memory
mapped file fails. OCRmyPDF may use memory mapping.
2021-10-04 09:34:39 +02:00
:func:`ocrmypdf.ocr()` will take a threading lock to prevent multiple runs of itself
in the same Python interpreter process. This is not thread-safe, because of how
OCRmyPDF's plugins and Python's library import system work. If you need to parallelize
OCRmyPDF, use processes.
.. warning::
2021-10-04 09:34:39 +02:00
On Windows and macOS, the script that calls :func:`ocrmypdf.ocr()` must be
protected by an "ifmain" guard (``if __name__ == '__main__'``). If you do
not take at least one of these steps, process semantics will prevent
OCRmyPDF from working correctly.
.. warning::
On macOS with Python 3.7, you must call
:func:`multiprocessing.set_start_method("spawn")`. Without this, multiprocessing
will be unstable. From the command line, OCRmyPDF does this automatically,
but as an API user you must do this. See Python bpo-33725 for details.
Python 3.8+ also resolve this automatically.
2019-05-24 01:05:32 -07:00
Logging
-------
2019-05-24 01:05:32 -07:00
OCRmyPDF will log under loggers named ``ocrmypdf``. In addition, it
imports ``pdfminer`` and ``PIL``, both of which post log messages under
those logging namespaces.
2019-05-24 01:05:32 -07:00
You can configure the logging as desired for your application or call
:func:`ocrmypdf.configure_logging` to configure logging the same way
OCRmyPDF itself does. The command line parameters such as ``--quiet``
and ``--verbose`` have no equivalents in the API; you must use the
provided configuration function or do configuration in a way that suits
your use case.
2019-05-24 01:05:32 -07:00
Progress monitoring
-------------------
2019-05-24 01:05:32 -07:00
OCRmyPDF uses the ``tqdm`` package to implement its progress bars.
:func:`ocrmypdf.configure_logging` will set up logging output to
``sys.stderr`` in a way that is compatible with the display of the
2019-11-08 02:59:02 -08:00
progress bar. Use ``ocrmypdf.ocr(...progress_bar=False)`` to disable
the progress bar.
2019-05-24 01:05:32 -07:00
Exceptions
----------
2019-05-24 01:05:32 -07:00
OCRmyPDF may throw standard Python exceptions, ``ocrmypdf.exceptions.*``
exceptions, some exceptions related to multiprocessing, and
2021-10-04 09:34:39 +02:00
:exc:`KeyboardInterrupt`. The parent process should provide an exception
handler. OCRmyPDF will clean up its temporary files and worker processes
automatically when an exception occurs.
2019-05-24 01:05:32 -07:00
Programs that call OCRmyPDF should consider trapping KeyboardInterrupt
so that they allow OCR to terminate with the whole program terminating.
2019-05-24 01:05:32 -07:00
2019-06-20 02:45:14 -07:00
When OCRmyPDF succeeds conditionally, it returns an integer exit code.
2019-05-24 01:05:32 -07:00
Reference
---------
2019-07-27 04:04:33 -07:00
.. autofunction:: ocrmypdf.ocr
2019-05-24 01:05:32 -07:00
.. autoclass:: ocrmypdf.Verbosity
:members:
:undoc-members:
2020-05-07 03:53:37 -07:00
.. autofunction:: ocrmypdf.configure_logging