2019-06-22 17:29:26 -07:00
|
|
|
======================
|
2019-05-24 01:05:32 -07:00
|
|
|
Using the OCRmyPDF API
|
|
|
|
======================
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF originated as a command line program and continues to have this
|
|
|
|
legacy, but parts of it can be imported and used in other Python
|
|
|
|
applications.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
Some applications may want to consider running ocrmypdf from a
|
|
|
|
subprocess call anyway, as this provides isolation of its activities.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Example
|
2019-06-22 17:29:26 -07:00
|
|
|
=======
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF one high-level function to run its main engine from an
|
|
|
|
application. The parameters are symmetric to the command line arguments
|
|
|
|
and largely have the same functions.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
2019-06-12 17:52:25 -07:00
|
|
|
import ocrmypdf
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2020-11-27 13:54:36 -08:00
|
|
|
if __name__ == '__main__': # To ensure correct behavior on Windows and macOS
|
2020-11-03 15:28:33 -08:00
|
|
|
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
With a few exceptions, all of the command line arguments are available
|
|
|
|
and may be passed as equivalent keywords.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
A few differences are that ``verbose`` and ``quiet`` are not available.
|
|
|
|
Instead, output should be managed by configuring logging.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Parent process requirements
|
2019-06-22 17:29:26 -07:00
|
|
|
---------------------------
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-07-27 04:04:33 -07:00
|
|
|
The :func:`ocrmypdf.ocr` function runs OCRmyPDF similar to command line
|
|
|
|
execution. To do this, it will:
|
|
|
|
|
|
|
|
- create a monitoring thread
|
2020-11-03 17:09:58 -08:00
|
|
|
- create worker processes (on Linux, forking itself; on Windows and macOS, by
|
|
|
|
spawning)
|
|
|
|
- manage the signal flags of its worker processes
|
2019-07-27 04:04:33 -07:00
|
|
|
- execute other subprocesses (forking and executing other programs)
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-07-07 02:11:44 -07:00
|
|
|
The Python process that calls ``ocrmypdf.ocr()`` must be sufficiently
|
2020-11-27 13:54:36 -08:00
|
|
|
privileged to perform these actions.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
There is no currently no option to manage how jobs are scheduled other
|
|
|
|
than the argument ``jobs=`` which will limit the number of worker
|
|
|
|
processes.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2020-11-03 17:09:58 -08:00
|
|
|
Creating a child process to call ``ocrmypdf.ocr()`` is suggested. That
|
2019-06-22 17:29:26 -07:00
|
|
|
way your application will survive and remain interactive even if
|
2020-11-03 17:09:58 -08:00
|
|
|
OCRmyPDF fails for any reason.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2020-08-03 16:03:54 -07:00
|
|
|
Programs that call ``ocrmypdf.ocr()`` should also install a SIGBUS signal
|
|
|
|
handler (except on Windows), to raise an exception if access to a memory
|
|
|
|
mapped file fails. OCRmyPDF may use memory mapping.
|
|
|
|
|
2020-04-01 16:29:18 -07:00
|
|
|
.. warning::
|
|
|
|
|
2020-11-03 17:09:58 -08:00
|
|
|
On Windows and macOS, the script that calls ``ocrmypdf.ocr()`` must be
|
|
|
|
protected by an "ifmain" guard (``if __name__ == '__main__'``). If you do
|
|
|
|
not take at least one of these steps, process semantics will prevent
|
|
|
|
OCRmyPDF from working correctly.
|
2020-04-01 16:29:18 -07:00
|
|
|
|
2019-05-24 01:05:32 -07:00
|
|
|
Logging
|
2019-06-22 17:29:26 -07:00
|
|
|
-------
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF will log under loggers named ``ocrmypdf``. In addition, it
|
|
|
|
imports ``pdfminer`` and ``PIL``, both of which post log messages under
|
|
|
|
those logging namespaces.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
You can configure the logging as desired for your application or call
|
|
|
|
:func:`ocrmypdf.configure_logging` to configure logging the same way
|
|
|
|
OCRmyPDF itself does. The command line parameters such as ``--quiet``
|
|
|
|
and ``--verbose`` have no equivalents in the API; you must use the
|
|
|
|
provided configuration function or do configuration in a way that suits
|
|
|
|
your use case.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Progress monitoring
|
2019-06-22 17:29:26 -07:00
|
|
|
-------------------
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF uses the ``tqdm`` package to implement its progress bars.
|
|
|
|
:func:`ocrmypdf.configure_logging` will set up logging output to
|
|
|
|
``sys.stderr`` in a way that is compatible with the display of the
|
2019-11-08 02:59:02 -08:00
|
|
|
progress bar. Use ``ocrmypdf.ocr(...progress_bar=False)`` to disable
|
|
|
|
the progress bar.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Exceptions
|
2019-06-22 17:29:26 -07:00
|
|
|
----------
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF may throw standard Python exceptions, ``ocrmypdf.exceptions.*``
|
|
|
|
exceptions, some exceptions related to multiprocessing, and
|
|
|
|
``KeyboardInterrupt``. The parent process should provide an exception
|
|
|
|
handler. OCRmyPDF will clean up its temporary files and worker processes
|
|
|
|
automatically when an exception occurs.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
Programs that call OCRmyPDF should consider trapping KeyboardInterrupt
|
|
|
|
so that they allow OCR to terminate with the whole program terminating.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-20 02:45:14 -07:00
|
|
|
When OCRmyPDF succeeds conditionally, it returns an integer exit code.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Reference
|
|
|
|
---------
|
|
|
|
|
2019-07-27 04:04:33 -07:00
|
|
|
.. autofunction:: ocrmypdf.ocr
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
.. autoclass:: ocrmypdf.Verbosity
|
|
|
|
:members:
|
|
|
|
:undoc-members:
|
|
|
|
|
2020-05-07 03:53:37 -07:00
|
|
|
.. autofunction:: ocrmypdf.configure_logging
|