OCRmyPDF/docs/api.rst

113 lines
3.6 KiB
ReStructuredText
Raw Normal View History

======================
2019-05-24 01:05:32 -07:00
Using the OCRmyPDF API
======================
OCRmyPDF originated as a command line program and continues to have this
legacy, but parts of it can be imported and used in other Python
applications.
2019-05-24 01:05:32 -07:00
Some applications may want to consider running ocrmypdf from a
subprocess call anyway, as this provides isolation of its activities.
2019-05-24 01:05:32 -07:00
Example
=======
2019-05-24 01:05:32 -07:00
OCRmyPDF one high-level function to run its main engine from an
application. The parameters are symmetric to the command line arguments
and largely have the same functions.
2019-05-24 01:05:32 -07:00
.. code-block:: python
import ocrmypdf
2019-05-24 01:05:32 -07:00
2019-07-07 02:11:44 -07:00
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
2019-05-24 01:05:32 -07:00
With a few exceptions, all of the command line arguments are available
and may be passed as equivalent keywords.
2019-05-24 01:05:32 -07:00
A few differences are that ``verbose`` and ``quiet`` are not available.
Instead, output should be managed by configuring logging.
2019-05-24 01:05:32 -07:00
Parent process requirements
---------------------------
2019-05-24 01:05:32 -07:00
2019-07-27 04:04:33 -07:00
The :func:`ocrmypdf.ocr` function runs OCRmyPDF similar to command line
execution. To do this, it will:
- create a monitoring thread
- create worker processes (forking itself)
- manage the signal flags of worker processes
- execute other subprocesses (forking and executing other programs)
2019-05-24 01:05:32 -07:00
2019-07-07 02:11:44 -07:00
The Python process that calls ``ocrmypdf.ocr()`` must be sufficiently
privileged to perform these actions. If it is not, ``ocrmypdf()`` will
fail.
2019-05-24 01:05:32 -07:00
There is no currently no option to manage how jobs are scheduled other
than the argument ``jobs=`` which will limit the number of worker
processes.
2019-05-24 01:05:32 -07:00
2019-07-07 02:11:44 -07:00
Forking a child process to call ``ocrmypdf.ocr()`` is suggested. That
way your application will survive and remain interactive even if
OCRmyPDF does not.
2019-05-24 01:05:32 -07:00
.. warning::
On Windows, the script that calls ``ocrmypdf.ocr()`` must be protected
by an "ifmain" guard (``if __name__ == '__main__'``) or you must use
``ocrmypdf.ocr(...use_threads=True)``. If you do not take at least one
of these steps, Windows fork semantics will prevent OCRmyPDF from working
correct.
2019-05-24 01:05:32 -07:00
Logging
-------
2019-05-24 01:05:32 -07:00
OCRmyPDF will log under loggers named ``ocrmypdf``. In addition, it
imports ``pdfminer`` and ``PIL``, both of which post log messages under
those logging namespaces.
2019-05-24 01:05:32 -07:00
You can configure the logging as desired for your application or call
:func:`ocrmypdf.configure_logging` to configure logging the same way
OCRmyPDF itself does. The command line parameters such as ``--quiet``
and ``--verbose`` have no equivalents in the API; you must use the
provided configuration function or do configuration in a way that suits
your use case.
2019-05-24 01:05:32 -07:00
Progress monitoring
-------------------
2019-05-24 01:05:32 -07:00
OCRmyPDF uses the ``tqdm`` package to implement its progress bars.
:func:`ocrmypdf.configure_logging` will set up logging output to
``sys.stderr`` in a way that is compatible with the display of the
2019-11-08 02:59:02 -08:00
progress bar. Use ``ocrmypdf.ocr(...progress_bar=False)`` to disable
the progress bar.
2019-05-24 01:05:32 -07:00
Exceptions
----------
2019-05-24 01:05:32 -07:00
OCRmyPDF may throw standard Python exceptions, ``ocrmypdf.exceptions.*``
exceptions, some exceptions related to multiprocessing, and
``KeyboardInterrupt``. The parent process should provide an exception
handler. OCRmyPDF will clean up its temporary files and worker processes
automatically when an exception occurs.
2019-05-24 01:05:32 -07:00
Programs that call OCRmyPDF should consider trapping KeyboardInterrupt
so that they allow OCR to terminate with the whole program terminating.
2019-05-24 01:05:32 -07:00
2019-06-20 02:45:14 -07:00
When OCRmyPDF succeeds conditionally, it returns an integer exit code.
2019-05-24 01:05:32 -07:00
Reference
---------
2019-07-27 04:04:33 -07:00
.. autofunction:: ocrmypdf.ocr
2019-05-24 01:05:32 -07:00
.. autoclass:: ocrmypdf.Verbosity
:members:
:undoc-members:
2020-05-07 03:53:37 -07:00
.. autofunction:: ocrmypdf.configure_logging
.. automodule:: ocrmypdf.exceptions
2019-05-24 01:05:32 -07:00
:members:
:undoc-members: