2019-06-22 17:29:26 -07:00
|
|
|
======================
|
2019-05-24 01:05:32 -07:00
|
|
|
Using the OCRmyPDF API
|
|
|
|
======================
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF originated as a command line program and continues to have this
|
|
|
|
legacy, but parts of it can be imported and used in other Python
|
|
|
|
applications.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
Some applications may want to consider running ocrmypdf from a
|
|
|
|
subprocess call anyway, as this provides isolation of its activities.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Example
|
2019-06-22 17:29:26 -07:00
|
|
|
=======
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF one high-level function to run its main engine from an
|
|
|
|
application. The parameters are symmetric to the command line arguments
|
|
|
|
and largely have the same functions.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
2019-06-12 17:52:25 -07:00
|
|
|
import ocrmypdf
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-07-07 02:11:44 -07:00
|
|
|
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
With a few exceptions, all of the command line arguments are available
|
|
|
|
and may be passed as equivalent keywords.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
A few differences are that ``verbose`` and ``quiet`` are not available.
|
|
|
|
Instead, output should be managed by configuring logging.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Parent process requirements
|
2019-06-22 17:29:26 -07:00
|
|
|
---------------------------
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-07-27 04:04:33 -07:00
|
|
|
The :func:`ocrmypdf.ocr` function runs OCRmyPDF similar to command line
|
|
|
|
execution. To do this, it will:
|
|
|
|
|
|
|
|
- create a monitoring thread
|
|
|
|
- create worker processes (forking itself)
|
|
|
|
- manage the signal flags of worker processes
|
|
|
|
- execute other subprocesses (forking and executing other programs)
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-07-07 02:11:44 -07:00
|
|
|
The Python process that calls ``ocrmypdf.ocr()`` must be sufficiently
|
2019-06-22 17:29:26 -07:00
|
|
|
privileged to perform these actions. If it is not, ``ocrmypdf()`` will
|
|
|
|
fail.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
There is no currently no option to manage how jobs are scheduled other
|
|
|
|
than the argument ``jobs=`` which will limit the number of worker
|
|
|
|
processes.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-07-07 02:11:44 -07:00
|
|
|
Forking a child process to call ``ocrmypdf.ocr()`` is suggested. That
|
2019-06-22 17:29:26 -07:00
|
|
|
way your application will survive and remain interactive even if
|
|
|
|
OCRmyPDF does not.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2020-04-01 16:29:18 -07:00
|
|
|
.. warning::
|
|
|
|
|
|
|
|
On Windows, the script that calls ``ocrmypdf.ocr()`` must be protected
|
|
|
|
by an "ifmain" guard (``if __name__ == '__main__'``) or you must use
|
|
|
|
``ocrmypdf.ocr(...use_threads=True)``. If you do not take at least one
|
|
|
|
of these steps, Windows fork semantics will prevent OCRmyPDF from working
|
|
|
|
correct.
|
|
|
|
|
2019-05-24 01:05:32 -07:00
|
|
|
Logging
|
2019-06-22 17:29:26 -07:00
|
|
|
-------
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF will log under loggers named ``ocrmypdf``. In addition, it
|
|
|
|
imports ``pdfminer`` and ``PIL``, both of which post log messages under
|
|
|
|
those logging namespaces.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
You can configure the logging as desired for your application or call
|
|
|
|
:func:`ocrmypdf.configure_logging` to configure logging the same way
|
|
|
|
OCRmyPDF itself does. The command line parameters such as ``--quiet``
|
|
|
|
and ``--verbose`` have no equivalents in the API; you must use the
|
|
|
|
provided configuration function or do configuration in a way that suits
|
|
|
|
your use case.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Progress monitoring
|
2019-06-22 17:29:26 -07:00
|
|
|
-------------------
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF uses the ``tqdm`` package to implement its progress bars.
|
|
|
|
:func:`ocrmypdf.configure_logging` will set up logging output to
|
|
|
|
``sys.stderr`` in a way that is compatible with the display of the
|
2019-11-08 02:59:02 -08:00
|
|
|
progress bar. Use ``ocrmypdf.ocr(...progress_bar=False)`` to disable
|
|
|
|
the progress bar.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Exceptions
|
2019-06-22 17:29:26 -07:00
|
|
|
----------
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF may throw standard Python exceptions, ``ocrmypdf.exceptions.*``
|
|
|
|
exceptions, some exceptions related to multiprocessing, and
|
|
|
|
``KeyboardInterrupt``. The parent process should provide an exception
|
|
|
|
handler. OCRmyPDF will clean up its temporary files and worker processes
|
|
|
|
automatically when an exception occurs.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
Programs that call OCRmyPDF should consider trapping KeyboardInterrupt
|
|
|
|
so that they allow OCR to terminate with the whole program terminating.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
2019-06-20 02:45:14 -07:00
|
|
|
When OCRmyPDF succeeds conditionally, it returns an integer exit code.
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
Reference
|
|
|
|
---------
|
|
|
|
|
2019-07-27 04:04:33 -07:00
|
|
|
.. autofunction:: ocrmypdf.ocr
|
2019-05-24 01:05:32 -07:00
|
|
|
|
|
|
|
.. autoclass:: ocrmypdf.Verbosity
|
|
|
|
:members:
|
|
|
|
:undoc-members:
|
|
|
|
|
2020-05-07 03:53:37 -07:00
|
|
|
.. autofunction:: ocrmypdf.configure_logging
|
|
|
|
|
|
|
|
.. automodule:: ocrmypdf.exceptions
|
2019-05-24 01:05:32 -07:00
|
|
|
:members:
|
|
|
|
:undoc-members:
|