mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2025-12-27 06:59:12 +00:00
Begin API documentation
This commit is contained in:
parent
db6aa22eae
commit
ed236e0c27
72
docs/api.rst
Normal file
72
docs/api.rst
Normal file
@ -0,0 +1,72 @@
|
||||
Using the OCRmyPDF API
|
||||
======================
|
||||
|
||||
OCRmyPDF originated as a command line program and continues to have this legacy, but parts of it can be imported and used in other Python applications.
|
||||
|
||||
Some applications may want to consider running ocrmypdf from a subprocess call anyway, as this provides isolation of its activities.
|
||||
|
||||
Example
|
||||
-------
|
||||
|
||||
OCRmyPDF one high-level function to run its main engine from an application. The parameters are symmetric to the command line arguments and largely have the same functions.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from ocrmypdf import ocrmypdf
|
||||
|
||||
ocrmypdf('input.pdf', 'output.pdf', deskew=True)
|
||||
|
||||
With a few exceptions, all of the command line arguments are available and may be passed as equivalent keywords.
|
||||
|
||||
A few differences are that ``verbose`` and ``quiet`` are not available. Instead, output should be managed by configuring logging.
|
||||
|
||||
Parent process requirements
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The :func:`ocrmypdf.ocrmypdf` function runs OCRmyPDF similar to command line execution. To do this, it will:
|
||||
- create a monitoring thread
|
||||
- create worker processes (forking itself)
|
||||
- manage the signal flags of worker processes
|
||||
0 execute other subprocesses (forking and executing other programs)
|
||||
|
||||
The Python process that calls ``ocrmypdf()`` must be sufficiently privileged to perform these actions. If it is not, ``ocrmypdf()`` will fail.
|
||||
|
||||
There is no currently no option to manage how jobs are scheduled other than the argument ``jobs=`` which will limit the number of worker processes.
|
||||
|
||||
Forking a child process to call ``ocrmypdf()`` is suggested. That way your application will survive even if OCRmyPDF does not.
|
||||
|
||||
Logging
|
||||
^^^^^^^
|
||||
|
||||
OCRmyPDF will log under loggers named ``ocrmypdf``. In addition, it imports ``pdfminer`` and ``PIL``, both of which post log messages under those logging namespaces.
|
||||
|
||||
You can configure the logging as desired for your application or call :func:`ocrmypdf.configure_logging` to configure logging the same way OCRmyPDF itself does. The command line parameters such as ``--quiet`` and ``--verbose`` have no equivalents in the API; you must configure logging.
|
||||
|
||||
Progress monitoring
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
OCRmyPDF uses the ``tqdm`` package to implement its progress bars. :func:`ocrmypdf.configure_logging` will set up logging output to ``sys.stderr`` in a way that is compatible with the display of the progress bar.
|
||||
|
||||
Exceptions
|
||||
^^^^^^^^^^
|
||||
|
||||
OCRmyPDF may throw standard Python exceptions, ``ocrmypdf.exceptions.*`` exceptions, some exceptions related to multiprocessing, and ``KeyboardInterrupt``. The parent process should provide an exception handler. OCRmyPDF will clean up its temporary files and worker processes automatically when an exception occurs.
|
||||
|
||||
Programs that call OCRmyPDF should consider trapping KeyboardInterrupt so that they allow OCR to terminate with the whole program terminating.
|
||||
|
||||
When OCRmyPDF succeeds conditionally, it may return an integer exit code.
|
||||
|
||||
Reference
|
||||
---------
|
||||
|
||||
.. autofunction:: ocrmypdf.ocrmypdf
|
||||
|
||||
.. autoclass:: ocrmypdf.Verbosity
|
||||
:members:
|
||||
:undoc-members:
|
||||
|
||||
.. autoclass:: ocrmypdf.ExitCode
|
||||
:members:
|
||||
:undoc-members:
|
||||
|
||||
.. autofunction:: ocrmypdf.configure_logging
|
||||
@ -95,14 +95,6 @@ This user contributed script also provides an example of batch processing.
|
||||
print("OCR complete")
|
||||
logging.info(result)
|
||||
|
||||
API
|
||||
"""
|
||||
|
||||
OCRmyPDF is currently supported as a command line interface. This means that even if you are using OCRmyPDF in a Python script, you should run it in a subprocess rather importing the ocrmypdf package.
|
||||
|
||||
(If you find individual functions implemented in OCRmyPDF useful (such as ``ocrmypdf.pdfinfo``), you can use these if you wish to.)
|
||||
|
||||
|
||||
Synology DiskStations
|
||||
"""""""""""""""""""""
|
||||
|
||||
|
||||
@ -27,6 +27,7 @@ PDF is the best format for storing and exchanging scanned documents. Unfortunat
|
||||
cookbook
|
||||
docker
|
||||
advanced
|
||||
api
|
||||
batch
|
||||
security
|
||||
errors
|
||||
|
||||
@ -95,7 +95,6 @@ Ghostscript also imposes some limitations:
|
||||
Regarding OCRmyPDF itself:
|
||||
|
||||
* PDFs that use transparency are not currently represented in the test suite
|
||||
* The Python API exported by ``import ocrmypdf`` is design to help scripts that use OCRmyPDF but is not currently capable of running OCRmyPDF jobs due to limitations in an underlying library.
|
||||
|
||||
Similar programs
|
||||
----------------
|
||||
|
||||
@ -5,8 +5,6 @@ OCRmyPDF uses `semantic versioning <http://semver.org/>`_ for its command line i
|
||||
|
||||
The ``ocrmypdf`` package may now be imported. The public API may be useful in scripts that launch OCRmyPDF processes or that wish to use some of its features for working with PDFs.
|
||||
|
||||
Unfortunately, the public API does **not** expose the ability to actually OCR a PDF. This is due to a limitation in an underlying library (ruffus) that makes OCRmyPDF non-reentrant.
|
||||
|
||||
Note that it is licensed under GPLv3, so scripts that ``import ocrmypdf`` and are released publicly should probably also be licensed under GPLv3.
|
||||
|
||||
.. Issue regex
|
||||
|
||||
@ -44,4 +44,4 @@ from . import hocrtransform
|
||||
from . import leptonica
|
||||
from . import pdfa
|
||||
from . import pdfinfo
|
||||
from .api import ocrmypdf
|
||||
from .api import ocrmypdf, configure_logging, Verbosity
|
||||
|
||||
@ -22,14 +22,14 @@ import sys
|
||||
|
||||
from . import __version__
|
||||
from .cli import parser
|
||||
from .api import configure_logging
|
||||
from .api import configure_logging, Verbosity
|
||||
from ._jobcontext import make_logger
|
||||
from ._sync import run_pipeline
|
||||
from ._validation import check_closed_streams, check_options
|
||||
from .exceptions import ExitCode, BadArgsError, MissingDependencyError
|
||||
|
||||
|
||||
def main(args=None):
|
||||
def run(args=None):
|
||||
options = parser.parse_args(args=args)
|
||||
|
||||
if not check_closed_streams(options):
|
||||
@ -44,7 +44,7 @@ def main(args=None):
|
||||
if not os.isatty(sys.stderr.fileno()):
|
||||
options.progress_bar = False
|
||||
if options.quiet:
|
||||
verbosity = -1
|
||||
verbosity = Verbosity.quiet
|
||||
options.progress_bar = False
|
||||
configure_logging(
|
||||
verbosity, progress_bar_friendly=options.progress_bar, manage_root_logger=True
|
||||
@ -68,4 +68,4 @@ def main(args=None):
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
sys.exit(run())
|
||||
|
||||
@ -17,6 +17,7 @@
|
||||
|
||||
import logging
|
||||
import sys
|
||||
from enum import IntEnum
|
||||
from pathlib import Path
|
||||
|
||||
from tqdm import tqdm
|
||||
@ -46,8 +47,17 @@ class TqdmConsole:
|
||||
self.file.flush()
|
||||
|
||||
|
||||
class Verbosity(IntEnum):
|
||||
"""Verbosity level for configure_logging."""
|
||||
|
||||
quiet = -1 #: Suppress most messages
|
||||
default = 0 #: Default level of logging
|
||||
debug = 1 #: Output ocrmypdf debug messages
|
||||
debug_all = 2 #: More detailed debugging from ocrmypdf and dependent modules
|
||||
|
||||
|
||||
def configure_logging(verbosity, progress_bar_friendly=True, manage_root_logger=False):
|
||||
"""Set up logging
|
||||
"""Set up logging.
|
||||
|
||||
Library users may wish to use this function if they want their log output to be
|
||||
similar to ocrmypdf command line interface. If not used, the external application
|
||||
@ -61,11 +71,7 @@ def configure_logging(verbosity, progress_bar_friendly=True, manage_root_logger=
|
||||
Library users may perform additional configuration afterwards.
|
||||
|
||||
Args:
|
||||
verbosity: Verbosity level.
|
||||
* `-1`: Quiet
|
||||
* `0`: Default
|
||||
* `1`: Output ocrmypdf debug messages
|
||||
* `2`: More detailed debugging from ocrmypdf and dependent modules
|
||||
verbosity (Verbosity): Verbosity level.
|
||||
progress_bar_friendly (bool): Install the TqdmConsole log handler, which is
|
||||
compatible with the tqdm progress bar; without this log messages will
|
||||
overwrite the progress bar
|
||||
@ -176,7 +182,34 @@ def ocrmypdf( # pylint: disable=unused-argument
|
||||
user_patterns=None,
|
||||
keep_temporary_files=None,
|
||||
progress_bar=None,
|
||||
process_ocr_image=None,
|
||||
):
|
||||
"""Run OCRmyPDF on one PDF or image.
|
||||
|
||||
Raises:
|
||||
ocrmypdf.PdfMergeFailedError: If the input PDF is malformed, preventing merging
|
||||
with the OCR layer.
|
||||
ocrmypdf.MissingDependencyError: If a required dependency program is missing or
|
||||
was not found on PATH.
|
||||
ocrmypdf.UnsupportedImageFormatError: If the input file type was an image that
|
||||
could not be read, or some other file type that is not a PDF.
|
||||
ocrmypdf.DpiError: If the input file is an image, but the resolution of the
|
||||
image is not credible (allowing it to proceed would cause poor OCR).
|
||||
ocrmypdf.OutputFileAccessError: If an attempt to write to the intended output
|
||||
file failed.
|
||||
ocrmypdf.PriorOcrFoundError: If the input PDF seems to have OCR or digital
|
||||
text already, and settings did not tell us to proceed.
|
||||
ocrmypdf.InputFileError: Any other problem with the input file.
|
||||
ocrmypdf.SubprocessOutputError: Any error related to executing a subprocess.
|
||||
ocrmypdf.EncryptedPdfERror: If the input PDF is encrypted (password protected).
|
||||
OCRmyPDF does not remove passwords.
|
||||
ocrmypdf.TesseractConfigError: If Tesseract reported its configuration was not
|
||||
valid.
|
||||
|
||||
Returns:
|
||||
:class:`ocrmypdf.ExitCode`
|
||||
"""
|
||||
|
||||
options = create_options(**locals())
|
||||
check_options(options)
|
||||
return run_pipeline(options, api=True)
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user