From 10aadefd6a2ddd7fd64328511af36dc6d05feeb2 Mon Sep 17 00:00:00 2001 From: "James R. Barlow" Date: Sat, 14 Apr 2018 00:18:58 -0700 Subject: [PATCH] Document return codes --- docs/advanced.rst | 84 +++++++++++++++++++++++++++++++++++++++-------- docs/cookbook.rst | 1 - 2 files changed, 70 insertions(+), 15 deletions(-) diff --git a/docs/advanced.rst b/docs/advanced.rst index 5f592079..3ae0c0b9 100644 --- a/docs/advanced.rst +++ b/docs/advanced.rst @@ -27,8 +27,8 @@ If you want to adjust the amount of time spent on OCR, change ``--tesseract-time .. code-block:: bash - # Allow 300 seconds for OCR; skip any page larger than 50 megapixels - ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf + # Allow 300 seconds for OCR; skip any page larger than 50 megapixels + ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf Overriding default tesseract """""""""""""""""""""""""""" @@ -39,20 +39,20 @@ Some relevant environment variables that influence Tesseract's behavior include: .. envvar:: TESSDATA_PREFIX - Overrides the path to Tesseract's data files. This can allow simultaneous installation of the "best" and "fast" training data sets. OCRmyPDF does not manage this environment variable. + Overrides the path to Tesseract's data files. This can allow simultaneous installation of the "best" and "fast" training data sets. OCRmyPDF does not manage this environment variable. .. envvar:: OMP_THREAD_LIMIT - Controls the number of threads Tesseract will use. OCRmyPDF will manage this environment if it is not already set. (Currently, it will set it to 1 because this gives the best results in testing.) + Controls the number of threads Tesseract will use. OCRmyPDF will manage this environment if it is not already set. (Currently, it will set it to 1 because this gives the best results in testing.) For example, if you are testing tesseract 4.00 and don't wish to use an existing tesseract 3.04 installation, you can launch OCRmyPDF as follows: .. code-block:: bash - env \ - PATH=/home/user/src/tesseract4/api:$PATH \ - TESSDATA_PREFIX=/home/user/src/tesseract4 \ - ocrmypdf --tesseract-oem 2 input.pdf output.pdf + env \ + PATH=/home/user/src/tesseract4/api:$PATH \ + TESSDATA_PREFIX=/home/user/src/tesseract4 \ + ocrmypdf --tesseract-oem 2 input.pdf output.pdf In this example ``TESSDATA_PREFIX`` directs Tesseract 4.0 to use LSTM training data. ``--tesseract-oem 1`` requests tesseract 4.0's new LSTM engine. (Tesseract 4.0 only.) @@ -80,19 +80,19 @@ Create a file named "no-dict.cfg" with these contents: :: - load_system_dawg 0 - language_model_penalty_non_dict_word 0 - language_model_penalty_non_freq_dict_word 0 + load_system_dawg 0 + language_model_penalty_non_dict_word 0 + language_model_penalty_non_freq_dict_word 0 then run ocrmypdf as follows (along with any other desired arguments): .. code-block:: bash - ocrmypdf --tesseract-config no-dict.cfg input.pdf output.pdf + ocrmypdf --tesseract-config no-dict.cfg input.pdf output.pdf .. warning:: - Some combinations of control parameters will break Tesseract or break assumptions that OCRmyPDF makes about Tesseract's output. + Some combinations of control parameters will break Tesseract or break assumptions that OCRmyPDF makes about Tesseract's output. Changing the PDF renderer @@ -136,4 +136,60 @@ The ``tesseract`` renderer creates a PDF with the image and text layers precompo If a PDF created with this renderer using Tesseract versions older than 3.05.00 is then passed through Ghostscript's pdfwrite feature, the OCR text *may* be corrupted. The ``--output-type=pdfa`` argument will produce a warning in this situation. -*This renderer is deprecated and will be removed whenever support for older versions of Tesseract is dropped.* \ No newline at end of file +*This renderer is deprecated and will be removed whenever support for older versions of Tesseract is dropped.* + + +Return code policy +------------------ + +OCRmyPDF writes all messages to ``stderr``. ``stdout`` is reserved for piping +output files. ``stdin`` is reserved for piping input files. + +The return codes generated by the OCRmyPDF are considered part of the stable +user interface. + +.. list-table:: Return codes + :widths: 5 35 60 + :header-rows: 1 + + * - Code + - Name + - Interpretation + * - 0 + - ``ocrmypdf.exceptions.ExitCode.ok`` + - Everything worked as expected. + * - 1 + - ``ocrmypdf.exceptions.ExitCode.bad_args`` + - Invalid arguments, exited with an error. + * - 2 + - ``ocrmypdf.exceptions.ExitCode.input_file`` + - The input file does not seem to be a valid PDF. + * - 3 + - ``ocrmypdf.exceptions.missing_dependency`` + - An external program required by OCRmyPDF is missing. + * - 4 + - ``ocrmypdf.exceptions.invalid_output_pdf`` + - An output file was created, but it does not seem to be a valid PDF or + PDF/A. The file will be available. + * - 5 + - ``ocrmypdf.exceptions.file_access_error`` + - The user running OCRmyPDF does not have sufficient permissions to read the input file and write the output file. + * - 6 + - ``ocrmypdf.exceptions.already_done_ocr`` + - The file already appears to contain text so it may not need OCR. See output message. + * - 7 + - ``ocrmypdf.exceptions.child_process_error`` + - An error occurred in an external program (child process) and OCRmyPDF cannot continue. + * - 8 + - ``ocrmypdf.exceptions.encrypted_pdf`` + - The input PDF is encrypted. OCRmyPDF does not read encrypted PDFs. Use another program such as ``qpdf`` to remove encryption. + * - 9 + - ``ocrmypdf.exceptions.invalid_config`` + - A custom configuration file was forwarded to Tesseract using ``--tesseract-config``, and Tesseract rejected this file. + * - 15 + - ``ocrmypdf.exceptions.other_error`` + - Some other error occurred. + * - 130 + - ``ocrmypdf.exceptions.ctrl_c`` + - The program was interrupted by pressing Ctrl+C. + diff --git a/docs/cookbook.rst b/docs/cookbook.rst index a3c4dbef..02fd4e34 100644 --- a/docs/cookbook.rst +++ b/docs/cookbook.rst @@ -198,4 +198,3 @@ The `Image processing`_ features can improve OCR quality. Rotating pages and deskewing helps to ensure that the page orientation is correct before OCR begins. Removing the background and/or cleaning the page can also improve results. The ``--oversample DPI`` argument can be specified to resample images to higher resolution before attempting OCR; this can improve results as well. OCR quality will suffer if the resolution of input images is not correct (since the range of pixel sizes that will be checked for possible fonts will also be incorrect). -