Document return codes

This commit is contained in:
James R. Barlow 2018-04-14 00:18:58 -07:00
parent e75b6280fd
commit 10aadefd6a
2 changed files with 70 additions and 15 deletions

View File

@ -27,8 +27,8 @@ If you want to adjust the amount of time spent on OCR, change ``--tesseract-time
.. code-block:: bash
# Allow 300 seconds for OCR; skip any page larger than 50 megapixels
ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf
# Allow 300 seconds for OCR; skip any page larger than 50 megapixels
ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf
Overriding default tesseract
""""""""""""""""""""""""""""
@ -39,20 +39,20 @@ Some relevant environment variables that influence Tesseract's behavior include:
.. envvar:: TESSDATA_PREFIX
Overrides the path to Tesseract's data files. This can allow simultaneous installation of the "best" and "fast" training data sets. OCRmyPDF does not manage this environment variable.
Overrides the path to Tesseract's data files. This can allow simultaneous installation of the "best" and "fast" training data sets. OCRmyPDF does not manage this environment variable.
.. envvar:: OMP_THREAD_LIMIT
Controls the number of threads Tesseract will use. OCRmyPDF will manage this environment if it is not already set. (Currently, it will set it to 1 because this gives the best results in testing.)
Controls the number of threads Tesseract will use. OCRmyPDF will manage this environment if it is not already set. (Currently, it will set it to 1 because this gives the best results in testing.)
For example, if you are testing tesseract 4.00 and don't wish to use an existing tesseract 3.04 installation, you can launch OCRmyPDF as follows:
.. code-block:: bash
env \
PATH=/home/user/src/tesseract4/api:$PATH \
TESSDATA_PREFIX=/home/user/src/tesseract4 \
ocrmypdf --tesseract-oem 2 input.pdf output.pdf
env \
PATH=/home/user/src/tesseract4/api:$PATH \
TESSDATA_PREFIX=/home/user/src/tesseract4 \
ocrmypdf --tesseract-oem 2 input.pdf output.pdf
In this example ``TESSDATA_PREFIX`` directs Tesseract 4.0 to use LSTM training data. ``--tesseract-oem 1`` requests tesseract 4.0's new LSTM engine. (Tesseract 4.0 only.)
@ -80,19 +80,19 @@ Create a file named "no-dict.cfg" with these contents:
::
load_system_dawg 0
language_model_penalty_non_dict_word 0
language_model_penalty_non_freq_dict_word 0
load_system_dawg 0
language_model_penalty_non_dict_word 0
language_model_penalty_non_freq_dict_word 0
then run ocrmypdf as follows (along with any other desired arguments):
.. code-block:: bash
ocrmypdf --tesseract-config no-dict.cfg input.pdf output.pdf
ocrmypdf --tesseract-config no-dict.cfg input.pdf output.pdf
.. warning::
Some combinations of control parameters will break Tesseract or break assumptions that OCRmyPDF makes about Tesseract's output.
Some combinations of control parameters will break Tesseract or break assumptions that OCRmyPDF makes about Tesseract's output.
Changing the PDF renderer
@ -136,4 +136,60 @@ The ``tesseract`` renderer creates a PDF with the image and text layers precompo
If a PDF created with this renderer using Tesseract versions older than 3.05.00 is then passed through Ghostscript's pdfwrite feature, the OCR text *may* be corrupted. The ``--output-type=pdfa`` argument will produce a warning in this situation.
*This renderer is deprecated and will be removed whenever support for older versions of Tesseract is dropped.*
*This renderer is deprecated and will be removed whenever support for older versions of Tesseract is dropped.*
Return code policy
------------------
OCRmyPDF writes all messages to ``stderr``. ``stdout`` is reserved for piping
output files. ``stdin`` is reserved for piping input files.
The return codes generated by the OCRmyPDF are considered part of the stable
user interface.
.. list-table:: Return codes
:widths: 5 35 60
:header-rows: 1
* - Code
- Name
- Interpretation
* - 0
- ``ocrmypdf.exceptions.ExitCode.ok``
- Everything worked as expected.
* - 1
- ``ocrmypdf.exceptions.ExitCode.bad_args``
- Invalid arguments, exited with an error.
* - 2
- ``ocrmypdf.exceptions.ExitCode.input_file``
- The input file does not seem to be a valid PDF.
* - 3
- ``ocrmypdf.exceptions.missing_dependency``
- An external program required by OCRmyPDF is missing.
* - 4
- ``ocrmypdf.exceptions.invalid_output_pdf``
- An output file was created, but it does not seem to be a valid PDF or
PDF/A. The file will be available.
* - 5
- ``ocrmypdf.exceptions.file_access_error``
- The user running OCRmyPDF does not have sufficient permissions to read the input file and write the output file.
* - 6
- ``ocrmypdf.exceptions.already_done_ocr``
- The file already appears to contain text so it may not need OCR. See output message.
* - 7
- ``ocrmypdf.exceptions.child_process_error``
- An error occurred in an external program (child process) and OCRmyPDF cannot continue.
* - 8
- ``ocrmypdf.exceptions.encrypted_pdf``
- The input PDF is encrypted. OCRmyPDF does not read encrypted PDFs. Use another program such as ``qpdf`` to remove encryption.
* - 9
- ``ocrmypdf.exceptions.invalid_config``
- A custom configuration file was forwarded to Tesseract using ``--tesseract-config``, and Tesseract rejected this file.
* - 15
- ``ocrmypdf.exceptions.other_error``
- Some other error occurred.
* - 130
- ``ocrmypdf.exceptions.ctrl_c``
- The program was interrupted by pressing Ctrl+C.

View File

@ -198,4 +198,3 @@ The `Image processing`_ features can improve OCR quality.
Rotating pages and deskewing helps to ensure that the page orientation is correct before OCR begins. Removing the background and/or cleaning the page can also improve results. The ``--oversample DPI`` argument can be specified to resample images to higher resolution before attempting OCR; this can improve results as well.
OCR quality will suffer if the resolution of input images is not correct (since the range of pixel sizes that will be checked for possible fonts will also be incorrect).