Update docs for --redo-ocr and --mask-barcodes

2025-12-29 16:10:06 +00:00 · 2018-11-10 01:34:33 -08:00 · 2018-11-10 01:34:33 -08:00 · 16a6fd2ea9
commit 16a6fd2ea9
parent e3fce112ed
3 changed files with 18 additions and 25 deletions
--- a/docs/advanced.rst
+++ b/docs/advanced.rst
@ -13,7 +13,9 @@ If a page in a PDF seems to have text, by default OCRmyPDF will exit without mod

 If ``--skip-text`` is issued, then no OCR will be performed on pages that already have text. The page will be copied to the output. This may be useful for documents that contain both "born digital" and scanned content, or to use OCRmyPDF to normalize and convert to PDF/A regardless of their contents.

-If ``--force-ocr`` is issued, then all pages will be rasterized to images, discarding any hidden OCR text, and rasterizing any printable text. This is useful for redoing OCR, for fixing OCR text with a damaged character map (text is selectable but not searchable), and destroying redacted information.
+If ``--redo-ocr`` is issued, then a detailed text analysis is performed. Text is categorized as either visible or invisible. Invisible text (OCR) is stripped out. Then an image of each page is created with visible text masked out. The page image is sent for OCR, and any additional text is inserted as OCR. If a file contains a mix of text and bitmap images that contain text, OCRmyPDF will locate the additional text in images without disrupting the existing text.
+
+If ``--force-ocr`` is issued, then all pages will be rasterized to images, discarding any hidden OCR text, and rasterizing any printable text. This is useful for redoing OCR, for fixing OCR text with a damaged character map (text is selectable but not searchable), and destroying redacted information. Any forms and vector graphics will be rasterized as well.


 Time and image size limits
--- a/docs/cookbook.rst
+++ b/docs/cookbook.rst
@ -57,7 +57,6 @@ You can increase (decrease) the parameter ``--rotate-pages-threshold`` to make p

 If the page is "just a little off horizontal", like a crooked picture, then you want ``--deskew``. ``--rotate-pages`` is for when the cardinal angle is wrong.

-
 OCR languages other than English
 """"""""""""""""""""""""""""""""

@ -70,7 +69,6 @@ By default OCRmyPDF assumes the document is English.

 Language packs must be installed for all languages specified. See :ref:`Installing additional language packs <lang-packs>`.

-
 Produce PDF and text file containing OCR text
 """""""""""""""""""""""""""""""""""""""""""""

@ -116,7 +114,6 @@ If you have multiple images, you must use ``img2pdf`` to convert the images to P

    ImageMagick ``convert`` can also convert a group of images to PDF, but in the author's experience it takes a long time, transcodes unnecessarily and gives poor results.

-
 Image processing
 ----------------

@ -132,6 +129,8 @@ OCRmyPDF perform some image processing on each page of a PDF, if desired.  The s

 * ``--clean-final`` uses unpaper to clean up pages before OCR and inserts the page into the final output.  You will want to review each page to ensure that unpaper did not remove something important.

+* ``-mask-barcodes`` will "cover up" any barcodes detected in the image of a page. Barcodes are known to confuse Tesseract OCR and interfere with the recognition of text on the same baseline as a barcode. The output file will contain the unaltered image of the barcode.
+
 .. note::

    In many cases image processing will rasterize PDF pages as images, potentially losing quality.
@ -140,7 +139,6 @@ OCRmyPDF perform some image processing on each page of a PDF, if desired.  The s

    ``--clean-final`` and ``-remove-background`` may leave undesirable visual artifacts in some images where their algorithms have shortcomings. Files should be visually reviewed after using these options.

-
 OCR and correct document skew (crooked scan)
 """"""""""""""""""""""""""""""""""""""""""""

@ -167,28 +165,22 @@ If you set ``--tesseract-timeout 0`` OCRmyPDF will apply its image processing wi
    ocrmypdf --tesseract-timeout=0 --remove-background input.pdf output.pdf


-Redo OCR
-""""""""
+Redo existing OCR
+"""""""""""""""""

-To redo OCR on a file OCRed with other OCR software or a previous version of OCRmyPDF and/or Tesseract, you may use the ``--force-ocr`` argument. Normally, OCRmyPDF does not modify files that already appear to contain OCR text.
+To redo OCR on a file OCRed with other OCR software or a previous version of OCRmyPDF and/or Tesseract, you may use the ``--redo-ocr`` argument. (Normally, OCRmyPDF will exit with an error if asked to modify a file with OCR.)
+
+This may be helpful for users who want to take advantage of accuracy improvements in Tesseract 4.0 for files they previously OCRed with an earlier version of Tesseract and OCRmyPDF.

 .. code-block:: bash

-    ocrmypdf --force-ocr input.pdf output.pdf
+    ocrmypdf --redo-ocr input.pdf output.pdf

-Note that the method above will force rasterization of all pages, potentially reducing quality or losing vector content.
+This method will replace OCR without rasterizing, reducing quality or removing vector content. If a file contains a mix of pure digital text and OCR, digital text will be ignored and OCR will be replaced. As such this mode is incompatible with image processing options, since they alter the appearance of the file.

-To ensure quality is preserved, one could extract all of the images and rebuild the PDF for a lossless transformation. This recipe does not work when PDFs contain multiple images per page, as many do in practice. It will also lose any page rotation information.
-
-.. code-block:: bash
-
-    pdfimages -all old-ocr.pdf prefix  # extract all images
-    img2pdf -o temp.pdf prefix*        # construct new PDF from the images
-    # review the new PDF to ensure it visually matches the old one
-    ocrmypdf --output-type pdf temp.pdf new-ocr.pdf
-
-``--output-type pdf`` is used here to avoid using Ghostscript which will also rasterize images.
+In some cases, existing OCR cannot be detected or replaced. Files produced by OCRmyPDF v2.2 or earlier, for example, are internally represented as having visible text with an opaque image drawn on top. This situation cannot be detected.

+If ``--redo-ocr`` does not work, you can use ``--force-ocr``, which will force rasterization of all pages, potentially reducing quality or losing vector content.

 Improving OCR quality
 ---------------------
@ -199,7 +191,6 @@ Rotating pages and deskewing helps to ensure that the page orientation is correc

 OCR quality will suffer if the resolution of input images is not correct (since the range of pixel sizes that will be checked for possible fonts will also be incorrect).

-
 PDF optimization
 ----------------

--- a/docs/introduction.rst
+++ b/docs/introduction.rst
@ -64,7 +64,7 @@ In the case of a PDF that is nothing other than a container of images (no rotati

 OCRmyPDF uses several strategies depending on input options and the input PDF itself, but generally speaking it rasterizes a page for OCR and then grafts the OCR back onto the original. As such it can handle complex PDFs and still preserve their contents as much as possible.

-OCRmyPDF also supports a many, many edge cases that have cropped over several years of development. We support PDF features like images inside of Form XObjects, and pages with UserUnit scaling. We support rare image formats like non-monochrome 1-bit images. Thanks to pikepdf and QPDF, we auto-repair PDFs that are damaged. (Not that you need to know what any of these are! You should be able to throw any PDF at it.)
+OCRmyPDF also supports a many, many edge cases that have cropped over several years of development. We support PDF features like images inside of Form XObjects, and pages with UserUnit scaling. We support rare image formats like non-monochrome 1-bit images. We warn about files you may not to OCR. Thanks to pikepdf and QPDF, we auto-repair PDFs that are damaged. (Not that you need to know what any of these are! You should be able to throw any PDF at it.)


 Limitations
@ -76,20 +76,20 @@ OCRmyPDF is limited by the Tesseract OCR engine.  As such it experiences these l
 * It is not capable of recognizing handwriting.
 * It may find gibberish and report this as OCR output.
 * If a document contains languages outside of those given in the ``-l LANG`` arguments, results may be poor.
-* It is not always good at analyzing the natural reading order of documents. For example, it may fail to recognize that a document contains two columns and join text across the columns.
+* It is not always good at analyzing the natural reading order of documents. For example, it may fail to recognize that a document contains two columns, and may try to join text across columns.
 * Poor quality scans may produce poor quality OCR. Garbage in, garbage out.
 * It does not expose information about what font family text belongs to.

 OCRmyPDF is also limited by the PDF specification:

 * PDF encodes the position of text glyphs but does not encode document structure.  There is no markup that divides a document in sections, paragraphs, sentences, or even words (since blank spaces are not represented). As such all elements of document structure including the spaces between words must be derived heuristically.  Some PDF viewers do a better job of this than others.
-* Because some popular open source PDF viewers have a particularly hard time with spaces betweem words, OCRmyPDF appends a space to each text element as a workaround. While this mixes document structure with graphical information that ideally should be left to the PDF viewer to interpret, it improves compatibility with some viewers and does not cause problems for better ones.
+* Because some popular open source PDF viewers have a particularly hard time with spaces betweem words, OCRmyPDF appends a space to each text element as a workaround (when using ``--pdf-renderer hocr``). While this mixes document structure with graphical information that ideally should be left to the PDF viewer to interpret, it improves compatibility with some viewers and does not cause problems for better ones.

 Ghostscript also imposes some limitations:

 * PDFs containing JBIG2-encoded content will be converted to CCITT Group4 encoding, which has lower compression ratios, if Ghostscript PDF/A is enabled.
 * PDFs containing JPEG 2000-encoded content will be converted to JPEG encoding, which may introduce compression artifacts, if Ghostscript PDF/A is enabled.
-* Ghostscript may transcode grayscale and color images, either lossy to lossless or lossless to lossy, based on an internal algorithm. This behavior can be suppressed by setting ``--pdfa-image-compression`` to ``jpeg`` or ``lossless`` to set all images to one type or the other. Ghostscript has no option to maintain the input image's format.
+* Ghostscript may transcode grayscale and color images, either lossy to lossless or lossless to lossy, based on an internal algorithm. This behavior can be suppressed by setting ``--pdfa-image-compression`` to ``jpeg`` or ``lossless`` to set all images to one type or the other. Ghostscript has no option to maintain the input image's format. (Ghostscript 9.25+ can copy JPEG images without transcoding them; earlier versions will transcode.)

 Regarding OCRmyPDF itself: