docs: expand ocr of image usage

2025-12-30 00:31:59 +00:00 · 2018-04-09 13:06:09 -07:00 · 2018-04-09 13:06:09 -07:00 · 75d37eb103
commit 75d37eb103
parent 11b6f77df0
3 changed files with 21 additions and 10 deletions
--- a/docs/advanced.rst
+++ b/docs/advanced.rst
@ -136,4 +136,4 @@ The ``tesseract`` renderer creates a PDF with the image and text layers precompo

 If a PDF created with this renderer using Tesseract versions older than 3.05.00 is then passed through Ghostscript's pdfwrite feature, the OCR text *may* be corrupted. The ``--output-type=pdfa`` argument will produce a warning in this situation.

-*This renderer is deprecated and will be removed whenever support for older versions of Tesseract is dropped.*
+*This renderer is deprecated and will be removed whenever support for older versions of Tesseract is dropped.*
--- a/docs/cookbook.rst
+++ b/docs/cookbook.rst
@ -55,6 +55,8 @@ OCR will attempt to automatic correct the rotation of each page. This can help f

 You can increase (decrease) the parameter ``--rotate-pages-threshold`` to make page rotation more (less) aggressive.

+If the page is "just a little off horizontal", like a crooked picture, then you want ``--deskew``. ``--rotate-pages`` is for when the cardinal angle is wrong.
+

 OCR languages other than English
 """"""""""""""""""""""""""""""""
@ -81,15 +83,28 @@ This produces a file named "output.pdf" and a companion text file named "output.
 OCR images, not PDFs
 --------------------

-Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to convert your images to PDFs, and then pipe the results to run ocrmypdf:
+If you are starting with images, you can just use Tesseract 3.04 or later directly to convert images to PDFs:
+
+.. code-block:: bash
+
+    tesseract my-image.jpg output-prefix pdf
+
+.. code-block:: bash
+
+    # When there are multiple images
+    tesseract text-file-containing-list-of-image-filenames.txt output-prefix pdf
+
+Tesseract's PDF output is quite good – OCRmyPDF uses it by internally by default. However, OCRmyPDF has many features not available in Tesseract like like image processing, metadata control, and PDF/A generation.
+
+Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to convert your images to PDFs, and then pipe the results to run ocrmypdf.  The `-` tells ocrmypdf to read standard input.

 .. code-block:: bash

    img2pdf my-images*.jpg | ocrmypdf - myfile.pdf

-``img2pdf`` also has features to control the position of images on a page, if desired.
+``img2pdf`` is recommended because it does an excellent job at generating PDFs without transcoding images.

-For convenience, OCRmyPDF can convert single images to PDFs on its own. If the resolution (dots per inch, DPI) of an image is not set or is incorrect, it can be overridden with ``--image-dpi``. (As 1 inch is 2.54 cm, 1 dpi = 0.39 dpcm).
+For convenience, OCRmyPDF can also convert single images to PDFs on its own. If the resolution (dots per inch, DPI) of an image is not set or is incorrect, it can be overridden with ``--image-dpi``. (As 1 inch is 2.54 cm, 1 dpi = 0.39 dpcm).

 .. code-block:: bash

@ -101,11 +116,6 @@ If you have multiple images, you must use ``img2pdf`` to convert the images to P

    ImageMagick ``convert`` can also convert a group of images to PDF, but in the author's experience it takes a long time, transcodes unnecessarily and gives poor results.

-You can also use Tesseract 3.04+ directly to convert single page images or multi-page TIFFs to PDF:
-
-.. code-block:: bash
-
-    tesseract my-image.jpg output-prefix pdf

 Image processing
 ----------------
--- a/docs/introduction.rst
+++ b/docs/introduction.rst
@ -79,7 +79,8 @@ OCRmyPDF is limited by the Tesseract OCR engine.  As such it experiences these l
 * It is not always good at analyzing the natural reading order of documents. For example, it may fail to recognize that a document contains two columns and join text across the columns.
 * Poor quality scans may produce poor quality OCR. Garbage in, garbage out.
 * PDFs that use transparent layers are not currently checked in the test suite, so they may not work correctly.
-  
+* It does not expose information about what font family text belongs to.
+
 OCRmyPDF is also limited by the PDF specification:

 * PDF encodes the position of text glyphs but does not encode document structure.  There is no markup that divides a document in sections, paragraphs, sentences, or even words (since blank spaces are not represented). As such all elements of document structure including the spaces between words must be derived heuristically.  Some PDF viewers do a better job of this than others.