Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to convert your images to PDFs, and then pipe the results to run ocrmypdf:
If given a single image as input, OCRmyPDF will try converting it to a PDF on its own. If the DPI specified in the image is incorrect, it can be overridden with ``--image-dpi``:
..code-block:: bash
ocrmypdf --image-dpi 300 image.png myfile.pdf
This feature may be removed at some point, because OCRmyPDF does not specialize in converting images to PDFs.
OCRmyPDF perform some image processing on each page of a PDF, if desired. The same processing is applied to each page. It is suggested that the user review files after image processing as these commands might remove desirable content, especially from poor quality scans.
*``--rotate-pages`` attempts to determine the correct orientation for each page and rotates the page if necessary.
*``--remove-background`` attempts to detect and remove a noisy background from grayscale or color images. Monochrome images are ignored. This should not be used on documents that contain color photos as it may remove them.
*``--deskew`` will correct pages were scanned at a skewed angle by rotating them back into place. Skew determination and correction is performed using `Postl's variance of line sums <http://www.leptonica.com/skew-measurement.html>`_ algorithm as implemented in `Leptonica <http://www.leptonica.com/index.html>`_.
*``--clean`` uses `unpaper <https://www.flameeyes.eu/projects/unpaper>`_ to clean up pages before OCR, but does not alter the final output. This makes it less likely that OCR will try to find text in background noise.
*``--clean-final`` uses unpaper to clean up pages before OCR and inserts the page into the final output. You will want to review each page to ensure that unpaper did not remove something important.
Image processing commands can be combined. The order in which options are given does not matter. OCRmyPDF always applies the steps of the image processing pipeline in the same order (rotate, remove background, deskew, clean).
By default, OCRmyPDF permits tesseract to run for only three minutes (180 seconds) per page. This is usually more than enough time to find all text on a reasonably sized page with modern hardware. A skipped page will be inserted into the output without any OCR text.
If you want to adjust the amount of time spent on OCR, change ``--tesseract-timeout``. You can also automatically skip images that exceed a certain number of megapixels.