docs: various fixes

As suggested by @Chealer Closes #829, #830, #831, #832
2025-12-03 10:30:48 +00:00 · 2021-09-14 00:24:18 -07:00 · 2021-09-14 00:24:18 -07:00 · a4da05b66b
commit a4da05b66b
parent eb8992e58b
1 changed files with 14 additions and 9 deletions
--- a/docs/introduction.rst
+++ b/docs/introduction.rst
@ -2,7 +2,12 @@
 Introduction
 ============

-OCRmyPDF is a Python 3 application and library that adds OCR layers to PDFs.
+OCRmyPDF is an application and library that adds text "layers" to images
+in PDFs, making scanned image PDFs searchable. It uses OCR to guess what text
+is contained in images. It is written in Python. OCRmyPDF supports plugins
+that allow customization of its processing steps, and is very tolerant of
+PDFs that contain scanned images and "born digital" content that needs no
+text recognition.

 About OCR
 =========
@ -26,7 +31,7 @@ exactly. They contain `vector
 graphics <http://vector-conversions.com/vectorizing/raster_vs_vector.html>`__
 that can contain raster objects such as scanned images. Because PDFs can
 contain multiple pages (unlike many image formats) and can contain fonts
-and text, it is a good formats for exchanging scanned documents.
+and text, it is a good format for exchanging scanned documents.

 |image|

@ -35,9 +40,9 @@ have one image. Some scanners or scanning software will segment pages
 into monochromatic text and color regions for example, to improve the
 compression ratio and appearance of the page.

-Rasterizing a PDF is the process of generating an image suitable for
-display or analyzing with an OCR engine. OCR engines like Tesseract work
-with images, not vector objects.
+Rasterizing a PDF is the process of generating corresponding raster images.
+OCR engines like Tesseract work with images, not scalable vector graphics
+or mixed raster-vector-text graphics such as PDF.

 About PDF/A
 ===========
@ -76,7 +81,7 @@ OCRmyPDF analyzes each page of a PDF to determine the colorspace and
 resolution (DPI) needed to capture all of the information on that page
 without losing content. It uses
 `Ghostscript <http://ghostscript.com/>`__ to rasterize the page, and
-then performs on OCR on the rasterized image to create an OCR "layer".
+then performs on OCR the rasterized image to create an OCR "layer".
 The layer is then grafted back onto the original PDF.

 While one can use a program like Ghostscript or ImageMagick to get an
@ -84,9 +89,9 @@ image and put the image through Tesseract, that actually creates a new
 PDF and many details may be lost. OCRmyPDF can produce a minimally
 changed PDF as output.

-OCRmyPDF also some image processing options like deskew which improve
-the appearance of files and quality of OCR. When these are used, the OCR
-layer is grafted onto the processed image instead.
+OCRmyPDF also provides some image processing options, like deskew, which
+improves the appearance of files and quality of OCR. When these are used,
+the OCR layer is grafted onto the processed image instead.

 By default, OCRmyPDF produces archival PDFs – PDF/A, which are a
 stricter subset of PDF features designed for long term archives. If