docs: various fixes

As suggested by @Chealer Closes #829, #830, #831, #832
2025-12-08 04:52:21 +00:00 · 2021-09-14 00:24:18 -07:00 · 2021-09-14 00:24:18 -07:00 · a4da05b66b
commit a4da05b66b
parent eb8992e58b
1 changed files with 14 additions and 9 deletions
--- a/docs/introduction.rst
+++ b/docs/introduction.rst
@ -2,7 +2,12 @@
 Introduction
 ============
-OCRmyPDF is a Python 3 application and library that adds OCR layers to PDFs.
+OCRmyPDF is an application and library that adds text "layers" to images
 in PDFs, making scanned image PDFs searchable. It uses OCR to guess what text
 is contained in images. It is written in Python. OCRmyPDF supports plugins
 that allow customization of its processing steps, and is very tolerant of
 PDFs that contain scanned images and "born digital" content that needs no
 text recognition.
 About OCR
 =========
@ -26,7 +31,7 @@ exactly. They contain `vector
 graphics <http://vector-conversions.com/vectorizing/raster_vs_vector.html>`__
 that can contain raster objects such as scanned images. Because PDFs can
 contain multiple pages (unlike many image formats) and can contain fonts
-and text, it is a good formats for exchanging scanned documents.
+and text, it is a good format for exchanging scanned documents.
 |image|
@ -35,9 +40,9 @@ have one image. Some scanners or scanning software will segment pages
 into monochromatic text and color regions for example, to improve the
 compression ratio and appearance of the page.
-Rasterizing a PDF is the process of generating an image suitable for
+Rasterizing a PDF is the process of generating corresponding raster images.
-display or analyzing with an OCR engine. OCR engines like Tesseract work
+OCR engines like Tesseract work with images, not scalable vector graphics
-with images, not vector objects.
+or mixed raster-vector-text graphics such as PDF.
 About PDF/A
 ===========
@ -76,7 +81,7 @@ OCRmyPDF analyzes each page of a PDF to determine the colorspace and
 resolution (DPI) needed to capture all of the information on that page
 without losing content. It uses
 `Ghostscript <http://ghostscript.com/>`__ to rasterize the page, and
-then performs on OCR on the rasterized image to create an OCR "layer".
+then performs on OCR the rasterized image to create an OCR "layer".
 The layer is then grafted back onto the original PDF.
 While one can use a program like Ghostscript or ImageMagick to get an
@ -84,9 +89,9 @@ image and put the image through Tesseract, that actually creates a new
 PDF and many details may be lost. OCRmyPDF can produce a minimally
 changed PDF as output.
-OCRmyPDF also some image processing options like deskew which improve
+OCRmyPDF also provides some image processing options, like deskew, which
-the appearance of files and quality of OCR. When these are used, the OCR
+improves the appearance of files and quality of OCR. When these are used,
-layer is grafted onto the processed image instead.
+the OCR layer is grafted onto the processed image instead.
 By default, OCRmyPDF produces archival PDFs – PDF/A, which are a
 stricter subset of PDF features designed for long term archives. If