mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2025-12-08 04:52:21 +00:00
docs: various fixes
As suggested by @Chealer Closes #829, #830, #831, #832
This commit is contained in:
parent
eb8992e58b
commit
a4da05b66b
@ -2,7 +2,12 @@
|
|||||||
Introduction
|
Introduction
|
||||||
============
|
============
|
||||||
|
|
||||||
OCRmyPDF is a Python 3 application and library that adds OCR layers to PDFs.
|
OCRmyPDF is an application and library that adds text "layers" to images
|
||||||
|
in PDFs, making scanned image PDFs searchable. It uses OCR to guess what text
|
||||||
|
is contained in images. It is written in Python. OCRmyPDF supports plugins
|
||||||
|
that allow customization of its processing steps, and is very tolerant of
|
||||||
|
PDFs that contain scanned images and "born digital" content that needs no
|
||||||
|
text recognition.
|
||||||
|
|
||||||
About OCR
|
About OCR
|
||||||
=========
|
=========
|
||||||
@ -26,7 +31,7 @@ exactly. They contain `vector
|
|||||||
graphics <http://vector-conversions.com/vectorizing/raster_vs_vector.html>`__
|
graphics <http://vector-conversions.com/vectorizing/raster_vs_vector.html>`__
|
||||||
that can contain raster objects such as scanned images. Because PDFs can
|
that can contain raster objects such as scanned images. Because PDFs can
|
||||||
contain multiple pages (unlike many image formats) and can contain fonts
|
contain multiple pages (unlike many image formats) and can contain fonts
|
||||||
and text, it is a good formats for exchanging scanned documents.
|
and text, it is a good format for exchanging scanned documents.
|
||||||
|
|
||||||
|image|
|
|image|
|
||||||
|
|
||||||
@ -35,9 +40,9 @@ have one image. Some scanners or scanning software will segment pages
|
|||||||
into monochromatic text and color regions for example, to improve the
|
into monochromatic text and color regions for example, to improve the
|
||||||
compression ratio and appearance of the page.
|
compression ratio and appearance of the page.
|
||||||
|
|
||||||
Rasterizing a PDF is the process of generating an image suitable for
|
Rasterizing a PDF is the process of generating corresponding raster images.
|
||||||
display or analyzing with an OCR engine. OCR engines like Tesseract work
|
OCR engines like Tesseract work with images, not scalable vector graphics
|
||||||
with images, not vector objects.
|
or mixed raster-vector-text graphics such as PDF.
|
||||||
|
|
||||||
About PDF/A
|
About PDF/A
|
||||||
===========
|
===========
|
||||||
@ -76,7 +81,7 @@ OCRmyPDF analyzes each page of a PDF to determine the colorspace and
|
|||||||
resolution (DPI) needed to capture all of the information on that page
|
resolution (DPI) needed to capture all of the information on that page
|
||||||
without losing content. It uses
|
without losing content. It uses
|
||||||
`Ghostscript <http://ghostscript.com/>`__ to rasterize the page, and
|
`Ghostscript <http://ghostscript.com/>`__ to rasterize the page, and
|
||||||
then performs on OCR on the rasterized image to create an OCR "layer".
|
then performs on OCR the rasterized image to create an OCR "layer".
|
||||||
The layer is then grafted back onto the original PDF.
|
The layer is then grafted back onto the original PDF.
|
||||||
|
|
||||||
While one can use a program like Ghostscript or ImageMagick to get an
|
While one can use a program like Ghostscript or ImageMagick to get an
|
||||||
@ -84,9 +89,9 @@ image and put the image through Tesseract, that actually creates a new
|
|||||||
PDF and many details may be lost. OCRmyPDF can produce a minimally
|
PDF and many details may be lost. OCRmyPDF can produce a minimally
|
||||||
changed PDF as output.
|
changed PDF as output.
|
||||||
|
|
||||||
OCRmyPDF also some image processing options like deskew which improve
|
OCRmyPDF also provides some image processing options, like deskew, which
|
||||||
the appearance of files and quality of OCR. When these are used, the OCR
|
improves the appearance of files and quality of OCR. When these are used,
|
||||||
layer is grafted onto the processed image instead.
|
the OCR layer is grafted onto the processed image instead.
|
||||||
|
|
||||||
By default, OCRmyPDF produces archival PDFs – PDF/A, which are a
|
By default, OCRmyPDF produces archival PDFs – PDF/A, which are a
|
||||||
stricter subset of PDF features designed for long term archives. If
|
stricter subset of PDF features designed for long term archives. If
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user