Remove redundant optimizer content

This commit is contained in:
James R. Barlow 2025-04-17 15:10:59 -07:00
parent d1a45e4abc
commit e4a8f7a354
No known key found for this signature in database
GPG Key ID: E54A300D567E1260
2 changed files with 21 additions and 43 deletions

View File

@ -283,8 +283,7 @@ as little as possible:
ocrmypdf --pages 1 --output-type pdf --optimize 0 input.pdf output.pdf
```
Redo existing OCR
-----------------
## Redo existing OCR
To redo OCR on a file OCRed with other OCR software or a previous
version of OCRmyPDF and/or Tesseract, you may use the `--redo-ocr`
@ -330,8 +329,7 @@ OCR quality will suffer if the resolution of input images is not correct
(since the range of pixel sizes that will be checked for possible fonts
will also be incorrect).
PDF optimization
----------------
## PDF optimization
By default OCRmyPDF will attempt to perform lossless optimizations on
the images inside PDFs after OCR is complete. Optimization is performed
@ -339,40 +337,9 @@ even if no OCR text is found.
The `--optimize N` (short form `-O`) argument controls optimization,
where `N` ranges from 0 to 3 inclusive, analogous to the optimization
levels in the GCC compiler.
levels in the GCC compiler. `-O1` is the default.
:::{list-table}
---
widths: auto
header-rows: 1
---
* - Level
- Comments
* - <nobr>``--optimize=0``</nobr>
- Disables optimization.
* - <nobr>``--optimize 1``</nobr>
- Enables lossless optimizations, such as transcoding images to more
efficient formats. Also compress other uncompressed objects in the
PDF and enables the more efficient "object streams" within the PDF.
(If ``--jbig2-lossy`` is issued, then lossy JBIG2 optimization is used.
The decision to use lossy JBIG2 is separate from standard optimization
settings.)
* - <nobr>``--optimize 2``</nobr>
- All of the above, and enables lossy optimizations and color quantization.
* - <nobr>``--optimize 3``</nobr>
- All of the above, and enables more aggressive optimizations and targets lower image quality.
:::
Optimization is improved when a JBIG2 encoder is available and when
`pngquant` is installed. If either of these components are missing, then
some types of images cannot be optimized.
The types of optimization available may expand over time. By default,
OCRmyPDF compresses data streams inside PDFs, and will change
inefficient compression modes to more modern versions. A program like
`qpdf` can be used to change encodings, e.g. to inspect the internals
for a PDF.
For further details, see the section on [PDF optimization](optimizer).
```bash
ocrmypdf --optimize 3 in.pdf out.pdf # Make it small

View File

@ -25,17 +25,23 @@ header-rows: 1
- Disable most optimizations.
* - ``--optimize 1`` (default)
- ``-O1``
- Safe and lossless optimizations.
- Enables lossless optimizations, such as transcoding images to more
efficient formats. Also compress other uncompressed objects in the
PDF and enables the more efficient "object streams" within the PDF.
(If ``--jbig2-lossy`` is issued, then lossy JBIG2 optimization is used.
The decision to use lossy JBIG2 is separate from standard optimization
settings.)
* - ``--optimize 2``
- ``-O2``
- Safe and lossy optimizations.
- All of the above, and enables lossy optimizations and color quantization.
* - ``--optimize 3``
- ``-O3``
- Aggressive lossy optimizations.
- All of the above, and enables more aggressive optimizations and targets lower
image quality.
:::
The exact type of optimizations performed will vary over time, and
depend on the availability of third-party tools.
depend on what third party tools are installed.
Despite optimizations, OCRmyPDF might still increase the overall file
size, since it must embed information about the recognized text, and
@ -83,8 +89,13 @@ objects more aggressively.
## Lossy optimizations
At optimization level `-O2` and `-O3`, OCRmyPDF will some attempt lossy
image optimization.
At optimization level `-O1`, `-O2` and `-O3`, OCRmyPDF will some attempt
loss image optimization.
If Ghostscript is used to create a PDF/A (the default), Ghostscript will
optimize some images by converting them to JPEG, which are lossy. If
`--output-type pdf` is used, there are no lossy optimizations. Ghostscript's
JPEG conversion is quite safe.
If `pngquant` is installed, OCRmyPDF will use it to perform quantize
paletted images to reduce their size.