Of course, this introduces recompression artifacts, and is unnecessary
if no options are given that modify the final image (no -d, -c, -i).
But rather than worry about that, it would be better to ultimately find
a way to combine the original PDF page with the output PDF text in the
case where we want no changes to the original. This is good enough for
now.
The better option can apparently be achieved using pdftk background, or
probably better, PyPDF2's merge. If Tesseract PDF generation is used
then we need a way to remove the image. Tesseract PDF generation at 3.03
does layout better (I think) and also properly encodes the hidden layer,
which is less likely to give display issues (I think).
Ghostscript has the clunkiest imaginable syntax, obtuse documentation,
quirky behavior, and poor diagnostics... but it *actually works* unlike
pdftoppm/poppler which gets things wrong.
In this case I observed poppler incorrectly decompresses certain CCITT
encoded monochrome PDFs. So set up Ghostscript to do the job instead.
For the moment this performs monochrome -> RGB conversion via reportlab.
The flag -dUseCIEColor is now deprecated, as it invokes the old engine
which introduces color errors. The new engine requires a PDF/A file
header with hardcoded location of a ICC profile to use, now included in
the project. Portable iterations should generate a PDFA_def.ps based on
the target system; for now OS X with homebrew is presumed.
I have selected sRGB since scanners tend to capture RGB and printing
is not a major consideration for PDF/A.
Also note all file paths given to gs must be absolute. May its creators
be forever haunted for their failure to document this unexpected quirk.
It appears to be possible to have a PDF with an embedded font that is
either unused or used only for whitespace. So check for some amount of
actual text instead.
Appears to be necessary to disable each state of the pipeline that is
inactive, not just initial and terminal stages of an inactive segment.
If nothing else this makes what is going on more explicit.
pdftoppm in recent versions (0.26.4,5) seems to be incapable of
producing valid TIFFs, so have it dump a .pnm file and let ImageMagick
figure out how to convert it to TIFF. This is not ideal, but at least
it works.
convert .pnm -deskew <...> .pnm seems to have a bug that produces an
invalid .pnm file which later causes tesseract (specifically,
leptonica) to choke (using 3.02/1.71 as versions, respectively). Will
change pipeline to use tiffs internally since they are less stupid.
Put TESS_CFG_FILES last because it is optional and can be blank. If
omitted it breaks the sequence of subsequent parameters. Also cleanup
text output in this new mode.
If a page contains font data, the script would abort, unless -f was given,
in which case it would use pdftoppm to rasterize the font into a bitmap
and then attempt to OCR it. -f is almost certainly not what users want
unless they want to debug OCR or something.
If a PDF already has fonts it either was OCR'd already, or it is
a composite file containing, for example, some scanned documents appended
to a text report. In the latter case, this -s option provides OCR on
pages that don't have it without changing those that do, and if a PDF
was completely OCRed it will be converted to PDF/A. In batch jobs with
a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.
When I upgraded to poppler 0.24.5, pdftoppm was not compiled because the
script had --disable-splash-output set for some reason.
For OS X Homebrew the solution is:
brew uninstall poppler
brew install poppler --with-splash-output