Ruffus treats omitted parameter as -j1. For our purposes it makes more
sense for omitting the parameter to mean "use all CPUs". As such we
must be able to distinguish -j1 from the parameter -j being omitted.
Telling ruffus to ignore the argument actually just makes it not auto
generate the argument. We can add an argument back with the same name.
Because we don't really use ruffus checkpoint feature, putting the
database in a permanent location does not help anything, but does cause
large database files and problems if the .ruffus_history.sqlite wanted
to be in a writable location.
qpdf won so hard it wasn't funny, even though it must be called once
per page to do the job. Perhaps Ghostscript interprets it as a call to
render the page?
time bash qpdf-test.fish ../tests/resources/multipage.pdf
0.07 real 0.02 user 0.03 sys
time gs -sDEVICE=pdfwrite -dSAFER -o '%06d.pdf' ../tests/resources/multipage.pdf
5.12 real 5.06 user 0.04 sys
Tess 3.03's has various quality problems like wrong DPI that are fixed
in Tess 3.04. Idea here is to introduce an option to let OCRmyPDF
select the rendering backend based on the options and system.
However, we're not ready for tesseract as the main renderer.
Setting pdf-renderer to tesseract does not pass all test cases, mainly
the one where --tesseract-timeout is triggered, and some others.
It appears that extractText() does not find all text. At a glance it
may be that Tesseract's PDF renderer generates a font and uses glyphs
that map to different Unicode code points that PyPDF expects, so it
discards the content and finds nothing. As a proxy in lieu of better
PDF parsing, assume that a "GlyphLessFont" means there is a text there.
I had previously found it does not work to check for the presence of a
font on page. Some PDF generators create a font resource entry even if
the font is never called for.