Works for a single page file, probably
Although arguably rotation is not quite lossless, and the two could be
mutually exclusive anyway, so maybe this is it. Did not check in some
debugging changes (lossless=False, text debugging=True)
PyPDF seems to get merging wrong when one of the pages is rotated.
Ruffus treats omitted parameter as -j1. For our purposes it makes more
sense for omitting the parameter to mean "use all CPUs". As such we
must be able to distinguish -j1 from the parameter -j being omitted.
Telling ruffus to ignore the argument actually just makes it not auto
generate the argument. We can add an argument back with the same name.
Because we don't really use ruffus checkpoint feature, putting the
database in a permanent location does not help anything, but does cause
large database files and problems if the .ruffus_history.sqlite wanted
to be in a writable location.
Tess 3.03's has various quality problems like wrong DPI that are fixed
in Tess 3.04. Idea here is to introduce an option to let OCRmyPDF
select the rendering backend based on the options and system.
However, we're not ready for tesseract as the main renderer.
Setting pdf-renderer to tesseract does not pass all test cases, mainly
the one where --tesseract-timeout is triggered, and some others.
Someone reported a bug where the .png input to unpaper ended up being
type 'P' (palette) for some reason, which was not supported in unpaper.
Not sure how it happened, but seemed easier to fix by explicitly
supporting. Here we use png256 if it would capture all colors in the
input file. It's up to tesseract/reportlab to make use of the palette
PNG when rendering.
fileinput is supposed to save time in these cases but it's not capable
of doing both in-place rewrites and working with a non-ascii encoding.
This was not noticed until characters outside of ASCII were picked up
by tesseract and saved in a HOCR file. Rework some surrounding code as
well and add multilingual test cases.
Far from being fluffy or friendly, Pillow silently allows installation
of itself without support for major image types. Reportlab calls for
pillow 2.4.0. On Ubuntu 14.04 LTS this will trigger an upgrade of
pillow that will be built without JPEG or ZLIB so it is effectively
neutered, and unfortunately Pillow will not detect this situation at
install time and guide users to a resolution. Instead, you see nasty
stack traces.
So add a run-time check to ensure that Pillow is sane and capable of JPEG
and PNG support since both may be used internally.