Executing a package with python -m packagename will check for
__main__.py inside the package. In other words main.py should have
always been named __main__.py.
In the unlikely event that someone depends on "import ocrmypdf.main"
being meaningful, main.py continues to exist and replicates the
behavior of __main__. (It's unlikely because import ocrmypdf.main does
unpythonic ruffus-related things at things import time, essentially
configuring itself to work with sys.argv. To fix another day.)
This should solve the problem of Debian needing to run test suites
before installation and afterwards for continuous integration without
having to patch either file, as python -m ocrmypdf will follow import
order. That is, if the current directory contains "ocrmypdf/" (e.g.
staging a new version) then that will be tested, else sys.path will
be checked.
The recent commit to accept files from stdin broken the feature of
returning the input filename on an error, returning the temp filename
instead, which is confusing.
I tried "qpdf merge + PyPDF2 metadata patching" first. The problem is
that PyPDF2 produces a 1.3 by default and generally I have less
confidence it.
New approach is to stuff the Document Info metadata in the first page
with PyPdf2, cross fingers and use qpdf to merge. It's not quite as
clean and might harm the first page, but it's better than shipping
files produced by PyPDF2.
Tests mostly passing. For the moment this is the new default.
Although PyPDF2 produces a PDF-1.3 which will be wrong for some contents
and possible should be repaired with qpdf. Again.
Looks like it could work better to merge PyPDF2 and fix everything
with qpdf.
This test is exercised by page 4 of multipage.pdf. If all images are
JPEGs, and one of deskew/clean removes DPI information, make sure that
we can get the right information back and that the DPI stays square.
Some called functions are particular about the data format of DPI and
don't like to deal with the Decimal() returned by PyPDF2. Convert to
float and int where needed.
I found this issue in ruffus 2.6.3
https://github.com/bunbun/ruffus/issues/65
also discussed here
https://github.com/bunbun/ruffus/pull/67
ruffus 2.6.3 RethrownJobError don't follow the normal conventions and
so its exception causes problems when they cross process boundaries.
This change carefully examines the various forms of ruffus exception
objects that can appear in 2.6.3 and parses them more carefully. It
also removes any direct posting of the exception to the logger because
this triggers another serializing of the exception object, mutating it
further.
The purpose of this change that caused the problem was a minor
optimization for the tesseract renderer path that had it pull an image
from select_image_for_pdf so that it could use a JPEG instead of PNG,
instead of taking it from preprocess_clean where it would only get a PNG
and make large files.