-update requirements.txt and dev_requirements.txt to more recent version
-setup.py updated to Ubuntu 14.04 rather than 12.04 backports
-request at least Pillow 3.1.1 now (since this makes jpeg/png mandatory)
Executing a package with python -m packagename will check for
__main__.py inside the package. In other words main.py should have
always been named __main__.py.
In the unlikely event that someone depends on "import ocrmypdf.main"
being meaningful, main.py continues to exist and replicates the
behavior of __main__. (It's unlikely because import ocrmypdf.main does
unpythonic ruffus-related things at things import time, essentially
configuring itself to work with sys.argv. To fix another day.)
This should solve the problem of Debian needing to run test suites
before installation and afterwards for continuous integration without
having to patch either file, as python -m ocrmypdf will follow import
order. That is, if the current directory contains "ocrmypdf/" (e.g.
staging a new version) then that will be tested, else sys.path will
be checked.
I found this issue in ruffus 2.6.3
https://github.com/bunbun/ruffus/issues/65
also discussed here
https://github.com/bunbun/ruffus/pull/67
ruffus 2.6.3 RethrownJobError don't follow the normal conventions and
so its exception causes problems when they cross process boundaries.
This change carefully examines the various forms of ruffus exception
objects that can appear in 2.6.3 and parses them more carefully. It
also removes any direct posting of the exception to the logger because
this triggers another serializing of the exception object, mutating it
further.
It seems the normal thing to wire up python setup.py test to invoke
the test suite rather than py.test. This may be the reason for the
past chain of cffi-related commits.
Switched from Ubuntu to debian:stretch because stretch has more recent
versions of our binary packages and starts smaller. In particular,
stretch has both pillow==2.9.0 and reportlab==3.2.0 available as system
packages which saves the considerable hassle of install a toolchain.
Instead, a pyvenv is set up with access to system's site-packages (note:
needs two steps), making the binary-dependent packages available. Then
the remaining packages are installed into the pyvenv with --no-cache-dir
to avoid saving files. And there we are.
Image is still very large (>500 MB), but programs like reportlab require
font rendering capabilities so they pull in large portions of the Linux
graphics stack. Not much will shrink that.
Far from being fluffy or friendly, Pillow silently allows installation
of itself without support for major image types. Reportlab calls for
pillow 2.4.0. On Ubuntu 14.04 LTS this will trigger an upgrade of
pillow that will be built without JPEG or ZLIB so it is effectively
neutered, and unfortunately Pillow will not detect this situation at
install time and guide users to a resolution. Instead, you see nasty
stack traces.
So add a run-time check to ensure that Pillow is sane and capable of JPEG
and PNG support since both may be used internally.
Drop two dependencies and replace them with one that does the job of
both. Smells like progress.
mupdf does PDF file repair and rendering
poppler does rendering and page splitting
qpdf does PDF file repair and page splitting
ghostscript does PDF file repair, rendering, and page splitting (sort of)
So we use qpdf. Ghostscript's page splitting is supposed is less
efficient because it reprints the page (PDF -> Postscript -> PDF) and
possibly loses quality. qpdf's library could be used to improve
performance.
This causes a slight performance regression:
py.test tests/test_main.py::test_maximum_options went from 187 seconds
up to 192. This is likely due to O(n) serialized invocations of qpdf
compared to a single serialized call to pdfseparate. Could improve on
this situation by using the example code in qpdf: pdf-split-pages.cc
or create marker files in split_pages() and then write a new @transform
function that would split pages on each CPU. Probably not worth it,
overall, unless this causes problems on files with hundreds of pages.