It seems that Tesseract v4 on a platform with OpenMP working correctly
while perform poorly with ocrmypdf because each will also soak up all
available CPUs. Running N^2 processes/threads on a N-core CPU where
each wants 100% of CPU turns out to be detrimental.
So, we restrict ocrmypdf w/tessv4 to a single Tesseract process at a
time, for now. Alternative may be to limit OpenMP threads if throughput
is higher.
Because ruffus doesn’t handle exceptions well I tended to call sys.exit
to make sure we got out of dodge when needed. However, sys.exit is not
ideal for the Python API this is moving towards, so this introduces
proper exceptions for the various cases that retain suggested error
codes. Only __main__.py should call sys.exit now, everyone else has to
throw an exception.
For now the worker raising a fatal exception is logging messages rather
than passing an exception object with the fatal error message, mainly
because ruffus doesn’t properly marshall the exception object so we
just check “what is the name of the exception class that caused ruffus
to thrown an RethrownJobError”?
Also fixed along the way was the wrong return code being shown for
encrypted PDF checking, and incorrect use of str.find (e.output.find)
in boolean logic (str.find returns -1 on failure to find, which is True).
This leaves __main__.py to handle command line arguments while pipeline.py
runs the pipeline - mostly. They are still somewhat intertwined, with
__main__.py doing essential things for pipeline.py, etc., and some
helper functions that could go in their own module.
All tests pass after this major refactor.
Redirect stderr->stdout to hopefully make GS output easier to work with
overall, since the previous code didn’t seem to account for mixed used
properly.
As of 3.1.1, our minimum version, these codecs are now required by
default for a successful installation, effectively solving the problem
of Pillow installed without libjpeg/libpng.
-update requirements.txt and dev_requirements.txt to more recent version
-setup.py updated to Ubuntu 14.04 rather than 12.04 backports
-request at least Pillow 3.1.1 now (since this makes jpeg/png mandatory)