It seems that Tesseract v4 on a platform with OpenMP working correctly
while perform poorly with ocrmypdf because each will also soak up all
available CPUs. Running N^2 processes/threads on a N-core CPU where
each wants 100% of CPU turns out to be detrimental.
So, we restrict ocrmypdf w/tessv4 to a single Tesseract process at a
time, for now. Alternative may be to limit OpenMP threads if throughput
is higher.
Because ruffus doesn’t handle exceptions well I tended to call sys.exit
to make sure we got out of dodge when needed. However, sys.exit is not
ideal for the Python API this is moving towards, so this introduces
proper exceptions for the various cases that retain suggested error
codes. Only __main__.py should call sys.exit now, everyone else has to
throw an exception.
For now the worker raising a fatal exception is logging messages rather
than passing an exception object with the fatal error message, mainly
because ruffus doesn’t properly marshall the exception object so we
just check “what is the name of the exception class that caused ruffus
to thrown an RethrownJobError”?
Also fixed along the way was the wrong return code being shown for
encrypted PDF checking, and incorrect use of str.find (e.output.find)
in boolean logic (str.find returns -1 on failure to find, which is True).
This leaves __main__.py to handle command line arguments while pipeline.py
runs the pipeline - mostly. They are still somewhat intertwined, with
__main__.py doing essential things for pipeline.py, etc., and some
helper functions that could go in their own module.
All tests pass after this major refactor.
Redirect stderr->stdout to hopefully make GS output easier to work with
overall, since the previous code didn’t seem to account for mixed used
properly.
As of 3.1.1, our minimum version, these codecs are now required by
default for a successful installation, effectively solving the problem
of Pillow installed without libjpeg/libpng.
-update requirements.txt and dev_requirements.txt to more recent version
-setup.py updated to Ubuntu 14.04 rather than 12.04 backports
-request at least Pillow 3.1.1 now (since this makes jpeg/png mandatory)
Never really investigated the reason why ruffus returns a mutex to go
along with its logger. It seems that the mutex is only needed if one
wanted to make multiple successive calls to a log function and have
them appear appear atomically. It is not needed to protect the logger
proxy because accessing the proxy triggers IPC in the child process
that handles the multiprocessing.Manager() object.
The logging wrapper only logs one line at a time, so the mutex does not
actually protect logging sequence. Cut it.
Also manager.Lock() returns a threading.Lock object so the purpose of it
is actually to help processes share a thread-level lock. It would be
more appropriate to use a semaphore based multiprocessing.Lock.
I experimented with the idea of using asyncio-based processing but
realized that that does not solve the import time binding problem
that is the real issue. Therefore the simpler refactoring is to convert
to ruffus-oo syntax and get things working again.
build_pipeline() is really ugly at the moment. The old syntax had its
advantages.
This test reproduces the complete pipeline graph but does not work
otherwise.