Because ruffus doesn’t handle exceptions well I tended to call sys.exit
to make sure we got out of dodge when needed. However, sys.exit is not
ideal for the Python API this is moving towards, so this introduces
proper exceptions for the various cases that retain suggested error
codes. Only __main__.py should call sys.exit now, everyone else has to
throw an exception.
For now the worker raising a fatal exception is logging messages rather
than passing an exception object with the fatal error message, mainly
because ruffus doesn’t properly marshall the exception object so we
just check “what is the name of the exception class that caused ruffus
to thrown an RethrownJobError”?
Also fixed along the way was the wrong return code being shown for
encrypted PDF checking, and incorrect use of str.find (e.output.find)
in boolean logic (str.find returns -1 on failure to find, which is True).
This leaves __main__.py to handle command line arguments while pipeline.py
runs the pipeline - mostly. They are still somewhat intertwined, with
__main__.py doing essential things for pipeline.py, etc., and some
helper functions that could go in their own module.
All tests pass after this major refactor.
Some versions of tesseract installed by homebrew end up without a
functional tessdata folder, and tesseract is not helpful in this
situation, so add a new test to make sure our output is at least
indicative of the problem.
In the process of properly handling return codes I discovered
test_override_metadata triggers a NPE inside JHOVE probably due to the
Unicode character checking. This could be specific to my JRE (1.6.0_65,
Oracle) but it's probably JHOVE's fault. A valid PDF/A (per Acrobat)
is still generated.