JHOVE is not an effective PDF/A validator, as detailed in this article:
http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/
In short, it's buggy. Out of 670 invalid PDF/A files in a test suite,
it only flagged 5. It only looks for certain problems that Ghostscript
generated PDFs are unlikely to have. So use qpdf as a final check for
general ill-formed PDF problems since it is quite reliable.
JHOVE 1 is no longer maintained. There's a JHOVE 2 but it has no PDF
support. I also don't know if it's appropriate to bundle JHOVE, with an
LGPL, under this project and its current license.
Removing a dependency on Java is a huge win. A world with less Java is
a world with less AbstractFactoryConstructorInterfaces.
ruffus swallows the return code if the process of handling an exception
we hit an error in ruffus' own code, which can happen. So pick through
its error stack and find out if there's an interesting return code in
there. Had to use eval() of all things.
Also suppress the stack trace for normal error conditions that don't
need one.
Some versions of tesseract installed by homebrew end up without a
functional tessdata folder, and tesseract is not helpful in this
situation, so add a new test to make sure our output is at least
indicative of the problem.
In the process of properly handling return codes I discovered
test_override_metadata triggers a NPE inside JHOVE probably due to the
Unicode character checking. This could be specific to my JRE (1.6.0_65,
Oracle) but it's probably JHOVE's fault. A valid PDF/A (per Acrobat)
is still generated.
Modified pipeline to fix regression and return the proper error code if
we did not produce a PDF/A as expected. The wrapper forces the output
to be PDF 1.3 which is not PDF/A compliant.
The funny thing is that in some cases JHOVE incorrectly states that a
file is PDF/A-1b compliant, well formed and valid, even when it is not
according to Acrobat XI and is missing the PDF/A metadata marker, as
far as I can tell. JHOVE may not be as beneficial as hoped.
Drop two dependencies and replace them with one that does the job of
both. Smells like progress.
mupdf does PDF file repair and rendering
poppler does rendering and page splitting
qpdf does PDF file repair and page splitting
ghostscript does PDF file repair, rendering, and page splitting (sort of)
So we use qpdf. Ghostscript's page splitting is supposed is less
efficient because it reprints the page (PDF -> Postscript -> PDF) and
possibly loses quality. qpdf's library could be used to improve
performance.
This causes a slight performance regression:
py.test tests/test_main.py::test_maximum_options went from 187 seconds
up to 192. This is likely due to O(n) serialized invocations of qpdf
compared to a single serialized call to pdfseparate. Could improve on
this situation by using the example code in qpdf: pdf-split-pages.cc
or create marker files in split_pages() and then write a new @transform
function that would split pages on each CPU. Probably not worth it,
overall, unless this causes problems on files with hundreds of pages.
It's much better a rendering text baselines than hocr and seems to
produce small file sizes, so it's progress. Not available for
Tesseract 3.02 obviously, so both modes need to remove available.