Although the real issue was that the ruffus pipeline cannot be executed
twice in the same process due to its reliance on global variables.
The new OO pipeline in ruffus 2.6 would be one resolution that would
allow for more comprehensive testing as opposed to farming out the
execution to subprocess and inspecting the results, as is currently
done.
Specifically it trips over the need to reimport ocrmypdf.main. That in
turn raises questions about whether to make that function into an
external script that imports ocrmypdf... or something else. Would be
possible with a loop that manipulates sys_argv and then reloads
ocrmypdf.main; might need that anyway.
It's much better a rendering text baselines than hocr and seems to
produce small file sizes, so it's progress. Not available for
Tesseract 3.02 obviously, so both modes need to remove available.
Github supports both, and PyPI expects .rst files, so use .rst and make
everyone happy.
Auto-converted using pandoc
find . -name '*.md' | parallel pandoc --from=markdown --to=rst --output='{.}.rst' '{}'
http://bfroehle.com/2013/04/26/converting-md-to-rst/
What a pain getting Unicode right, but there it is.
I cannot find anything to confirm that it is acceptable to put the PDF/A
definition file at the end of the Ghostscript inputs. I did this because
Ghostscript seems to copy document info from the last document on the
list so reportlab's information "wins" in normal order, so it fixes that
issue, and reportlab 'helpfully' fills in all of those fields even if it
does not have information.
It could also work to pass document information along to reportlab, and
set it in each output PDF: .debug.pdf, .rendered.pdf, and .page.pdf to
ensure that whatever page is last in the pipeline has the right
information. Or perhaps it's possible to write a Postscript trailer that
overwrites any previous docinfo with no side effects, but I can't find
any information on how to do that. I don't think it's worth pursuing
unless this arrangement causes some problem with PDF/A generation.
On a minor note, Jhove misreads the way I have encoded the strings in
producing its validation log. It reads them as UTF-16 little endian, so
will tend to produce a string of Asian characters in place of the real
data.
Little point to this feature - on most platforms the environment
variable can be overridden if desired to set a new root location.
At the same time, this change removes the ability to resume a partially
executed pipeline by deleting all of the results on failure. If -k is
provided then the temporary files will survive but there's no way to
resume from them. Because resuming doesn't really work away and would
only be useful to users experiencing very specific problems, this is
probably not worth it, so no major loss. The intent of -k is to assist
debugging.