Drop two dependencies and replace them with one that does the job of
both. Smells like progress.
mupdf does PDF file repair and rendering
poppler does rendering and page splitting
qpdf does PDF file repair and page splitting
ghostscript does PDF file repair, rendering, and page splitting (sort of)
So we use qpdf. Ghostscript's page splitting is supposed is less
efficient because it reprints the page (PDF -> Postscript -> PDF) and
possibly loses quality. qpdf's library could be used to improve
performance.
This causes a slight performance regression:
py.test tests/test_main.py::test_maximum_options went from 187 seconds
up to 192. This is likely due to O(n) serialized invocations of qpdf
compared to a single serialized call to pdfseparate. Could improve on
this situation by using the example code in qpdf: pdf-split-pages.cc
or create marker files in split_pages() and then write a new @transform
function that would split pages on each CPU. Probably not worth it,
overall, unless this causes problems on files with hundreds of pages.
Although the real issue was that the ruffus pipeline cannot be executed
twice in the same process due to its reliance on global variables.
The new OO pipeline in ruffus 2.6 would be one resolution that would
allow for more comprehensive testing as opposed to farming out the
execution to subprocess and inspecting the results, as is currently
done.
Specifically it trips over the need to reimport ocrmypdf.main. That in
turn raises questions about whether to make that function into an
external script that imports ocrmypdf... or something else. Would be
possible with a loop that manipulates sys_argv and then reloads
ocrmypdf.main; might need that anyway.
It's much better a rendering text baselines than hocr and seems to
produce small file sizes, so it's progress. Not available for
Tesseract 3.02 obviously, so both modes need to remove available.
Github supports both, and PyPI expects .rst files, so use .rst and make
everyone happy.
Auto-converted using pandoc
find . -name '*.md' | parallel pandoc --from=markdown --to=rst --output='{.}.rst' '{}'
http://bfroehle.com/2013/04/26/converting-md-to-rst/