Previous incarnation was only suitable for generating a local cache
where the suite was executed repeatedly. Now the cache ignores
differences, so it can be checked into Github and shared.
Source files to GPL3
Exceptions:
-tests/spoof/* to MIT
-hocrtransform.py
-_unicodefun.py
Test resources to CC BY-SA 4.0 except when otherwise noted.
Add GPL license.
Tesseract 3.05.01 backported the textonly_pdf=1 which allows the use
of this superior PDF renderer prior to 4.00 alpha. This means that
the tess4 name is no longer accurate, so call it a sandwich because of
its merge-preserve characteristic. Preserve the tess4 name. Fix the
documentation and tests to reflect this.
Make it the default, because it’s better. It does not have the issues
the “tesseract” renderer does prior to Tess 3.05.00 with rendering
PDFs that Ghostscript corrupts, and it produces better output without
re-rastering.
Deprecate some old stuff to avoid the test suite growing obscenely
large.
Forgot to update tesseract spoofers to account for change in tesseract
parameters. Also the change to outputting multiple files in the collate
steps affected how ruffus passes information into downstream consumers
of those files.
For sanity's sake, deal with tesseract streams in binary without
transcoding (via universal_newlines, etc.). The only differences are
printing messages regarding spoofing.
Also hash the source file so that changes to the cache mechanism
invalidate old cache automatically. That is probably too aggressive,
but simple and safer than the previous approach.