22 Commits

Author SHA1 Message Date
James R. Barlow
6b315e8315 Add ability to disable cache 2018-05-01 15:52:00 -07:00
James R. Barlow
49fa7f6b5c tesseract_cache: don't reveal host system file paths in manifest file 2018-04-12 00:47:28 -07:00
James R. Barlow
a2d00f5f1d tess cache: fix tess3 error for -psm instead of --psm 2018-03-25 00:43:02 -07:00
James R. Barlow
8c1c61f207 test cache: fix Path + str error 2018-03-25 00:02:03 -07:00
James R. Barlow
77476965ae test cache: use .bin extension, fix .gitignore .gitattributes 2018-03-24 23:54:16 -07:00
James R. Barlow
909eaeeead spoof: Allow tesseract cache to share cache
Previous incarnation was only suitable for generating a local cache
where the suite was executed repeatedly. Now the cache ignores
differences, so it can be checked into Github and shared.
2018-03-24 22:17:36 -07:00
James R. Barlow
11d74dea09 Remove the OCRMYPDF_program environment variables
Really, this was just replicating the functionality of the PATH
environment variable, and users probably do that anyway.
2018-03-24 15:07:02 -07:00
James R. Barlow
6756016572 Add license notice to all files
Source files to GPL3

Exceptions:
-tests/spoof/* to MIT
-hocrtransform.py
-_unicodefun.py

Test resources to CC BY-SA 4.0 except when otherwise noted.

Add GPL license.
2018-03-24 02:33:24 -07:00
James R. Barlow
2c24f67deb Rename “tess4” renderer to “sandwich” and make it default in Tess 3.05.01
Tesseract 3.05.01 backported the textonly_pdf=1 which allows the use
of this superior PDF renderer prior to 4.00 alpha. This means that
the tess4 name is no longer accurate, so call it a sandwich because of
its merge-preserve characteristic. Preserve the tess4 name. Fix the
documentation and tests to reflect this.

Make it the default, because it’s better. It does not have the issues
the “tesseract” renderer does prior to Tess 3.05.00 with rendering
PDFs that Ghostscript corrupts, and it produces better output without
re-rastering.

Deprecate some old stuff to avoid the test suite growing obscenely
large.
2017-06-13 13:09:12 -07:00
James R. Barlow
5de107d44c tesseract_cache: update explanatory notes 2017-05-14 23:54:09 -07:00
James R. Barlow
048ae40e75 Update copyrights 2017-05-14 23:38:28 -07:00
James R. Barlow
234183ecd2 Fix: Tesseract 3.04 is sensitive to order of configuration commands
“txt hocr” is not acceptable and does not produce expected output .txt
while “hocr text” works fine, so switch the order everywhere.

Should fix #169
2017-05-14 23:27:46 -07:00
James R. Barlow
cb06359c0b Turn on Tesseract 4 cache in test suite
Travis is too slow without it, and perhaps it’s overly paranoid to
never cache Tess4. Maybe nuke the cache occasionally to be safe…
2017-05-12 11:42:27 -07:00
James R. Barlow
c8a4cbcf17 Fix test suite breakage after sidecar feature added
Forgot to update tesseract spoofers to account for change in tesseract
parameters.  Also the change to outputting multiple files in the collate
steps affected how ruffus passes information into downstream consumers
of those files.
2017-05-11 00:17:24 -07:00
James R. Barlow
8c17c9918e Add documentation and test cases for —tesseract-config
This parameter has existed for along time but never really got any
attention.
2017-01-28 22:06:51 -08:00
James R. Barlow
a4f07756a5 tesseract caching: don't transcode tesseract's output, hash source file
For sanity's sake, deal with tesseract streams in binary without
transcoding (via universal_newlines, etc.). The only differences are
printing messages regarding spoofing.

Also hash the source file so that changes to the cache mechanism
invalidate old cache automatically. That is probably too aggressive,
but simple and safer than the previous approach.
2016-10-28 16:44:12 -07:00
James R. Barlow
cc7e328358 Improve some documentation for tests 2016-08-26 15:04:08 -07:00
James R. Barlow
8246cc0538 Gracefully recover from tesseract's failure to process very large images
And test cases to check this
2016-02-20 04:53:23 -08:00
James R. Barlow
b907234d5c Update tesseract spoofing to cache orientation and script detection checks
No cache: 269 s
With cache: 144 s

test_oversample[tesseract] now fails, all others good
2016-02-08 02:21:56 -08:00
James R. Barlow
3b53e9adac Use tesseract cache for -psm 2016-01-11 17:22:50 -08:00
James R. Barlow
09782242c8 Adjust test cases to use cache and noop more effectively
This reduces total execution time to 164s on my machine, down from
about double that.
2015-12-17 14:00:17 -08:00
James R. Barlow
9ec4aa039d Add tesseract caching to speed up tests 2015-12-17 12:52:12 -08:00