James R. Barlow
16e4d342d2
Bug fix: --force-ocr should still run on pages with no images
...
Useful for people who want to reprocess text.
This also requires --oversample because DPI is undefined. To be fixed
in next commit.
2016-07-27 15:06:49 -07:00
James R. Barlow
b4a734fc0d
Test case for "algorithm 4" test
...
Algorithm 4 -> PDF version 1.6
2016-06-23 13:21:26 -07:00
James R. Barlow
ff092c8629
Fix race condition between these tests when run in parallel
2016-04-28 00:39:15 -07:00
James R. Barlow
40baab32ac
Remove dead code "import stuff in testcase"
2016-04-14 14:22:34 -07:00
James R. Barlow
e877d37ac8
--rotate-pages: Only apply rotation if we're reasonable confident
...
Take the threshold from tesseract's default value for -psm 1.
2016-04-14 13:49:44 -07:00
James R. Barlow
322085933b
unpaper: fix check for missing and old versions, add test case
2016-03-10 15:37:09 -08:00
James R. Barlow
7c5e58a497
Fix test cases that break in Docker, improve test for running in Docker
2016-02-20 23:47:37 -08:00
James R. Barlow
cab381a339
Add JPEG 2000 test case
2016-02-20 05:13:19 -08:00
James R. Barlow
8246cc0538
Gracefully recover from tesseract's failure to process very large images
...
And test cases to check this
2016-02-20 04:53:23 -08:00
James R. Barlow
4206e74f42
tests: also check that monochrome correlation correctly detects matches
2016-02-19 14:35:31 -08:00
James R. Barlow
68c3ce56a9
Don't do chmod unless necessarily (breaks py.test on Docker)
2016-02-19 14:09:56 -08:00
James R. Barlow
ab0e5fa425
Improve error checking for tesseract -psm 0 (orientation) errors
2016-02-19 03:58:39 -08:00
James R. Barlow
f3b0434a87
Improve ability to capture error messages from tesseract on a crash
2016-02-19 03:48:49 -08:00
James R. Barlow
ef0aab060a
Make debug output more verbose on failure
2016-02-16 05:17:18 -08:00
James R. Barlow
88433e4c34
Fiddle with travis, try to get better debug output
...
Essentially cffi failed somehow, not clear how
2016-02-16 02:12:14 -08:00
James R. Barlow
ab13342931
Revise rotation tests in prep for adding a few more
2016-02-15 17:17:43 -08:00
James R. Barlow
d7913da484
Test case: remove filename conflict
2016-02-15 16:49:28 -08:00
James R. Barlow
6510bcad19
DPI information not transferred automatically from PNG to JPEG
2016-02-09 02:18:54 -08:00
James R. Barlow
16c7ac2582
Fix test_deskew for new Leptonica API
2016-02-08 15:20:01 -08:00
James R. Barlow
4ceb59215f
Leptonica: classes are better
2016-02-08 15:14:44 -08:00
James R. Barlow
2e6879ee51
Introduce Leptonica class for Pix
2016-02-08 14:52:01 -08:00
James R. Barlow
66fc2e9d7d
Add rotate 180 correlation sanity check
2016-02-08 13:10:11 -08:00
James R. Barlow
2c7a6e574f
Shorten names of _make_input/output
2016-02-08 12:57:26 -08:00
James R. Barlow
78c3bf5dba
Check autorotate using leptonica correlation
2016-02-08 12:55:50 -08:00
James R. Barlow
98c115e3bb
Cache wasn't enabled properly for test_autorotate
2016-02-08 12:55:28 -08:00
James R. Barlow
7c0940609a
Take a stab at writing test case for autorotate
2016-02-08 12:32:39 -08:00
James R. Barlow
9058dedfbe
New tests for ccitt, jbig2 encodings
2016-01-19 13:01:56 -08:00
James R. Barlow
354e61946e
Use os.makedirs for test output directories
...
Broke Travis
2016-01-16 02:47:56 -08:00
James R. Barlow
360acd1e2c
Adjust test_oversample test case
...
Add -f to force generation of the background image at the desired
oversample resolution. Our new behavior is to only send the oversampled
image to Tesseract while leaving the main page intact unless asked to
deskew, clean, etc.
2016-01-15 15:55:23 -08:00
James R. Barlow
7c558b3713
Move pageinfo test into tests folder
2016-01-11 17:40:44 -08:00
James R. Barlow
3b53e9adac
Use tesseract cache for -psm
2016-01-11 17:22:50 -08:00
James R. Barlow
074c1d71b4
Activate --tesseract-pagesegmode
2016-01-11 17:19:32 -08:00
James R. Barlow
09782242c8
Adjust test cases to use cache and noop more effectively
...
This reduces total execution time to 164s on my machine, down from
about double that.
2015-12-17 14:00:17 -08:00
James R. Barlow
9ec4aa039d
Add tesseract caching to speed up tests
2015-12-17 12:52:12 -08:00
James R. Barlow
ecebe2f24b
Let some tests use the spoofed tesseract
...
Where getting OCR doesn't matter
2015-12-17 11:56:09 -08:00
James R. Barlow
7313a77c2a
Implement pdf renderer side of tess spoof
2015-12-17 11:41:54 -08:00
James R. Barlow
45113676a3
Add Tesseract spoofing
2015-12-17 11:36:47 -08:00
James R. Barlow
102bd07019
Check for encrypted PDF and complain appropriately
2015-12-17 10:37:54 -08:00
James R. Barlow
9622e31da9
Use envvars in a new test case
...
And get rid of the messy binary replacement spoofing
2015-12-17 09:29:01 -08:00
James R. Barlow
276fe49867
Better error messages for input file not found or invalid
...
Not as good finding a general way to deal with ruffus exceptions, but
better than nil.
2015-12-04 03:07:53 -08:00
James R. Barlow
acb31abe86
Fix issue #20 - fails on uppercase .PDF
2015-12-04 02:14:09 -08:00
James R. Barlow
7ed60429b3
Test case: No longer using JHOVE
...
So JHOVE will not claim this is an invalid PDF and we should see it
reported as valid.
2015-09-05 01:12:33 -07:00
James R. Barlow
3d26257710
Add test cases for additional image formats
2015-08-28 04:51:11 -07:00
James R. Barlow
b376672dbc
Bug fix: exception thrown if input PDF was missing DocumentInfo block
2015-08-24 01:23:30 -07:00
James R. Barlow
859b063444
Fixup other docker test suite errors
...
Outstanding failures:
test_pageinfo::test_jpeg
tests involving unpaper due to version <6.1 failures
2015-08-20 02:37:03 -07:00
James R. Barlow
9dad40b5a3
Major overhaul of the Dockerfile
...
Switched from Ubuntu to debian:stretch because stretch has more recent
versions of our binary packages and starts smaller. In particular,
stretch has both pillow==2.9.0 and reportlab==3.2.0 available as system
packages which saves the considerable hassle of install a toolchain.
Instead, a pyvenv is set up with access to system's site-packages (note:
needs two steps), making the binary-dependent packages available. Then
the remaining packages are installed into the pyvenv with --no-cache-dir
to avoid saving files. And there we are.
Image is still very large (>500 MB), but programs like reportlab require
font rendering capabilities so they pull in large portions of the Linux
graphics stack. Not much will shrink that.
2015-08-20 01:25:31 -07:00
James R. Barlow
630e6cbf1e
pip chokes on Unicode filenames?
2015-08-18 23:56:30 -07:00
James R. Barlow
cc161780df
Replace fileinput with regular open-replace
...
fileinput is supposed to save time in these cases but it's not capable
of doing both in-place rewrites and working with a non-ascii encoding.
This was not noticed until characters outside of ASCII were picked up
by tesseract and saved in a HOCR file. Rework some surrounding code as
well and add multilingual test cases.
2015-08-18 23:27:50 -07:00
James R. Barlow
0ec13d3a17
Fix test cases: minor issues
...
-os.environ directly modified when whole suite run, breaking subsequent
tests
-no longer trusting JHOVE for PDF/A validation
2015-08-16 01:57:35 -07:00
James R. Barlow
85af0f0d03
Add test case for blank PDF page
2015-08-14 00:46:50 -07:00