177 Commits

Author SHA1 Message Date
James R. Barlow
bc56b8e058 Move metadata tests to new test_metadata 2018-03-26 01:49:25 -07:00
James R. Barlow
874ec6a87f Add missing fixture to test_unpaper 2018-03-24 22:24:14 -07:00
James R. Barlow
c138161fae Tests: more cleanup 2018-03-24 15:35:57 -07:00
James R. Barlow
e48590d66c Refactor out unpaper-specific tests 2018-03-24 15:21:44 -07:00
James R. Barlow
5b1c8541fc Review some skipped tests to make sure reasons still valid 2018-03-24 15:13:23 -07:00
James R. Barlow
e5e011021b Remove the OCRMYPDF_program environment variables
Really, this was just replicating the functionality of the PATH
environment variable, and users probably do that anyway.
2018-03-24 15:09:08 -07:00
James R. Barlow
11d74dea09 Remove the OCRMYPDF_program environment variables
Really, this was just replicating the functionality of the PATH
environment variable, and users probably do that anyway.
2018-03-24 15:07:02 -07:00
James R. Barlow
6756016572 Add license notice to all files
Source files to GPL3

Exceptions:
-tests/spoof/* to MIT
-hocrtransform.py
-_unicodefun.py

Test resources to CC BY-SA 4.0 except when otherwise noted.

Add GPL license.
2018-03-24 02:33:24 -07:00
James R. Barlow
d700154e0e Fix regressions after --skip-text improvements 2018-03-24 02:24:45 -07:00
James R. Barlow
8159cc6b88 Skip one test that fails for qpdf 8.0.[0,1], due to qpdf regression 2018-03-09 07:57:22 -08:00
James R. Barlow
4046766ca5 Fix Python 3.5 test suite failure on symlinks
Did not account for API difference in pathlib
2018-03-02 16:57:46 -08:00
James R. Barlow
74ca736333 Issue #223: improve text of encrypted PDF error message 2018-02-27 15:08:22 -08:00
James R. Barlow
e7bcb95635 Fix pylint errors 2018-02-24 11:59:01 -08:00
James R. Barlow
3de83627a9 Handle output to /dev/null or directory (#219)
Previously we threw an exception if the output name was a directory (only after doing OCR) and would trigger a PermissionError on trying to flip permission bits of /dev/null due to shutil.copyfile implementation. Instead of copying file use shutil.copyfileobj which should also respect umask etc.
2018-02-19 22:15:07 -08:00
James R. Barlow
a9da839c39 Add vector-only PDF test case 2018-02-08 00:17:35 -08:00
James R. Barlow
1dfc32d7e6 Preserve "text as curves" vector content
Never updated the checking logic to deal with a pure vector file with no text that needs an OCR layer. This is doable, so allow it.
2018-02-07 16:05:48 -08:00
James R. Barlow
ad7a4476db hugemono.pdf needs --max-image-mpixels to pass with Pillow 5.0 2018-01-10 16:55:18 -08:00
James R. Barlow
4812b20fb2 Fix tesseract_noop.py generating wrong size of output PDF in tests
This caused trouble before with test_deskew
2018-01-10 16:35:31 -08:00
James R. Barlow
882fc2257c Add --max-image-mpixels argument to support Pillow 5.0 2018-01-10 15:43:59 -08:00
James R. Barlow
91b42cbfa8 Fix issue in sandwich renderer when skipping OCR on a rotated and deskewed page
If OCR is skipped due to --tesseract-timeout or similar, and the skip page is rotated with /Rotate, and the skip page was deskewed or had other image processing, then the skip page was created with the wrong dimensions causing the output page to be cropped.
2018-01-09 00:17:53 -08:00
James R. Barlow
44a45fc3fb Add "bad UTF8 output from Tesseract" test 2017-11-29 14:08:07 -08:00
James R. Barlow
a7b307af04 Looks like issue was negzero.pdf with qpdf 5.1.1 on travis, which is why osx passes
Reorganize and see if this is better now
2017-11-29 12:47:09 -08:00
James R. Barlow
731c9ea55e Set timeouts on the tests that seem to be stalling on travis (but not elsewhere) 2017-11-27 14:46:10 -08:00
James R. Barlow
92ca9e954c Fix test warning/failures, hopefully 2017-11-27 13:41:32 -08:00
James R. Barlow
56614fcaa4 Add support and tests for handling page count > ulimit - fixes issue #181 2017-11-27 00:32:35 -08:00
James R. Barlow
4d9169e15f Add merge ulimit test case 2017-11-26 23:34:36 -08:00
James R. Barlow
965de3a235 Test case for issue #200 2017-11-26 22:52:53 -08:00
James R. Barlow
7bbf6bc7f4 Travis didn't like LANG, use LC_ALL 2017-11-16 20:37:30 -08:00
James R. Barlow
40aa82ab41 Check that the locale is sane before allowing OCR to proceed 2017-11-16 17:18:02 -08:00
James R. Barlow
c7b8b6e18b Fix issue #194 - --sidecar creates blank txt file 2017-10-26 18:15:31 -07:00
James R. Barlow
4b7135f0e5 Add option to produce PDF/A-1B 2017-10-11 14:32:58 -07:00
James R. Barlow
952f0cca15 Dockerfiles: set LANG=C.UTF-8
Issue #184 to avoid issue with printing UTF-8 text to sidecar
2017-08-30 13:25:54 -07:00
James R. Barlow
b3097a2384 Fix broken test case related to language packs 2017-08-24 13:01:02 -07:00
James R. Barlow
f7ce8f44e9 Weaken the --user-words test so it will pass on Travis 2017-07-26 21:03:51 -07:00
James R. Barlow
52483072dc Add a differential test that checks tesseract uses supplied word list 2017-07-21 16:40:20 -07:00
James R. Barlow
7f0b8621f3 Tests: accept rich path objects without having to str() everything 2017-07-21 16:39:22 -07:00
James R. Barlow
cd8db60b06 Crash test all renderers, not just two 2017-07-21 14:10:02 -07:00
James R. Barlow
1aa34f5d2e Make some interfaces accepting of both str-paths and Path objects 2017-07-21 13:28:30 -07:00
James R. Barlow
d792ef7222 Give the ‘auto’ renderer setting more test covfefe 2017-06-13 13:13:58 -07:00
James R. Barlow
2c24f67deb Rename “tess4” renderer to “sandwich” and make it default in Tess 3.05.01
Tesseract 3.05.01 backported the textonly_pdf=1 which allows the use
of this superior PDF renderer prior to 4.00 alpha. This means that
the tess4 name is no longer accurate, so call it a sandwich because of
its merge-preserve characteristic. Preserve the tess4 name. Fix the
documentation and tests to reflect this.

Make it the default, because it’s better. It does not have the issues
the “tesseract” renderer does prior to Tess 3.05.00 with rendering
PDFs that Ghostscript corrupts, and it produces better output without
re-rastering.

Deprecate some old stuff to avoid the test suite growing obscenely
large.
2017-06-13 13:09:12 -07:00
James R. Barlow
28341b755f Refactor common test fixtures 2017-05-29 12:47:55 -07:00
James R. Barlow
08e47117a3 Rename pageinfo to pdfinfo 2017-05-19 15:48:23 -07:00
James R. Barlow
8694f8d2eb Replace magic strings colorspace and encoding with Enums 2017-05-18 22:32:27 -07:00
James R. Barlow
56d2aae963 Refactor from ImageInfo index to attribute accessing 2017-05-18 18:39:14 -07:00
James R. Barlow
caee5b1428 Access PageInfo instance variables instead of dictionary 2017-05-18 17:12:04 -07:00
James R. Barlow
cd04ae6949 Refactor PdfInfo(str(filename)) -> PdfInfo(filename) 2017-05-18 16:43:50 -07:00
James R. Barlow
6a0b68298f Refactor pdf_get_all_pageinfo to PdfInfo 2017-05-18 16:31:18 -07:00
James R. Barlow
e1e9135e93 Test suite: tidy up imports 2017-05-14 23:15:29 -07:00
James R. Barlow
96045e98f4 Update develop with master changes
We’re well out of the “trivial updates” zone
2017-05-11 22:54:27 -07:00
James R. Barlow
01b7205e2c Ensure skipped pages are explained in sidecars 2017-05-11 00:43:36 -07:00