James R. Barlow
45c7bd9a60
lint: Remove shebangs from non-executable files
2018-02-24 12:38:58 -08:00
James R. Barlow
e7bcb95635
Fix pylint errors
2018-02-24 11:59:01 -08:00
James R. Barlow
3de83627a9
Handle output to /dev/null or directory ( #219 )
...
Previously we threw an exception if the output name was a directory (only after doing OCR) and would trigger a PermissionError on trying to flip permission bits of /dev/null due to shutil.copyfile implementation. Instead of copying file use shutil.copyfileobj which should also respect umask etc.
2018-02-19 22:15:07 -08:00
James R. Barlow
a9da839c39
Add vector-only PDF test case
2018-02-08 00:17:35 -08:00
James R. Barlow
1dfc32d7e6
Preserve "text as curves" vector content
...
Never updated the checking logic to deal with a pure vector file with no text that needs an OCR layer. This is doable, so allow it.
2018-02-07 16:05:48 -08:00
James R. Barlow
019513696b
Ghostscript spoof scripts did not report their --version correctly
2018-01-10 17:08:14 -08:00
James R. Barlow
ad7a4476db
hugemono.pdf needs --max-image-mpixels to pass with Pillow 5.0
2018-01-10 16:55:18 -08:00
James R. Barlow
4812b20fb2
Fix tesseract_noop.py generating wrong size of output PDF in tests
...
This caused trouble before with test_deskew
2018-01-10 16:35:31 -08:00
James R. Barlow
882fc2257c
Add --max-image-mpixels argument to support Pillow 5.0
2018-01-10 15:43:59 -08:00
James R. Barlow
91b42cbfa8
Fix issue in sandwich renderer when skipping OCR on a rotated and deskewed page
...
If OCR is skipped due to --tesseract-timeout or similar, and the skip page is rotated with /Rotate, and the skip page was deskewed or had other image processing, then the skip page was created with the wrong dimensions causing the output page to be cropped.
2018-01-09 00:17:53 -08:00
James R. Barlow
da11fd17ee
qpdf dummy: needs to return version now
2017-11-29 14:35:37 -08:00
James R. Barlow
44a45fc3fb
Add "bad UTF8 output from Tesseract" test
2017-11-29 14:08:07 -08:00
James R. Barlow
c5a1d22e81
That fixed it. Complain about old versions of qpdf now
2017-11-29 12:53:34 -08:00
James R. Barlow
a7b307af04
Looks like issue was negzero.pdf with qpdf 5.1.1 on travis, which is why osx passes
...
Reorganize and see if this is better now
2017-11-29 12:47:09 -08:00
James R. Barlow
731c9ea55e
Set timeouts on the tests that seem to be stalling on travis (but not elsewhere)
2017-11-27 14:46:10 -08:00
James R. Barlow
92ca9e954c
Fix test warning/failures, hopefully
2017-11-27 13:41:32 -08:00
James R. Barlow
56614fcaa4
Add support and tests for handling page count > ulimit - fixes issue #181
2017-11-27 00:32:35 -08:00
James R. Barlow
4d9169e15f
Add merge ulimit test case
2017-11-26 23:34:36 -08:00
James R. Barlow
3a167af2c4
Nearly smallest possible PDF-1.3 with all required fields
2017-11-26 23:32:21 -08:00
James R. Barlow
965de3a235
Test case for issue #200
2017-11-26 22:52:53 -08:00
James R. Barlow
7bbf6bc7f4
Travis didn't like LANG, use LC_ALL
2017-11-16 20:37:30 -08:00
James R. Barlow
40aa82ab41
Check that the locale is sane before allowing OCR to proceed
2017-11-16 17:18:02 -08:00
James R. Barlow
c7b8b6e18b
Fix issue #194 - --sidecar creates blank txt file
2017-10-26 18:15:31 -07:00
James R. Barlow
4b7135f0e5
Add option to produce PDF/A-1B
2017-10-11 14:32:58 -07:00
James R. Barlow
34fc1f5fd7
Add reminder that blank.pdf is not trivial
2017-09-13 01:19:18 -07:00
James R. Barlow
6af7d61ee5
Fix CI failure due to spoofers not being updated to Tesseract 3.05 strings
2017-09-01 16:17:26 -07:00
James R. Barlow
d04e43d46d
Update copyright info for test files
...
[ci skip]
2017-09-01 01:00:32 -07:00
James R. Barlow
952f0cca15
Dockerfiles: set LANG=C.UTF-8
...
Issue #184 to avoid issue with printing UTF-8 text to sidecar
2017-08-30 13:25:54 -07:00
James R. Barlow
b3097a2384
Fix broken test case related to language packs
2017-08-24 13:01:02 -07:00
James R. Barlow
f7ce8f44e9
Weaken the --user-words test so it will pass on Travis
2017-07-26 21:03:51 -07:00
James R. Barlow
52483072dc
Add a differential test that checks tesseract uses supplied word list
2017-07-21 16:40:20 -07:00
James R. Barlow
7f0b8621f3
Tests: accept rich path objects without having to str() everything
2017-07-21 16:39:22 -07:00
James R. Barlow
cd8db60b06
Crash test all renderers, not just two
2017-07-21 14:10:02 -07:00
James R. Barlow
1aa34f5d2e
Make some interfaces accepting of both str-paths and Path objects
2017-07-21 13:28:30 -07:00
James R. Barlow
d792ef7222
Give the ‘auto’ renderer setting more test covfefe
2017-06-13 13:13:58 -07:00
James R. Barlow
2c24f67deb
Rename “tess4” renderer to “sandwich” and make it default in Tess 3.05.01
...
Tesseract 3.05.01 backported the textonly_pdf=1 which allows the use
of this superior PDF renderer prior to 4.00 alpha. This means that
the tess4 name is no longer accurate, so call it a sandwich because of
its merge-preserve characteristic. Preserve the tess4 name. Fix the
documentation and tests to reflect this.
Make it the default, because it’s better. It does not have the issues
the “tesseract” renderer does prior to Tess 3.05.00 with rendering
PDFs that Ghostscript corrupts, and it produces better output without
re-rastering.
Deprecate some old stuff to avoid the test suite growing obscenely
large.
2017-06-13 13:09:12 -07:00
James R. Barlow
47298be132
Remove Python <3.5 test
2017-06-13 10:14:28 -07:00
James R. Barlow
3d2f6f0772
Fix tess4 test using old-style pageinfo API
2017-05-29 13:51:21 -07:00
James R. Barlow
d3c54fbbde
For —rotate-pages, rasterize preview at half DPI instead of 200 DPI
...
Ensures that time is not wasted on previews at higher resolution than
the input as was sometimes the case
2017-05-29 13:01:18 -07:00
James R. Barlow
28341b755f
Refactor common test fixtures
2017-05-29 12:47:55 -07:00
James R. Barlow
4b5cd420e1
Add new test file
2017-05-29 12:16:08 -07:00
James R. Barlow
9b50ede977
Partially solve ghostscript rasterize_pdf producing wrong file size
...
Kludge. Assumes JPEG for now. Messy.
2017-05-25 01:17:43 -07:00
James R. Barlow
82cf010333
Error out if trying to produce PDF/A >200” due to Ghostscript limitation
2017-05-25 00:07:29 -07:00
James R. Barlow
6ff6c8614f
—output-type=pdf now outputs /UserUnit PDFs at the correct size
...
This currently distorts the output size because Tesseract assumes it
knows the DPI better than we do.
Does not work for Ghostscript, because it emerges that Ghostscript
honors /UserUnit for rasterizing but not in pdfwrite (resolve/wontfix).
https://bugs.ghostscript.com/show_bug.cgi?id=690781
Ghostscript’s output would need to be patched in a PDF/A safe way for
this to work. Temporary route may be to block Ghostscript if
/UserUnit.
2017-05-24 23:26:07 -07:00
James R. Barlow
148b632b4f
Prove multiprocessing works, although it is still racy in some places
2017-05-23 16:32:13 -07:00
James R. Barlow
75f2262659
Ensure JobContext stuff is actually tested for IPC consistency
2017-05-19 17:57:07 -07:00
James R. Barlow
d9005a1074
pdfinfo: replace most remaining dict-style access
2017-05-19 16:17:36 -07:00
James R. Barlow
08e47117a3
Rename pageinfo to pdfinfo
2017-05-19 15:48:23 -07:00
James R. Barlow
8694f8d2eb
Replace magic strings colorspace and encoding with Enums
2017-05-18 22:32:27 -07:00
James R. Barlow
56d2aae963
Refactor from ImageInfo index to attribute accessing
2017-05-18 18:39:14 -07:00