579 Commits

Author SHA1 Message Date
James R. Barlow
caee5b1428 Access PageInfo instance variables instead of dictionary 2017-05-18 17:12:04 -07:00
James R. Barlow
cd04ae6949 Refactor PdfInfo(str(filename)) -> PdfInfo(filename) 2017-05-18 16:43:50 -07:00
James R. Barlow
6a0b68298f Refactor pdf_get_all_pageinfo to PdfInfo 2017-05-18 16:31:18 -07:00
James R. Barlow
5de107d44c tesseract_cache: update explanatory notes 2017-05-14 23:54:09 -07:00
James R. Barlow
048ae40e75 Update copyrights 2017-05-14 23:38:28 -07:00
James R. Barlow
234183ecd2 Fix: Tesseract 3.04 is sensitive to order of configuration commands
“txt hocr” is not acceptable and does not produce expected output .txt
while “hocr text” works fine, so switch the order everywhere.

Should fix #169
2017-05-14 23:27:46 -07:00
James R. Barlow
e1e9135e93 Test suite: tidy up imports 2017-05-14 23:15:29 -07:00
James R. Barlow
cb06359c0b Turn on Tesseract 4 cache in test suite
Travis is too slow without it, and perhaps it’s overly paranoid to
never cache Tess4. Maybe nuke the cache occasionally to be safe…
2017-05-12 11:42:27 -07:00
James R. Barlow
b0e95842b8 Fix Travis CI errors while looking around for Tess4 2017-05-12 00:40:00 -07:00
James R. Barlow
21982cf1cb baiona_gray remove alpha channel 2017-05-11 23:23:37 -07:00
James R. Barlow
edc01408da Update the .png files, again, hopefully without corruption 2017-05-11 23:20:50 -07:00
James R. Barlow
96045e98f4 Update develop with master changes
We’re well out of the “trivial updates” zone
2017-05-11 22:54:27 -07:00
James R. Barlow
01b7205e2c Ensure skipped pages are explained in sidecars 2017-05-11 00:43:36 -07:00
James R. Barlow
c8a4cbcf17 Fix test suite breakage after sidecar feature added
Forgot to update tesseract spoofers to account for change in tesseract
parameters.  Also the change to outputting multiple files in the collate
steps affected how ruffus passes information into downstream consumers
of those files.
2017-05-11 00:17:24 -07:00
James R. Barlow
183eafa587 Implement sidecar text files (#126) 2017-05-10 15:22:44 -07:00
James R. Barlow
01a1c2b576 Implement —pdfa-image-compression to control Ghostscript’s compression
Fixes #163
2017-05-09 16:37:29 -07:00
James R. Barlow
c97ea1f2a9 Update high DPI test case to confirm the output image is not downsampled 2017-05-06 22:34:01 -07:00
James R. Barlow
bf04f03c4c Fix corrupt test file “typewriter.png”
This file is not currently used in any tests, but could be, so replace
corrupt version with a useful one.
2017-05-06 22:28:34 -07:00
James R. Barlow
93e802f473 Fix issue #163, color and grayscale images JPEG compressed when not needed 2017-05-06 22:27:25 -07:00
James R. Barlow
aa859a4139 Fix #156 - NoneType has no ‘getObject’ for pages with no /Contents 2017-05-01 15:46:15 -07:00
James R. Barlow
b9b12e2879 Ensure that ocrmypdf stops and reports an error if Ghostscript fails
Past behavior was to continue and let ruffus puke eventually
2017-05-01 15:44:21 -07:00
James R. Barlow
554fcc8b9d Add test case for #152 2017-04-18 15:20:25 -07:00
James R. Barlow
7b7e3a3e03 Enable lossless reconstruction for —pdf-renderer tess4 where appropriate 2017-03-29 23:44:12 -07:00
James R. Barlow
1e7fbd4202 Fix issues with —pdf-renderer tess4 page skipping
If tess4 renderer needed to skip OCR on a page it would end up
duplicating the page contents onto the new page, rather than creating
a blank OCR layer and placing it on the output page. This created
duplicated content in output files.
2017-03-29 23:43:26 -07:00
James R. Barlow
89599b4812 Drop Python 3.4 compatibility 2017-03-29 15:46:53 -07:00
James R. Barlow
88ef2718f1 Reject high Unicode metadata at command line
Ghostscript 9.21 does not seem to accept Unicode above U+FFFF. Previous
versions did, but it now exits with a rangecheck error (-15).

Reject on the command line for now. Complete fix would also need to
check input PDF’s metadata.
2017-03-28 11:08:38 -07:00
James R. Barlow
e71e8ca3ad Workaround for GS VMerror -25 bug
Avoid inserting docinfo keys that would be translated to null strings,
to avoid running afoul of
https://bugs.ghostscript.com/show_bug.cgi?id=697684
2017-03-28 11:05:43 -07:00
James R. Barlow
199de96cff Ghostcript 9.21 seems to have a regression related to Unicode metadata 2017-03-24 15:15:46 -07:00
James R. Barlow
8ddbe81513 Fix issue #147: unpaper loses DPI information, affects —pdf-renderer tess4 2017-03-24 13:23:03 -07:00
James R. Barlow
f035cb1088 Fixed issue #142 — closed streams raise an exception on fork attempt 2017-03-13 15:52:57 -07:00
James R. Barlow
72660d0dec MacOS skip the one test that needs poppler, to save installing poppler 2017-03-11 17:03:26 -08:00
James R. Barlow
4a1fec8328 Improvements to macOS test and work on homebrew tap autobrew
Squashed commits:
[3f06c1e] Try setting up homebrew tap autobuilding
[01532f1] Strict mode error in brew
2017-03-11 17:00:54 -08:00
James R. Barlow
7cd2770a13 Fix issue #137 - proportions of non-square resolution distorted
Distortion mainly affected —force-ocr
2017-02-26 17:13:16 -08:00
James R. Barlow
d1a0065ef8 Create test case for Form XObjects 2017-02-14 12:51:15 -08:00
James R. Barlow
9f800736bc Fix running_in_docker() check failing on newer Docker
This test has to work to ensure spoof/tesseract_cache.py has a writable
directory to put cache into. Otherwise those tests fail.
2017-02-13 02:16:06 -08:00
James R. Barlow
a0657ad937 Prevent use of —pdf-renderer tess4 on tesseract 3 2017-02-06 13:49:43 -08:00
James R. Barlow
005216bc57 Support ocrmypdf-tess4 2017-01-29 18:26:52 -08:00
James R. Barlow
8c17c9918e Add documentation and test cases for —tesseract-config
This parameter has existed for along time but never really got any
attention.
2017-01-28 22:06:51 -08:00
James R. Barlow
9a15a4db10 Ensure specified destination is writable before starting pipeline process 2017-01-26 22:08:24 -08:00
James R. Barlow
55aeaec293 Autorotation check: Replace duplicated tests with parameterized test 2017-01-26 18:07:59 -08:00
James R. Barlow
f6df1fb40c Fix test suite regression: output files dumped in tests/resources 2017-01-26 18:07:09 -08:00
James R. Barlow
b889a89c36 Fix remaining 3.4/3.5 regressions 2017-01-26 17:53:27 -08:00
James R. Barlow
1976dc6f30 Fix issue #121 “pop from empty list” (content stream parsing error) 2017-01-26 17:24:40 -08:00
James R. Barlow
e864c65d26 (Hopefully) Fix Path <-> py.path conversion on Py3.4/3.5 2017-01-26 17:19:15 -08:00
James R. Barlow
02fba02d31 Refactor test suite to use fixtures to manage paths 2017-01-26 16:38:59 -08:00
James R. Barlow
fb9e7c82f6 Move duplicate test code into common namespace 2017-01-26 13:36:52 -08:00
James R. Barlow
bad67c6dc5 Rename ‘tesstop’ to ‘tess4’
There’s no reason text-only PDF shouldn’t become the default for
tesseract 4.
2017-01-26 12:28:51 -08:00
James R. Barlow
b8767e5ba9 Rename exe -> exec, more Unix-y and suggestive 2016-12-10 15:34:00 -08:00
James R. Barlow
d33a50660d Replace most sys.exit() with raising exceptions
Because ruffus doesn’t handle exceptions well I tended to call sys.exit
to make sure we got out of dodge when needed.  However, sys.exit is not
ideal for the Python API this is moving towards, so this introduces
proper exceptions for the various cases that retain suggested error
codes. Only __main__.py should call sys.exit now, everyone else has to
throw an exception.

For now the worker raising a fatal exception is logging messages rather
than passing an exception object with the fatal error message, mainly
because ruffus doesn’t properly marshall the exception object so we
just check “what is the name of the exception class that caused ruffus
to thrown an RethrownJobError”?

Also fixed along the way was the wrong return code being shown for
encrypted PDF checking, and incorrect use of str.find (e.output.find)
in boolean logic (str.find returns -1 on failure to find, which is True).
2016-12-10 15:24:24 -08:00
James R. Barlow
4ee9658e97 Move external program wrappers to ocrmypdf.exe package 2016-12-09 16:54:24 -08:00