James R. Barlow
c7b8b6e18b
Fix issue #194 - --sidecar creates blank txt file
2017-10-26 18:15:31 -07:00
James R. Barlow
4b7135f0e5
Add option to produce PDF/A-1B
2017-10-11 14:32:58 -07:00
James R. Barlow
952f0cca15
Dockerfiles: set LANG=C.UTF-8
...
Issue #184 to avoid issue with printing UTF-8 text to sidecar
2017-08-30 13:25:54 -07:00
James R. Barlow
b3097a2384
Fix broken test case related to language packs
2017-08-24 13:01:02 -07:00
James R. Barlow
f7ce8f44e9
Weaken the --user-words test so it will pass on Travis
2017-07-26 21:03:51 -07:00
James R. Barlow
52483072dc
Add a differential test that checks tesseract uses supplied word list
2017-07-21 16:40:20 -07:00
James R. Barlow
7f0b8621f3
Tests: accept rich path objects without having to str() everything
2017-07-21 16:39:22 -07:00
James R. Barlow
cd8db60b06
Crash test all renderers, not just two
2017-07-21 14:10:02 -07:00
James R. Barlow
1aa34f5d2e
Make some interfaces accepting of both str-paths and Path objects
2017-07-21 13:28:30 -07:00
James R. Barlow
d792ef7222
Give the ‘auto’ renderer setting more test covfefe
2017-06-13 13:13:58 -07:00
James R. Barlow
2c24f67deb
Rename “tess4” renderer to “sandwich” and make it default in Tess 3.05.01
...
Tesseract 3.05.01 backported the textonly_pdf=1 which allows the use
of this superior PDF renderer prior to 4.00 alpha. This means that
the tess4 name is no longer accurate, so call it a sandwich because of
its merge-preserve characteristic. Preserve the tess4 name. Fix the
documentation and tests to reflect this.
Make it the default, because it’s better. It does not have the issues
the “tesseract” renderer does prior to Tess 3.05.00 with rendering
PDFs that Ghostscript corrupts, and it produces better output without
re-rastering.
Deprecate some old stuff to avoid the test suite growing obscenely
large.
2017-06-13 13:09:12 -07:00
James R. Barlow
28341b755f
Refactor common test fixtures
2017-05-29 12:47:55 -07:00
James R. Barlow
08e47117a3
Rename pageinfo to pdfinfo
2017-05-19 15:48:23 -07:00
James R. Barlow
8694f8d2eb
Replace magic strings colorspace and encoding with Enums
2017-05-18 22:32:27 -07:00
James R. Barlow
56d2aae963
Refactor from ImageInfo index to attribute accessing
2017-05-18 18:39:14 -07:00
James R. Barlow
caee5b1428
Access PageInfo instance variables instead of dictionary
2017-05-18 17:12:04 -07:00
James R. Barlow
cd04ae6949
Refactor PdfInfo(str(filename)) -> PdfInfo(filename)
2017-05-18 16:43:50 -07:00
James R. Barlow
6a0b68298f
Refactor pdf_get_all_pageinfo to PdfInfo
2017-05-18 16:31:18 -07:00
James R. Barlow
e1e9135e93
Test suite: tidy up imports
2017-05-14 23:15:29 -07:00
James R. Barlow
96045e98f4
Update develop with master changes
...
We’re well out of the “trivial updates” zone
2017-05-11 22:54:27 -07:00
James R. Barlow
01b7205e2c
Ensure skipped pages are explained in sidecars
2017-05-11 00:43:36 -07:00
James R. Barlow
183eafa587
Implement sidecar text files ( #126 )
2017-05-10 15:22:44 -07:00
James R. Barlow
01a1c2b576
Implement —pdfa-image-compression to control Ghostscript’s compression
...
Fixes #163
2017-05-09 16:37:29 -07:00
James R. Barlow
c97ea1f2a9
Update high DPI test case to confirm the output image is not downsampled
2017-05-06 22:34:01 -07:00
James R. Barlow
93e802f473
Fix issue #163 , color and grayscale images JPEG compressed when not needed
2017-05-06 22:27:25 -07:00
James R. Barlow
aa859a4139
Fix #156 - NoneType has no ‘getObject’ for pages with no /Contents
2017-05-01 15:46:15 -07:00
James R. Barlow
b9b12e2879
Ensure that ocrmypdf stops and reports an error if Ghostscript fails
...
Past behavior was to continue and let ruffus puke eventually
2017-05-01 15:44:21 -07:00
James R. Barlow
554fcc8b9d
Add test case for #152
2017-04-18 15:20:25 -07:00
James R. Barlow
89599b4812
Drop Python 3.4 compatibility
2017-03-29 15:46:53 -07:00
James R. Barlow
88ef2718f1
Reject high Unicode metadata at command line
...
Ghostscript 9.21 does not seem to accept Unicode above U+FFFF. Previous
versions did, but it now exits with a rangecheck error (-15).
Reject on the command line for now. Complete fix would also need to
check input PDF’s metadata.
2017-03-28 11:08:38 -07:00
James R. Barlow
e71e8ca3ad
Workaround for GS VMerror -25 bug
...
Avoid inserting docinfo keys that would be translated to null strings,
to avoid running afoul of
https://bugs.ghostscript.com/show_bug.cgi?id=697684
2017-03-28 11:05:43 -07:00
James R. Barlow
199de96cff
Ghostcript 9.21 seems to have a regression related to Unicode metadata
2017-03-24 15:15:46 -07:00
James R. Barlow
8ddbe81513
Fix issue #147 : unpaper loses DPI information, affects —pdf-renderer tess4
2017-03-24 13:23:03 -07:00
James R. Barlow
f035cb1088
Fixed issue #142 — closed streams raise an exception on fork attempt
2017-03-13 15:52:57 -07:00
James R. Barlow
72660d0dec
MacOS skip the one test that needs poppler, to save installing poppler
2017-03-11 17:03:26 -08:00
James R. Barlow
4a1fec8328
Improvements to macOS test and work on homebrew tap autobrew
...
Squashed commits:
[3f06c1e] Try setting up homebrew tap autobuilding
[01532f1] Strict mode error in brew
2017-03-11 17:00:54 -08:00
James R. Barlow
7cd2770a13
Fix issue #137 - proportions of non-square resolution distorted
...
Distortion mainly affected —force-ocr
2017-02-26 17:13:16 -08:00
James R. Barlow
d1a0065ef8
Create test case for Form XObjects
2017-02-14 12:51:15 -08:00
James R. Barlow
005216bc57
Support ocrmypdf-tess4
2017-01-29 18:26:52 -08:00
James R. Barlow
8c17c9918e
Add documentation and test cases for —tesseract-config
...
This parameter has existed for along time but never really got any
attention.
2017-01-28 22:06:51 -08:00
James R. Barlow
9a15a4db10
Ensure specified destination is writable before starting pipeline process
2017-01-26 22:08:24 -08:00
James R. Barlow
55aeaec293
Autorotation check: Replace duplicated tests with parameterized test
2017-01-26 18:07:59 -08:00
James R. Barlow
f6df1fb40c
Fix test suite regression: output files dumped in tests/resources
2017-01-26 18:07:09 -08:00
James R. Barlow
b889a89c36
Fix remaining 3.4/3.5 regressions
2017-01-26 17:53:27 -08:00
James R. Barlow
1976dc6f30
Fix issue #121 “pop from empty list” (content stream parsing error)
2017-01-26 17:24:40 -08:00
James R. Barlow
02fba02d31
Refactor test suite to use fixtures to manage paths
2017-01-26 16:38:59 -08:00
James R. Barlow
fb9e7c82f6
Move duplicate test code into common namespace
2017-01-26 13:36:52 -08:00
James R. Barlow
b8767e5ba9
Rename exe -> exec, more Unix-y and suggestive
2016-12-10 15:34:00 -08:00
James R. Barlow
d33a50660d
Replace most sys.exit() with raising exceptions
...
Because ruffus doesn’t handle exceptions well I tended to call sys.exit
to make sure we got out of dodge when needed. However, sys.exit is not
ideal for the Python API this is moving towards, so this introduces
proper exceptions for the various cases that retain suggested error
codes. Only __main__.py should call sys.exit now, everyone else has to
throw an exception.
For now the worker raising a fatal exception is logging messages rather
than passing an exception object with the fatal error message, mainly
because ruffus doesn’t properly marshall the exception object so we
just check “what is the name of the exception class that caused ruffus
to thrown an RethrownJobError”?
Also fixed along the way was the wrong return code being shown for
encrypted PDF checking, and incorrect use of str.find (e.output.find)
in boolean logic (str.find returns -1 on failure to find, which is True).
2016-12-10 15:24:24 -08:00
James R. Barlow
4ee9658e97
Move external program wrappers to ocrmypdf.exe package
2016-12-09 16:54:24 -08:00