943 Commits

Author SHA1 Message Date
James R. Barlow
5480da4f04 Additional docs updates for v4.4 v4.4 2017-01-26 23:02:44 -08:00
James R. Barlow
9a15a4db10 Ensure specified destination is writable before starting pipeline process 2017-01-26 22:08:24 -08:00
James R. Barlow
55aeaec293 Autorotation check: Replace duplicated tests with parameterized test 2017-01-26 18:07:59 -08:00
James R. Barlow
f6df1fb40c Fix test suite regression: output files dumped in tests/resources 2017-01-26 18:07:09 -08:00
James R. Barlow
b889a89c36 Fix remaining 3.4/3.5 regressions 2017-01-26 17:53:27 -08:00
James R. Barlow
1976dc6f30 Fix issue #121 “pop from empty list” (content stream parsing error) 2017-01-26 17:24:40 -08:00
James R. Barlow
e864c65d26 (Hopefully) Fix Path <-> py.path conversion on Py3.4/3.5 2017-01-26 17:19:15 -08:00
James R. Barlow
02fba02d31 Refactor test suite to use fixtures to manage paths 2017-01-26 16:38:59 -08:00
James R. Barlow
fb9e7c82f6 Move duplicate test code into common namespace 2017-01-26 13:36:52 -08:00
James R. Barlow
77d31bf646 Add renderers page (missed from previous) 2017-01-26 13:20:44 -08:00
James R. Barlow
29ca799bcf Move pytest.ini into setup.cfg 2017-01-26 12:45:38 -08:00
James R. Barlow
467b7f0163 Update docs for eventual v4.4 release 2017-01-26 12:29:11 -08:00
James R. Barlow
bad67c6dc5 Rename ‘tesstop’ to ‘tess4’
There’s no reason text-only PDF shouldn’t become the default for
tesseract 4.
2017-01-26 12:28:51 -08:00
James R. Barlow
ac40426971 Implement “tesstop” (tesseract v4 text-only pages - working name) 2017-01-20 17:16:01 -08:00
James R. Barlow
7acfaf6d34 pipeline: rename some of the stages, for clarity 2017-01-20 17:15:00 -08:00
James R. Barlow
99e47c9c04 tesseract: add support for using v4 textonly_pdf feature 2017-01-20 17:06:23 -08:00
James R. Barlow
d7904e2251 Travis now has Python 3.6, test against it 2017-01-20 14:26:17 -08:00
James R. Barlow
68aef489de Merge branch 'master' (4.3.5, Python 3.6 support) into develop
# Conflicts:
#	dev_requirements.txt
#	requirements.txt
2017-01-20 14:25:28 -08:00
James R. Barlow
3f9adcd5e0 Document idea for producing companion text files 2017-01-19 16:48:05 -08:00
James R. Barlow
6cc5135d2d Output to stdout: ensure stdout is flushed to prevent truncation errors 2017-01-19 16:41:10 -08:00
James R. Barlow
d4c72b371f Forward --oem argument to tesseract 4 2017-01-18 21:37:50 -08:00
James R. Barlow
18b6f05657 Resolve issue #124 - poor performance with Tesseract v4
It seems that Tesseract v4 on a platform with OpenMP working correctly
while perform poorly with ocrmypdf because each will also soak up all
available CPUs. Running N^2 processes/threads on a N-core CPU where
each wants 100% of CPU turns out to be detrimental.

So, we restrict ocrmypdf w/tessv4 to a single Tesseract process at a
time, for now. Alternative may be to limit OpenMP threads if throughput
is higher.
2017-01-18 17:52:12 -08:00
James R. Barlow
c42d9baa26 tesseract: for v4, use --psm while keeping -psm for v3
At the moment v4 accepts both but who knows if this will get dropped,
so do as document for each version.
2017-01-18 17:43:47 -08:00
James R. Barlow
6e27ecd2b9 Finalize ‘exec’ migration and make it backward compatibility for now 2017-01-18 17:40:50 -08:00
James R. Barlow
482692396e Add installation instructions for Ubuntu 16.04 2017-01-17 08:17:56 -08:00
James R. Barlow
c48acf165a v4.3.5: Python 3.6 compatibility v4.3.5 2017-01-03 00:45:33 -08:00
James R. Barlow
9e004c3ec0 Another attempt at py 3.4/3.5
Revert to exactly what the previous passing build specified.
2017-01-03 00:34:26 -08:00
James R. Barlow
7be4e9c919 fix setuptools-scm for py 3.4, 3.5 2017-01-03 00:25:57 -08:00
James R. Barlow
5ec38a4bed Update requirements files and documentation for Python 3.6 - no code changes 2017-01-03 00:11:34 -08:00
James R. Barlow
f246779b8e pdfa: documentation, remove from __future__ 2016-12-12 15:10:10 -08:00
James R. Barlow
a7d8cdf061 Don’t copy pageinfo - job manager already provides a copy of real pdfinfo 2016-12-12 15:09:41 -08:00
James R. Barlow
620745c812 pipeline: don’t use qpdf to check page count again
We already know the number of pages at this stage.
2016-12-12 15:09:11 -08:00
James R. Barlow
b8767e5ba9 Rename exe -> exec, more Unix-y and suggestive 2016-12-10 15:34:00 -08:00
James R. Barlow
d33a50660d Replace most sys.exit() with raising exceptions
Because ruffus doesn’t handle exceptions well I tended to call sys.exit
to make sure we got out of dodge when needed.  However, sys.exit is not
ideal for the Python API this is moving towards, so this introduces
proper exceptions for the various cases that retain suggested error
codes. Only __main__.py should call sys.exit now, everyone else has to
throw an exception.

For now the worker raising a fatal exception is logging messages rather
than passing an exception object with the fatal error message, mainly
because ruffus doesn’t properly marshall the exception object so we
just check “what is the name of the exception class that caused ruffus
to thrown an RethrownJobError”?

Also fixed along the way was the wrong return code being shown for
encrypted PDF checking, and incorrect use of str.find (e.output.find)
in boolean logic (str.find returns -1 on failure to find, which is True).
2016-12-10 15:24:24 -08:00
James R. Barlow
4ee9658e97 Move external program wrappers to ocrmypdf.exe package 2016-12-09 16:54:24 -08:00
James R. Barlow
dd1b84e7ba More refactoring - helpers.py 2016-12-09 16:31:08 -08:00
James R. Barlow
4c677e6c47 Extract pipeline out of __main__.py and into pipeline.py
This leaves __main__.py to handle command line arguments while pipeline.py
runs the pipeline - mostly. They are still somewhat intertwined, with
__main__.py doing essential things for pipeline.py, etc., and some
helper functions that could go in their own module.

All tests pass after this major refactor.
2016-12-09 16:17:12 -08:00
James R. Barlow
f0f889440b Merge branch 'master' into feature/ooruffus 2016-12-08 16:36:03 -08:00
James R. Barlow
cc9ceaeb74 v4.3.4: release notes v4.3.4 2016-12-08 16:34:09 -08:00
James R. Barlow
ad2fa8d1d7 Fix MANIFEST for .png 2016-12-08 16:25:04 -08:00
James R. Barlow
adc1580742 Help py.test collect output in more cases 2016-12-08 16:21:07 -08:00
James R. Barlow
4d3b44d6df ghostscript: cleanup harmless error message printed for overprint
Redirect stderr->stdout to hopefully make GS output easier to work with
overall, since the previous code didn’t seem to account for mixed used
properly.
2016-12-08 16:19:15 -08:00
James R. Barlow
e57aa0eee2 pageinfo: fix “decimal.InvalidOperation: quantize result has too many digits”
And add new test case for this.
2016-12-08 16:06:53 -08:00
James R. Barlow
1ae1d116c7 Make setup.py license internally consistent 2016-12-08 16:06:31 -08:00
James R. Barlow
097a69d07f pageinfo: fix “decimal.InvalidOperation: quantize result has too many digits”
And add new test case for this.
2016-12-08 16:04:14 -08:00
James R. Barlow
a81ce87a50 Remove non-reentrant options checking and logging setup 2016-12-05 14:13:36 -08:00
James R. Barlow
88be0d43a0 Make setup.py license internally consistent 2016-12-03 21:37:24 -08:00
James R. Barlow
ff16a00a3d Remove test for Pillow JPEG and PNG
As of 3.1.1, our minimum version, these codecs are now required by
default for a successful installation, effectively solving the problem
of Pillow installed without libjpeg/libpng.
2016-12-03 14:25:46 -08:00
James R. Barlow
8982b3e1e2 Update requirements
-update requirements.txt and dev_requirements.txt to more recent version
-setup.py updated to Ubuntu 14.04 rather than 12.04 backports
-request at least Pillow 3.1.1 now (since this makes jpeg/png mandatory)
2016-12-03 14:14:07 -08:00
James R. Barlow
be0fa35d14 Merge branch 'master' into feature/ooruffus 2016-12-03 14:02:43 -08:00