2676 Commits

Author SHA1 Message Date
James R. Barlow
68aef489de Merge branch 'master' (4.3.5, Python 3.6 support) into develop
# Conflicts:
#	dev_requirements.txt
#	requirements.txt
2017-01-20 14:25:28 -08:00
James R. Barlow
3f9adcd5e0 Document idea for producing companion text files 2017-01-19 16:48:05 -08:00
James R. Barlow
6cc5135d2d Output to stdout: ensure stdout is flushed to prevent truncation errors 2017-01-19 16:41:10 -08:00
James R. Barlow
d4c72b371f Forward --oem argument to tesseract 4 2017-01-18 21:37:50 -08:00
James R. Barlow
18b6f05657 Resolve issue #124 - poor performance with Tesseract v4
It seems that Tesseract v4 on a platform with OpenMP working correctly
while perform poorly with ocrmypdf because each will also soak up all
available CPUs. Running N^2 processes/threads on a N-core CPU where
each wants 100% of CPU turns out to be detrimental.

So, we restrict ocrmypdf w/tessv4 to a single Tesseract process at a
time, for now. Alternative may be to limit OpenMP threads if throughput
is higher.
2017-01-18 17:52:12 -08:00
James R. Barlow
c42d9baa26 tesseract: for v4, use --psm while keeping -psm for v3
At the moment v4 accepts both but who knows if this will get dropped,
so do as document for each version.
2017-01-18 17:43:47 -08:00
James R. Barlow
6e27ecd2b9 Finalize ‘exec’ migration and make it backward compatibility for now 2017-01-18 17:40:50 -08:00
James R. Barlow
482692396e Add installation instructions for Ubuntu 16.04 2017-01-17 08:17:56 -08:00
James R. Barlow
c48acf165a v4.3.5: Python 3.6 compatibility v4.3.5 2017-01-03 00:45:33 -08:00
James R. Barlow
9e004c3ec0 Another attempt at py 3.4/3.5
Revert to exactly what the previous passing build specified.
2017-01-03 00:34:26 -08:00
James R. Barlow
7be4e9c919 fix setuptools-scm for py 3.4, 3.5 2017-01-03 00:25:57 -08:00
James R. Barlow
5ec38a4bed Update requirements files and documentation for Python 3.6 - no code changes 2017-01-03 00:11:34 -08:00
James R. Barlow
f246779b8e pdfa: documentation, remove from __future__ 2016-12-12 15:10:10 -08:00
James R. Barlow
a7d8cdf061 Don’t copy pageinfo - job manager already provides a copy of real pdfinfo 2016-12-12 15:09:41 -08:00
James R. Barlow
620745c812 pipeline: don’t use qpdf to check page count again
We already know the number of pages at this stage.
2016-12-12 15:09:11 -08:00
James R. Barlow
b8767e5ba9 Rename exe -> exec, more Unix-y and suggestive 2016-12-10 15:34:00 -08:00
James R. Barlow
d33a50660d Replace most sys.exit() with raising exceptions
Because ruffus doesn’t handle exceptions well I tended to call sys.exit
to make sure we got out of dodge when needed.  However, sys.exit is not
ideal for the Python API this is moving towards, so this introduces
proper exceptions for the various cases that retain suggested error
codes. Only __main__.py should call sys.exit now, everyone else has to
throw an exception.

For now the worker raising a fatal exception is logging messages rather
than passing an exception object with the fatal error message, mainly
because ruffus doesn’t properly marshall the exception object so we
just check “what is the name of the exception class that caused ruffus
to thrown an RethrownJobError”?

Also fixed along the way was the wrong return code being shown for
encrypted PDF checking, and incorrect use of str.find (e.output.find)
in boolean logic (str.find returns -1 on failure to find, which is True).
2016-12-10 15:24:24 -08:00
James R. Barlow
4ee9658e97 Move external program wrappers to ocrmypdf.exe package 2016-12-09 16:54:24 -08:00
James R. Barlow
dd1b84e7ba More refactoring - helpers.py 2016-12-09 16:31:08 -08:00
James R. Barlow
4c677e6c47 Extract pipeline out of __main__.py and into pipeline.py
This leaves __main__.py to handle command line arguments while pipeline.py
runs the pipeline - mostly. They are still somewhat intertwined, with
__main__.py doing essential things for pipeline.py, etc., and some
helper functions that could go in their own module.

All tests pass after this major refactor.
2016-12-09 16:17:12 -08:00
James R. Barlow
f0f889440b Merge branch 'master' into feature/ooruffus 2016-12-08 16:36:03 -08:00
James R. Barlow
cc9ceaeb74 v4.3.4: release notes v4.3.4 2016-12-08 16:34:09 -08:00
James R. Barlow
ad2fa8d1d7 Fix MANIFEST for .png 2016-12-08 16:25:04 -08:00
James R. Barlow
adc1580742 Help py.test collect output in more cases 2016-12-08 16:21:07 -08:00
James R. Barlow
4d3b44d6df ghostscript: cleanup harmless error message printed for overprint
Redirect stderr->stdout to hopefully make GS output easier to work with
overall, since the previous code didn’t seem to account for mixed used
properly.
2016-12-08 16:19:15 -08:00
James R. Barlow
e57aa0eee2 pageinfo: fix “decimal.InvalidOperation: quantize result has too many digits”
And add new test case for this.
2016-12-08 16:06:53 -08:00
James R. Barlow
1ae1d116c7 Make setup.py license internally consistent 2016-12-08 16:06:31 -08:00
James R. Barlow
097a69d07f pageinfo: fix “decimal.InvalidOperation: quantize result has too many digits”
And add new test case for this.
2016-12-08 16:04:14 -08:00
James R. Barlow
a81ce87a50 Remove non-reentrant options checking and logging setup 2016-12-05 14:13:36 -08:00
James R. Barlow
88be0d43a0 Make setup.py license internally consistent 2016-12-03 21:37:24 -08:00
James R. Barlow
ff16a00a3d Remove test for Pillow JPEG and PNG
As of 3.1.1, our minimum version, these codecs are now required by
default for a successful installation, effectively solving the problem
of Pillow installed without libjpeg/libpng.
2016-12-03 14:25:46 -08:00
James R. Barlow
8982b3e1e2 Update requirements
-update requirements.txt and dev_requirements.txt to more recent version
-setup.py updated to Ubuntu 14.04 rather than 12.04 backports
-request at least Pillow 3.1.1 now (since this makes jpeg/png mandatory)
2016-12-03 14:14:07 -08:00
James R. Barlow
be0fa35d14 Merge branch 'master' into feature/ooruffus 2016-12-03 14:02:43 -08:00
James R. Barlow
9f51ed9d01 Finalize v4.3.3 release notes v4.3.3 2016-12-03 00:39:24 -08:00
James R. Barlow
731e6792c7 Add test cases for Ghostscript PDF/A warnings 2016-12-03 00:32:09 -08:00
James R. Barlow
c35ec0b4aa ghostscript: more effort at error logging 2016-12-03 00:22:03 -08:00
James R. Barlow
03aaf575dc v4.3.3 release notes, fix more gs 9.20 issues 2016-12-02 16:26:34 -08:00
James R. Barlow
9a060579ba Move work_folder into multiprocessing manager 2016-12-02 01:39:17 -08:00
James R. Barlow
d40a5c4f7a Remove all remaining traces of ‘options’ global state from task runners 2016-12-02 01:31:57 -08:00
James R. Barlow
21f7dc3377 Distribute ‘options’ to worker processes via the multiprocessing manager 2016-12-02 01:06:11 -08:00
James R. Barlow
43c13a1ed9 Replace pdfinfo, pdfinfo_lock with multiprocessing manager
Using a context manager to guard the pdfinfo list makes the lock
unnecessary. (Although it was probably unnecessary in the first place
anyway.)
2016-12-01 23:36:30 -08:00
James R. Barlow
6bc3f189e1 Remove “WrappedLogger” - does not do anything useful
Never really investigated the reason why ruffus returns a mutex to go
along with its logger. It seems that the mutex is only needed if one
wanted to make multiple successive calls to a log function and have
them appear appear atomically. It is not needed to protect the logger
proxy because accessing the proxy triggers IPC in the child process
that handles the multiprocessing.Manager() object.

The logging wrapper only logs one line at a time, so the mutex does not
actually protect logging sequence. Cut it.

Also manager.Lock() returns a threading.Lock object so the purpose of it
is actually to help processes share a thread-level lock. It would be
more appropriate to use a semaphore based multiprocessing.Lock.
2016-12-01 15:27:07 -08:00
James R. Barlow
2c5437135c Remove temporary re_symlink logging shim 2016-12-01 00:31:42 -08:00
James R. Barlow
444da02523 Fix mistake made in converting pipeline; incredibly, all tests pass now 2016-12-01 00:30:19 -08:00
James R. Barlow
00e8af2381 Reactivate the pipeline; surprisingly works in quick test 2016-12-01 00:03:03 -08:00
James R. Barlow
401b21864f Convert to object oriented ruffus syntax (does not run)
I experimented with the idea of using asyncio-based processing but
realized that that does not solve the import time binding problem
that is the real issue. Therefore the simpler refactoring is to convert
to ruffus-oo syntax and get things working again.

build_pipeline() is really ugly at the moment. The old syntax had its
advantages.

This test reproduces the complete pipeline graph but does not work
otherwise.
2016-11-30 23:58:26 -08:00
James R. Barlow
de939951d4 Record version in debug log 2016-11-29 15:30:50 -08:00
James R. Barlow
7725d16a26 Fix exception on inline stencil masks with no /CS attribute 2016-11-24 22:37:00 -08:00
James R. Barlow
8a74408d83 Add security suggestions 2016-11-21 20:58:31 -08:00
James R. Barlow
3d0dc95a06 Moved venvs 2016-11-21 20:40:22 -08:00