305 Commits

Author SHA1 Message Date
James R. Barlow
9a15a4db10 Ensure specified destination is writable before starting pipeline process 2017-01-26 22:08:24 -08:00
James R. Barlow
1976dc6f30 Fix issue #121 “pop from empty list” (content stream parsing error) 2017-01-26 17:24:40 -08:00
James R. Barlow
02fba02d31 Refactor test suite to use fixtures to manage paths 2017-01-26 16:38:59 -08:00
James R. Barlow
bad67c6dc5 Rename ‘tesstop’ to ‘tess4’
There’s no reason text-only PDF shouldn’t become the default for
tesseract 4.
2017-01-26 12:28:51 -08:00
James R. Barlow
ac40426971 Implement “tesstop” (tesseract v4 text-only pages - working name) 2017-01-20 17:16:01 -08:00
James R. Barlow
7acfaf6d34 pipeline: rename some of the stages, for clarity 2017-01-20 17:15:00 -08:00
James R. Barlow
99e47c9c04 tesseract: add support for using v4 textonly_pdf feature 2017-01-20 17:06:23 -08:00
James R. Barlow
6cc5135d2d Output to stdout: ensure stdout is flushed to prevent truncation errors 2017-01-19 16:41:10 -08:00
James R. Barlow
d4c72b371f Forward --oem argument to tesseract 4 2017-01-18 21:37:50 -08:00
James R. Barlow
18b6f05657 Resolve issue #124 - poor performance with Tesseract v4
It seems that Tesseract v4 on a platform with OpenMP working correctly
while perform poorly with ocrmypdf because each will also soak up all
available CPUs. Running N^2 processes/threads on a N-core CPU where
each wants 100% of CPU turns out to be detrimental.

So, we restrict ocrmypdf w/tessv4 to a single Tesseract process at a
time, for now. Alternative may be to limit OpenMP threads if throughput
is higher.
2017-01-18 17:52:12 -08:00
James R. Barlow
c42d9baa26 tesseract: for v4, use --psm while keeping -psm for v3
At the moment v4 accepts both but who knows if this will get dropped,
so do as document for each version.
2017-01-18 17:43:47 -08:00
James R. Barlow
6e27ecd2b9 Finalize ‘exec’ migration and make it backward compatibility for now 2017-01-18 17:40:50 -08:00
James R. Barlow
f246779b8e pdfa: documentation, remove from __future__ 2016-12-12 15:10:10 -08:00
James R. Barlow
a7d8cdf061 Don’t copy pageinfo - job manager already provides a copy of real pdfinfo 2016-12-12 15:09:41 -08:00
James R. Barlow
620745c812 pipeline: don’t use qpdf to check page count again
We already know the number of pages at this stage.
2016-12-12 15:09:11 -08:00
James R. Barlow
b8767e5ba9 Rename exe -> exec, more Unix-y and suggestive 2016-12-10 15:34:00 -08:00
James R. Barlow
d33a50660d Replace most sys.exit() with raising exceptions
Because ruffus doesn’t handle exceptions well I tended to call sys.exit
to make sure we got out of dodge when needed.  However, sys.exit is not
ideal for the Python API this is moving towards, so this introduces
proper exceptions for the various cases that retain suggested error
codes. Only __main__.py should call sys.exit now, everyone else has to
throw an exception.

For now the worker raising a fatal exception is logging messages rather
than passing an exception object with the fatal error message, mainly
because ruffus doesn’t properly marshall the exception object so we
just check “what is the name of the exception class that caused ruffus
to thrown an RethrownJobError”?

Also fixed along the way was the wrong return code being shown for
encrypted PDF checking, and incorrect use of str.find (e.output.find)
in boolean logic (str.find returns -1 on failure to find, which is True).
2016-12-10 15:24:24 -08:00
James R. Barlow
4ee9658e97 Move external program wrappers to ocrmypdf.exe package 2016-12-09 16:54:24 -08:00
James R. Barlow
dd1b84e7ba More refactoring - helpers.py 2016-12-09 16:31:08 -08:00
James R. Barlow
4c677e6c47 Extract pipeline out of __main__.py and into pipeline.py
This leaves __main__.py to handle command line arguments while pipeline.py
runs the pipeline - mostly. They are still somewhat intertwined, with
__main__.py doing essential things for pipeline.py, etc., and some
helper functions that could go in their own module.

All tests pass after this major refactor.
2016-12-09 16:17:12 -08:00
James R. Barlow
f0f889440b Merge branch 'master' into feature/ooruffus 2016-12-08 16:36:03 -08:00
James R. Barlow
4d3b44d6df ghostscript: cleanup harmless error message printed for overprint
Redirect stderr->stdout to hopefully make GS output easier to work with
overall, since the previous code didn’t seem to account for mixed used
properly.
2016-12-08 16:19:15 -08:00
James R. Barlow
e57aa0eee2 pageinfo: fix “decimal.InvalidOperation: quantize result has too many digits”
And add new test case for this.
2016-12-08 16:06:53 -08:00
James R. Barlow
097a69d07f pageinfo: fix “decimal.InvalidOperation: quantize result has too many digits”
And add new test case for this.
2016-12-08 16:04:14 -08:00
James R. Barlow
a81ce87a50 Remove non-reentrant options checking and logging setup 2016-12-05 14:13:36 -08:00
James R. Barlow
ff16a00a3d Remove test for Pillow JPEG and PNG
As of 3.1.1, our minimum version, these codecs are now required by
default for a successful installation, effectively solving the problem
of Pillow installed without libjpeg/libpng.
2016-12-03 14:25:46 -08:00
James R. Barlow
be0fa35d14 Merge branch 'master' into feature/ooruffus 2016-12-03 14:02:43 -08:00
James R. Barlow
c35ec0b4aa ghostscript: more effort at error logging 2016-12-03 00:22:03 -08:00
James R. Barlow
03aaf575dc v4.3.3 release notes, fix more gs 9.20 issues 2016-12-02 16:26:34 -08:00
James R. Barlow
9a060579ba Move work_folder into multiprocessing manager 2016-12-02 01:39:17 -08:00
James R. Barlow
d40a5c4f7a Remove all remaining traces of ‘options’ global state from task runners 2016-12-02 01:31:57 -08:00
James R. Barlow
21f7dc3377 Distribute ‘options’ to worker processes via the multiprocessing manager 2016-12-02 01:06:11 -08:00
James R. Barlow
43c13a1ed9 Replace pdfinfo, pdfinfo_lock with multiprocessing manager
Using a context manager to guard the pdfinfo list makes the lock
unnecessary. (Although it was probably unnecessary in the first place
anyway.)
2016-12-01 23:36:30 -08:00
James R. Barlow
6bc3f189e1 Remove “WrappedLogger” - does not do anything useful
Never really investigated the reason why ruffus returns a mutex to go
along with its logger. It seems that the mutex is only needed if one
wanted to make multiple successive calls to a log function and have
them appear appear atomically. It is not needed to protect the logger
proxy because accessing the proxy triggers IPC in the child process
that handles the multiprocessing.Manager() object.

The logging wrapper only logs one line at a time, so the mutex does not
actually protect logging sequence. Cut it.

Also manager.Lock() returns a threading.Lock object so the purpose of it
is actually to help processes share a thread-level lock. It would be
more appropriate to use a semaphore based multiprocessing.Lock.
2016-12-01 15:27:07 -08:00
James R. Barlow
2c5437135c Remove temporary re_symlink logging shim 2016-12-01 00:31:42 -08:00
James R. Barlow
444da02523 Fix mistake made in converting pipeline; incredibly, all tests pass now 2016-12-01 00:30:19 -08:00
James R. Barlow
00e8af2381 Reactivate the pipeline; surprisingly works in quick test 2016-12-01 00:03:03 -08:00
James R. Barlow
401b21864f Convert to object oriented ruffus syntax (does not run)
I experimented with the idea of using asyncio-based processing but
realized that that does not solve the import time binding problem
that is the real issue. Therefore the simpler refactoring is to convert
to ruffus-oo syntax and get things working again.

build_pipeline() is really ugly at the moment. The old syntax had its
advantages.

This test reproduces the complete pipeline graph but does not work
otherwise.
2016-11-30 23:58:26 -08:00
James R. Barlow
de939951d4 Record version in debug log 2016-11-29 15:30:50 -08:00
James R. Barlow
7725d16a26 Fix exception on inline stencil masks with no /CS attribute 2016-11-24 22:37:00 -08:00
James R. Barlow
23c95e9660 ghostscript: elide overprinting to fix PDF/A errors in GS 9.20
It looks like GS 9.19 can incorrectly set overprinting for the text layer
even though this makes no sense in PDF/A, or at least someone produced
PDFs that have this after a Tesseract PDF -> GS PDF/A conversion. GS 9.20
complains about this. Instead of aborting, elide the feature.

See
http://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=094d5a1880f1cb9ed320ca9353eb69436e09b594
and
issue #107.

It looks like it is better to elide features and warn about elision rather
than abort with an error.
2016-11-10 14:48:02 -08:00
James R. Barlow
eecab9b95d pdfa: fix KeyError on pdfa_dict if document has some xmp metadata but
not exactly what we’re looking for
2016-11-09 05:41:12 -08:00
James R. Barlow
bb91393b85 Fix “deskew-rotate” bug.
Turns out this occurred in any case where pdf-renderer hocr was used
and a tesseract timeout or error occurred. We created a replacement
page based on the unrotated page dimensions instead of the input image’s
dimensions.
2016-11-07 14:17:31 -08:00
James R. Barlow
cc9c0d819e Add test case for documents that get rotated incorrectly after deskew 2016-11-07 14:15:03 -08:00
James R. Barlow
a72b8caf47 Update documentation on other languages, multilingual documents 2016-11-07 14:14:06 -08:00
James R. Barlow
c096b4ca8c Make debug dump of pageinfo at the end of processing readable 2016-11-04 02:23:02 -07:00
James R. Barlow
427add3008 Add @posttask debug hooks 2016-11-03 18:15:21 -07:00
James R. Barlow
c45871700d Fix bug: LeptonicaErrorTrap() leaks file handles 2016-11-03 15:51:27 -07:00
James R. Barlow
73b88a0a6f More work on documentation 2016-10-28 01:22:40 -07:00
James R. Barlow
cab65d1f11 pageinfo: add a python3.4 implementation of isclose() 2016-10-28 00:31:04 -07:00