OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2026-01-07 04:32:45 +00:00

Author	SHA1	Message	Date
James R. Barlow	9a15a4db10	Ensure specified destination is writable before starting pipeline process	2017-01-26 22:08:24 -08:00
James R. Barlow	1976dc6f30	Fix issue #121 “pop from empty list” (content stream parsing error)	2017-01-26 17:24:40 -08:00
James R. Barlow	02fba02d31	Refactor test suite to use fixtures to manage paths	2017-01-26 16:38:59 -08:00
James R. Barlow	bad67c6dc5	Rename ‘tesstop’ to ‘tess4’ There’s no reason text-only PDF shouldn’t become the default for tesseract 4.	2017-01-26 12:28:51 -08:00
James R. Barlow	ac40426971	Implement “tesstop” (tesseract v4 text-only pages - working name)	2017-01-20 17:16:01 -08:00
James R. Barlow	7acfaf6d34	pipeline: rename some of the stages, for clarity	2017-01-20 17:15:00 -08:00
James R. Barlow	99e47c9c04	tesseract: add support for using v4 textonly_pdf feature	2017-01-20 17:06:23 -08:00
James R. Barlow	6cc5135d2d	Output to stdout: ensure stdout is flushed to prevent truncation errors	2017-01-19 16:41:10 -08:00
James R. Barlow	d4c72b371f	Forward --oem argument to tesseract 4	2017-01-18 21:37:50 -08:00
James R. Barlow	18b6f05657	Resolve issue #124 - poor performance with Tesseract v4 It seems that Tesseract v4 on a platform with OpenMP working correctly while perform poorly with ocrmypdf because each will also soak up all available CPUs. Running N^2 processes/threads on a N-core CPU where each wants 100% of CPU turns out to be detrimental. So, we restrict ocrmypdf w/tessv4 to a single Tesseract process at a time, for now. Alternative may be to limit OpenMP threads if throughput is higher.	2017-01-18 17:52:12 -08:00
James R. Barlow	c42d9baa26	tesseract: for v4, use --psm while keeping -psm for v3 At the moment v4 accepts both but who knows if this will get dropped, so do as document for each version.	2017-01-18 17:43:47 -08:00
James R. Barlow	6e27ecd2b9	Finalize ‘exec’ migration and make it backward compatibility for now	2017-01-18 17:40:50 -08:00
James R. Barlow	f246779b8e	pdfa: documentation, remove from __future__	2016-12-12 15:10:10 -08:00
James R. Barlow	a7d8cdf061	Don’t copy pageinfo - job manager already provides a copy of real pdfinfo	2016-12-12 15:09:41 -08:00
James R. Barlow	620745c812	pipeline: don’t use qpdf to check page count again We already know the number of pages at this stage.	2016-12-12 15:09:11 -08:00
James R. Barlow	b8767e5ba9	Rename exe -> exec, more Unix-y and suggestive	2016-12-10 15:34:00 -08:00
James R. Barlow	d33a50660d	Replace most sys.exit() with raising exceptions Because ruffus doesn’t handle exceptions well I tended to call sys.exit to make sure we got out of dodge when needed. However, sys.exit is not ideal for the Python API this is moving towards, so this introduces proper exceptions for the various cases that retain suggested error codes. Only __main__.py should call sys.exit now, everyone else has to throw an exception. For now the worker raising a fatal exception is logging messages rather than passing an exception object with the fatal error message, mainly because ruffus doesn’t properly marshall the exception object so we just check “what is the name of the exception class that caused ruffus to thrown an RethrownJobError”? Also fixed along the way was the wrong return code being shown for encrypted PDF checking, and incorrect use of str.find (e.output.find) in boolean logic (str.find returns -1 on failure to find, which is True).	2016-12-10 15:24:24 -08:00
James R. Barlow	4ee9658e97	Move external program wrappers to ocrmypdf.exe package	2016-12-09 16:54:24 -08:00
James R. Barlow	dd1b84e7ba	More refactoring - helpers.py	2016-12-09 16:31:08 -08:00
James R. Barlow	4c677e6c47	Extract pipeline out of __main__.py and into pipeline.py This leaves __main__.py to handle command line arguments while pipeline.py runs the pipeline - mostly. They are still somewhat intertwined, with __main__.py doing essential things for pipeline.py, etc., and some helper functions that could go in their own module. All tests pass after this major refactor.	2016-12-09 16:17:12 -08:00
James R. Barlow	f0f889440b	Merge branch 'master' into feature/ooruffus	2016-12-08 16:36:03 -08:00
James R. Barlow	4d3b44d6df	ghostscript: cleanup harmless error message printed for overprint Redirect stderr->stdout to hopefully make GS output easier to work with overall, since the previous code didn’t seem to account for mixed used properly.	2016-12-08 16:19:15 -08:00
James R. Barlow	e57aa0eee2	pageinfo: fix “decimal.InvalidOperation: quantize result has too many digits” And add new test case for this.	2016-12-08 16:06:53 -08:00
James R. Barlow	097a69d07f	pageinfo: fix “decimal.InvalidOperation: quantize result has too many digits” And add new test case for this.	2016-12-08 16:04:14 -08:00
James R. Barlow	a81ce87a50	Remove non-reentrant options checking and logging setup	2016-12-05 14:13:36 -08:00
James R. Barlow	ff16a00a3d	Remove test for Pillow JPEG and PNG As of 3.1.1, our minimum version, these codecs are now required by default for a successful installation, effectively solving the problem of Pillow installed without libjpeg/libpng.	2016-12-03 14:25:46 -08:00
James R. Barlow	be0fa35d14	Merge branch 'master' into feature/ooruffus	2016-12-03 14:02:43 -08:00
James R. Barlow	c35ec0b4aa	ghostscript: more effort at error logging	2016-12-03 00:22:03 -08:00
James R. Barlow	03aaf575dc	v4.3.3 release notes, fix more gs 9.20 issues	2016-12-02 16:26:34 -08:00
James R. Barlow	9a060579ba	Move work_folder into multiprocessing manager	2016-12-02 01:39:17 -08:00
James R. Barlow	d40a5c4f7a	Remove all remaining traces of ‘options’ global state from task runners	2016-12-02 01:31:57 -08:00
James R. Barlow	21f7dc3377	Distribute ‘options’ to worker processes via the multiprocessing manager	2016-12-02 01:06:11 -08:00
James R. Barlow	43c13a1ed9	Replace pdfinfo, pdfinfo_lock with multiprocessing manager Using a context manager to guard the pdfinfo list makes the lock unnecessary. (Although it was probably unnecessary in the first place anyway.)	2016-12-01 23:36:30 -08:00
James R. Barlow	6bc3f189e1	Remove “WrappedLogger” - does not do anything useful Never really investigated the reason why ruffus returns a mutex to go along with its logger. It seems that the mutex is only needed if one wanted to make multiple successive calls to a log function and have them appear appear atomically. It is not needed to protect the logger proxy because accessing the proxy triggers IPC in the child process that handles the multiprocessing.Manager() object. The logging wrapper only logs one line at a time, so the mutex does not actually protect logging sequence. Cut it. Also manager.Lock() returns a threading.Lock object so the purpose of it is actually to help processes share a thread-level lock. It would be more appropriate to use a semaphore based multiprocessing.Lock.	2016-12-01 15:27:07 -08:00
James R. Barlow	2c5437135c	Remove temporary re_symlink logging shim	2016-12-01 00:31:42 -08:00
James R. Barlow	444da02523	Fix mistake made in converting pipeline; incredibly, all tests pass now	2016-12-01 00:30:19 -08:00
James R. Barlow	00e8af2381	Reactivate the pipeline; surprisingly works in quick test	2016-12-01 00:03:03 -08:00
James R. Barlow	401b21864f	Convert to object oriented ruffus syntax (does not run) I experimented with the idea of using asyncio-based processing but realized that that does not solve the import time binding problem that is the real issue. Therefore the simpler refactoring is to convert to ruffus-oo syntax and get things working again. build_pipeline() is really ugly at the moment. The old syntax had its advantages. This test reproduces the complete pipeline graph but does not work otherwise.	2016-11-30 23:58:26 -08:00
James R. Barlow	de939951d4	Record version in debug log	2016-11-29 15:30:50 -08:00
James R. Barlow	7725d16a26	Fix exception on inline stencil masks with no /CS attribute	2016-11-24 22:37:00 -08:00
James R. Barlow	23c95e9660	ghostscript: elide overprinting to fix PDF/A errors in GS 9.20 It looks like GS 9.19 can incorrectly set overprinting for the text layer even though this makes no sense in PDF/A, or at least someone produced PDFs that have this after a Tesseract PDF -> GS PDF/A conversion. GS 9.20 complains about this. Instead of aborting, elide the feature. See http://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=094d5a1880f1cb9ed320ca9353eb69436e09b594 and issue #107. It looks like it is better to elide features and warn about elision rather than abort with an error.	2016-11-10 14:48:02 -08:00
James R. Barlow	eecab9b95d	pdfa: fix KeyError on pdfa_dict if document has some xmp metadata but not exactly what we’re looking for	2016-11-09 05:41:12 -08:00
James R. Barlow	bb91393b85	Fix “deskew-rotate” bug. Turns out this occurred in any case where pdf-renderer hocr was used and a tesseract timeout or error occurred. We created a replacement page based on the unrotated page dimensions instead of the input image’s dimensions.	2016-11-07 14:17:31 -08:00
James R. Barlow	cc9c0d819e	Add test case for documents that get rotated incorrectly after deskew	2016-11-07 14:15:03 -08:00
James R. Barlow	a72b8caf47	Update documentation on other languages, multilingual documents	2016-11-07 14:14:06 -08:00
James R. Barlow	c096b4ca8c	Make debug dump of pageinfo at the end of processing readable	2016-11-04 02:23:02 -07:00
James R. Barlow	427add3008	Add @posttask debug hooks	2016-11-03 18:15:21 -07:00
James R. Barlow	c45871700d	Fix bug: LeptonicaErrorTrap() leaks file handles	2016-11-03 15:51:27 -07:00
James R. Barlow	73b88a0a6f	More work on documentation	2016-10-28 01:22:40 -07:00
James R. Barlow	cab65d1f11	pageinfo: add a python3.4 implementation of isclose()	2016-10-28 00:31:04 -07:00

1 2 3 4 5 ...

305 Commits