Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							7638a88a6a 
							
						 
					 
					
						
						
							
							Change 'clean' to 'repair' for clarity since 'clean' is what unpaper does  
						
						 
						
						
						
						
					 
					
						2015-07-24 01:55:54 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							bed12d2021 
							
						 
					 
					
						
						
							
							Remove 'pdftoppm' renderer  
						
						 
						
						... 
						
						
						
						Ghostscript is more reliable than Poppler's pdftoppm renderer. gs is
also a hard dependency, as the only open source tool that can produce
a PDF/A file, while Poppler could be removed.  pdftoppm has awkward
syntax with some special handling needed for different versions.  I have
found isolated rendering bugs with pdftoppm as well.
With that, I'm removing supporting for multiple rasterizers.
A minor advantage of pdftoppm is that its code produced JPEGs where
possible, but this can be achieved with gs. 
						
						
					 
					
						2015-07-24 01:35:33 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							587569fcb6 
							
						 
					 
					
						
						
							
							Tidy up  
						
						 
						
						
						
						
					 
					
						2015-07-24 01:27:01 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							8c0dc9a06d 
							
						 
					 
					
						
						
							
							Platform independent search for iccprofiles for PDF/A  
						
						 
						
						
						
						
					 
					
						2015-07-24 01:18:46 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							289e4025ad 
							
						 
					 
					
						
						
							
							First successful PDF/A produced by new pipeline  
						
						 
						
						
						
						
					 
					
						2015-07-23 23:28:32 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							5476eafe4c 
							
						 
					 
					
						
						
							
							Rasterize PDF pages and generate .hocr files  
						
						 
						
						
						
						
					 
					
						2015-07-23 23:09:29 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							df32f283cd 
							
						 
					 
					
						
						
							
							Langauge checking  
						
						 
						
						
						
						
					 
					
						2015-07-23 18:38:59 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							68ecaac9cc 
							
						 
					 
					
						
						
							
							Add tesseract version check  
						
						 
						
						
						
						
					 
					
						2015-07-23 17:06:00 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							cffd4623ca 
							
						 
					 
					
						
						
							
							Add PDF/A validation  
						
						 
						
						
						
						
					 
					
						2015-07-23 17:05:34 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							6dc2782e80 
							
						 
					 
					
						
						
							
							Can now generate PDF/A files, multipage and single page  
						
						 
						
						
						
						
					 
					
						2015-07-23 04:57:31 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							5df187c086 
							
						 
					 
					
						
						
							
							Wrap a proxy around pdfinfo block so it can be passed around processes  
						
						 
						
						
						
						
					 
					
						2015-07-23 03:49:30 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							7fd172e41e 
							
						 
					 
					
						
						
							
							Get rid of chdir, replace deprecated @split with @subdivide  
						
						 
						
						
						
						
					 
					
						2015-07-23 03:09:03 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							619528a1b5 
							
						 
					 
					
						
						
							
							Try a method for passing along the pdfinfo struct  
						
						 
						
						
						
						
					 
					
						2015-07-23 02:39:42 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							596d468c14 
							
						 
					 
					
						
						
							
							Reinstate WrapperLogger with more multiprocessing fixes  
						
						 
						
						
						
						
					 
					
						2015-07-23 02:26:09 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							eddbf1060a 
							
						 
					 
					
						
						
							
							diff --git a/src/ocrmypdf.py b/src/ocrmypdf.py  
						
						 
						
						... 
						
						
						
						index 68d1591..95afa8f 100755
--- a/src/ocrmypdf.py
+++ b/src/ocrmypdf.py
@@ -24,6 +24,7 @@ import ruffus.cmdline as cmdline
 from .hocrtransform import HocrTransform
 import warnings
+import multiprocessing
 warnings.simplefilter('ignore', pypdf.utils.PdfReadWarning)
@@ -96,7 +97,7 @@ debugging.add_argument(
     '-k', '--keep-temporary-files', action='store_true',
     help="keep temporary files (helpful for debugging)")
 debugging.add_argument(
-    '-g' ,'--debug-rendering', action='store_true',
+    '-g', '--debug-rendering', action='store_true',
     help="render each page twice with debug information on second page")
@@ -106,51 +107,19 @@ if not options.temp_folder:
     options.temp_folder = 'tmp'
-_logger, _logger_mutex = cmdline.setup_logging(__name__, options.log_file,
-                                               options.verbose)
+log, log_mutex = cmdline.setup_logging(__name__, options.log_file,
+                                       options.verbose)
-class WrappedLogger:
-
-    def __init__(self, my_logger, my_mutex):
-        self.logger = my_logger
-        self.mutex = my_mutex
-
-    def log(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.log(*args, **kwargs)
-
-    def debug(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.debug(*args, **kwargs)
-
-    def info(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.info(*args, **kwargs)
-
-    def warning(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.warning(*args, **kwargs)
-
-    def error(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.error(*args, **kwargs)
-
-    def critical(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.critical(*args, **kwargs)
-
-log = WrappedLogger(_logger, _logger_mutex)
-
-
-def re_symlink(input_file, soft_link_name, log=log):
+def re_symlink(input_file, soft_link_name, log, mutex):
     """
     Helper function: relinks soft symbolic link if necessary
     """
     if input_file == soft_link_name:
-        log.debug("Warning: No symbolic link made. You are using " +
-                     "the original data directory as the working directory.")
+        with mutex:
+            log.debug("Warning: No symbolic link made. You are using " +
+                      "the original data directory as the working directory.")
         return
@@ -161,12 +130,14 @@ def re_symlink(input_file, soft_link_name, log=log):
         try:
             os.unlink(soft_link_name)
         except:
-            log.debug("Can't unlink %s" % (soft_link_name))
+            with mutex:
+                log.debug("Can't unlink %s" % (soft_link_name))
     if not os.path.exists(input_file):
         raise Exception("trying to create a broken symlink to %s" % input_file)
-    log.debug("os.symlink(%s, %s)" % (input_file, soft_link_name))
+    with mutex:
+        log.debug("os.symlink(%s, %s)" % (input_file, soft_link_name))
     os.symlink( 
						
						
					 
					
						2015-07-23 02:22:12 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							33731a6864 
							
						 
					 
					
						
						
							
							Move pageinfo code out of the pipeline  
						
						 
						
						
						
						
					 
					
						2015-07-23 02:17:13 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							0c36cd2e24 
							
						 
					 
					
						
						
							
							Fix errors related to use working directory  
						
						 
						
						... 
						
						
						
						Mainly workaround lack of @split(...output_dir) in ruffus 
						
						
					 
					
						2015-07-23 01:16:05 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							5cef1be26d 
							
						 
					 
					
						
						
							
							New pipeline runs, splits pages  
						
						 
						
						
						
						
					 
					
						2015-07-22 22:58:13 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							e89f482c3d 
							
						 
					 
					
						
						
							
							Fixes from early testing of new pipeline  
						
						 
						
						
						
						
					 
					
						2015-07-22 22:51:38 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							fe3e40305d 
							
						 
					 
					
						
						
							
							Learn to split PDF into pages  
						
						 
						
						
						
						
					 
					
						2015-07-22 22:46:00 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							a92b5ceb6b 
							
						 
					 
					
						
						
							
							Begin unifying main script and page script  
						
						 
						
						
						
						
					 
					
						2015-07-22 22:30:00 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							0e7e7d8437 
							
						 
					 
					
						
						
							
							Suppress the xref warning for now  
						
						 
						
						
						
						
					 
					
						2015-07-22 11:24:14 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							f47fa98f33 
							
						 
					 
					
						
						
							
							Fixes to colorspace and other inquiries  
						
						 
						
						
						
						
					 
					
						2015-07-22 11:24:06 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							d3d5879911 
							
						 
					 
					
						
						
							
							Replace pdfimages -list call to poppler with PyPDF test for image  
						
						 
						
						... 
						
						
						
						The immediate reason for doing this is that (newer?) versions of parse()
seem to choke on the parse string. It appears to trigger exponential
behavior in the underlying regex. In any case, replacing subprocesses
with native Python is usually better. 
						
						
					 
					
						2015-07-22 11:22:12 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							b2168e11db 
							
						 
					 
					
						
						
							
							Require Py3 for tests  
						
						 
						
						
						
						
					 
					
						2015-07-22 11:21:33 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							6d5d8be708 
							
						 
					 
					
						
						
							
							New test: check skew  
						
						 
						
						
						
						
					 
					
						2015-07-22 04:00:59 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							ce2dbdf372 
							
						 
					 
					
						
						
							
							Add another test  
						
						 
						
						
						
						
					 
					
						2015-07-22 03:16:19 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							ec8a35a7a6 
							
						 
					 
					
						
						
							
							Basic test cases  
						
						 
						
						
						
						
					 
					
						2015-07-22 02:59:25 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							f6577c22c3 
							
						 
					 
					
						
						
							
							Complete wrapping of logger/logger_mutex  
						
						 
						
						
						
						
					 
					
						2015-07-22 02:57:13 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							43d6c03093 
							
						 
					 
					
						
						
							
							Implement oversampling in ocrpage.py  
						
						 
						
						
						
						
					 
					
						2015-03-27 18:32:55 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							1870f116bb 
							
						 
					 
					
						
						
							
							More consistent spacing  
						
						 
						
						
						
						
					 
					
						2015-03-24 23:05:42 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							8b87def013 
							
						 
					 
					
						
						
							
							Don't presume two jobs  
						
						 
						
						
						
						
					 
					
						2015-03-24 23:04:49 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							de599d97b5 
							
						 
					 
					
						
						
							
							Tidy up readme  
						
						 
						
						
						
						
					 
					
						2015-03-24 23:04:33 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							5d7e6b45c4 
							
						 
					 
					
						
						
							
							Cleanup logger  
						
						 
						
						
						
						
					 
					
						2015-03-24 22:46:33 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							c6091bcfe1 
							
						 
					 
					
						
						
							
							Change python2 -> python3 for readlink()  
						
						 
						
						
						
						
					 
					
						2015-03-24 22:36:13 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							466a8a1318 
							
						 
					 
					
						
						
							
							It's now py3 that uses lxml, reportlab  
						
						 
						
						
						
						
					 
					
						2015-03-19 17:12:32 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							a99ba3b696 
							
						 
					 
					
						
						
							
							Add rudimentary support for combining OCR layer with existing content  
						
						 
						
						... 
						
						
						
						It appears to be very fragile due to weaknesses in PyPDF. Better
option is probably to use pdftk's watermark feature. 
						
						
					 
					
						2015-03-10 14:28:38 -07:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							9229f7c6cc 
							
						 
					 
					
						
						
							
							Add option to render text as invisible OCR text  
						
						 
						
						... 
						
						
						
						Prior to this change, hocrtransform would render printable text (black
on white) and then a fully opaque image on top of the text. According to
the PDF spec, text that is the output of OCR should be marked invisible,
so that PDF viewers /know/ it's OCR output in a document that might mix
OCR and text overlays. Another benefit is that PDF viewers would know
to skip rendering text if they are not smart enough to figure out the
image will completely overwrite it.
However, for debug, visible text is nice, so retain it as an option. 
						
						
					 
					
						2015-02-22 12:43:27 -08:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							bf114bb188 
							
						 
					 
					
						
						
							
							Clean up pixel transform logic with namedtuple  
						
						 
						
						
						
						
					 
					
						2015-02-21 14:14:34 -08:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							b8eed2f861 
							
						 
					 
					
						
						
							
							More PEP8/lint  
						
						 
						
						
						
						
					 
					
						2015-02-21 13:00:46 -08:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							ccb1e347be 
							
						 
					 
					
						
						
							
							Call HocrTransform directly instead of through a subprocess  
						
						 
						
						
						
						
					 
					
						2015-02-20 17:20:48 -08:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							8698974f11 
							
						 
					 
					
						
						
							
							Rename hocrTransform -> hocrtransform  
						
						 
						
						
						
						
					 
					
						2015-02-20 16:47:36 -08:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							f2c79c4341 
							
						 
					 
					
						
						
							
							Convert hocrtransform to py3  
						
						 
						
						
						
						
					 
					
						2015-02-20 16:38:24 -08:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							4966d1346b 
							
						 
					 
					
						
						
							
							Module marker for src folder  
						
						 
						
						
						
						
					 
					
						2015-02-20 15:43:05 -08:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							4a9337f757 
							
						 
					 
					
						
						
							
							PEP8  
						
						 
						
						
						
						
					 
					
						2015-02-20 15:42:06 -08:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							db311fb6a2 
							
						 
					 
					
						
						
							
							Add support for -b (skip big pages)  
						
						 
						
						
						
						
					 
					
						2015-02-20 15:26:33 -08:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							02c1dcec8e 
							
						 
					 
					
						
						
							
							Remove filenames from .hocr files  
						
						 
						
						... 
						
						
						
						As documented, Tesseract does not escape the filename when inserting it
into .hocr, potentially creating an invalid XML file as a result. Since
there is no use for the title, regex it and nuke it. 
						
						
					 
					
						2015-02-13 13:41:14 -08:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							52dc74d3ce 
							
						 
					 
					
						
						
							
							Support Tesseract 3.03 quirk: .html vs .hocr extension  
						
						 
						
						
						
						
					 
					
						2015-02-11 10:24:10 -08:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							cc2af2bc15 
							
						 
					 
					
						
						
							
							Convert the final image to a JPEG if the original image was a JPEG  
						
						 
						
						... 
						
						
						
						Of course, this introduces recompression artifacts, and is unnecessary
if no options are given that modify the final image (no -d, -c, -i).
But rather than worry about that, it would be better to ultimately find
a way to combine the original PDF page with the output PDF text in the
case where we want no changes to the original. This is good enough for
now.
The better option can apparently be achieved using pdftk background, or
probably better, PyPDF2's merge. If Tesseract PDF generation is used
then we need a way to remove the image. Tesseract PDF generation at 3.03
does layout better (I think) and also properly encodes the hidden layer,
which is less likely to give display issues (I think). 
						
						
					 
					
						2015-02-11 10:23:45 -08:00  
					
					
						 
						
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jim Barlow 
							
						 
					 
					
						
						
						
						
							
						
						
							638c6db05d 
							
						 
					 
					
						
						
							
							Use the appropriate PNG rendered given the types of image present  
						
						 
						
						
						
						
					 
					
						2015-02-11 03:32:00 -08:00