2676 Commits

Author SHA1 Message Date
Jim Barlow
aa2baabfa9 Implement deskew and clean using unpaper 2015-07-24 15:19:37 -07:00
Jim Barlow
75c2b23efc Cleanup externals 2015-07-24 02:01:19 -07:00
Jim Barlow
6451017962 Implement oversample 2015-07-24 01:56:44 -07:00
Jim Barlow
0f857a6a34 Put .rendered.pdf files into temp folder 2015-07-24 01:56:19 -07:00
Jim Barlow
7638a88a6a Change 'clean' to 'repair' for clarity since 'clean' is what unpaper does 2015-07-24 01:55:54 -07:00
Jim Barlow
bed12d2021 Remove 'pdftoppm' renderer
Ghostscript is more reliable than Poppler's pdftoppm renderer. gs is
also a hard dependency, as the only open source tool that can produce
a PDF/A file, while Poppler could be removed.  pdftoppm has awkward
syntax with some special handling needed for different versions.  I have
found isolated rendering bugs with pdftoppm as well.

With that, I'm removing supporting for multiple rasterizers.

A minor advantage of pdftoppm is that its code produced JPEGs where
possible, but this can be achieved with gs.
2015-07-24 01:35:33 -07:00
Jim Barlow
587569fcb6 Tidy up 2015-07-24 01:27:01 -07:00
Jim Barlow
8c0dc9a06d Platform independent search for iccprofiles for PDF/A 2015-07-24 01:18:46 -07:00
Jim Barlow
289e4025ad First successful PDF/A produced by new pipeline 2015-07-23 23:28:32 -07:00
Jim Barlow
5476eafe4c Rasterize PDF pages and generate .hocr files 2015-07-23 23:09:29 -07:00
Jim Barlow
df32f283cd Langauge checking 2015-07-23 18:38:59 -07:00
Jim Barlow
68ecaac9cc Add tesseract version check 2015-07-23 17:06:00 -07:00
Jim Barlow
cffd4623ca Add PDF/A validation 2015-07-23 17:05:34 -07:00
Jim Barlow
6dc2782e80 Can now generate PDF/A files, multipage and single page 2015-07-23 04:57:31 -07:00
Jim Barlow
5df187c086 Wrap a proxy around pdfinfo block so it can be passed around processes 2015-07-23 03:49:30 -07:00
Jim Barlow
7fd172e41e Get rid of chdir, replace deprecated @split with @subdivide 2015-07-23 03:09:03 -07:00
Jim Barlow
619528a1b5 Try a method for passing along the pdfinfo struct 2015-07-23 02:39:42 -07:00
Jim Barlow
596d468c14 Reinstate WrapperLogger with more multiprocessing fixes 2015-07-23 02:26:09 -07:00
Jim Barlow
eddbf1060a diff --git a/src/ocrmypdf.py b/src/ocrmypdf.py
index 68d1591..95afa8f 100755
--- a/src/ocrmypdf.py
+++ b/src/ocrmypdf.py
@@ -24,6 +24,7 @@ import ruffus.cmdline as cmdline
 from .hocrtransform import HocrTransform

 import warnings
+import multiprocessing

 warnings.simplefilter('ignore', pypdf.utils.PdfReadWarning)

@@ -96,7 +97,7 @@ debugging.add_argument(
     '-k', '--keep-temporary-files', action='store_true',
     help="keep temporary files (helpful for debugging)")
 debugging.add_argument(
-    '-g' ,'--debug-rendering', action='store_true',
+    '-g', '--debug-rendering', action='store_true',
     help="render each page twice with debug information on second page")

@@ -106,51 +107,19 @@ if not options.temp_folder:
     options.temp_folder = 'tmp'

-_logger, _logger_mutex = cmdline.setup_logging(__name__, options.log_file,
-                                               options.verbose)
+log, log_mutex = cmdline.setup_logging(__name__, options.log_file,
+                                       options.verbose)

-class WrappedLogger:
-
-    def __init__(self, my_logger, my_mutex):
-        self.logger = my_logger
-        self.mutex = my_mutex
-
-    def log(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.log(*args, **kwargs)
-
-    def debug(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.debug(*args, **kwargs)
-
-    def info(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.info(*args, **kwargs)
-
-    def warning(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.warning(*args, **kwargs)
-
-    def error(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.error(*args, **kwargs)
-
-    def critical(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.critical(*args, **kwargs)
-
-log = WrappedLogger(_logger, _logger_mutex)
-
-
-def re_symlink(input_file, soft_link_name, log=log):
+def re_symlink(input_file, soft_link_name, log, mutex):
     """
     Helper function: relinks soft symbolic link if necessary
     """
     if input_file == soft_link_name:
-        log.debug("Warning: No symbolic link made. You are using " +
-                     "the original data directory as the working directory.")
+        with mutex:
+            log.debug("Warning: No symbolic link made. You are using " +
+                      "the original data directory as the working directory.")
         return

@@ -161,12 +130,14 @@ def re_symlink(input_file, soft_link_name, log=log):
         try:
             os.unlink(soft_link_name)
         except:
-            log.debug("Can't unlink %s" % (soft_link_name))
+            with mutex:
+                log.debug("Can't unlink %s" % (soft_link_name))

     if not os.path.exists(input_file):
         raise Exception("trying to create a broken symlink to %s" % input_file)

-    log.debug("os.symlink(%s, %s)" % (input_file, soft_link_name))
+    with mutex:
+        log.debug("os.symlink(%s, %s)" % (input_file, soft_link_name))

     os.symlink(
2015-07-23 02:22:12 -07:00
Jim Barlow
33731a6864 Move pageinfo code out of the pipeline 2015-07-23 02:17:13 -07:00
Jim Barlow
0c36cd2e24 Fix errors related to use working directory
Mainly workaround lack of @split(...output_dir) in ruffus
2015-07-23 01:16:05 -07:00
Jim Barlow
5cef1be26d New pipeline runs, splits pages 2015-07-22 22:58:13 -07:00
Jim Barlow
e89f482c3d Fixes from early testing of new pipeline 2015-07-22 22:51:38 -07:00
Jim Barlow
fe3e40305d Learn to split PDF into pages 2015-07-22 22:46:00 -07:00
Jim Barlow
a92b5ceb6b Begin unifying main script and page script 2015-07-22 22:30:00 -07:00
Jim Barlow
0e7e7d8437 Suppress the xref warning for now 2015-07-22 11:24:14 -07:00
Jim Barlow
f47fa98f33 Fixes to colorspace and other inquiries 2015-07-22 11:24:06 -07:00
Jim Barlow
d3d5879911 Replace pdfimages -list call to poppler with PyPDF test for image
The immediate reason for doing this is that (newer?) versions of parse()
seem to choke on the parse string. It appears to trigger exponential
behavior in the underlying regex. In any case, replacing subprocesses
with native Python is usually better.
2015-07-22 11:22:12 -07:00
Jim Barlow
b2168e11db Require Py3 for tests 2015-07-22 11:21:33 -07:00
Jim Barlow
6d5d8be708 New test: check skew 2015-07-22 04:00:59 -07:00
Jim Barlow
ce2dbdf372 Add another test 2015-07-22 03:16:19 -07:00
Jim Barlow
ec8a35a7a6 Basic test cases 2015-07-22 02:59:25 -07:00
Jim Barlow
f6577c22c3 Complete wrapping of logger/logger_mutex 2015-07-22 02:57:13 -07:00
Jim Barlow
43d6c03093 Implement oversampling in ocrpage.py 2015-03-27 18:32:55 -07:00
Jim Barlow
1870f116bb More consistent spacing 2015-03-24 23:05:42 -07:00
Jim Barlow
8b87def013 Don't presume two jobs 2015-03-24 23:04:49 -07:00
Jim Barlow
de599d97b5 Tidy up readme 2015-03-24 23:04:33 -07:00
Jim Barlow
5d7e6b45c4 Cleanup logger 2015-03-24 22:46:33 -07:00
Jim Barlow
c6091bcfe1 Change python2 -> python3 for readlink() 2015-03-24 22:36:13 -07:00
Jim Barlow
466a8a1318 It's now py3 that uses lxml, reportlab 2015-03-19 17:12:32 -07:00
Jim Barlow
a99ba3b696 Add rudimentary support for combining OCR layer with existing content
It appears to be very fragile due to weaknesses in PyPDF. Better
option is probably to use pdftk's watermark feature.
2015-03-10 14:28:38 -07:00
Jim Barlow
9229f7c6cc Add option to render text as invisible OCR text
Prior to this change, hocrtransform would render printable text (black
on white) and then a fully opaque image on top of the text. According to
the PDF spec, text that is the output of OCR should be marked invisible,
so that PDF viewers /know/ it's OCR output in a document that might mix
OCR and text overlays. Another benefit is that PDF viewers would know
to skip rendering text if they are not smart enough to figure out the
image will completely overwrite it.

However, for debug, visible text is nice, so retain it as an option.
2015-02-22 12:43:27 -08:00
Jim Barlow
bf114bb188 Clean up pixel transform logic with namedtuple 2015-02-21 14:14:34 -08:00
Jim Barlow
b8eed2f861 More PEP8/lint 2015-02-21 13:00:46 -08:00
Jim Barlow
ccb1e347be Call HocrTransform directly instead of through a subprocess 2015-02-20 17:20:48 -08:00
Jim Barlow
8698974f11 Rename hocrTransform -> hocrtransform 2015-02-20 16:47:36 -08:00
Jim Barlow
f2c79c4341 Convert hocrtransform to py3 2015-02-20 16:38:24 -08:00
Jim Barlow
4966d1346b Module marker for src folder 2015-02-20 15:43:05 -08:00
Jim Barlow
4a9337f757 PEP8 2015-02-20 15:42:06 -08:00
Jim Barlow
db311fb6a2 Add support for -b (skip big pages) 2015-02-20 15:26:33 -08:00