2895 Commits

Author SHA1 Message Date
James R. Barlow
d3088829af More packaging changes: move jhove, fix console script 2015-07-26 01:52:08 -07:00
James R. Barlow
9aaaba1714 Packaging stuff 2015-07-25 23:45:13 -07:00
Jim Barlow
9adb0d696f Prepare for Python packaging - move to ocrmypdf folder 2015-07-25 18:22:04 -07:00
Jim Barlow
c270f1ba5f Update release notes so far 2015-07-25 18:18:37 -07:00
Jim Barlow
7b255b575a Metadata override from command lien 2015-07-25 18:12:25 -07:00
Jim Barlow
d7a9f3a2ab Transfer Unicode document information from input PDF to output PDF
What a pain getting Unicode right, but there it is.

I cannot find anything to confirm that it is acceptable to put the PDF/A
definition file at the end of the Ghostscript inputs.  I did this because
Ghostscript seems to copy document info from the last document on the
list so reportlab's information "wins" in normal order, so it fixes that
issue, and reportlab 'helpfully' fills in all of those fields even if it
does not have information.

It could also work to pass document information along to reportlab, and
set it in each output PDF: .debug.pdf, .rendered.pdf, and .page.pdf to
ensure that whatever page is last in the pipeline has the right
information. Or perhaps it's possible to write a Postscript trailer that
overwrites any previous docinfo with no side effects, but I can't find
any information on how to do that.  I don't think it's worth pursuing
unless this arrangement causes some problem with PDF/A generation.

On a minor note, Jhove misreads the way I have encoded the strings in
producing its validation log.  It reads them as UTF-16 little endian, so
will tend to produce a string of Asian characters in place of the real
data.
2015-07-25 18:05:25 -07:00
Jim Barlow
abf2e7e9bb Copy document metadata from source document into output (untested)
This works for ASCII only; will do Unicode version.
2015-07-25 15:31:02 -07:00
Jim Barlow
72e5fa9ba0 Reimplement debug pages 2015-07-25 14:14:02 -07:00
Jim Barlow
32c1078d2c Reimplement skip text pages 2015-07-25 14:13:32 -07:00
Jim Barlow
133f901a69 Change @subdivide to @split
@split is for "1 to many" operations, so it's the right tool for this
case.
2015-07-25 02:58:34 -07:00
Jim Barlow
42cd683ec0 Try to make pdfinfo less obnoxious by printing too many decimals 2015-07-25 02:47:59 -07:00
Jim Barlow
151eb05377 For now, unpaper is the only deskew provider 2015-07-25 01:46:16 -07:00
Jim Barlow
16177d0a52 Remove ability to override temporary (working) folder
Little point to this feature - on most platforms the environment
variable can be overridden if desired to set a new root location.

At the same time, this change removes the ability to resume a partially
executed pipeline by deleting all of the results on failure.  If -k is
provided then the temporary files will survive but there's no way to
resume from them.  Because resuming doesn't really work away and would
only be useful to users experiencing very specific problems, this is
probably not worth it, so no major loss.  The intent of -k is to assist
debugging.
2015-07-25 01:45:26 -07:00
Jim Barlow
5ce544289f Automatically try to use all available CPUs 2015-07-25 01:10:14 -07:00
Jim Barlow
77bd35c3c7 Remove duplicate test folder 2015-07-25 01:00:40 -07:00
Jim Barlow
0c5c208db0 Goodbye, so long, farewell, shell... 2015-07-25 00:57:07 -07:00
Jim Barlow
60eb745331 Split selecting final image and render PDF result into separate tasks
Simplifies the logic - one deals with all images, the other details
with an image and .hocr. Also add JPEG reconversion.
2015-07-25 00:54:00 -07:00
Jim Barlow
9f90b5cb0a Modularize unpaper; get -d and -c working again 2015-07-25 00:22:56 -07:00
Jim Barlow
5adff94545 Remove more dead/old code 2015-07-24 15:41:24 -07:00
Jim Barlow
aa2baabfa9 Implement deskew and clean using unpaper 2015-07-24 15:19:37 -07:00
Jim Barlow
75c2b23efc Cleanup externals 2015-07-24 02:01:19 -07:00
Jim Barlow
6451017962 Implement oversample 2015-07-24 01:56:44 -07:00
Jim Barlow
0f857a6a34 Put .rendered.pdf files into temp folder 2015-07-24 01:56:19 -07:00
Jim Barlow
7638a88a6a Change 'clean' to 'repair' for clarity since 'clean' is what unpaper does 2015-07-24 01:55:54 -07:00
Jim Barlow
bed12d2021 Remove 'pdftoppm' renderer
Ghostscript is more reliable than Poppler's pdftoppm renderer. gs is
also a hard dependency, as the only open source tool that can produce
a PDF/A file, while Poppler could be removed.  pdftoppm has awkward
syntax with some special handling needed for different versions.  I have
found isolated rendering bugs with pdftoppm as well.

With that, I'm removing supporting for multiple rasterizers.

A minor advantage of pdftoppm is that its code produced JPEGs where
possible, but this can be achieved with gs.
2015-07-24 01:35:33 -07:00
Jim Barlow
587569fcb6 Tidy up 2015-07-24 01:27:01 -07:00
Jim Barlow
8c0dc9a06d Platform independent search for iccprofiles for PDF/A 2015-07-24 01:18:46 -07:00
Jim Barlow
289e4025ad First successful PDF/A produced by new pipeline 2015-07-23 23:28:32 -07:00
Jim Barlow
5476eafe4c Rasterize PDF pages and generate .hocr files 2015-07-23 23:09:29 -07:00
Jim Barlow
df32f283cd Langauge checking 2015-07-23 18:38:59 -07:00
Jim Barlow
68ecaac9cc Add tesseract version check 2015-07-23 17:06:00 -07:00
Jim Barlow
cffd4623ca Add PDF/A validation 2015-07-23 17:05:34 -07:00
Jim Barlow
6dc2782e80 Can now generate PDF/A files, multipage and single page 2015-07-23 04:57:31 -07:00
Jim Barlow
5df187c086 Wrap a proxy around pdfinfo block so it can be passed around processes 2015-07-23 03:49:30 -07:00
Jim Barlow
7fd172e41e Get rid of chdir, replace deprecated @split with @subdivide 2015-07-23 03:09:03 -07:00
Jim Barlow
619528a1b5 Try a method for passing along the pdfinfo struct 2015-07-23 02:39:42 -07:00
Jim Barlow
596d468c14 Reinstate WrapperLogger with more multiprocessing fixes 2015-07-23 02:26:09 -07:00
Jim Barlow
eddbf1060a diff --git a/src/ocrmypdf.py b/src/ocrmypdf.py
index 68d1591..95afa8f 100755
--- a/src/ocrmypdf.py
+++ b/src/ocrmypdf.py
@@ -24,6 +24,7 @@ import ruffus.cmdline as cmdline
 from .hocrtransform import HocrTransform

 import warnings
+import multiprocessing

 warnings.simplefilter('ignore', pypdf.utils.PdfReadWarning)

@@ -96,7 +97,7 @@ debugging.add_argument(
     '-k', '--keep-temporary-files', action='store_true',
     help="keep temporary files (helpful for debugging)")
 debugging.add_argument(
-    '-g' ,'--debug-rendering', action='store_true',
+    '-g', '--debug-rendering', action='store_true',
     help="render each page twice with debug information on second page")

@@ -106,51 +107,19 @@ if not options.temp_folder:
     options.temp_folder = 'tmp'

-_logger, _logger_mutex = cmdline.setup_logging(__name__, options.log_file,
-                                               options.verbose)
+log, log_mutex = cmdline.setup_logging(__name__, options.log_file,
+                                       options.verbose)

-class WrappedLogger:
-
-    def __init__(self, my_logger, my_mutex):
-        self.logger = my_logger
-        self.mutex = my_mutex
-
-    def log(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.log(*args, **kwargs)
-
-    def debug(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.debug(*args, **kwargs)
-
-    def info(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.info(*args, **kwargs)
-
-    def warning(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.warning(*args, **kwargs)
-
-    def error(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.error(*args, **kwargs)
-
-    def critical(self, *args, **kwargs):
-        with self.mutex:
-            self.logger.critical(*args, **kwargs)
-
-log = WrappedLogger(_logger, _logger_mutex)
-
-
-def re_symlink(input_file, soft_link_name, log=log):
+def re_symlink(input_file, soft_link_name, log, mutex):
     """
     Helper function: relinks soft symbolic link if necessary
     """
     if input_file == soft_link_name:
-        log.debug("Warning: No symbolic link made. You are using " +
-                     "the original data directory as the working directory.")
+        with mutex:
+            log.debug("Warning: No symbolic link made. You are using " +
+                      "the original data directory as the working directory.")
         return

@@ -161,12 +130,14 @@ def re_symlink(input_file, soft_link_name, log=log):
         try:
             os.unlink(soft_link_name)
         except:
-            log.debug("Can't unlink %s" % (soft_link_name))
+            with mutex:
+                log.debug("Can't unlink %s" % (soft_link_name))

     if not os.path.exists(input_file):
         raise Exception("trying to create a broken symlink to %s" % input_file)

-    log.debug("os.symlink(%s, %s)" % (input_file, soft_link_name))
+    with mutex:
+        log.debug("os.symlink(%s, %s)" % (input_file, soft_link_name))

     os.symlink(
2015-07-23 02:22:12 -07:00
Jim Barlow
33731a6864 Move pageinfo code out of the pipeline 2015-07-23 02:17:13 -07:00
Jim Barlow
0c36cd2e24 Fix errors related to use working directory
Mainly workaround lack of @split(...output_dir) in ruffus
2015-07-23 01:16:05 -07:00
Jim Barlow
5cef1be26d New pipeline runs, splits pages 2015-07-22 22:58:13 -07:00
Jim Barlow
e89f482c3d Fixes from early testing of new pipeline 2015-07-22 22:51:38 -07:00
Jim Barlow
fe3e40305d Learn to split PDF into pages 2015-07-22 22:46:00 -07:00
Jim Barlow
a92b5ceb6b Begin unifying main script and page script 2015-07-22 22:30:00 -07:00
Jim Barlow
0e7e7d8437 Suppress the xref warning for now 2015-07-22 11:24:14 -07:00
Jim Barlow
f47fa98f33 Fixes to colorspace and other inquiries 2015-07-22 11:24:06 -07:00
Jim Barlow
d3d5879911 Replace pdfimages -list call to poppler with PyPDF test for image
The immediate reason for doing this is that (newer?) versions of parse()
seem to choke on the parse string. It appears to trigger exponential
behavior in the underlying regex. In any case, replacing subprocesses
with native Python is usually better.
2015-07-22 11:22:12 -07:00
Jim Barlow
b2168e11db Require Py3 for tests 2015-07-22 11:21:33 -07:00
Jim Barlow
6d5d8be708 New test: check skew 2015-07-22 04:00:59 -07:00
Jim Barlow
ce2dbdf372 Add another test 2015-07-22 03:16:19 -07:00