James R. Barlow
d3088829af
More packaging changes: move jhove, fix console script
2015-07-26 01:52:08 -07:00
James R. Barlow
9aaaba1714
Packaging stuff
2015-07-25 23:45:13 -07:00
Jim Barlow
9adb0d696f
Prepare for Python packaging - move to ocrmypdf folder
2015-07-25 18:22:04 -07:00
Jim Barlow
c270f1ba5f
Update release notes so far
2015-07-25 18:18:37 -07:00
Jim Barlow
7b255b575a
Metadata override from command lien
2015-07-25 18:12:25 -07:00
Jim Barlow
d7a9f3a2ab
Transfer Unicode document information from input PDF to output PDF
...
What a pain getting Unicode right, but there it is.
I cannot find anything to confirm that it is acceptable to put the PDF/A
definition file at the end of the Ghostscript inputs. I did this because
Ghostscript seems to copy document info from the last document on the
list so reportlab's information "wins" in normal order, so it fixes that
issue, and reportlab 'helpfully' fills in all of those fields even if it
does not have information.
It could also work to pass document information along to reportlab, and
set it in each output PDF: .debug.pdf, .rendered.pdf, and .page.pdf to
ensure that whatever page is last in the pipeline has the right
information. Or perhaps it's possible to write a Postscript trailer that
overwrites any previous docinfo with no side effects, but I can't find
any information on how to do that. I don't think it's worth pursuing
unless this arrangement causes some problem with PDF/A generation.
On a minor note, Jhove misreads the way I have encoded the strings in
producing its validation log. It reads them as UTF-16 little endian, so
will tend to produce a string of Asian characters in place of the real
data.
2015-07-25 18:05:25 -07:00
Jim Barlow
abf2e7e9bb
Copy document metadata from source document into output (untested)
...
This works for ASCII only; will do Unicode version.
2015-07-25 15:31:02 -07:00
Jim Barlow
72e5fa9ba0
Reimplement debug pages
2015-07-25 14:14:02 -07:00
Jim Barlow
32c1078d2c
Reimplement skip text pages
2015-07-25 14:13:32 -07:00
Jim Barlow
133f901a69
Change @subdivide to @split
...
@split is for "1 to many" operations, so it's the right tool for this
case.
2015-07-25 02:58:34 -07:00
Jim Barlow
42cd683ec0
Try to make pdfinfo less obnoxious by printing too many decimals
2015-07-25 02:47:59 -07:00
Jim Barlow
151eb05377
For now, unpaper is the only deskew provider
2015-07-25 01:46:16 -07:00
Jim Barlow
16177d0a52
Remove ability to override temporary (working) folder
...
Little point to this feature - on most platforms the environment
variable can be overridden if desired to set a new root location.
At the same time, this change removes the ability to resume a partially
executed pipeline by deleting all of the results on failure. If -k is
provided then the temporary files will survive but there's no way to
resume from them. Because resuming doesn't really work away and would
only be useful to users experiencing very specific problems, this is
probably not worth it, so no major loss. The intent of -k is to assist
debugging.
2015-07-25 01:45:26 -07:00
Jim Barlow
5ce544289f
Automatically try to use all available CPUs
2015-07-25 01:10:14 -07:00
Jim Barlow
77bd35c3c7
Remove duplicate test folder
2015-07-25 01:00:40 -07:00
Jim Barlow
0c5c208db0
Goodbye, so long, farewell, shell...
2015-07-25 00:57:07 -07:00
Jim Barlow
60eb745331
Split selecting final image and render PDF result into separate tasks
...
Simplifies the logic - one deals with all images, the other details
with an image and .hocr. Also add JPEG reconversion.
2015-07-25 00:54:00 -07:00
Jim Barlow
9f90b5cb0a
Modularize unpaper; get -d and -c working again
2015-07-25 00:22:56 -07:00
Jim Barlow
5adff94545
Remove more dead/old code
2015-07-24 15:41:24 -07:00
Jim Barlow
aa2baabfa9
Implement deskew and clean using unpaper
2015-07-24 15:19:37 -07:00
Jim Barlow
75c2b23efc
Cleanup externals
2015-07-24 02:01:19 -07:00
Jim Barlow
6451017962
Implement oversample
2015-07-24 01:56:44 -07:00
Jim Barlow
0f857a6a34
Put .rendered.pdf files into temp folder
2015-07-24 01:56:19 -07:00
Jim Barlow
7638a88a6a
Change 'clean' to 'repair' for clarity since 'clean' is what unpaper does
2015-07-24 01:55:54 -07:00
Jim Barlow
bed12d2021
Remove 'pdftoppm' renderer
...
Ghostscript is more reliable than Poppler's pdftoppm renderer. gs is
also a hard dependency, as the only open source tool that can produce
a PDF/A file, while Poppler could be removed. pdftoppm has awkward
syntax with some special handling needed for different versions. I have
found isolated rendering bugs with pdftoppm as well.
With that, I'm removing supporting for multiple rasterizers.
A minor advantage of pdftoppm is that its code produced JPEGs where
possible, but this can be achieved with gs.
2015-07-24 01:35:33 -07:00
Jim Barlow
587569fcb6
Tidy up
2015-07-24 01:27:01 -07:00
Jim Barlow
8c0dc9a06d
Platform independent search for iccprofiles for PDF/A
2015-07-24 01:18:46 -07:00
Jim Barlow
289e4025ad
First successful PDF/A produced by new pipeline
2015-07-23 23:28:32 -07:00
Jim Barlow
5476eafe4c
Rasterize PDF pages and generate .hocr files
2015-07-23 23:09:29 -07:00
Jim Barlow
df32f283cd
Langauge checking
2015-07-23 18:38:59 -07:00
Jim Barlow
68ecaac9cc
Add tesseract version check
2015-07-23 17:06:00 -07:00
Jim Barlow
cffd4623ca
Add PDF/A validation
2015-07-23 17:05:34 -07:00
Jim Barlow
6dc2782e80
Can now generate PDF/A files, multipage and single page
2015-07-23 04:57:31 -07:00
Jim Barlow
5df187c086
Wrap a proxy around pdfinfo block so it can be passed around processes
2015-07-23 03:49:30 -07:00
Jim Barlow
7fd172e41e
Get rid of chdir, replace deprecated @split with @subdivide
2015-07-23 03:09:03 -07:00
Jim Barlow
619528a1b5
Try a method for passing along the pdfinfo struct
2015-07-23 02:39:42 -07:00
Jim Barlow
596d468c14
Reinstate WrapperLogger with more multiprocessing fixes
2015-07-23 02:26:09 -07:00
Jim Barlow
eddbf1060a
diff --git a/src/ocrmypdf.py b/src/ocrmypdf.py
...
index 68d1591..95afa8f 100755
--- a/src/ocrmypdf.py
+++ b/src/ocrmypdf.py
@@ -24,6 +24,7 @@ import ruffus.cmdline as cmdline
from .hocrtransform import HocrTransform
import warnings
+import multiprocessing
warnings.simplefilter('ignore', pypdf.utils.PdfReadWarning)
@@ -96,7 +97,7 @@ debugging.add_argument(
'-k', '--keep-temporary-files', action='store_true',
help="keep temporary files (helpful for debugging)")
debugging.add_argument(
- '-g' ,'--debug-rendering', action='store_true',
+ '-g', '--debug-rendering', action='store_true',
help="render each page twice with debug information on second page")
@@ -106,51 +107,19 @@ if not options.temp_folder:
options.temp_folder = 'tmp'
-_logger, _logger_mutex = cmdline.setup_logging(__name__, options.log_file,
- options.verbose)
+log, log_mutex = cmdline.setup_logging(__name__, options.log_file,
+ options.verbose)
-class WrappedLogger:
-
- def __init__(self, my_logger, my_mutex):
- self.logger = my_logger
- self.mutex = my_mutex
-
- def log(self, *args, **kwargs):
- with self.mutex:
- self.logger.log(*args, **kwargs)
-
- def debug(self, *args, **kwargs):
- with self.mutex:
- self.logger.debug(*args, **kwargs)
-
- def info(self, *args, **kwargs):
- with self.mutex:
- self.logger.info(*args, **kwargs)
-
- def warning(self, *args, **kwargs):
- with self.mutex:
- self.logger.warning(*args, **kwargs)
-
- def error(self, *args, **kwargs):
- with self.mutex:
- self.logger.error(*args, **kwargs)
-
- def critical(self, *args, **kwargs):
- with self.mutex:
- self.logger.critical(*args, **kwargs)
-
-log = WrappedLogger(_logger, _logger_mutex)
-
-
-def re_symlink(input_file, soft_link_name, log=log):
+def re_symlink(input_file, soft_link_name, log, mutex):
"""
Helper function: relinks soft symbolic link if necessary
"""
if input_file == soft_link_name:
- log.debug("Warning: No symbolic link made. You are using " +
- "the original data directory as the working directory.")
+ with mutex:
+ log.debug("Warning: No symbolic link made. You are using " +
+ "the original data directory as the working directory.")
return
@@ -161,12 +130,14 @@ def re_symlink(input_file, soft_link_name, log=log):
try:
os.unlink(soft_link_name)
except:
- log.debug("Can't unlink %s" % (soft_link_name))
+ with mutex:
+ log.debug("Can't unlink %s" % (soft_link_name))
if not os.path.exists(input_file):
raise Exception("trying to create a broken symlink to %s" % input_file)
- log.debug("os.symlink(%s, %s)" % (input_file, soft_link_name))
+ with mutex:
+ log.debug("os.symlink(%s, %s)" % (input_file, soft_link_name))
os.symlink(
2015-07-23 02:22:12 -07:00
Jim Barlow
33731a6864
Move pageinfo code out of the pipeline
2015-07-23 02:17:13 -07:00
Jim Barlow
0c36cd2e24
Fix errors related to use working directory
...
Mainly workaround lack of @split(...output_dir) in ruffus
2015-07-23 01:16:05 -07:00
Jim Barlow
5cef1be26d
New pipeline runs, splits pages
2015-07-22 22:58:13 -07:00
Jim Barlow
e89f482c3d
Fixes from early testing of new pipeline
2015-07-22 22:51:38 -07:00
Jim Barlow
fe3e40305d
Learn to split PDF into pages
2015-07-22 22:46:00 -07:00
Jim Barlow
a92b5ceb6b
Begin unifying main script and page script
2015-07-22 22:30:00 -07:00
Jim Barlow
0e7e7d8437
Suppress the xref warning for now
2015-07-22 11:24:14 -07:00
Jim Barlow
f47fa98f33
Fixes to colorspace and other inquiries
2015-07-22 11:24:06 -07:00
Jim Barlow
d3d5879911
Replace pdfimages -list call to poppler with PyPDF test for image
...
The immediate reason for doing this is that (newer?) versions of parse()
seem to choke on the parse string. It appears to trigger exponential
behavior in the underlying regex. In any case, replacing subprocesses
with native Python is usually better.
2015-07-22 11:22:12 -07:00
Jim Barlow
b2168e11db
Require Py3 for tests
2015-07-22 11:21:33 -07:00
Jim Barlow
6d5d8be708
New test: check skew
2015-07-22 04:00:59 -07:00
Jim Barlow
ce2dbdf372
Add another test
2015-07-22 03:16:19 -07:00