Jim Barlow
aa2baabfa9
Implement deskew and clean using unpaper
2015-07-24 15:19:37 -07:00
Jim Barlow
75c2b23efc
Cleanup externals
2015-07-24 02:01:19 -07:00
Jim Barlow
6451017962
Implement oversample
2015-07-24 01:56:44 -07:00
Jim Barlow
0f857a6a34
Put .rendered.pdf files into temp folder
2015-07-24 01:56:19 -07:00
Jim Barlow
7638a88a6a
Change 'clean' to 'repair' for clarity since 'clean' is what unpaper does
2015-07-24 01:55:54 -07:00
Jim Barlow
bed12d2021
Remove 'pdftoppm' renderer
...
Ghostscript is more reliable than Poppler's pdftoppm renderer. gs is
also a hard dependency, as the only open source tool that can produce
a PDF/A file, while Poppler could be removed. pdftoppm has awkward
syntax with some special handling needed for different versions. I have
found isolated rendering bugs with pdftoppm as well.
With that, I'm removing supporting for multiple rasterizers.
A minor advantage of pdftoppm is that its code produced JPEGs where
possible, but this can be achieved with gs.
2015-07-24 01:35:33 -07:00
Jim Barlow
587569fcb6
Tidy up
2015-07-24 01:27:01 -07:00
Jim Barlow
8c0dc9a06d
Platform independent search for iccprofiles for PDF/A
2015-07-24 01:18:46 -07:00
Jim Barlow
289e4025ad
First successful PDF/A produced by new pipeline
2015-07-23 23:28:32 -07:00
Jim Barlow
5476eafe4c
Rasterize PDF pages and generate .hocr files
2015-07-23 23:09:29 -07:00
Jim Barlow
df32f283cd
Langauge checking
2015-07-23 18:38:59 -07:00
Jim Barlow
68ecaac9cc
Add tesseract version check
2015-07-23 17:06:00 -07:00
Jim Barlow
cffd4623ca
Add PDF/A validation
2015-07-23 17:05:34 -07:00
Jim Barlow
6dc2782e80
Can now generate PDF/A files, multipage and single page
2015-07-23 04:57:31 -07:00
Jim Barlow
5df187c086
Wrap a proxy around pdfinfo block so it can be passed around processes
2015-07-23 03:49:30 -07:00
Jim Barlow
7fd172e41e
Get rid of chdir, replace deprecated @split with @subdivide
2015-07-23 03:09:03 -07:00
Jim Barlow
619528a1b5
Try a method for passing along the pdfinfo struct
2015-07-23 02:39:42 -07:00
Jim Barlow
596d468c14
Reinstate WrapperLogger with more multiprocessing fixes
2015-07-23 02:26:09 -07:00
Jim Barlow
eddbf1060a
diff --git a/src/ocrmypdf.py b/src/ocrmypdf.py
...
index 68d1591..95afa8f 100755
--- a/src/ocrmypdf.py
+++ b/src/ocrmypdf.py
@@ -24,6 +24,7 @@ import ruffus.cmdline as cmdline
from .hocrtransform import HocrTransform
import warnings
+import multiprocessing
warnings.simplefilter('ignore', pypdf.utils.PdfReadWarning)
@@ -96,7 +97,7 @@ debugging.add_argument(
'-k', '--keep-temporary-files', action='store_true',
help="keep temporary files (helpful for debugging)")
debugging.add_argument(
- '-g' ,'--debug-rendering', action='store_true',
+ '-g', '--debug-rendering', action='store_true',
help="render each page twice with debug information on second page")
@@ -106,51 +107,19 @@ if not options.temp_folder:
options.temp_folder = 'tmp'
-_logger, _logger_mutex = cmdline.setup_logging(__name__, options.log_file,
- options.verbose)
+log, log_mutex = cmdline.setup_logging(__name__, options.log_file,
+ options.verbose)
-class WrappedLogger:
-
- def __init__(self, my_logger, my_mutex):
- self.logger = my_logger
- self.mutex = my_mutex
-
- def log(self, *args, **kwargs):
- with self.mutex:
- self.logger.log(*args, **kwargs)
-
- def debug(self, *args, **kwargs):
- with self.mutex:
- self.logger.debug(*args, **kwargs)
-
- def info(self, *args, **kwargs):
- with self.mutex:
- self.logger.info(*args, **kwargs)
-
- def warning(self, *args, **kwargs):
- with self.mutex:
- self.logger.warning(*args, **kwargs)
-
- def error(self, *args, **kwargs):
- with self.mutex:
- self.logger.error(*args, **kwargs)
-
- def critical(self, *args, **kwargs):
- with self.mutex:
- self.logger.critical(*args, **kwargs)
-
-log = WrappedLogger(_logger, _logger_mutex)
-
-
-def re_symlink(input_file, soft_link_name, log=log):
+def re_symlink(input_file, soft_link_name, log, mutex):
"""
Helper function: relinks soft symbolic link if necessary
"""
if input_file == soft_link_name:
- log.debug("Warning: No symbolic link made. You are using " +
- "the original data directory as the working directory.")
+ with mutex:
+ log.debug("Warning: No symbolic link made. You are using " +
+ "the original data directory as the working directory.")
return
@@ -161,12 +130,14 @@ def re_symlink(input_file, soft_link_name, log=log):
try:
os.unlink(soft_link_name)
except:
- log.debug("Can't unlink %s" % (soft_link_name))
+ with mutex:
+ log.debug("Can't unlink %s" % (soft_link_name))
if not os.path.exists(input_file):
raise Exception("trying to create a broken symlink to %s" % input_file)
- log.debug("os.symlink(%s, %s)" % (input_file, soft_link_name))
+ with mutex:
+ log.debug("os.symlink(%s, %s)" % (input_file, soft_link_name))
os.symlink(
2015-07-23 02:22:12 -07:00
Jim Barlow
33731a6864
Move pageinfo code out of the pipeline
2015-07-23 02:17:13 -07:00
Jim Barlow
0c36cd2e24
Fix errors related to use working directory
...
Mainly workaround lack of @split(...output_dir) in ruffus
2015-07-23 01:16:05 -07:00
Jim Barlow
5cef1be26d
New pipeline runs, splits pages
2015-07-22 22:58:13 -07:00
Jim Barlow
e89f482c3d
Fixes from early testing of new pipeline
2015-07-22 22:51:38 -07:00
Jim Barlow
fe3e40305d
Learn to split PDF into pages
2015-07-22 22:46:00 -07:00
Jim Barlow
a92b5ceb6b
Begin unifying main script and page script
2015-07-22 22:30:00 -07:00
Jim Barlow
0e7e7d8437
Suppress the xref warning for now
2015-07-22 11:24:14 -07:00
Jim Barlow
f47fa98f33
Fixes to colorspace and other inquiries
2015-07-22 11:24:06 -07:00
Jim Barlow
d3d5879911
Replace pdfimages -list call to poppler with PyPDF test for image
...
The immediate reason for doing this is that (newer?) versions of parse()
seem to choke on the parse string. It appears to trigger exponential
behavior in the underlying regex. In any case, replacing subprocesses
with native Python is usually better.
2015-07-22 11:22:12 -07:00
Jim Barlow
b2168e11db
Require Py3 for tests
2015-07-22 11:21:33 -07:00
Jim Barlow
6d5d8be708
New test: check skew
2015-07-22 04:00:59 -07:00
Jim Barlow
ce2dbdf372
Add another test
2015-07-22 03:16:19 -07:00
Jim Barlow
ec8a35a7a6
Basic test cases
2015-07-22 02:59:25 -07:00
Jim Barlow
f6577c22c3
Complete wrapping of logger/logger_mutex
2015-07-22 02:57:13 -07:00
Jim Barlow
43d6c03093
Implement oversampling in ocrpage.py
2015-03-27 18:32:55 -07:00
Jim Barlow
1870f116bb
More consistent spacing
2015-03-24 23:05:42 -07:00
Jim Barlow
8b87def013
Don't presume two jobs
2015-03-24 23:04:49 -07:00
Jim Barlow
de599d97b5
Tidy up readme
2015-03-24 23:04:33 -07:00
Jim Barlow
5d7e6b45c4
Cleanup logger
2015-03-24 22:46:33 -07:00
Jim Barlow
c6091bcfe1
Change python2 -> python3 for readlink()
2015-03-24 22:36:13 -07:00
Jim Barlow
466a8a1318
It's now py3 that uses lxml, reportlab
2015-03-19 17:12:32 -07:00
Jim Barlow
a99ba3b696
Add rudimentary support for combining OCR layer with existing content
...
It appears to be very fragile due to weaknesses in PyPDF. Better
option is probably to use pdftk's watermark feature.
2015-03-10 14:28:38 -07:00
Jim Barlow
9229f7c6cc
Add option to render text as invisible OCR text
...
Prior to this change, hocrtransform would render printable text (black
on white) and then a fully opaque image on top of the text. According to
the PDF spec, text that is the output of OCR should be marked invisible,
so that PDF viewers /know/ it's OCR output in a document that might mix
OCR and text overlays. Another benefit is that PDF viewers would know
to skip rendering text if they are not smart enough to figure out the
image will completely overwrite it.
However, for debug, visible text is nice, so retain it as an option.
2015-02-22 12:43:27 -08:00
Jim Barlow
bf114bb188
Clean up pixel transform logic with namedtuple
2015-02-21 14:14:34 -08:00
Jim Barlow
b8eed2f861
More PEP8/lint
2015-02-21 13:00:46 -08:00
Jim Barlow
ccb1e347be
Call HocrTransform directly instead of through a subprocess
2015-02-20 17:20:48 -08:00
Jim Barlow
8698974f11
Rename hocrTransform -> hocrtransform
2015-02-20 16:47:36 -08:00
Jim Barlow
f2c79c4341
Convert hocrtransform to py3
2015-02-20 16:38:24 -08:00
Jim Barlow
4966d1346b
Module marker for src folder
2015-02-20 15:43:05 -08:00
Jim Barlow
4a9337f757
PEP8
2015-02-20 15:42:06 -08:00
Jim Barlow
db311fb6a2
Add support for -b (skip big pages)
2015-02-20 15:26:33 -08:00