OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-11-15 17:44:46 +00:00

Author	SHA1	Message	Date
Jim Barlow	aa2baabfa9	Implement deskew and clean using unpaper	2015-07-24 15:19:37 -07:00
Jim Barlow	75c2b23efc	Cleanup externals	2015-07-24 02:01:19 -07:00
Jim Barlow	6451017962	Implement oversample	2015-07-24 01:56:44 -07:00
Jim Barlow	0f857a6a34	Put .rendered.pdf files into temp folder	2015-07-24 01:56:19 -07:00
Jim Barlow	7638a88a6a	Change 'clean' to 'repair' for clarity since 'clean' is what unpaper does	2015-07-24 01:55:54 -07:00
Jim Barlow	bed12d2021	Remove 'pdftoppm' renderer Ghostscript is more reliable than Poppler's pdftoppm renderer. gs is also a hard dependency, as the only open source tool that can produce a PDF/A file, while Poppler could be removed. pdftoppm has awkward syntax with some special handling needed for different versions. I have found isolated rendering bugs with pdftoppm as well. With that, I'm removing supporting for multiple rasterizers. A minor advantage of pdftoppm is that its code produced JPEGs where possible, but this can be achieved with gs.	2015-07-24 01:35:33 -07:00
Jim Barlow	587569fcb6	Tidy up	2015-07-24 01:27:01 -07:00
Jim Barlow	8c0dc9a06d	Platform independent search for iccprofiles for PDF/A	2015-07-24 01:18:46 -07:00
Jim Barlow	289e4025ad	First successful PDF/A produced by new pipeline	2015-07-23 23:28:32 -07:00
Jim Barlow	5476eafe4c	Rasterize PDF pages and generate .hocr files	2015-07-23 23:09:29 -07:00
Jim Barlow	df32f283cd	Langauge checking	2015-07-23 18:38:59 -07:00
Jim Barlow	68ecaac9cc	Add tesseract version check	2015-07-23 17:06:00 -07:00
Jim Barlow	cffd4623ca	Add PDF/A validation	2015-07-23 17:05:34 -07:00
Jim Barlow	6dc2782e80	Can now generate PDF/A files, multipage and single page	2015-07-23 04:57:31 -07:00
Jim Barlow	5df187c086	Wrap a proxy around pdfinfo block so it can be passed around processes	2015-07-23 03:49:30 -07:00
Jim Barlow	7fd172e41e	Get rid of chdir, replace deprecated @split with @subdivide	2015-07-23 03:09:03 -07:00
Jim Barlow	619528a1b5	Try a method for passing along the pdfinfo struct	2015-07-23 02:39:42 -07:00
Jim Barlow	596d468c14	Reinstate WrapperLogger with more multiprocessing fixes	2015-07-23 02:26:09 -07:00
Jim Barlow	eddbf1060a	diff --git a/src/ocrmypdf.py b/src/ocrmypdf.py index 68d1591..95afa8f 100755 --- a/src/ocrmypdf.py +++ b/src/ocrmypdf.py @@ -24,6 +24,7 @@ import ruffus.cmdline as cmdline from .hocrtransform import HocrTransform import warnings +import multiprocessing warnings.simplefilter('ignore', pypdf.utils.PdfReadWarning) @@ -96,7 +97,7 @@ debugging.add_argument( '-k', '--keep-temporary-files', action='store_true', help="keep temporary files (helpful for debugging)") debugging.add_argument( - '-g' ,'--debug-rendering', action='store_true', + '-g', '--debug-rendering', action='store_true', help="render each page twice with debug information on second page") @@ -106,51 +107,19 @@ if not options.temp_folder: options.temp_folder = 'tmp' -_logger, _logger_mutex = cmdline.setup_logging(__name__, options.log_file, - options.verbose) +log, log_mutex = cmdline.setup_logging(__name__, options.log_file, + options.verbose) -class WrappedLogger: - - def __init__(self, my_logger, my_mutex): - self.logger = my_logger - self.mutex = my_mutex - - def log(self, args, kwargs): - with self.mutex: - self.logger.log(args, *kwargs) - - def debug(self, args, *kwargs): - with self.mutex: - self.logger.debug(args, *kwargs) - - def info(self, args, *kwargs): - with self.mutex: - self.logger.info(args, *kwargs) - - def warning(self, args, *kwargs): - with self.mutex: - self.logger.warning(args, *kwargs) - - def error(self, args, *kwargs): - with self.mutex: - self.logger.error(args, *kwargs) - - def critical(self, args, *kwargs): - with self.mutex: - self.logger.critical(args, **kwargs) - -log = WrappedLogger(_logger, _logger_mutex) - - -def re_symlink(input_file, soft_link_name, log=log): +def re_symlink(input_file, soft_link_name, log, mutex): """ Helper function: relinks soft symbolic link if necessary """ if input_file == soft_link_name: - log.debug("Warning: No symbolic link made. You are using " + - "the original data directory as the working directory.") + with mutex: + log.debug("Warning: No symbolic link made. You are using " + + "the original data directory as the working directory.") return @@ -161,12 +130,14 @@ def re_symlink(input_file, soft_link_name, log=log): try: os.unlink(soft_link_name) except: - log.debug("Can't unlink %s" % (soft_link_name)) + with mutex: + log.debug("Can't unlink %s" % (soft_link_name)) if not os.path.exists(input_file): raise Exception("trying to create a broken symlink to %s" % input_file) - log.debug("os.symlink(%s, %s)" % (input_file, soft_link_name)) + with mutex: + log.debug("os.symlink(%s, %s)" % (input_file, soft_link_name)) os.symlink(	2015-07-23 02:22:12 -07:00
Jim Barlow	33731a6864	Move pageinfo code out of the pipeline	2015-07-23 02:17:13 -07:00
Jim Barlow	0c36cd2e24	Fix errors related to use working directory Mainly workaround lack of @split(...output_dir) in ruffus	2015-07-23 01:16:05 -07:00
Jim Barlow	5cef1be26d	New pipeline runs, splits pages	2015-07-22 22:58:13 -07:00
Jim Barlow	e89f482c3d	Fixes from early testing of new pipeline	2015-07-22 22:51:38 -07:00
Jim Barlow	fe3e40305d	Learn to split PDF into pages	2015-07-22 22:46:00 -07:00
Jim Barlow	a92b5ceb6b	Begin unifying main script and page script	2015-07-22 22:30:00 -07:00
Jim Barlow	0e7e7d8437	Suppress the xref warning for now	2015-07-22 11:24:14 -07:00
Jim Barlow	f47fa98f33	Fixes to colorspace and other inquiries	2015-07-22 11:24:06 -07:00
Jim Barlow	d3d5879911	Replace pdfimages -list call to poppler with PyPDF test for image The immediate reason for doing this is that (newer?) versions of parse() seem to choke on the parse string. It appears to trigger exponential behavior in the underlying regex. In any case, replacing subprocesses with native Python is usually better.	2015-07-22 11:22:12 -07:00
Jim Barlow	b2168e11db	Require Py3 for tests	2015-07-22 11:21:33 -07:00
Jim Barlow	6d5d8be708	New test: check skew	2015-07-22 04:00:59 -07:00
Jim Barlow	ce2dbdf372	Add another test	2015-07-22 03:16:19 -07:00
Jim Barlow	ec8a35a7a6	Basic test cases	2015-07-22 02:59:25 -07:00
Jim Barlow	f6577c22c3	Complete wrapping of logger/logger_mutex	2015-07-22 02:57:13 -07:00
Jim Barlow	43d6c03093	Implement oversampling in ocrpage.py	2015-03-27 18:32:55 -07:00
Jim Barlow	1870f116bb	More consistent spacing	2015-03-24 23:05:42 -07:00
Jim Barlow	8b87def013	Don't presume two jobs	2015-03-24 23:04:49 -07:00
Jim Barlow	de599d97b5	Tidy up readme	2015-03-24 23:04:33 -07:00
Jim Barlow	5d7e6b45c4	Cleanup logger	2015-03-24 22:46:33 -07:00
Jim Barlow	c6091bcfe1	Change python2 -> python3 for readlink()	2015-03-24 22:36:13 -07:00
Jim Barlow	466a8a1318	It's now py3 that uses lxml, reportlab	2015-03-19 17:12:32 -07:00
Jim Barlow	a99ba3b696	Add rudimentary support for combining OCR layer with existing content It appears to be very fragile due to weaknesses in PyPDF. Better option is probably to use pdftk's watermark feature.	2015-03-10 14:28:38 -07:00
Jim Barlow	9229f7c6cc	Add option to render text as invisible OCR text Prior to this change, hocrtransform would render printable text (black on white) and then a fully opaque image on top of the text. According to the PDF spec, text that is the output of OCR should be marked invisible, so that PDF viewers /know/ it's OCR output in a document that might mix OCR and text overlays. Another benefit is that PDF viewers would know to skip rendering text if they are not smart enough to figure out the image will completely overwrite it. However, for debug, visible text is nice, so retain it as an option.	2015-02-22 12:43:27 -08:00
Jim Barlow	bf114bb188	Clean up pixel transform logic with namedtuple	2015-02-21 14:14:34 -08:00
Jim Barlow	b8eed2f861	More PEP8/lint	2015-02-21 13:00:46 -08:00
Jim Barlow	ccb1e347be	Call HocrTransform directly instead of through a subprocess	2015-02-20 17:20:48 -08:00
Jim Barlow	8698974f11	Rename hocrTransform -> hocrtransform	2015-02-20 16:47:36 -08:00
Jim Barlow	f2c79c4341	Convert hocrtransform to py3	2015-02-20 16:38:24 -08:00
Jim Barlow	4966d1346b	Module marker for src folder	2015-02-20 15:43:05 -08:00
Jim Barlow	4a9337f757	PEP8	2015-02-20 15:42:06 -08:00
Jim Barlow	db311fb6a2	Add support for -b (skip big pages)	2015-02-20 15:26:33 -08:00

... 46 47 48 49 50 ...

2676 Commits