OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-10-26 15:29:23 +00:00

Author	SHA1	Message	Date
Jim Barlow	cc2af2bc15	Convert the final image to a JPEG if the original image was a JPEG Of course, this introduces recompression artifacts, and is unnecessary if no options are given that modify the final image (no -d, -c, -i). But rather than worry about that, it would be better to ultimately find a way to combine the original PDF page with the output PDF text in the case where we want no changes to the original. This is good enough for now. The better option can apparently be achieved using pdftk background, or probably better, PyPDF2's merge. If Tesseract PDF generation is used then we need a way to remove the image. Tesseract PDF generation at 3.03 does layout better (I think) and also properly encodes the hidden layer, which is less likely to give display issues (I think).	2015-02-11 10:23:45 -08:00
Jim Barlow	638c6db05d	Use the appropriate PNG rendered given the types of image present	2015-02-11 03:32:00 -08:00
Jim Barlow	f7db8d9aff	Use Ghostscript -> PNG instead of pdftoppm for rendering Ghostscript has the clunkiest imaginable syntax, obtuse documentation, quirky behavior, and poor diagnostics... but it actually works unlike pdftoppm/poppler which gets things wrong. In this case I observed poppler incorrectly decompresses certain CCITT encoded monochrome PDFs. So set up Ghostscript to do the job instead. For the moment this performs monochrome -> RGB conversion via reportlab.	2015-02-11 03:13:07 -08:00
Jim Barlow	564fb7a87e	Support Ghostscript 9.14's new color conversion engine (not portable) The flag -dUseCIEColor is now deprecated, as it invokes the old engine which introduces color errors. The new engine requires a PDF/A file header with hardcoded location of a ICC profile to use, now included in the project. Portable iterations should generate a PDFA_def.ps based on the target system; for now OS X with homebrew is presumed. I have selected sRGB since scanners tend to capture RGB and printing is not a major consideration for PDF/A. Also note all file paths given to gs must be absolute. May its creators be forever haunted for their failure to document this unexpected quirk.	2015-02-09 15:33:49 -08:00
Jim Barlow	4d88e64774	Standardize tmpfile prefix	2015-02-09 15:02:49 -08:00
Jim Barlow	26f1163b46	Handle case where a page contains no images - don't OCR It doesn't make much sense to do anything with an all vector page except extract the page unmodified.	2015-02-08 20:05:54 -08:00
Jim Barlow	40058e99e0	Implement debug text only page option	2015-02-08 19:51:41 -08:00
Jim Barlow	bece4c3e02	Describe what decision was made based on -f and -s and presence of text	2015-02-08 19:51:18 -08:00
Jim Barlow	f0f6b57c87	When deciding on OCR, check for presence of text rather than a font It appears to be possible to have a PDF with an embedded font that is either unused or used only for whitespace. So check for some amount of actual text instead.	2015-02-08 17:38:27 -08:00
Jim Barlow	dc2a4ab044	Logic error	2015-02-08 17:33:35 -08:00
Jim Barlow	b16d6f5b81	Implement skipping OCR when -s is specified Appears to be necessary to disable each state of the pipeline that is inactive, not just initial and terminal stages of an inactive segment. If nothing else this makes what is going on more explicit.	2015-02-08 17:26:16 -08:00
Jim Barlow	69ce6ff7b5	Not a named param	2014-11-22 15:35:05 -08:00
Jim Barlow	32ba50b8dc	Add Tesseract timeout to keep things reasonable	2014-11-14 02:06:23 -08:00
Jim Barlow	36aca45f35	The -dci options now work (and valid combinations thereof)	2014-11-14 00:23:22 -08:00
Jim Barlow	925290342d	Leptonica deskew can handle .pnm input, unlike imagemagick	2014-11-13 23:20:25 -08:00
Jim Barlow	4dc0370c57	Add leptonica deskew	2014-11-13 16:53:26 -08:00
Jim Barlow	b92f8e43f2	Run as a module instead	2014-11-13 16:52:53 -08:00
Jim Barlow	22b0733a1d	Merge branch 'feature/findskew' into develop	2014-11-13 16:00:27 -08:00
Jim Barlow	6021684ab6	Attempt to fix multiprocessing pickling error	2014-11-13 15:58:57 -08:00
Jim Barlow	f4b1d0cdfe	Fix symlink error that occurs in multipage processing	2014-11-13 15:58:36 -08:00
Jim Barlow	d0d8048621	Comments	2014-10-17 17:28:31 -07:00
Jim Barlow	cfd119325d	Use abspath instead of relpath for temporary directory symlink	2014-10-11 17:48:56 -07:00
Jim Barlow	ad30833ffc	Support missing tess_cfg_files parameter when omitted by OCRmyPDF.sh	2014-10-11 17:48:33 -07:00
Jim Barlow	e5c79a6666	Use TIFFs as intermediates pdftoppm in recent versions (0.26.4,5) seems to be incapable of producing valid TIFFs, so have it dump a .pnm file and let ImageMagick figure out how to convert it to TIFF. This is not ideal, but at least it works.	2014-10-10 01:54:16 -07:00
Jim Barlow	63dc753c1b	Standardize intermediate filenames better convert .pnm -deskew <...> .pnm seems to have a bug that produces an invalid .pnm file which later causes tesseract (specifically, leptonica) to choke (using 3.02/1.71 as versions, respectively). Will change pipeline to use tiffs internally since they are less stupid.	2014-10-10 01:30:43 -07:00
Jim Barlow	017bc1f252	Basic error handling	2014-10-10 01:07:46 -07:00
Jim Barlow	bcd67c009d	Sort of working, but fragile; uses tmp folder properly now	2014-10-10 00:35:49 -07:00
Jim Barlow	2f6cfafdfc	Now produces a finished OCR-PDF page	2014-10-08 03:54:06 -07:00
Jim Barlow	25234fa30b	First crack at Ruffus, working well	2014-10-08 03:21:28 -07:00
Jim Barlow	dabbddb04e	deskew and clean	2014-09-27 15:03:07 -07:00
Jim Barlow	fccfb4589e	Moving quickly - we can now output .ppm files at correct resolution	2014-09-26 04:43:15 -07:00
Jim Barlow	5384c98013	Initial ocrpage.py rewrite into python3	2014-09-26 04:19:41 -07:00
Jim Barlow	d7130a1e56	Merge branch 'feature/keep-text-pages' into develop	2014-09-25 03:50:21 -07:00
Jim Barlow	f69054cb17	Fix parameter order problems Put TESS_CFG_FILES last because it is optional and can be blank. If omitted it breaks the sequence of subsequent parameters. Also cleanup text output in this new mode.	2014-09-25 03:50:01 -07:00
Jim Barlow	80dc6eca2c	Merge branches 'feature/readlink-osx' and 'feature/keep-text-pages' into develop Conflicts: OCRmyPDF.sh	2014-09-25 03:14:10 -07:00
Jim Barlow	d250fbb3d6	Fix call to readlink on OS X readlink -f is a GNU coreutils extension, so not available on OS X and other platforms.	2014-09-25 03:11:27 -07:00
Jim Barlow	09bbe92611	Add command line option to skip pages that contain font data If a page contains font data, the script would abort, unless -f was given, in which case it would use pdftoppm to rasterize the font into a bitmap and then attempt to OCR it. -f is almost certainly not what users want unless they want to debug OCR or something. If a PDF already has fonts it either was OCR'd already, or it is a composite file containing, for example, some scanned documents appended to a text report. In the latter case, this -s option provides OCR on pages that don't have it without changing those that do, and if a PDF was completely OCRed it will be converted to PDF/A. In batch jobs with a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.	2014-09-25 02:43:40 -07:00
Jim Barlow	69d922e096	Check for missing pdftoppm when poppler installed with --disable-splash-output When I upgraded to poppler 0.24.5, pdftoppm was not compiled because the script had --disable-splash-output set for some reason. For OS X Homebrew the solution is: brew uninstall poppler brew install poppler --with-splash-output	2014-09-25 02:30:29 -07:00
fritz-hh	d510e7e4ae	prevent new spurious jhove message to be displayed	2014-09-24 23:43:37 +02:00
fritz-hh	5893290dd9	update to jhove v1.11	2014-09-24 23:17:39 +02:00
fritz-hh	5c3bbc4031	typo in OCRmyPDF.sh	2014-09-22 21:22:38 +02:00
fritz-hh	27cd8cf0db	add link to heise open source	2014-09-20 20:47:02 +02:00
fritz-hh	b403016d5b	Release notes updated for v2.1-stable v2.1-stable	2014-09-20 19:50:32 +02:00
fritz-hh	5a81823969	Merge pull request #82 from orbitcowboy/v2.x Fixed typo	2014-09-20 19:02:33 +02:00
fritz-hh	17801401cd	Merge pull request #83 from DorianScholz/v2.x - small changes to make this work on Ubuntu 12.04 called via symlink - lowered minimum parallel version	2014-09-20 18:59:57 +02:00
Dorian Scholz	5c7b2a2a36	lowered minimum version for parallel to 20121122	2014-09-10 13:27:59 +02:00
Dorian Scholz	1db06de287	added BASEPATH to allow for execution via symlink	2014-09-10 13:26:14 +02:00
Martin Ettl	3904178d44	Fixed typo	2014-09-09 07:01:04 +02:00
fritz-hh	8bb9c3610c	Merge pull request #81 from MoritzFago/v2.x fixed tipo ghostcript to ghostscript	2014-09-08 18:31:00 +02:00
MoritzFago	7dcc382ccc	fixed tipo ghostcript to ghostscript	2014-09-08 16:52:49 +02:00

1 2 3 4 5

246 Commits