OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-11-27 23:52:21 +00:00

Author	SHA1	Message	Date
fritz-hh	2612105d32	correct download path	2014-09-29 22:29:25 +02:00
fritz-hh	954fe13f54	update release notes for v2.2-stable v2.2-stable	2014-09-29 22:25:02 +02:00
fritz-hh	bb5a00685e	Make clear this is a draft	2014-09-28 21:10:04 +02:00
Jim Barlow	dabbddb04e	deskew and clean	2014-09-27 15:03:07 -07:00
fritz-hh	5f173e5acb	return right return code Python does not map the expression to its return code automatically, so this line returns success regardless of the reportlab version installed. (I also realized that hasattr is superfluous).	2014-09-27 00:53:10 +02:00
fritz-hh	b28ff40aea	remove reportlab patch. fixes #91 remove patch that was required for versions of reportlab <3.0 (fixed in 3.0 now) patch was necessary in order to reduce size of graysclage / b&w images in pdf	2014-09-26 23:58:19 +02:00
Jim Barlow	fccfb4589e	Moving quickly - we can now output .ppm files at correct resolution	2014-09-26 04:43:15 -07:00
Jim Barlow	5384c98013	Initial ocrpage.py rewrite into python3	2014-09-26 04:19:41 -07:00
fritz-hh	2ed2307573	Merge pull request #89 from jbarlow83/feature/readlink-osx More portable solution (works also on OS X) to get OCRmyPDF.sh path (following simlinks)	2014-09-25 23:09:26 +02:00
Jim Barlow	3f8a2d8d3e	Eliminate readlink entirely and do the same thing on all platforms	2014-09-25 13:47:35 -07:00
fritz-hh	1a13b7c85f	Check if the input file exist Previously I checked only if the folder in which the input file should be exists	2014-09-25 22:03:45 +02:00
Jim Barlow	d7130a1e56	Merge branch 'feature/keep-text-pages' into develop	2014-09-25 03:50:21 -07:00
Jim Barlow	f69054cb17	Fix parameter order problems Put TESS_CFG_FILES last because it is optional and can be blank. If omitted it breaks the sequence of subsequent parameters. Also cleanup text output in this new mode.	2014-09-25 03:50:01 -07:00
Jim Barlow	80dc6eca2c	Merge branches 'feature/readlink-osx' and 'feature/keep-text-pages' into develop Conflicts: OCRmyPDF.sh	2014-09-25 03:14:10 -07:00
Jim Barlow	d250fbb3d6	Fix call to readlink on OS X readlink -f is a GNU coreutils extension, so not available on OS X and other platforms.	2014-09-25 03:11:27 -07:00
Jim Barlow	09bbe92611	Add command line option to skip pages that contain font data If a page contains font data, the script would abort, unless -f was given, in which case it would use pdftoppm to rasterize the font into a bitmap and then attempt to OCR it. -f is almost certainly not what users want unless they want to debug OCR or something. If a PDF already has fonts it either was OCR'd already, or it is a composite file containing, for example, some scanned documents appended to a text report. In the latter case, this -s option provides OCR on pages that don't have it without changing those that do, and if a PDF was completely OCRed it will be converted to PDF/A. In batch jobs with a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.	2014-09-25 02:43:40 -07:00
Jim Barlow	69d922e096	Check for missing pdftoppm when poppler installed with --disable-splash-output When I upgraded to poppler 0.24.5, pdftoppm was not compiled because the script had --disable-splash-output set for some reason. For OS X Homebrew the solution is: brew uninstall poppler brew install poppler --with-splash-output	2014-09-25 02:30:29 -07:00
fritz-hh	d510e7e4ae	prevent new spurious jhove message to be displayed	2014-09-24 23:43:37 +02:00
fritz-hh	5893290dd9	update to jhove v1.11	2014-09-24 23:17:39 +02:00
fritz-hh	5c3bbc4031	typo in OCRmyPDF.sh	2014-09-22 21:22:38 +02:00
fritz-hh	27cd8cf0db	add link to heise open source	2014-09-20 20:47:02 +02:00
fritz-hh	b403016d5b	Release notes updated for v2.1-stable v2.1-stable	2014-09-20 19:50:32 +02:00
fritz-hh	5a81823969	Merge pull request #82 from orbitcowboy/v2.x Fixed typo	2014-09-20 19:02:33 +02:00
fritz-hh	17801401cd	Merge pull request #83 from DorianScholz/v2.x - small changes to make this work on Ubuntu 12.04 called via symlink - lowered minimum parallel version	2014-09-20 18:59:57 +02:00
Dorian Scholz	5c7b2a2a36	lowered minimum version for parallel to 20121122	2014-09-10 13:27:59 +02:00
Dorian Scholz	1db06de287	added BASEPATH to allow for execution via symlink	2014-09-10 13:26:14 +02:00
Martin Ettl	3904178d44	Fixed typo	2014-09-09 07:01:04 +02:00
fritz-hh	8bb9c3610c	Merge pull request #81 from MoritzFago/v2.x fixed tipo ghostcript to ghostscript	2014-09-08 18:31:00 +02:00
MoritzFago	7dcc382ccc	fixed tipo ghostcript to ghostscript	2014-09-08 16:52:49 +02:00
fritz-hh	b71fc807d2	Merge pull request #77 from andysigner/v2.x Fixed typo in help text	2014-05-23 19:51:20 +02:00
Andy Signer	15d28d970a	Fixed typo in help text	2014-05-23 12:41:31 +02:00
fritz-hh	e083a860e9	Merge pull request #73 from andreas-christ/v2.x Fixed typo in import of reportlab.	2014-04-27 23:20:48 +02:00
Andreas Christ	6463b9dd84	Fixed typo in import of reportlab.	2014-04-27 19:15:46 +02:00
fritz-hh	c873de6ca4	Consider that the hocr file has not always the same name Closes #72	2014-04-27 16:01:11 +02:00
fritz-hh	b70863b47e	support both older and newer versions of reportlab closes #71	2014-04-27 15:53:20 +02:00
fritz-hh	3546f84c6d	ignore *.pyc files	2014-04-27 15:46:55 +02:00
Jim Barlow	1d98917db9	Add command line option to skip pages that contain font data If a page contains font data, the script would abort, unless -f was given, in which case it would use pdftoppm to rasterize the font into a bitmap and then attempt to OCR it. -f is almost certainly not what users want unless they want to debug OCR or something. If a PDF already has fonts it either was OCR'd already, or it is a composite file containing, for example, some scanned documents appended to a text report. In the latter case, this -s option provides OCR on pages that don't have it without changing those that do, and if a PDF was completely OCRed it will be converted to PDF/A. In batch jobs with a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.	2014-02-06 23:11:54 -08:00
fritz-hh	1c34fd69cf	RELEASE_NOTES update prior delivery of v2.0-stable v2.0-stable	2014-01-25 22:14:05 +01:00
fritz-hh	4cf38404cc	fixes #51 Allow tesseract 3.02.01 to be used. Even 3.02.01 fails in few cases (see issue #28). I decided to allow this version anyway because 3.02.02 is not yet available for some widespread linux distributions	2014-01-25 21:58:50 +01:00
Jim Barlow	112fb5098b	Expose pixFindSkew API	2014-01-21 21:36:41 -08:00
Jim Barlow	5ace6906c7	Bug fix: leptonica generates .png when asked to produce .pbm/pgm/ppm Leptonica does not interpret those extensions correctly. However, when asked to produce a .pnm file, it will produce the expected .pbm/pgm/ppm file depending on the input. So ask it to produce a .pnm and then adjust the extension. And add a test case.	2014-01-21 21:35:58 -08:00
Jim Barlow	8cfbdaf0d0	Fix a silly typo, and other minor cleanup	2014-01-19 19:06:19 -08:00
Jim Barlow	6703434976	Replace ImageMagick-convert with Leptonica	2014-01-19 14:47:51 -08:00
Jim Barlow	62edc15cd7	Implement ctypes wrapper around Leptonica to access its deskew function A few design notes: Leptonica's deskew is far superior to ImageMagick's convert -deskew command -- around 30-40x faster. Subjectively the output appears to this contributor to be of higher quality as well. The difference is the algorithm: ImageMagick uses the complex Hough transform to find the skew angle, while Leptonica uses the simpler method, Postl's variance of differential line sums -- conceptually, shear the image and check for straight horizontal. In this case simplicity wins. Finding the skew angle is the bulk of the work. Leptonica's author explains the advantages of his approach here: http://www.leptonica.com/skew-measurement.html Leptonica is the low-level library that Tesseract depends on. Hence, this project already depends on Leptonica. Leptonica can read and write most common image file types on its own. Unfortunately its error handling is poor: it seldom returns any meaningful error codes. The best it manages is writing messages to stderr, which in the context of a verbose script is just confusing since the error's source is not indicated. The problem is compounded by Tesseract's use of Leptonica, which will produce exactly the same errors in some cases. So we trap stderr between calls to Leptonica and parse it for a few different types of error message. leptonica.py is Python 2/3 compatible and set up to provide access to other Leptonica functions as needed. Of particular interest are its orientation detection (including flip and rotation errors) which it does by comparing text ascenders to descenders. There is a PyPI "pylepthonica" package, however it is out of date by a few years, and it implements all of Leptonica with Python wrappers -- so it is massive, with one .py file at 2.5 MB. This module is loosely inspired by pyleptonica but more modern, up to date, and contains only limited functionality.	2014-01-19 14:28:52 -08:00
fritz-hh	be830ddc31	List supported languages In case lan is not supported, list the supported languages in the error message	2014-01-18 22:22:19 +01:00
fritz-hh	18322b424f	fixes #60 Check if the languages option provided to tesseract (-l) are supported	2014-01-18 21:38:22 +01:00
fritz-hh	6901c60db4	more robust way to check tesseract version better way of checking if the tesseract version is compatible with the script. If the required tess version is 3.02.02, and the actual version is 3.03, the script would have told before that the version is too old, because 303<30202, now it compares 3.03>3.0202	2014-01-18 21:02:15 +01:00
fritz-hh	e369ce6766	config file: version updated to v2.0-rc2 v2.0-rc2	2014-01-16 21:22:24 +01:00
fritz-hh	64e4e5d91e	release notes updated for v2.0-rc2	2014-01-16 21:19:15 +01:00
fritz-hh	efce7de9ae	wording corrected	2014-01-15 23:08:26 +01:00

... 48 49 50 51 52 ...

2676 Commits