OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-10-24 06:20:17 +00:00

Author	SHA1	Message	Date
fritz-hh	e083a860e9	Merge pull request #73 from andreas-christ/v2.x Fixed typo in import of reportlab.	2014-04-27 23:20:48 +02:00
Andreas Christ	6463b9dd84	Fixed typo in import of reportlab.	2014-04-27 19:15:46 +02:00
fritz-hh	c873de6ca4	Consider that the hocr file has not always the same name Closes #72	2014-04-27 16:01:11 +02:00
fritz-hh	b70863b47e	support both older and newer versions of reportlab closes #71	2014-04-27 15:53:20 +02:00
fritz-hh	3546f84c6d	ignore *.pyc files	2014-04-27 15:46:55 +02:00
Jim Barlow	1d98917db9	Add command line option to skip pages that contain font data If a page contains font data, the script would abort, unless -f was given, in which case it would use pdftoppm to rasterize the font into a bitmap and then attempt to OCR it. -f is almost certainly not what users want unless they want to debug OCR or something. If a PDF already has fonts it either was OCR'd already, or it is a composite file containing, for example, some scanned documents appended to a text report. In the latter case, this -s option provides OCR on pages that don't have it without changing those that do, and if a PDF was completely OCRed it will be converted to PDF/A. In batch jobs with a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.	2014-02-06 23:11:54 -08:00
fritz-hh	1c34fd69cf	RELEASE_NOTES update prior delivery of v2.0-stable v2.0-stable	2014-01-25 22:14:05 +01:00
fritz-hh	4cf38404cc	fixes #51 Allow tesseract 3.02.01 to be used. Even 3.02.01 fails in few cases (see issue #28). I decided to allow this version anyway because 3.02.02 is not yet available for some widespread linux distributions	2014-01-25 21:58:50 +01:00
Jim Barlow	112fb5098b	Expose pixFindSkew API	2014-01-21 21:36:41 -08:00
Jim Barlow	5ace6906c7	Bug fix: leptonica generates .png when asked to produce .pbm/pgm/ppm Leptonica does not interpret those extensions correctly. However, when asked to produce a .pnm file, it will produce the expected .pbm/pgm/ppm file depending on the input. So ask it to produce a .pnm and then adjust the extension. And add a test case.	2014-01-21 21:35:58 -08:00
Jim Barlow	8cfbdaf0d0	Fix a silly typo, and other minor cleanup	2014-01-19 19:06:19 -08:00
Jim Barlow	6703434976	Replace ImageMagick-convert with Leptonica	2014-01-19 14:47:51 -08:00
Jim Barlow	62edc15cd7	Implement ctypes wrapper around Leptonica to access its deskew function A few design notes: Leptonica's deskew is far superior to ImageMagick's convert -deskew command -- around 30-40x faster. Subjectively the output appears to this contributor to be of higher quality as well. The difference is the algorithm: ImageMagick uses the complex Hough transform to find the skew angle, while Leptonica uses the simpler method, Postl's variance of differential line sums -- conceptually, shear the image and check for straight horizontal. In this case simplicity wins. Finding the skew angle is the bulk of the work. Leptonica's author explains the advantages of his approach here: http://www.leptonica.com/skew-measurement.html Leptonica is the low-level library that Tesseract depends on. Hence, this project already depends on Leptonica. Leptonica can read and write most common image file types on its own. Unfortunately its error handling is poor: it seldom returns any meaningful error codes. The best it manages is writing messages to stderr, which in the context of a verbose script is just confusing since the error's source is not indicated. The problem is compounded by Tesseract's use of Leptonica, which will produce exactly the same errors in some cases. So we trap stderr between calls to Leptonica and parse it for a few different types of error message. leptonica.py is Python 2/3 compatible and set up to provide access to other Leptonica functions as needed. Of particular interest are its orientation detection (including flip and rotation errors) which it does by comparing text ascenders to descenders. There is a PyPI "pylepthonica" package, however it is out of date by a few years, and it implements all of Leptonica with Python wrappers -- so it is massive, with one .py file at 2.5 MB. This module is loosely inspired by pyleptonica but more modern, up to date, and contains only limited functionality.	2014-01-19 14:28:52 -08:00
fritz-hh	be830ddc31	List supported languages In case lan is not supported, list the supported languages in the error message	2014-01-18 22:22:19 +01:00
fritz-hh	18322b424f	fixes #60 Check if the languages option provided to tesseract (-l) are supported	2014-01-18 21:38:22 +01:00
fritz-hh	6901c60db4	more robust way to check tesseract version better way of checking if the tesseract version is compatible with the script. If the required tess version is 3.02.02, and the actual version is 3.03, the script would have told before that the version is too old, because 303<30202, now it compares 3.03>3.0202	2014-01-18 21:02:15 +01:00
fritz-hh	e369ce6766	config file: version updated to v2.0-rc2 v2.0-rc2	2014-01-16 21:22:24 +01:00
fritz-hh	64e4e5d91e	release notes updated for v2.0-rc2	2014-01-16 21:19:15 +01:00
fritz-hh	efce7de9ae	wording corrected	2014-01-15 23:08:26 +01:00
fritz-hh	38c64ac689	dependency to pdftk removed concatenation is now done also with ghostscript	2014-01-15 21:23:42 +01:00
fritz-hh	6d203e3eee	portability improvements + minor changes	2014-01-15 21:23:41 +01:00
fritz-hh	81f461e557	disclaimer added	2014-01-14 23:46:33 +01:00
fritz-hh	988bde1387	tmpfiles to $TMPDIR + better portability (mktemp) mktemp: consider both FreeBSD/OSX and Linux OS having incompatible syntax From now on temporary files are saved in the folder specified by the environment variable $TMPDIR	2014-01-14 22:57:10 +01:00
fritz-hh	aedbabdbe8	merged pull request from oxplot	2014-01-14 22:29:41 +01:00
fritz-hh	6ed53e53c7	Readme improved	2014-01-14 19:47:28 +01:00
Mansour Behabadi	a78630ce99	Make src scripts executable Signed-off-by: Mansour Behabadi <mansour@oxplot.com>	2014-01-14 17:50:46 +11:00
Mansour Behabadi	6653066784	Use --gnu in parralell and XX for mktemp Signed-off-by: Mansour Behabadi <mansour@oxplot.com>	2014-01-14 17:49:24 +11:00
fritz-hh	e40f1fa081	better handling of ligatures: fixes #58	2014-01-13 23:13:15 +01:00
fritz-hh	a872ce751d	config file restructured to be make which parameters are allowed to be changed by the user	2014-01-13 22:11:28 +01:00
fritz-hh	317846fbdc	Check of tmp folder creation was successful	2014-01-13 22:05:26 +01:00
fritz-hh	f581a55544	Merge pull request #57 from jbarlow83/for-upstream/tmpfolder Fix temporary folder name generation collisions	2014-01-13 12:31:02 -08:00
fritz-hh	447b291e70	minor changes	2014-01-13 18:03:44 +01:00
fritz-hh	01d07253e8	indicate python2 to be used in header	2014-01-13 18:03:43 +01:00
fritz-hh	034a466094	Merge pull request #56 from jbarlow83/for-upstream/hocr-selfwidth Fix AttributeError on self.width if Tesseract finds no OCR text	2014-01-13 08:44:16 -08:00
fritz-hh	c6211e2335	Merge pull request #55 from jbarlow83/for-upstream/check-poppler Verify that pdftoppm is the Poppler version, not xpdf version	2014-01-13 08:42:33 -08:00
Jim Barlow	1d03a6417d	Verify that pdftoppm is the Poppler version, not xpdf version	2014-01-12 22:12:09 -08:00
Jim Barlow	1d62ef27a2	Fix AttributeError on self.width if Tesseract finds no OCR text self.width remains undefined unless hOCR finds text. It might not, if a page contains only an image for example. Full error message is: AttributeError: ‘hocrTransform’ object has no attribute ‘width’	2014-01-12 22:10:15 -08:00
Jim Barlow	996048dc08	Fix temporary folder name generation collisions First, the regular expression matches everything after the first period in a filename. Adding the $ make it match the last, so that filenames such as “Report.1.pdf” get trimmed to “Report.1”. Next use mktemp to get the OS to create a temporary folder. It will guarantee a unique directory name beginning with prefix, even if parallel processes are at work.	2014-01-12 22:05:11 -08:00
fritz-hh	bf02ee3bdc	Resolved conflits with jbarlow83 pull request	2014-01-12 15:37:14 +01:00
fritz-hh	a3c7fba02d	minor changes (comments)	2014-01-11 22:26:29 +01:00
fritz-hh	a8cd7febf6	remove spurious space in img number Tell the script that "nbImg" is a number, so that leading/trailing spaces are removed	2014-01-11 22:15:53 +01:00
fritz-hh	20c008b84f	avoid spurious error msg if no image in pdf	2014-01-11 22:05:19 +01:00
fritz-hh	7cd73566be	check if python libs are installed Check if reportlab and lxml are installed, otherwise exist with an error	2014-01-11 17:08:26 +01:00
fritz-hh	e56fd53d06	poppler syntax (rather than xpdf syntax)	2014-01-11 16:19:52 +01:00
fritz-hh	810b1b3b3e	Merge pull request #48 from jbarlow83/for-upstream/osx-errors Fix pdffonts error when filename contains a space	2014-01-11 07:10:12 -08:00
fritz-hh	cb0b033fe7	Merge branch 'v2.x' of https://github.com/fritz-hh/OCRmyPDF into v2.x	2014-01-11 15:52:01 +01:00
fritz-hh	46f673a3b7	exit if bad parallel/tesseract version installed	2014-01-10 22:59:33 +01:00
fritz-hh	455303b3d4	parallel version added in RELEASE_NOTES	2014-01-10 22:12:58 +01:00
Jim Barlow	24a84d6380	Fix pdffonts error when filename contains a space	2014-01-09 16:44:24 -08:00
Jim Barlow	9aa2171052	Monkeypatch reportlab to output grayscale and monochrome colorspaces	2014-01-09 16:36:26 -08:00

... 53 54 55 56 57 ...

2895 Commits