2895 Commits

Author SHA1 Message Date
fritz-hh
e083a860e9 Merge pull request #73 from andreas-christ/v2.x
Fixed typo in import of reportlab.
2014-04-27 23:20:48 +02:00
Andreas Christ
6463b9dd84 Fixed typo in import of reportlab. 2014-04-27 19:15:46 +02:00
fritz-hh
c873de6ca4 Consider that the hocr file has not always the same name
Closes #72
2014-04-27 16:01:11 +02:00
fritz-hh
b70863b47e support both older and newer versions of reportlab
closes #71
2014-04-27 15:53:20 +02:00
fritz-hh
3546f84c6d ignore *.pyc files 2014-04-27 15:46:55 +02:00
Jim Barlow
1d98917db9 Add command line option to skip pages that contain font data
If a page contains font data, the script would abort, unless -f was given,
in which case it would use pdftoppm to rasterize the font into a bitmap
and then attempt to OCR it. -f is almost certainly not what users want
unless they want to debug OCR or something.

If a PDF already has fonts it either was OCR'd already, or it is
a composite file containing, for example, some scanned documents appended
to a text report.  In the latter case, this -s option provides OCR on
pages that don't have it without changing those that do, and if a PDF
was completely OCRed it will be converted to PDF/A.  In batch jobs with
a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.
2014-02-06 23:11:54 -08:00
fritz-hh
1c34fd69cf RELEASE_NOTES update prior delivery of v2.0-stable v2.0-stable 2014-01-25 22:14:05 +01:00
fritz-hh
4cf38404cc fixes #51
Allow tesseract 3.02.01 to be used.
Even 3.02.01 fails in few cases (see issue #28). I decided to allow this
version anyway because 3.02.02 is not yet available for some widespread
linux distributions
2014-01-25 21:58:50 +01:00
Jim Barlow
112fb5098b Expose pixFindSkew API 2014-01-21 21:36:41 -08:00
Jim Barlow
5ace6906c7 Bug fix: leptonica generates .png when asked to produce .pbm/pgm/ppm
Leptonica does not interpret those extensions correctly.  However, when
asked to produce a .pnm file, it will produce the expected .pbm/pgm/ppm
file depending on the input.  So ask it to produce a .pnm and then
adjust the extension.

And add a test case.
2014-01-21 21:35:58 -08:00
Jim Barlow
8cfbdaf0d0 Fix a silly typo, and other minor cleanup 2014-01-19 19:06:19 -08:00
Jim Barlow
6703434976 Replace ImageMagick-convert with Leptonica 2014-01-19 14:47:51 -08:00
Jim Barlow
62edc15cd7 Implement ctypes wrapper around Leptonica to access its deskew function
A few design notes:
Leptonica's deskew is far superior to ImageMagick's convert -deskew command --
around 30-40x faster.  Subjectively the output appears to this contributor to
be of higher quality as well.  The difference is the algorithm: ImageMagick
uses the complex Hough transform to find the skew angle, while Leptonica uses
the simpler method, Postl's variance of differential line sums -- conceptually, shear the image and check for straight horizontal.  In this case
simplicity wins.  Finding the skew angle is the bulk of the work.

Leptonica's author explains the advantages of his approach here:
http://www.leptonica.com/skew-measurement.html

Leptonica is the low-level library that Tesseract depends on.  Hence, this
project already depends on Leptonica.  Leptonica can read and write most
common image file types on its own.

Unfortunately its error handling is poor: it seldom returns any meaningful
error codes.  The best it manages is writing messages to stderr, which in
the context of a verbose script is just confusing since the error's source
is not indicated.  The problem is compounded by Tesseract's use of Leptonica,
which will produce exactly the same errors in some cases.  So we trap stderr
between calls to Leptonica and parse it for a few different types of error
message.

leptonica.py is Python 2/3 compatible and set up to provide access to other
Leptonica functions as needed.  Of particular interest are its orientation
detection (including flip and rotation errors) which it does by comparing
text ascenders to descenders.

There is a PyPI "pylepthonica" package, however it is out of date by a few
years, and it implements all of Leptonica with Python wrappers -- so it is
massive, with one .py file at 2.5 MB.  This module is loosely inspired by
pyleptonica but more modern, up to date, and contains only limited
functionality.
2014-01-19 14:28:52 -08:00
fritz-hh
be830ddc31 List supported languages
In case lan is not supported, list the supported languages in the error
message
2014-01-18 22:22:19 +01:00
fritz-hh
18322b424f fixes #60
Check if the languages option provided to tesseract (-l) are supported
2014-01-18 21:38:22 +01:00
fritz-hh
6901c60db4 more robust way to check tesseract version
better way of checking if the tesseract version is compatible with the
script.
If the required tess version is 3.02.02, and the actual version is 3.03,
the script would have told before that the version is too old, because
303<30202, now it compares 3.03>3.0202
2014-01-18 21:02:15 +01:00
fritz-hh
e369ce6766 config file: version updated to v2.0-rc2 v2.0-rc2 2014-01-16 21:22:24 +01:00
fritz-hh
64e4e5d91e release notes updated for v2.0-rc2 2014-01-16 21:19:15 +01:00
fritz-hh
efce7de9ae wording corrected 2014-01-15 23:08:26 +01:00
fritz-hh
38c64ac689 dependency to pdftk removed
concatenation is now done also with ghostscript
2014-01-15 21:23:42 +01:00
fritz-hh
6d203e3eee portability improvements + minor changes 2014-01-15 21:23:41 +01:00
fritz-hh
81f461e557 disclaimer added 2014-01-14 23:46:33 +01:00
fritz-hh
988bde1387 tmpfiles to $TMPDIR + better portability (mktemp)
mktemp: consider both FreeBSD/OSX and Linux OS having incompatible
syntax
From now on temporary files are saved in the folder specified by the
environment variable $TMPDIR
2014-01-14 22:57:10 +01:00
fritz-hh
aedbabdbe8 merged pull request from oxplot 2014-01-14 22:29:41 +01:00
fritz-hh
6ed53e53c7 Readme improved 2014-01-14 19:47:28 +01:00
Mansour Behabadi
a78630ce99 Make src scripts executable
Signed-off-by: Mansour Behabadi <mansour@oxplot.com>
2014-01-14 17:50:46 +11:00
Mansour Behabadi
6653066784 Use --gnu in parralell and XX for mktemp
Signed-off-by: Mansour Behabadi <mansour@oxplot.com>
2014-01-14 17:49:24 +11:00
fritz-hh
e40f1fa081 better handling of ligatures: fixes #58 2014-01-13 23:13:15 +01:00
fritz-hh
a872ce751d config file restructured
to be make which parameters are allowed to be changed by the user
2014-01-13 22:11:28 +01:00
fritz-hh
317846fbdc Check of tmp folder creation was successful 2014-01-13 22:05:26 +01:00
fritz-hh
f581a55544 Merge pull request #57 from jbarlow83/for-upstream/tmpfolder
Fix temporary folder name generation collisions
2014-01-13 12:31:02 -08:00
fritz-hh
447b291e70 minor changes 2014-01-13 18:03:44 +01:00
fritz-hh
01d07253e8 indicate python2 to be used in header 2014-01-13 18:03:43 +01:00
fritz-hh
034a466094 Merge pull request #56 from jbarlow83/for-upstream/hocr-selfwidth
Fix AttributeError on self.width if Tesseract finds no OCR text
2014-01-13 08:44:16 -08:00
fritz-hh
c6211e2335 Merge pull request #55 from jbarlow83/for-upstream/check-poppler
Verify that pdftoppm is the Poppler version, not xpdf version
2014-01-13 08:42:33 -08:00
Jim Barlow
1d03a6417d Verify that pdftoppm is the Poppler version, not xpdf version 2014-01-12 22:12:09 -08:00
Jim Barlow
1d62ef27a2 Fix AttributeError on self.width if Tesseract finds no OCR text
self.width remains undefined unless hOCR finds text.  It might not, if
a page contains only an image for example.

Full error message is:
AttributeError: ‘hocrTransform’ object has no attribute ‘width’
2014-01-12 22:10:15 -08:00
Jim Barlow
996048dc08 Fix temporary folder name generation collisions
First, the regular expression matches everything after the first period
in a filename.  Adding the $ make it match the last, so that filenames
such as “Report.1.pdf” get trimmed to “Report.1”.

Next use mktemp to get the OS to create a temporary folder.  It will
guarantee a unique directory name beginning with prefix, even if parallel
processes are at work.
2014-01-12 22:05:11 -08:00
fritz-hh
bf02ee3bdc Resolved conflits with jbarlow83 pull request 2014-01-12 15:37:14 +01:00
fritz-hh
a3c7fba02d minor changes (comments) 2014-01-11 22:26:29 +01:00
fritz-hh
a8cd7febf6 remove spurious space in img number
Tell the script that "nbImg" is a number, so that leading/trailing
spaces are removed
2014-01-11 22:15:53 +01:00
fritz-hh
20c008b84f avoid spurious error msg if no image in pdf 2014-01-11 22:05:19 +01:00
fritz-hh
7cd73566be check if python libs are installed
Check if reportlab and lxml are installed, otherwise exist with an error
2014-01-11 17:08:26 +01:00
fritz-hh
e56fd53d06 poppler syntax (rather than xpdf syntax) 2014-01-11 16:19:52 +01:00
fritz-hh
810b1b3b3e Merge pull request #48 from jbarlow83/for-upstream/osx-errors
Fix pdffonts error when filename contains a space
2014-01-11 07:10:12 -08:00
fritz-hh
cb0b033fe7 Merge branch 'v2.x' of https://github.com/fritz-hh/OCRmyPDF into v2.x 2014-01-11 15:52:01 +01:00
fritz-hh
46f673a3b7 exit if bad parallel/tesseract version installed 2014-01-10 22:59:33 +01:00
fritz-hh
455303b3d4 parallel version added in RELEASE_NOTES 2014-01-10 22:12:58 +01:00
Jim Barlow
24a84d6380 Fix pdffonts error when filename contains a space 2014-01-09 16:44:24 -08:00
Jim Barlow
9aa2171052 Monkeypatch reportlab to output grayscale and monochrome colorspaces 2014-01-09 16:36:26 -08:00