Python does not map the expression to its return code automatically, so
this line returns success regardless of the reportlab version installed.
(I also realized that hasattr is superfluous).
remove patch that was required for versions of reportlab <3.0 (fixed in
3.0 now)
patch was necessary in order to reduce size of graysclage / b&w images
in pdf
Put TESS_CFG_FILES last because it is optional and can be blank. If
omitted it breaks the sequence of subsequent parameters. Also cleanup
text output in this new mode.
If a page contains font data, the script would abort, unless -f was given,
in which case it would use pdftoppm to rasterize the font into a bitmap
and then attempt to OCR it. -f is almost certainly not what users want
unless they want to debug OCR or something.
If a PDF already has fonts it either was OCR'd already, or it is
a composite file containing, for example, some scanned documents appended
to a text report. In the latter case, this -s option provides OCR on
pages that don't have it without changing those that do, and if a PDF
was completely OCRed it will be converted to PDF/A. In batch jobs with
a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.
When I upgraded to poppler 0.24.5, pdftoppm was not compiled because the
script had --disable-splash-output set for some reason.
For OS X Homebrew the solution is:
brew uninstall poppler
brew install poppler --with-splash-output
If a page contains font data, the script would abort, unless -f was given,
in which case it would use pdftoppm to rasterize the font into a bitmap
and then attempt to OCR it. -f is almost certainly not what users want
unless they want to debug OCR or something.
If a PDF already has fonts it either was OCR'd already, or it is
a composite file containing, for example, some scanned documents appended
to a text report. In the latter case, this -s option provides OCR on
pages that don't have it without changing those that do, and if a PDF
was completely OCRed it will be converted to PDF/A. In batch jobs with
a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.
Allow tesseract 3.02.01 to be used.
Even 3.02.01 fails in few cases (see issue #28). I decided to allow this
version anyway because 3.02.02 is not yet available for some widespread
linux distributions
Leptonica does not interpret those extensions correctly. However, when
asked to produce a .pnm file, it will produce the expected .pbm/pgm/ppm
file depending on the input. So ask it to produce a .pnm and then
adjust the extension.
And add a test case.
A few design notes:
Leptonica's deskew is far superior to ImageMagick's convert -deskew command --
around 30-40x faster. Subjectively the output appears to this contributor to
be of higher quality as well. The difference is the algorithm: ImageMagick
uses the complex Hough transform to find the skew angle, while Leptonica uses
the simpler method, Postl's variance of differential line sums -- conceptually, shear the image and check for straight horizontal. In this case
simplicity wins. Finding the skew angle is the bulk of the work.
Leptonica's author explains the advantages of his approach here:
http://www.leptonica.com/skew-measurement.html
Leptonica is the low-level library that Tesseract depends on. Hence, this
project already depends on Leptonica. Leptonica can read and write most
common image file types on its own.
Unfortunately its error handling is poor: it seldom returns any meaningful
error codes. The best it manages is writing messages to stderr, which in
the context of a verbose script is just confusing since the error's source
is not indicated. The problem is compounded by Tesseract's use of Leptonica,
which will produce exactly the same errors in some cases. So we trap stderr
between calls to Leptonica and parse it for a few different types of error
message.
leptonica.py is Python 2/3 compatible and set up to provide access to other
Leptonica functions as needed. Of particular interest are its orientation
detection (including flip and rotation errors) which it does by comparing
text ascenders to descenders.
There is a PyPI "pylepthonica" package, however it is out of date by a few
years, and it implements all of Leptonica with Python wrappers -- so it is
massive, with one .py file at 2.5 MB. This module is loosely inspired by
pyleptonica but more modern, up to date, and contains only limited
functionality.
better way of checking if the tesseract version is compatible with the
script.
If the required tess version is 3.02.02, and the actual version is 3.03,
the script would have told before that the version is too old, because
303<30202, now it compares 3.03>3.0202