2676 Commits

Author SHA1 Message Date
fritz-hh
2612105d32 correct download path 2014-09-29 22:29:25 +02:00
fritz-hh
954fe13f54 update release notes for v2.2-stable v2.2-stable 2014-09-29 22:25:02 +02:00
fritz-hh
bb5a00685e Make clear this is a draft 2014-09-28 21:10:04 +02:00
Jim Barlow
dabbddb04e deskew and clean 2014-09-27 15:03:07 -07:00
fritz-hh
5f173e5acb return right return code
Python does not map the expression to its return code automatically, so
this line returns success regardless of the reportlab version installed.
(I also realized that hasattr is superfluous).
2014-09-27 00:53:10 +02:00
fritz-hh
b28ff40aea remove reportlab patch. fixes #91
remove patch that was required for versions of reportlab <3.0 (fixed in
3.0 now)
patch was necessary in order to reduce size of graysclage / b&w images
in pdf
2014-09-26 23:58:19 +02:00
Jim Barlow
fccfb4589e Moving quickly - we can now output .ppm files at correct resolution 2014-09-26 04:43:15 -07:00
Jim Barlow
5384c98013 Initial ocrpage.py rewrite into python3 2014-09-26 04:19:41 -07:00
fritz-hh
2ed2307573 Merge pull request #89 from jbarlow83/feature/readlink-osx
More portable solution (works also on OS X) to get OCRmyPDF.sh path (following simlinks)
2014-09-25 23:09:26 +02:00
Jim Barlow
3f8a2d8d3e Eliminate readlink entirely and do the same thing on all platforms 2014-09-25 13:47:35 -07:00
fritz-hh
1a13b7c85f Check if the input file exist
Previously I checked only if the folder in which the input file should
be exists
2014-09-25 22:03:45 +02:00
Jim Barlow
d7130a1e56 Merge branch 'feature/keep-text-pages' into develop 2014-09-25 03:50:21 -07:00
Jim Barlow
f69054cb17 Fix parameter order problems
Put TESS_CFG_FILES last because it is optional and can be blank. If
omitted it breaks the sequence of subsequent parameters. Also cleanup
text output in this new mode.
2014-09-25 03:50:01 -07:00
Jim Barlow
80dc6eca2c Merge branches 'feature/readlink-osx' and 'feature/keep-text-pages' into develop
Conflicts:
	OCRmyPDF.sh
2014-09-25 03:14:10 -07:00
Jim Barlow
d250fbb3d6 Fix call to readlink on OS X
readlink -f is a GNU coreutils extension, so not available on OS X and
other platforms.
2014-09-25 03:11:27 -07:00
Jim Barlow
09bbe92611 Add command line option to skip pages that contain font data
If a page contains font data, the script would abort, unless -f was given,
in which case it would use pdftoppm to rasterize the font into a bitmap
and then attempt to OCR it. -f is almost certainly not what users want
unless they want to debug OCR or something.

If a PDF already has fonts it either was OCR'd already, or it is
a composite file containing, for example, some scanned documents appended
to a text report.  In the latter case, this -s option provides OCR on
pages that don't have it without changing those that do, and if a PDF
was completely OCRed it will be converted to PDF/A.  In batch jobs with
a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.
2014-09-25 02:43:40 -07:00
Jim Barlow
69d922e096 Check for missing pdftoppm when poppler installed with --disable-splash-output
When I upgraded to poppler 0.24.5, pdftoppm was not compiled because the
script had --disable-splash-output set for some reason.

For OS X Homebrew the solution is:
brew uninstall poppler
brew install poppler --with-splash-output
2014-09-25 02:30:29 -07:00
fritz-hh
d510e7e4ae prevent new spurious jhove message to be displayed 2014-09-24 23:43:37 +02:00
fritz-hh
5893290dd9 update to jhove v1.11 2014-09-24 23:17:39 +02:00
fritz-hh
5c3bbc4031 typo in OCRmyPDF.sh 2014-09-22 21:22:38 +02:00
fritz-hh
27cd8cf0db add link to heise open source 2014-09-20 20:47:02 +02:00
fritz-hh
b403016d5b Release notes updated for v2.1-stable v2.1-stable 2014-09-20 19:50:32 +02:00
fritz-hh
5a81823969 Merge pull request #82 from orbitcowboy/v2.x
Fixed typo
2014-09-20 19:02:33 +02:00
fritz-hh
17801401cd Merge pull request #83 from DorianScholz/v2.x
- small changes to make this work on Ubuntu 12.04 called via symlink
- lowered minimum parallel version
2014-09-20 18:59:57 +02:00
Dorian Scholz
5c7b2a2a36 lowered minimum version for parallel to 20121122 2014-09-10 13:27:59 +02:00
Dorian Scholz
1db06de287 added BASEPATH to allow for execution via symlink 2014-09-10 13:26:14 +02:00
Martin Ettl
3904178d44 Fixed typo 2014-09-09 07:01:04 +02:00
fritz-hh
8bb9c3610c Merge pull request #81 from MoritzFago/v2.x
fixed tipo ghostcript to ghostscript
2014-09-08 18:31:00 +02:00
MoritzFago
7dcc382ccc fixed tipo ghostcript to ghostscript 2014-09-08 16:52:49 +02:00
fritz-hh
b71fc807d2 Merge pull request #77 from andysigner/v2.x
Fixed typo in help text
2014-05-23 19:51:20 +02:00
Andy Signer
15d28d970a Fixed typo in help text 2014-05-23 12:41:31 +02:00
fritz-hh
e083a860e9 Merge pull request #73 from andreas-christ/v2.x
Fixed typo in import of reportlab.
2014-04-27 23:20:48 +02:00
Andreas Christ
6463b9dd84 Fixed typo in import of reportlab. 2014-04-27 19:15:46 +02:00
fritz-hh
c873de6ca4 Consider that the hocr file has not always the same name
Closes #72
2014-04-27 16:01:11 +02:00
fritz-hh
b70863b47e support both older and newer versions of reportlab
closes #71
2014-04-27 15:53:20 +02:00
fritz-hh
3546f84c6d ignore *.pyc files 2014-04-27 15:46:55 +02:00
Jim Barlow
1d98917db9 Add command line option to skip pages that contain font data
If a page contains font data, the script would abort, unless -f was given,
in which case it would use pdftoppm to rasterize the font into a bitmap
and then attempt to OCR it. -f is almost certainly not what users want
unless they want to debug OCR or something.

If a PDF already has fonts it either was OCR'd already, or it is
a composite file containing, for example, some scanned documents appended
to a text report.  In the latter case, this -s option provides OCR on
pages that don't have it without changing those that do, and if a PDF
was completely OCRed it will be converted to PDF/A.  In batch jobs with
a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.
2014-02-06 23:11:54 -08:00
fritz-hh
1c34fd69cf RELEASE_NOTES update prior delivery of v2.0-stable v2.0-stable 2014-01-25 22:14:05 +01:00
fritz-hh
4cf38404cc fixes #51
Allow tesseract 3.02.01 to be used.
Even 3.02.01 fails in few cases (see issue #28). I decided to allow this
version anyway because 3.02.02 is not yet available for some widespread
linux distributions
2014-01-25 21:58:50 +01:00
Jim Barlow
112fb5098b Expose pixFindSkew API 2014-01-21 21:36:41 -08:00
Jim Barlow
5ace6906c7 Bug fix: leptonica generates .png when asked to produce .pbm/pgm/ppm
Leptonica does not interpret those extensions correctly.  However, when
asked to produce a .pnm file, it will produce the expected .pbm/pgm/ppm
file depending on the input.  So ask it to produce a .pnm and then
adjust the extension.

And add a test case.
2014-01-21 21:35:58 -08:00
Jim Barlow
8cfbdaf0d0 Fix a silly typo, and other minor cleanup 2014-01-19 19:06:19 -08:00
Jim Barlow
6703434976 Replace ImageMagick-convert with Leptonica 2014-01-19 14:47:51 -08:00
Jim Barlow
62edc15cd7 Implement ctypes wrapper around Leptonica to access its deskew function
A few design notes:
Leptonica's deskew is far superior to ImageMagick's convert -deskew command --
around 30-40x faster.  Subjectively the output appears to this contributor to
be of higher quality as well.  The difference is the algorithm: ImageMagick
uses the complex Hough transform to find the skew angle, while Leptonica uses
the simpler method, Postl's variance of differential line sums -- conceptually, shear the image and check for straight horizontal.  In this case
simplicity wins.  Finding the skew angle is the bulk of the work.

Leptonica's author explains the advantages of his approach here:
http://www.leptonica.com/skew-measurement.html

Leptonica is the low-level library that Tesseract depends on.  Hence, this
project already depends on Leptonica.  Leptonica can read and write most
common image file types on its own.

Unfortunately its error handling is poor: it seldom returns any meaningful
error codes.  The best it manages is writing messages to stderr, which in
the context of a verbose script is just confusing since the error's source
is not indicated.  The problem is compounded by Tesseract's use of Leptonica,
which will produce exactly the same errors in some cases.  So we trap stderr
between calls to Leptonica and parse it for a few different types of error
message.

leptonica.py is Python 2/3 compatible and set up to provide access to other
Leptonica functions as needed.  Of particular interest are its orientation
detection (including flip and rotation errors) which it does by comparing
text ascenders to descenders.

There is a PyPI "pylepthonica" package, however it is out of date by a few
years, and it implements all of Leptonica with Python wrappers -- so it is
massive, with one .py file at 2.5 MB.  This module is loosely inspired by
pyleptonica but more modern, up to date, and contains only limited
functionality.
2014-01-19 14:28:52 -08:00
fritz-hh
be830ddc31 List supported languages
In case lan is not supported, list the supported languages in the error
message
2014-01-18 22:22:19 +01:00
fritz-hh
18322b424f fixes #60
Check if the languages option provided to tesseract (-l) are supported
2014-01-18 21:38:22 +01:00
fritz-hh
6901c60db4 more robust way to check tesseract version
better way of checking if the tesseract version is compatible with the
script.
If the required tess version is 3.02.02, and the actual version is 3.03,
the script would have told before that the version is too old, because
303<30202, now it compares 3.03>3.0202
2014-01-18 21:02:15 +01:00
fritz-hh
e369ce6766 config file: version updated to v2.0-rc2 v2.0-rc2 2014-01-16 21:22:24 +01:00
fritz-hh
64e4e5d91e release notes updated for v2.0-rc2 2014-01-16 21:19:15 +01:00
fritz-hh
efce7de9ae wording corrected 2014-01-15 23:08:26 +01:00