If a page contains font data, the script would abort, unless -f was given,
in which case it would use pdftoppm to rasterize the font into a bitmap
and then attempt to OCR it. -f is almost certainly not what users want
unless they want to debug OCR or something.
If a PDF already has fonts it either was OCR'd already, or it is
a composite file containing, for example, some scanned documents appended
to a text report. In the latter case, this -s option provides OCR on
pages that don't have it without changing those that do, and if a PDF
was completely OCRed it will be converted to PDF/A. In batch jobs with
a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.
Allow tesseract 3.02.01 to be used.
Even 3.02.01 fails in few cases (see issue #28). I decided to allow this
version anyway because 3.02.02 is not yet available for some widespread
linux distributions
better way of checking if the tesseract version is compatible with the
script.
If the required tess version is 3.02.02, and the actual version is 3.03,
the script would have told before that the version is too old, because
303<30202, now it compares 3.03>3.0202
mktemp: consider both FreeBSD/OSX and Linux OS having incompatible
syntax
From now on temporary files are saved in the folder specified by the
environment variable $TMPDIR
First, the regular expression matches everything after the first period
in a filename. Adding the $ make it match the last, so that filenames
such as “Report.1.pdf” get trimmed to “Report.1”.
Next use mktemp to get the OS to create a temporary folder. It will
guarantee a unique directory name beginning with prefix, even if parallel
processes are at work.
- Oversampling resolution can now be set from the cmd line (-o option)
- If a page contains more than one image, warn the user but process the
page anyway with a default resolution
tell GNU parallel to protect against evaluation by the sub shell (-q
flag).
This is required in case the file name passed as argument contains
special characters like "#"
- In debug mode: compute and echo time required for processing
- Resolutions (x/y) that are nearly equal are not supported (because the
test did not take into account imprecision due to trauncation)