88 Commits

Author SHA1 Message Date
Jim Barlow
09bbe92611 Add command line option to skip pages that contain font data
If a page contains font data, the script would abort, unless -f was given,
in which case it would use pdftoppm to rasterize the font into a bitmap
and then attempt to OCR it. -f is almost certainly not what users want
unless they want to debug OCR or something.

If a PDF already has fonts it either was OCR'd already, or it is
a composite file containing, for example, some scanned documents appended
to a text report.  In the latter case, this -s option provides OCR on
pages that don't have it without changing those that do, and if a PDF
was completely OCRed it will be converted to PDF/A.  In batch jobs with
a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.
2014-09-25 02:43:40 -07:00
fritz-hh
d510e7e4ae prevent new spurious jhove message to be displayed 2014-09-24 23:43:37 +02:00
fritz-hh
5c3bbc4031 typo in OCRmyPDF.sh 2014-09-22 21:22:38 +02:00
fritz-hh
5a81823969 Merge pull request #82 from orbitcowboy/v2.x
Fixed typo
2014-09-20 19:02:33 +02:00
Dorian Scholz
5c7b2a2a36 lowered minimum version for parallel to 20121122 2014-09-10 13:27:59 +02:00
Dorian Scholz
1db06de287 added BASEPATH to allow for execution via symlink 2014-09-10 13:26:14 +02:00
Martin Ettl
3904178d44 Fixed typo 2014-09-09 07:01:04 +02:00
MoritzFago
7dcc382ccc fixed tipo ghostcript to ghostscript 2014-09-08 16:52:49 +02:00
Andy Signer
15d28d970a Fixed typo in help text 2014-05-23 12:41:31 +02:00
fritz-hh
4cf38404cc fixes #51
Allow tesseract 3.02.01 to be used.
Even 3.02.01 fails in few cases (see issue #28). I decided to allow this
version anyway because 3.02.02 is not yet available for some widespread
linux distributions
2014-01-25 21:58:50 +01:00
fritz-hh
be830ddc31 List supported languages
In case lan is not supported, list the supported languages in the error
message
2014-01-18 22:22:19 +01:00
fritz-hh
18322b424f fixes #60
Check if the languages option provided to tesseract (-l) are supported
2014-01-18 21:38:22 +01:00
fritz-hh
6901c60db4 more robust way to check tesseract version
better way of checking if the tesseract version is compatible with the
script.
If the required tess version is 3.02.02, and the actual version is 3.03,
the script would have told before that the version is too old, because
303<30202, now it compares 3.03>3.0202
2014-01-18 21:02:15 +01:00
fritz-hh
38c64ac689 dependency to pdftk removed
concatenation is now done also with ghostscript
2014-01-15 21:23:42 +01:00
fritz-hh
988bde1387 tmpfiles to $TMPDIR + better portability (mktemp)
mktemp: consider both FreeBSD/OSX and Linux OS having incompatible
syntax
From now on temporary files are saved in the folder specified by the
environment variable $TMPDIR
2014-01-14 22:57:10 +01:00
fritz-hh
aedbabdbe8 merged pull request from oxplot 2014-01-14 22:29:41 +01:00
Mansour Behabadi
6653066784 Use --gnu in parralell and XX for mktemp
Signed-off-by: Mansour Behabadi <mansour@oxplot.com>
2014-01-14 17:49:24 +11:00
fritz-hh
e40f1fa081 better handling of ligatures: fixes #58 2014-01-13 23:13:15 +01:00
fritz-hh
317846fbdc Check of tmp folder creation was successful 2014-01-13 22:05:26 +01:00
fritz-hh
f581a55544 Merge pull request #57 from jbarlow83/for-upstream/tmpfolder
Fix temporary folder name generation collisions
2014-01-13 12:31:02 -08:00
Jim Barlow
1d03a6417d Verify that pdftoppm is the Poppler version, not xpdf version 2014-01-12 22:12:09 -08:00
Jim Barlow
996048dc08 Fix temporary folder name generation collisions
First, the regular expression matches everything after the first period
in a filename.  Adding the $ make it match the last, so that filenames
such as “Report.1.pdf” get trimmed to “Report.1”.

Next use mktemp to get the OS to create a temporary folder.  It will
guarantee a unique directory name beginning with prefix, even if parallel
processes are at work.
2014-01-12 22:05:11 -08:00
fritz-hh
7cd73566be check if python libs are installed
Check if reportlab and lxml are installed, otherwise exist with an error
2014-01-11 17:08:26 +01:00
fritz-hh
46f673a3b7 exit if bad parallel/tesseract version installed 2014-01-10 22:59:33 +01:00
fritz-hh
828f195071 erroneous exit code corrected 2014-01-07 21:57:18 +01:00
fritz-hh
c1103c0248 check tesseract version
fixes #41
versions older than 3.02.02 are known to produce invalid hocr output (in
some cases)
2014-01-07 21:04:28 +01:00
fritz-hh
54f47ab89b Minor change 2014-01-06 22:41:43 +01:00
fritz-hh
7eab052e0f Improved consistency of tmp file names 2014-01-06 22:00:58 +01:00
fritz-hh
6ef4ba31e2 help and documentation improved 2014-01-05 22:02:12 +01:00
fritz-hh
71593421ed minor change 2014-01-05 21:22:31 +01:00
fritz-hh
2754970f37 Echo arguments of script in debug mode 2014-01-04 21:43:41 +01:00
fritz-hh
5945454597 Support for -f option
Fixes #16
2014-01-04 21:24:33 +01:00
fritz-hh
7d76c46731 Check if page already contains a font 2014-01-04 18:05:21 +01:00
fritz-hh
f8ccf42c06 path to tmp folder now defined in config.sh 2014-01-04 17:24:35 +01:00
fritz-hh
ee8a5d80ff echo also java version in debug mode 2014-01-03 16:27:11 +01:00
fritz-hh
41cd88506e Echo version of the used tools
Fixes #35
2014-01-03 15:59:51 +01:00
fritz-hh
95fe7cd3bc Oversampling + more than 1 img
- Oversampling resolution can now be set from the cmd line (-o option)
- If a page contains more than one image, warn the user but process the
page anyway with a default resolution
2013-12-30 23:44:38 +01:00
fritz-hh
407670e1f3 Minor change 2013-11-29 10:34:05 +01:00
fritz-hh
b4a23c005d fixes #34
tell GNU parallel to protect against evaluation by the sub shell (-q
flag).
This is required in case the file name passed as argument contains
special characters like "#"
2013-11-27 23:15:54 +01:00
fritz-hh
5e0f8be4b1 Various improvements
-Constants moved to config.sh
- Use "python2" cmd instead of "python"
- few other minor changes
2013-11-27 22:34:21 +01:00
fritz-hh
88ddeb1fb6 OCRmyPDF.sh: added dependency to GNU parallel 2013-05-06 21:54:05 +02:00
fritz-hh
f9e2e74bf3 Merge remote-tracking branch 'origin/v1.x' into v2.x 2013-05-06 21:35:34 +02:00
fritz-hh
7e8481186a OCRmyPDF.sh: metadata not added anymore
Removed feature to add metadata in final pdf file (because it lead to to
final PDF file that does not comply to the PDF/A-1 format)
2013-05-06 21:26:33 +02:00
fritz-hh
2b0103a4e6 basic implementation of parallel page processing
- basic implementation of parallel page processing using GNU parallel
- processing around 40% faster on dual core processor
2013-05-05 22:33:54 +02:00
fritz-hh
064d4be83c Merge remote-tracking branch 'origin/v1.x' into v2.x
Conflicts:
	OCRmyPDF.sh

Fixes #31
2013-05-05 21:01:17 +02:00
fritz-hh
ab536d5678 OCRmyPDF.sh: fixes issue for files having spaces
fixes #31
2013-05-05 20:56:45 +02:00
fritz-hh
f7923a9761 OCRmyPDF.sh: few variables renamed for clarity 2013-05-05 20:44:03 +02:00
fritz-hh
e4ffb58269 OCRmyPDF.sh: provision for parallel pages processing 2013-05-02 22:06:16 +02:00
fritz-hh
9271fe73a8 OCRmyPDF.sh: fixes #27
The fix should now be compatible to most implementation of grep
2013-05-02 16:51:46 +02:00
fritz-hh
edaa70b97f OCRmyPDF.sh: fixes #25 and fixes #26
- In debug mode: compute and echo time required for processing
- Resolutions (x/y) that are nearly equal are not supported (because the
test did not take into account imprecision due to trauncation)
2013-05-01 15:58:55 +02:00