246 Commits

Author SHA1 Message Date
Jim Barlow
cc2af2bc15 Convert the final image to a JPEG if the original image was a JPEG
Of course, this introduces recompression artifacts, and is unnecessary
if no options are given that modify the final image (no -d, -c, -i).
But rather than worry about that, it would be better to ultimately find
a way to combine the original PDF page with the output PDF text in the
case where we want no changes to the original. This is good enough for
now.

The better option can apparently be achieved using pdftk background, or
probably better, PyPDF2's merge. If Tesseract PDF generation is used
then we need a way to remove the image. Tesseract PDF generation at 3.03
does layout better (I think) and also properly encodes the hidden layer,
which is less likely to give display issues (I think).
2015-02-11 10:23:45 -08:00
Jim Barlow
638c6db05d Use the appropriate PNG rendered given the types of image present 2015-02-11 03:32:00 -08:00
Jim Barlow
f7db8d9aff Use Ghostscript -> PNG instead of pdftoppm for rendering
Ghostscript has the clunkiest imaginable syntax, obtuse documentation,
quirky behavior, and poor diagnostics... but it *actually works* unlike
pdftoppm/poppler which gets things wrong.

In this case I observed poppler incorrectly decompresses certain CCITT
encoded monochrome PDFs. So set up Ghostscript to do the job instead.

For the moment this performs monochrome -> RGB conversion via reportlab.
2015-02-11 03:13:07 -08:00
Jim Barlow
564fb7a87e Support Ghostscript 9.14's new color conversion engine (not portable)
The flag -dUseCIEColor is now deprecated, as it invokes the old engine
which introduces color errors. The new engine requires a PDF/A file
header with hardcoded location of a ICC profile to use, now included in
the project. Portable iterations should generate a PDFA_def.ps based on
the target system; for now OS X with homebrew is presumed.

I have selected sRGB since scanners tend to capture RGB and printing
is not a major consideration for PDF/A.

Also note all file paths given to gs must be absolute. May its creators
be forever haunted for their failure to document this unexpected quirk.
2015-02-09 15:33:49 -08:00
Jim Barlow
4d88e64774 Standardize tmpfile prefix 2015-02-09 15:02:49 -08:00
Jim Barlow
26f1163b46 Handle case where a page contains no images - don't OCR
It doesn't make much sense to do anything with an all vector page
except extract the page unmodified.
2015-02-08 20:05:54 -08:00
Jim Barlow
40058e99e0 Implement debug text only page option 2015-02-08 19:51:41 -08:00
Jim Barlow
bece4c3e02 Describe what decision was made based on -f and -s and presence of text 2015-02-08 19:51:18 -08:00
Jim Barlow
f0f6b57c87 When deciding on OCR, check for presence of text rather than a font
It appears to be possible to have a PDF with an embedded font that is
either unused or used only for whitespace. So check for some amount of
actual text instead.
2015-02-08 17:38:27 -08:00
Jim Barlow
dc2a4ab044 Logic error 2015-02-08 17:33:35 -08:00
Jim Barlow
b16d6f5b81 Implement skipping OCR when -s is specified
Appears to be necessary to disable each state of the pipeline that is
inactive, not just initial and terminal stages of an inactive segment.
If nothing else this makes what is going on more explicit.
2015-02-08 17:26:16 -08:00
Jim Barlow
69ce6ff7b5 Not a named param 2014-11-22 15:35:05 -08:00
Jim Barlow
32ba50b8dc Add Tesseract timeout to keep things reasonable 2014-11-14 02:06:23 -08:00
Jim Barlow
36aca45f35 The -dci options now work (and valid combinations thereof) 2014-11-14 00:23:22 -08:00
Jim Barlow
925290342d Leptonica deskew can handle .pnm input, unlike imagemagick 2014-11-13 23:20:25 -08:00
Jim Barlow
4dc0370c57 Add leptonica deskew 2014-11-13 16:53:26 -08:00
Jim Barlow
b92f8e43f2 Run as a module instead 2014-11-13 16:52:53 -08:00
Jim Barlow
22b0733a1d Merge branch 'feature/findskew' into develop 2014-11-13 16:00:27 -08:00
Jim Barlow
6021684ab6 Attempt to fix multiprocessing pickling error 2014-11-13 15:58:57 -08:00
Jim Barlow
f4b1d0cdfe Fix symlink error that occurs in multipage processing 2014-11-13 15:58:36 -08:00
Jim Barlow
d0d8048621 Comments 2014-10-17 17:28:31 -07:00
Jim Barlow
cfd119325d Use abspath instead of relpath for temporary directory symlink 2014-10-11 17:48:56 -07:00
Jim Barlow
ad30833ffc Support missing tess_cfg_files parameter when omitted by OCRmyPDF.sh 2014-10-11 17:48:33 -07:00
Jim Barlow
e5c79a6666 Use TIFFs as intermediates
pdftoppm in recent versions (0.26.4,5) seems to be incapable of
producing valid TIFFs, so have it dump a .pnm file and let ImageMagick
figure out how to convert it to TIFF. This is not ideal, but at least
it works.
2014-10-10 01:54:16 -07:00
Jim Barlow
63dc753c1b Standardize intermediate filenames better
convert .pnm -deskew <...> .pnm seems to have a bug that produces an
invalid .pnm file which later causes tesseract (specifically,
leptonica) to choke (using 3.02/1.71 as versions, respectively). Will
change pipeline to use tiffs internally since they are less stupid.
2014-10-10 01:30:43 -07:00
Jim Barlow
017bc1f252 Basic error handling 2014-10-10 01:07:46 -07:00
Jim Barlow
bcd67c009d Sort of working, but fragile; uses tmp folder properly now 2014-10-10 00:35:49 -07:00
Jim Barlow
2f6cfafdfc Now produces a finished OCR-PDF page 2014-10-08 03:54:06 -07:00
Jim Barlow
25234fa30b First crack at Ruffus, working well 2014-10-08 03:21:28 -07:00
Jim Barlow
dabbddb04e deskew and clean 2014-09-27 15:03:07 -07:00
Jim Barlow
fccfb4589e Moving quickly - we can now output .ppm files at correct resolution 2014-09-26 04:43:15 -07:00
Jim Barlow
5384c98013 Initial ocrpage.py rewrite into python3 2014-09-26 04:19:41 -07:00
Jim Barlow
d7130a1e56 Merge branch 'feature/keep-text-pages' into develop 2014-09-25 03:50:21 -07:00
Jim Barlow
f69054cb17 Fix parameter order problems
Put TESS_CFG_FILES last because it is optional and can be blank. If
omitted it breaks the sequence of subsequent parameters. Also cleanup
text output in this new mode.
2014-09-25 03:50:01 -07:00
Jim Barlow
80dc6eca2c Merge branches 'feature/readlink-osx' and 'feature/keep-text-pages' into develop
Conflicts:
	OCRmyPDF.sh
2014-09-25 03:14:10 -07:00
Jim Barlow
d250fbb3d6 Fix call to readlink on OS X
readlink -f is a GNU coreutils extension, so not available on OS X and
other platforms.
2014-09-25 03:11:27 -07:00
Jim Barlow
09bbe92611 Add command line option to skip pages that contain font data
If a page contains font data, the script would abort, unless -f was given,
in which case it would use pdftoppm to rasterize the font into a bitmap
and then attempt to OCR it. -f is almost certainly not what users want
unless they want to debug OCR or something.

If a PDF already has fonts it either was OCR'd already, or it is
a composite file containing, for example, some scanned documents appended
to a text report.  In the latter case, this -s option provides OCR on
pages that don't have it without changing those that do, and if a PDF
was completely OCRed it will be converted to PDF/A.  In batch jobs with
a mix of OCR and non-OCR the implicit conversion to PDF/A is also useful.
2014-09-25 02:43:40 -07:00
Jim Barlow
69d922e096 Check for missing pdftoppm when poppler installed with --disable-splash-output
When I upgraded to poppler 0.24.5, pdftoppm was not compiled because the
script had --disable-splash-output set for some reason.

For OS X Homebrew the solution is:
brew uninstall poppler
brew install poppler --with-splash-output
2014-09-25 02:30:29 -07:00
fritz-hh
d510e7e4ae prevent new spurious jhove message to be displayed 2014-09-24 23:43:37 +02:00
fritz-hh
5893290dd9 update to jhove v1.11 2014-09-24 23:17:39 +02:00
fritz-hh
5c3bbc4031 typo in OCRmyPDF.sh 2014-09-22 21:22:38 +02:00
fritz-hh
27cd8cf0db add link to heise open source 2014-09-20 20:47:02 +02:00
fritz-hh
b403016d5b Release notes updated for v2.1-stable v2.1-stable 2014-09-20 19:50:32 +02:00
fritz-hh
5a81823969 Merge pull request #82 from orbitcowboy/v2.x
Fixed typo
2014-09-20 19:02:33 +02:00
fritz-hh
17801401cd Merge pull request #83 from DorianScholz/v2.x
- small changes to make this work on Ubuntu 12.04 called via symlink
- lowered minimum parallel version
2014-09-20 18:59:57 +02:00
Dorian Scholz
5c7b2a2a36 lowered minimum version for parallel to 20121122 2014-09-10 13:27:59 +02:00
Dorian Scholz
1db06de287 added BASEPATH to allow for execution via symlink 2014-09-10 13:26:14 +02:00
Martin Ettl
3904178d44 Fixed typo 2014-09-09 07:01:04 +02:00
fritz-hh
8bb9c3610c Merge pull request #81 from MoritzFago/v2.x
fixed tipo ghostcript to ghostscript
2014-09-08 18:31:00 +02:00
MoritzFago
7dcc382ccc fixed tipo ghostcript to ghostscript 2014-09-08 16:52:49 +02:00