OCRmyPDF/ocrmypdf/tesseract.py

#!/usr/bin/env python3
# © 2015 James R. Barlow: github.com/jbarlow83

from subprocess import STDOUT, CalledProcessError, check_output
import sys
import os
import re
from functools import lru_cache
from . import ExitCode


@lru_cache(maxsize=1)
def version():
    args_tess = [
        'tesseract',
        '--version'
    ]
    try:
        versions = check_output(
                args_tess, close_fds=True, universal_newlines=True,
                stderr=STDOUT)
    except CalledProcessError:
        print("Could not find Tesseract executable on system PATH.")
        sys.exit(ExitCode.missing_dependency)

    tesseract_version = re.match(r'tesseract\s(.+)', versions).group(1)
    return tesseract_version


@lru_cache(maxsize=1)
def languages():
    args_tess = [
        'tesseract',
        '--list-langs'
    ]
    try:
        langs = check_output(
                args_tess, close_fds=True, universal_newlines=True,
                stderr=STDOUT)
    except CalledProcessError as e:
        print("Tesseract failed to report available languages.")
        print("Output from Tesseract:")
        print("-" * 40)
        print(e.output)
        sys.exit(ExitCode.missing_dependency)
    return set(lang.strip() for lang in langs.splitlines()[1:])


HOCR_TEMPLATE = '''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta name='ocr-system' content='tesseract 3.02.02' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "x.tif"; bbox 0 0 {0} {1}; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 0 1 {0} {1}">
    <p class='ocr_par' dir='ltr' id='par_1' title="bbox 0 1 {0} {1}">
     <span class='ocr_line' id='line_1' title="bbox 0 1 {0} {1}"><span class='ocrx_word' id='word_1' title="bbox 0 1 {0} {1}"> </span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>'''
Add tesseract version check 2015-07-23 17:06:00 -07:00			`#!/usr/bin/env python3`
Update release notes, add copyrights 2015-07-28 04:36:58 -07:00			`# © 2015 James R. Barlow: github.com/jbarlow83`
Add tesseract version check 2015-07-23 17:06:00 -07:00
Get rid of subprocess call on import of tesseract, unpaper -- bit nasty 2015-07-28 01:00:29 -07:00			`from subprocess import STDOUT, CalledProcessError, check_output`
Add tesseract version check 2015-07-23 17:06:00 -07:00			`import sys`
Langauge checking 2015-07-23 18:38:59 -07:00			`import os`
			`import re`
Get rid of subprocess call on import of tesseract, unpaper -- bit nasty 2015-07-28 01:00:29 -07:00			`from functools import lru_cache`
Refactor exit codes; test for missing tessdata Some versions of tesseract installed by homebrew end up without a functional tessdata folder, and tesseract is not helpful in this situation, so add a new test to make sure our output is at least indicative of the problem. In the process of properly handling return codes I discovered test_override_metadata triggers a NPE inside JHOVE probably due to the Unicode character checking. This could be specific to my JRE (1.6.0_65, Oracle) but it's probably JHOVE's fault. A valid PDF/A (per Acrobat) is still generated. 2015-08-11 00:17:02 -07:00			`from . import ExitCode`
Add tesseract version check 2015-07-23 17:06:00 -07:00

Get rid of subprocess call on import of tesseract, unpaper -- bit nasty 2015-07-28 01:00:29 -07:00			`@lru_cache(maxsize=1)`
			`def version():`
Langauge checking 2015-07-23 18:38:59 -07:00			`args_tess = [`
			`'tesseract',`
			`'--version'`
			`]`
Get rid of subprocess call on import of tesseract, unpaper -- bit nasty 2015-07-28 01:00:29 -07:00			`try:`
			`versions = check_output(`
			`args_tess, close_fds=True, universal_newlines=True,`
			`stderr=STDOUT)`
			`except CalledProcessError:`
			`print("Could not find Tesseract executable on system PATH.")`
Refactor exit codes; test for missing tessdata Some versions of tesseract installed by homebrew end up without a functional tessdata folder, and tesseract is not helpful in this situation, so add a new test to make sure our output is at least indicative of the problem. In the process of properly handling return codes I discovered test_override_metadata triggers a NPE inside JHOVE probably due to the Unicode character checking. This could be specific to my JRE (1.6.0_65, Oracle) but it's probably JHOVE's fault. A valid PDF/A (per Acrobat) is still generated. 2015-08-11 00:17:02 -07:00			`sys.exit(ExitCode.missing_dependency)`
Langauge checking 2015-07-23 18:38:59 -07:00
			`tesseract_version = re.match(r'tesseract\s(.+)', versions).group(1)`
			`return tesseract_version`

Add tesseract version check 2015-07-23 17:06:00 -07:00
Get rid of subprocess call on import of tesseract, unpaper -- bit nasty 2015-07-28 01:00:29 -07:00			`@lru_cache(maxsize=1)`
			`def languages():`
Langauge checking 2015-07-23 18:38:59 -07:00			`args_tess = [`
			`'tesseract',`
			`'--list-langs'`
			`]`
Refactor exit codes; test for missing tessdata Some versions of tesseract installed by homebrew end up without a functional tessdata folder, and tesseract is not helpful in this situation, so add a new test to make sure our output is at least indicative of the problem. In the process of properly handling return codes I discovered test_override_metadata triggers a NPE inside JHOVE probably due to the Unicode character checking. This could be specific to my JRE (1.6.0_65, Oracle) but it's probably JHOVE's fault. A valid PDF/A (per Acrobat) is still generated. 2015-08-11 00:17:02 -07:00			`try:`
			`langs = check_output(`
			`args_tess, close_fds=True, universal_newlines=True,`
			`stderr=STDOUT)`
			`except CalledProcessError as e:`
			`print("Tesseract failed to report available languages.")`
			`print("Output from Tesseract:")`
			`print("-" * 40)`
			`print(e.output)`
			`sys.exit(ExitCode.missing_dependency)`
Langauge checking 2015-07-23 18:38:59 -07:00			`return set(lang.strip() for lang in langs.splitlines()[1:])`

Rasterize PDF pages and generate .hocr files 2015-07-23 23:09:29 -07:00
			`HOCR_TEMPLATE = '''<?xml version="1.0" encoding="UTF-8"?>`
			`<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"`
			`"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">`
			`<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">`
			`<head>`
			`<title></title>`
			`<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />`
			`<meta name='ocr-system' content='tesseract 3.02.02' />`
			`<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>`
			`</head>`
			`<body>`
			`<div class='ocr_page' id='page_1' title='image "x.tif"; bbox 0 0 {0} {1}; ppageno 0'>`
			`<div class='ocr_carea' id='block_1_1' title="bbox 0 1 {0} {1}">`
			`<p class='ocr_par' dir='ltr' id='par_1' title="bbox 0 1 {0} {1}">`
			`<span class='ocr_line' id='line_1' title="bbox 0 1 {0} {1}"><span class='ocrx_word' id='word_1' title="bbox 0 1 {0} {1}"> </span>`
			`</span>`
			`</p>`
			`</div>`
			`</div>`
			`</body>`
			`</html>'''`