Update documentation on other languages, multilingual documents

This commit is contained in:
James R. Barlow 2016-11-07 14:12:37 -08:00
parent fdd9b8b8ce
commit a72b8caf47
4 changed files with 34 additions and 9 deletions

View File

@ -49,10 +49,23 @@ OCR will attempt to automatic correct the rotation of each page. This can help f
You can increase (decrease) the parameter ``--rotate-pages-threshold`` to make page rotation more (less) aggressive.
OCR languages other than English
""""""""""""""""""""""""""""""""
By default OCRmyPDF assumes the document is English.
.. code-block:: bash
ocrmypdf -l fre LeParisien.pdf LeParisien.pdf
ocrmypdf -l eng+fre Bilingual-English-French.pdf Bilingual-English-French.pdf
Language packs must be installed for all languages specified. See :ref:`Installing additional language packs <lang-packs>`.
OCR images, not PDFs
--------------------
Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to convert your images to PDFs, and then pipe the resutls to run ocrmypdf:
Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to convert your images to PDFs, and then pipe the results to run ocrmypdf:
.. code-block:: bash
@ -107,6 +120,7 @@ watchdog installs the command line program ``watchmedo``, which can be told to r
mkdir out
watchmedo shell-command \
--patterns="*.pdf" \
--ignore-directories \
--command='ocrmypdf "${watch_src_path}" "out/${watch_src_path}" ' \
. # don't forget the final dot
@ -114,12 +128,12 @@ For more complex behavior you can write a Python script around to use the watchd
On file servers, you could configure watchmedo as a system service so it will run all the time.
Caveats
"""""""
* ``watchmedo`` may not work properly on a networked file system, depending on the capabilities of the file system client and server.
* This simple recipe does not filter for the type of file system event, so file copies, deletes and moves, and directory operations, will all be sent to ocrmypdf, producing errors in several cases. Disable your watched folder if you are doing anything other than copying files to it.
* If the source and destination directory are the same, watchmedo may create an infinite loop.
Batch jobs

View File

@ -28,11 +28,16 @@ Rasterizing a PDF is the process of generating an image suitable for display or
About PDF/A
-----------
`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`_ is a standardized subset of the full PDF specification that is designed for archiving. PDF/A differs from PDF primarily by omitting features that would make it difficult to read the file in the future, such as embedded Javascript or references to external fonts. All fonts and resources needed to interpret the PDF must be contained within it. Generally speaking, scanned documents should be converted to PDF/A. There are various conformance levels and versions, such as "PDF/A-2b".
`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`_ is an ISO-standardized subset of the full PDF specification that is designed for archiving (the 'A' stands for Archive). PDF/A differs from PDF primarily by omitting features that would make it difficult to read the file in the future, such as embedded Javascript, video, audio and references to external fonts. All fonts and resources needed to interpret the PDF must be contained within it. Because PDF/A disables Javascript and other types of embedded content, it is probably more secure.
Since most people who scan documents are interested in reading them in the future, OCRmyPDF generates PDF/A-2b by default.
There are various conformance levels and versions, such as "PDF/A-2b".
Generally speaking, the best format for scanned documents is PDF/A. Some governments and jurisdictions, US Courts in particular, `mandate the use of PDF/A <https://pdfblog.com/2012/02/13/what-is-pdfa/>`_ for scanned documents.
Since most people who scan documents are interested in reading them indefinitely into the future, OCRmyPDF generates PDF/A-2b by default.
PDF/A has a few drawbacks. Some PDF viewers include an alert that the file is a PDF/A, which may confuse some users. It also tends to produce larger files than PDF, because it embeds certain resources even if they are commonly available. PDF/A files can be digitally signed, but may not be encrypted, to ensure they can be read in the future. Fortunately, converting from PDF/A to a regular PDF is trivial, and any PDF viewer can view PDF/A.
PDF/A has a few drawbacks. Some PDF viewers include an alert that the file is a PDF/A, which may confuse some users. It also tends to produce larger files than PDF, because it embeds certain resources even if they are commonly available.
What OCRmyPDF does
------------------

View File

@ -1,3 +1,5 @@
.. _lang-packs:
Installing additional language packs
====================================
@ -19,7 +21,7 @@ Debian and Ubuntu users
apt-get install tesseract-ocr-chi-sim # Example: Install Chinese Simplified language back
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple
languages can be requested.
languages can be requested using either ``-l eng+fre`` (English and French) or ``-l eng -l fre``.
Mac OS X (macOS) users
----------------------
@ -38,7 +40,7 @@ As of v4.2, users of ocrmypdf working languages outside the Latin alphabet shoul
.. code-block:: bash
ocrmypdf --output-type pdf --pdf-renderer tesseract
ocrmypdf -l eng+gre --output-type pdf --pdf-renderer tesseract
The reasons for this are:

View File

@ -166,8 +166,10 @@ parser.add_argument(
help="output searchable PDF file (or '-' to write to standard output)")
parser.add_argument(
'-l', '--language', action='append',
help="languages of the file to be OCRed (see tesseract --list-langs for "
"all language packs installed in your system)")
help="Language(s) of the file to be OCRed (see tesseract --list-langs for "
"all language packs installed in your system). To specify multiple "
"languages, join them with '+' or issue this argument once for each "
"language.")
parser.add_argument(
'-j', '--jobs', metavar='N', type=int,
help="Use up to N CPU cores simultaneously (default: use all)")
@ -1011,9 +1013,11 @@ def select_image_layer(
with open(image, 'rb') as imfile, \
open(output_file, 'wb') as pdf:
rawdata = imfile.read()
log.debug('{:4d}: convert'.format(page_number(page_pdf)))
img2pdf.convert(
rawdata, with_pdfrw=False,
layout_fun=layout_fun, outputstream=pdf)
log.debug('{:4d}: convert done'.format(page_number(page_pdf)))
@posttask(partial(done_task, 'render_hocr_page'))