Update documentation on other languages, multilingual documents

2025-12-29 08:01:04 +00:00 · 2016-11-07 14:12:37 -08:00 · 2016-11-07 14:12:37 -08:00 · a72b8caf47
commit a72b8caf47
parent fdd9b8b8ce
4 changed files with 34 additions and 9 deletions
--- a/docs/cookbook.rst
+++ b/docs/cookbook.rst
@ -49,10 +49,23 @@ OCR will attempt to automatic correct the rotation of each page. This can help f
 You can increase (decrease) the parameter ``--rotate-pages-threshold`` to make page rotation more (less) aggressive.


+OCR languages other than English
+""""""""""""""""""""""""""""""""
+
+By default OCRmyPDF assumes the document is English. 
+
+.. code-block:: bash
+
+	ocrmypdf -l fre LeParisien.pdf LeParisien.pdf
+	ocrmypdf -l eng+fre Bilingual-English-French.pdf Bilingual-English-French.pdf
+
+Language packs must be installed for all languages specified. See :ref:`Installing additional language packs <lang-packs>`.
+
+
 OCR images, not PDFs
 --------------------

-Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to convert your images to PDFs, and then pipe the resutls to run ocrmypdf:
+Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to convert your images to PDFs, and then pipe the results to run ocrmypdf:

 .. code-block:: bash

@ -107,6 +120,7 @@ watchdog installs the command line program ``watchmedo``, which can be told to r
 	mkdir out
 	watchmedo shell-command \
 		--patterns="*.pdf" \
+		--ignore-directories \
 		--command='ocrmypdf "${watch_src_path}" "out/${watch_src_path}" ' \
 		.  # don't forget the final dot

@ -114,12 +128,12 @@ For more complex behavior you can write a Python script around to use the watchd

 On file servers, you could configure watchmedo as a system service so it will run all the time.

-
 Caveats
 """""""

 * ``watchmedo`` may not work properly on a networked file system, depending on the capabilities of the file system client and server.
 * This simple recipe does not filter for the type of file system event, so file copies, deletes and moves, and directory operations, will all be sent to ocrmypdf, producing errors in several cases. Disable your watched folder if you are doing anything other than copying files to it.
+* If the source and destination directory are the same, watchmedo may create an infinite loop.


 Batch jobs
--- a/docs/introduction.rst
+++ b/docs/introduction.rst
@ -28,11 +28,16 @@ Rasterizing a PDF is the process of generating an image suitable for display or
 About PDF/A
 -----------

-`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`_ is a standardized subset of the full PDF specification that is designed for archiving.  PDF/A differs from PDF primarily by omitting features that would make it difficult to read the file in the future, such as embedded Javascript or references to external fonts.  All fonts and resources needed to interpret the PDF must be contained within it.  Generally speaking, scanned documents should be converted to PDF/A. There are various conformance levels and versions, such as "PDF/A-2b".
+`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`_ is an ISO-standardized subset of the full PDF specification that is designed for archiving (the 'A' stands for Archive).  PDF/A differs from PDF primarily by omitting features that would make it difficult to read the file in the future, such as embedded Javascript, video, audio and references to external fonts.  All fonts and resources needed to interpret the PDF must be contained within it. Because PDF/A disables Javascript and other types of embedded content, it is probably more secure.

-Since most people who scan documents are interested in reading them in the future, OCRmyPDF generates PDF/A-2b by default.
+There are various conformance levels and versions, such as "PDF/A-2b".
+
+Generally speaking, the best format for scanned documents is PDF/A. Some governments and jurisdictions, US Courts in particular, `mandate the use of PDF/A <https://pdfblog.com/2012/02/13/what-is-pdfa/>`_ for scanned documents.
+
+Since most people who scan documents are interested in reading them indefinitely into the future, OCRmyPDF generates PDF/A-2b by default.
+
+PDF/A has a few drawbacks.  Some PDF viewers include an alert that the file is a PDF/A, which may confuse some users.  It also tends to produce larger files than PDF, because it embeds certain resources even if they are commonly available. PDF/A files can be digitally signed, but may not be encrypted, to ensure they can be read in the future.  Fortunately, converting from PDF/A to a regular PDF is trivial, and any PDF viewer can view PDF/A.

-PDF/A has a few drawbacks.  Some PDF viewers include an alert that the file is a PDF/A, which may confuse some users.  It also tends to produce larger files than PDF, because it embeds certain resources even if they are commonly available. 

 What OCRmyPDF does
 ------------------
--- a/docs/languages.rst
+++ b/docs/languages.rst
@ -1,3 +1,5 @@
+.. _lang-packs:
+
 Installing additional language packs
 ====================================

@ -19,7 +21,7 @@ Debian and Ubuntu users
   apt-get install tesseract-ocr-chi-sim  # Example: Install Chinese Simplified language back
   
 You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple
-languages can be requested.
+languages can be requested using either ``-l eng+fre`` (English and French) or ``-l eng -l fre``.

 Mac OS X (macOS) users
 ----------------------
@ -38,7 +40,7 @@ As of v4.2, users of ocrmypdf working languages outside the Latin alphabet shoul

 .. code-block:: bash

-	ocrmypdf --output-type pdf --pdf-renderer tesseract
+	ocrmypdf -l eng+gre --output-type pdf --pdf-renderer tesseract

 The reasons for this are:

--- a/ocrmypdf/main.py
+++ b/ocrmypdf/main.py
@ -166,8 +166,10 @@ parser.add_argument(
    help="output searchable PDF file (or '-' to write to standard output)")
 parser.add_argument(
    '-l', '--language', action='append',
-    help="languages of the file to be OCRed (see tesseract --list-langs for "
-         "all language packs installed in your system)")
+    help="Language(s) of the file to be OCRed (see tesseract --list-langs for "
+         "all language packs installed in your system). To specify multiple "
+         "languages, join them with '+' or issue this argument once for each "
+         "language.")
 parser.add_argument(
    '-j', '--jobs', metavar='N', type=int,
    help="Use up to N CPU cores simultaneously (default: use all)")
@ -1011,9 +1013,11 @@ def select_image_layer(
        with open(image, 'rb') as imfile, \
                open(output_file, 'wb') as pdf:
            rawdata = imfile.read()
+            log.debug('{:4d}: convert'.format(page_number(page_pdf)))
            img2pdf.convert(
                rawdata, with_pdfrw=False,
                layout_fun=layout_fun, outputstream=pdf)
+            log.debug('{:4d}: convert done'.format(page_number(page_pdf)))


@posttask(partial(done_task, 'render_hocr_page'))