v5.6.0 release notes, docs

This commit is contained in:
James R. Barlow 2018-02-07 16:48:04 -08:00
parent 1dfc32d7e6
commit fa2c0296d6
3 changed files with 15 additions and 7 deletions

View File

@ -285,6 +285,8 @@ Now we need to install ``pip`` and let it install ocrmypdf:
wget -O - -o /dev/null https://bootstrap.pypa.io/get-pip.py | python3.6
pip3.6 install ocrmypdf
The ``wget`` command will download a program and run it.
These installation instructions omit the optional dependency ``unpaper``, which is only available at version 0.4.2 in Ubuntu 14.04. The author could not find a backport of ``unpaper``, and created a .deb package to do the job of installing unpaper 6.1 (for x86 64-bit only):
.. code-block:: bash

View File

@ -16,7 +16,7 @@ OCRmyPDF uses `Tesseract <https://github.com/tesseract-ocr/tesseract>`_, the bes
About PDFs
----------
PDFs are page description files that attempts to preserve a layout exactly. They can contain `vector graphic files <http://vector-conversions.com/vectorizing/raster_vs_vector.html>`_ that can contain raster objects such as scanned images. Because PDFs can contain multiple pages (unlike many image formats) and can contain fonts and text, it is a good formats for exchanging scanned documents.
PDFs are page description files that attempts to preserve a layout exactly. They contain `vector graphics <http://vector-conversions.com/vectorizing/raster_vs_vector.html>`_ that can contain raster objects such as scanned images. Because PDFs can contain multiple pages (unlike many image formats) and can contain fonts and text, it is a good formats for exchanging scanned documents.
.. image:: bitmap_vs_svg.svg
@ -42,15 +42,13 @@ PDF/A has a few drawbacks. Some PDF viewers include an alert that the file is a
What OCRmyPDF does
------------------
OCRmyPDF analyzes each page of a PDF to determine the colorspace and resolution (DPI) needed to capture all of the information on that page without losing content. It uses `Ghostscript <http://ghostscript.com/>`_ to rasterize the page, and then performs on OCR on the rasterized image. It is not enough to simply extract the images from each page and run OCR on them individually. Of course one could use Ghostscript or another PDF rasterizer and then pass the image to Tesseract. OCRmyPDF automates this process and produces a minimally changed output file that contains the same information, colorspace and resolution.
OCRmyPDF analyzes each page of a PDF to determine the colorspace and resolution (DPI) needed to capture all of the information on that page without losing content. It uses `Ghostscript <http://ghostscript.com/>`_ to rasterize the page, and then performs on OCR on the rasterized image to create an OCR "layer". The layer is then grafted back onto the original PDF.
The Tesseract OCR engine can output 'hOCR' files, which are XML files that contain a description of the text it found on the page. OCRmyPDF will render a new PDF that contains only the hidden text layer, and merge this with the original page.
While one can use a program like Ghostscript or ImageMagick to get an image and put the image through Tesseract, that actually creates a new PDF and many details may be lost. OCRmyPDF can produce a minimally changed PDF as output.
Alternately, OCRmyPDF can use the Tesseract OCR engine to directly output PDFs for each page, then merge them.
OCRmyPDF also some image processing options like deskew which improve the appearance of files and quality of OCR. When these are used, the OCR layer is grafted onto the processed image instead.
By default, OCRmyPDF will convert the file to a PDF/A. This behavior can be disabled with the ``--output-type pdf`` argument.
Depending on the settings selected, OCRmyPDF may "graft" the OCR layer into the existing PDF, or reconstruct a visually equivalent new PDF.
By default, OCRmyPDF produces archival PDFs PDF/A, which are a stricter subset of PDF features designed for long term archives. If regular PDFs are desired, this can be disabled with ``--output-type pdf``.
Why you shouldn't do this manually

View File

@ -5,6 +5,14 @@ OCRmyPDF uses `semantic versioning <http://semver.org/>`_ for its command line i
The OCRmyPDF package itself does not contain a public API, although it is fairly stable and breaking changes are usually timed with a major release. A future release will clearly define the stable public API.
v5.6.0
------
- Fix issue #216: preserve "text as curves" PDFs without rasterizing file
- Related to the above, messages about rasterizing are more consistent
- For consistency versions minor releases will now get the trailing .0 they always should have had.
v5.5
----