2019-06-22 17:29:26 -07:00
|
|
|
=============
|
2017-04-18 18:26:31 -07:00
|
|
|
Release notes
|
2015-07-26 03:00:21 -07:00
|
|
|
=============
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF uses `semantic versioning <http://semver.org/>`__ for its
|
|
|
|
command line interface and its public API.
|
2017-06-13 10:15:11 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
The ``ocrmypdf`` package may now be imported. The public API may be
|
|
|
|
useful in scripts that launch OCRmyPDF processes or that wish to use
|
|
|
|
some of its features for working with PDFs.
|
2018-07-09 14:28:37 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
Note that it is licensed under GPLv3, so scripts that
|
|
|
|
``import ocrmypdf`` and are released publicly should probably also be
|
|
|
|
licensed under GPLv3.
|
2018-07-09 14:28:37 -07:00
|
|
|
|
2020-02-10 01:01:49 -08:00
|
|
|
v9.6.0
|
|
|
|
======
|
|
|
|
|
2020-02-10 01:13:28 -08:00
|
|
|
- Fixed a regression with transferring metadata from the input PDF to the output
|
|
|
|
PDF in certain situations.
|
2020-02-10 01:01:49 -08:00
|
|
|
- pdfminer.six is now supported up to version 2020-01-24.
|
|
|
|
- Messages are explaining page rotation decisions are now shown at the standard
|
|
|
|
verbosity level again when ``--rotate-pages``. In some previous version they
|
|
|
|
were set to debug level messages that only appeared with the parameter ``-v1``.
|
2020-02-10 01:13:28 -08:00
|
|
|
- Improvements to ``misc/watcher.py``. Thanks to @ianalexander and @svenihoney.
|
2020-02-10 01:01:49 -08:00
|
|
|
- Documentation improvements.
|
|
|
|
|
2020-01-17 03:11:33 -08:00
|
|
|
v9.5.0
|
|
|
|
======
|
|
|
|
|
|
|
|
- Added API functions to measure OCR quality.
|
2020-01-18 01:48:33 -08:00
|
|
|
- Modest improvements to handling PDFs with difficult/non compliant metadata.
|
2020-01-17 03:11:33 -08:00
|
|
|
|
2020-01-05 21:35:52 -08:00
|
|
|
v9.4.0
|
|
|
|
======
|
|
|
|
|
|
|
|
- Updated recommended dependency versions.
|
|
|
|
- Improvements to test coverage and changes to facilitate better measurement of
|
|
|
|
test coverage, such as when tests run in subprocesses.
|
|
|
|
- Improvements to error messages when Leptonica is not installed correctly.
|
|
|
|
- Fixed use of pytest "session scope" that may have caused some intermittent
|
|
|
|
CI failures.
|
|
|
|
- When the argument ``--keep-temporary-files`` or verbosity is set to ``-v1``,
|
|
|
|
a debug log file is generated in the working temporary folder.
|
|
|
|
|
2019-12-28 15:42:24 -08:00
|
|
|
v9.3.0
|
|
|
|
======
|
|
|
|
|
|
|
|
- Improved native Windows support: we now check in the obvious places in
|
|
|
|
the "Program Files" folders installations of Tesseract and Ghostscript,
|
|
|
|
rather than relying on the user to edit ``PATH`` to specify their location.
|
|
|
|
The ``PATH`` environment variable can still be used to differentiate when
|
|
|
|
multiple installations are present or the programs are installed to non-
|
|
|
|
standard locations.
|
|
|
|
- Fixed an exception on parsing Ghostscript error messages.
|
|
|
|
- Added an improved example demonstrating how to set up a watched folder
|
|
|
|
for automated OCR processing (thanks to @ianalexander for the contribution).
|
|
|
|
|
2019-12-11 13:13:51 -08:00
|
|
|
v9.2.0
|
|
|
|
======
|
|
|
|
|
|
|
|
- Native Windows is now supported.
|
|
|
|
- Continuous integration moved to Azure Pipelines.
|
|
|
|
- Improved test coverage and speed of tests.
|
|
|
|
- Fixed an issue where a page that was originally a JPEG would be saved as a
|
|
|
|
PNG, increasing file size. This occurred only when a preprocessing option
|
|
|
|
was selected along with ``--output-type=pdf`` and all images on the original
|
|
|
|
page were JPEGs. Regression since v7.0.0.
|
|
|
|
- OCRmyPDF no longer depends on the QPDF executable ``qpdf`` or ``libqpdf``.
|
|
|
|
It uses pikepdf (which in turn depends on ``libqpdf``). Package maintainers
|
|
|
|
should adjust dependencies so that OCRmyPDF no longer calls for libqpdf on
|
|
|
|
its own. For users of Python binary wheels, this change means a separate
|
|
|
|
installation of QPDF is no longer necessary. This change is mainly to
|
|
|
|
simplify installation on Windows.
|
|
|
|
- Fixed a rare case where log messages from Tesseract would be discarded.
|
|
|
|
- Fixed incorrect function signature for pixFindPageForeground, causing
|
|
|
|
exceptions on certain platforms/Leptonica versions.
|
|
|
|
|
2019-11-18 15:17:00 -08:00
|
|
|
v9.1.1
|
|
|
|
======
|
|
|
|
|
|
|
|
- Expand the range of pdfminer.six versions that are supported.
|
|
|
|
- Fixed Docker build when using pikepdf 1.7.0.
|
|
|
|
- Fixed documentation to recommend using pip from get-pip.py.
|
|
|
|
|
2019-11-11 22:39:33 -08:00
|
|
|
v9.1.0
|
|
|
|
======
|
|
|
|
|
|
|
|
- Improved diagnostics when file size increases at output. Now warns if JBIG2
|
|
|
|
or pngquant were not available.
|
|
|
|
- pikepdf 1.7.0 is now required, to pick up changes that remove the need for
|
|
|
|
a source install on Linux systems running Python 3.8.
|
|
|
|
|
2019-11-04 03:00:15 -08:00
|
|
|
v9.0.5
|
|
|
|
======
|
|
|
|
|
|
|
|
- The Alpine Docker image (jbarlow83/ocrmypdf-alpine) has been dropped due to
|
|
|
|
the difficulties of supporting Alpine Linux.
|
|
|
|
- The primary Docker image (jbarlow83/ocrmypdf) has been improved to take on
|
|
|
|
the extra features that used to be exclusive to the Alpine image.
|
|
|
|
- No changes to application code.
|
2019-11-04 03:15:59 -08:00
|
|
|
- pdfminer.six version 20191020 is now supported.
|
2019-11-04 03:00:15 -08:00
|
|
|
|
2019-10-20 03:20:54 -07:00
|
|
|
v9.0.4
|
|
|
|
======
|
|
|
|
|
2019-11-03 01:49:36 -08:00
|
|
|
- Fixed compatibility with Python 3.8 (but requires source install for the moment).
|
2019-10-20 03:20:54 -07:00
|
|
|
- Fixed Tesseract settings for ``--user-words`` and ``--user-patterns``.
|
2019-10-24 16:58:39 -07:00
|
|
|
- Changed to pikepdf 1.6.5 (for Python 3.8).
|
|
|
|
- Changed to Pillow 6.2.0 (to mitigate a security vulnerability in earlier Pillow).
|
|
|
|
- A debug message now mentions when English is automatically selected if the locale
|
|
|
|
is not English.
|
2019-10-20 03:20:54 -07:00
|
|
|
|
2019-09-05 13:17:26 -07:00
|
|
|
v9.0.3
|
|
|
|
======
|
|
|
|
|
|
|
|
- Embed an encoded version of the sRGB ICC profile in the intermediate
|
|
|
|
Postscript file (used for PDF/A conversion). Previously we included the
|
|
|
|
filename, which required Postscript to run with file access enabled. For
|
|
|
|
security, Ghostscript 9.28 enables ``-dSAFER`` and as such, no longer
|
|
|
|
permits access to any file by default. This fix is necessary for
|
|
|
|
compatibility with Ghostscript 9.28.
|
2019-09-05 13:39:43 -07:00
|
|
|
- Exclude a test that sometimes times out and fails in continuous integration
|
|
|
|
from the standard test suite.
|
2019-09-05 13:17:26 -07:00
|
|
|
|
2019-09-04 02:34:21 -07:00
|
|
|
v9.0.2
|
|
|
|
======
|
|
|
|
|
|
|
|
- The image optimizer now skips optimizing flate (PNG) encoded images in some
|
|
|
|
situations where the optimization effort was likely wasted.
|
|
|
|
- The image optimizer now ignores images that specify arbitrary decode arrays,
|
|
|
|
since these are rare.
|
|
|
|
- Fixed an issue that caused inversion of black and white in monochrome images.
|
|
|
|
We are not certain but the problem seems to be linked to Leptonica 1.76.0 and
|
|
|
|
older.
|
2019-09-05 13:39:43 -07:00
|
|
|
- Fixed some cases where the test suite failed if
|
2019-09-04 02:34:21 -07:00
|
|
|
English or German Tesseract language packs were not installed.
|
|
|
|
- Fixed a runtime error if the Tesseract English language is not installed.
|
|
|
|
- Improved explicit closing of Pillow images after use.
|
|
|
|
- Actually fixed of Alpine Docker image build.
|
|
|
|
- Changed to pikepdf 1.6.3.
|
|
|
|
|
2019-08-11 17:14:11 -07:00
|
|
|
v9.0.1
|
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed test suite failing when either of optional dependencies unpaper and
|
|
|
|
pngquant were missing.
|
2019-09-04 02:34:21 -07:00
|
|
|
- Attempted fix of Alpine Docker image build.
|
2019-08-11 17:14:11 -07:00
|
|
|
- Documented that FreeBSD ports are now available.
|
2019-09-04 02:34:21 -07:00
|
|
|
- Changed to pikepdf 1.6.1.
|
2019-08-11 17:14:11 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
v9.0.0
|
|
|
|
======
|
2017-06-13 10:15:11 -07:00
|
|
|
|
2019-06-23 16:54:53 -07:00
|
|
|
**Breaking changes**
|
2018-04-06 14:52:40 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
- The ``--mask-barcodes`` experimental feature has been dropped due to poor
|
|
|
|
reliability and occasional crashes, both due to the underlying library that
|
|
|
|
implements this feature (Leptonica).
|
2019-07-27 03:23:56 -07:00
|
|
|
- The ``-v`` (verbosity level) parameter now accepts only ``0``, ``1``, and
|
|
|
|
``2``.
|
2019-07-27 16:15:48 -07:00
|
|
|
- Dropped support for Tesseract 4.00.00-alpha releases. Tesseract 4.0 beta and
|
2019-07-27 03:23:56 -07:00
|
|
|
later remain supported.
|
2019-07-27 16:15:48 -07:00
|
|
|
- Dropped the ``ocrmypdf-polyglot`` and ``ocrmypdf-webservice`` images.
|
2019-07-10 13:36:06 -07:00
|
|
|
|
2019-07-27 16:15:48 -07:00
|
|
|
**New features**
|
2019-07-10 13:36:06 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
- Added a high level API for applications that want to integrate OCRmyPDF.
|
|
|
|
Special thanks to Martin Wind (@mawi1988) whose made significant contributions
|
2019-07-27 03:23:56 -07:00
|
|
|
to this effort. OCRmyPDF is GPLv3-licensed.
|
|
|
|
- Added progress bars for long-running steps. ■■■■■■■□□
|
2019-07-27 16:15:48 -07:00
|
|
|
- We now create linearized ("fast web view") PDFs by default. The new parameter
|
|
|
|
``--fast-web-view`` provides control over when this feature is applied.
|
|
|
|
- Added a new ``--pages`` feature to limit OCR to only a specific page range.
|
|
|
|
The list may contain commas or single pages, such as ``1, 3, 5-11``.
|
2019-07-27 03:23:56 -07:00
|
|
|
- When the number of pages is small compared to the number of allowed jobs, we
|
|
|
|
run Tesseract in multithreaded (OpenMP) mode when available. This should
|
|
|
|
improve performance on files with low page counts.
|
2019-07-27 02:03:42 -07:00
|
|
|
- Removed dependency on ``ruffus``, and with that, the non-reentrancy
|
|
|
|
restrictions that previous made an API impossible.
|
2019-07-27 03:23:56 -07:00
|
|
|
- Output and logging messages overhauled so that ocrmypdf may be integrated
|
|
|
|
into applications that use the logging module.
|
|
|
|
- pikepdf 1.6.0 is required.
|
2019-07-27 16:15:48 -07:00
|
|
|
- Added a logo. 😊
|
2019-07-10 13:36:06 -07:00
|
|
|
|
2019-07-27 16:15:48 -07:00
|
|
|
**Bug fixes**
|
2019-07-03 02:22:50 -07:00
|
|
|
|
2019-07-27 16:15:48 -07:00
|
|
|
- Pages with vector artwork are treated as full color. Previously, vectors
|
|
|
|
were ignored when considering the colorspace needed to cover a page, which
|
|
|
|
could cause loss of color under certain settings.
|
2019-07-27 02:03:42 -07:00
|
|
|
- Test suite now spawns processes less frequently, allowing more accurate
|
|
|
|
measurement of code coverage.
|
2019-07-27 03:23:56 -07:00
|
|
|
- Improved test coverage.
|
|
|
|
- Fixed a rare division by zero (if optimization produced an invalid file).
|
2019-07-27 02:03:42 -07:00
|
|
|
- Updated Docker images to use newer versions.
|
2019-07-27 03:23:56 -07:00
|
|
|
- Fixed images encoded as JBIG2 with a colorspace other than ``/DeviceGray``
|
|
|
|
were not interpreted correctly.
|
2019-07-30 00:39:14 -07:00
|
|
|
- Fixed a OCR text-image registration (i.e. alignment) problem when the page
|
|
|
|
when MediaBox had a nonzero corner.
|
2019-07-03 02:22:50 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
v8.3.2
|
|
|
|
======
|
2019-05-11 12:50:44 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
- Dropped workaround for macOS that allowed it work without pdfminer.six,
|
|
|
|
now a proper sdist release of pdfminer.six is available.
|
2019-05-11 12:50:44 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
- pikepdf 1.5.0 is now required.
|
2019-05-11 12:50:44 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
v8.3.1
|
|
|
|
======
|
2019-05-11 12:50:44 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
- Fixed an issue where PDFs with malformed metadata would be rendered as
|
|
|
|
blank pages. `#398 <https://github.com/jbarlow83/OCRmyPDF/issues/398>`_.
|
2019-05-11 12:50:44 -07:00
|
|
|
|
|
|
|
v8.3.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Improved the strategy for updating pages when a new image of the page
|
|
|
|
was produced. We now attempt to preserve more content from the
|
|
|
|
original file, for annotations in particular.
|
|
|
|
- For PDFs with more than 100 pages and a sequence where one PDF page
|
|
|
|
was replaced and one or more subsequent ones were skipped, an
|
|
|
|
intermediate file would be corrupted while grafting OCR text, causing
|
|
|
|
processing to fail. This is a regression, likely introduced in
|
|
|
|
v8.2.4.
|
|
|
|
- Previously, we resized the images produced by Ghostscript by a small
|
|
|
|
number of pixels to ensure the output image size was an exactly what
|
|
|
|
we wanted. Having discovered a way to get Ghostscript to produce the
|
|
|
|
exact image sizes we require, we eliminated the resizing step.
|
|
|
|
- Command line completions for ``bash`` are now available, in addition
|
|
|
|
to ``fish``, both in ``misc/completion``. Package maintainers, please
|
|
|
|
install these so users can take advantage.
|
|
|
|
- Updated requirements.
|
|
|
|
- pikepdf 1.3.0 is now required.
|
2019-05-11 12:50:44 -07:00
|
|
|
|
2019-04-23 00:07:12 -07:00
|
|
|
v8.2.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed a false positive while checking for a certain type of PDF that
|
|
|
|
only Acrobat can read. We now more accurately detect Acrobat-only
|
|
|
|
PDFs.
|
|
|
|
- OCRmyPDF holds fewer open file handles and is more prompt about
|
|
|
|
releasing those it no longer needs.
|
|
|
|
- Minor optimization: we no longer traverse the table of contents to
|
|
|
|
ensure all references in it are resolved, as changes to libqpdf have
|
|
|
|
made this unnecessary.
|
|
|
|
- pikepdf 1.2.0 is now required.
|
2019-04-23 00:07:12 -07:00
|
|
|
|
2019-04-03 01:19:12 -07:00
|
|
|
v8.2.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2019-04-03 01:19:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed that ``--mask-barcodes`` would occasionally leave a unwanted
|
|
|
|
temporary file named ``junkpixt`` in the current working folder.
|
|
|
|
- Fixed (hopefully) handling of Leptonica errors in an environment
|
|
|
|
where a non-standard ``sys.stderr`` is present.
|
|
|
|
- Improved help text for ``--verbose``.
|
2019-04-03 01:19:12 -07:00
|
|
|
|
2019-03-07 14:27:16 -08:00
|
|
|
v8.2.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2019-03-06 22:22:50 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed a regression from v8.2.0, an exception that occurred while
|
|
|
|
attempting to report that ``unpaper`` or another optional dependency
|
|
|
|
was unavailable.
|
|
|
|
- In some cases, ``ocrmypdf [-c|--clean]`` failed to exit with an error
|
|
|
|
when ``unpaper`` is not installed.
|
2019-03-06 22:22:50 -08:00
|
|
|
|
2019-03-07 14:27:16 -08:00
|
|
|
v8.2.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2019-03-07 14:27:16 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- This release was canceled.
|
2019-03-07 14:27:16 -08:00
|
|
|
|
2019-03-03 14:15:20 -08:00
|
|
|
v8.2.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- A major improvement to our Docker image is now available thanks to
|
|
|
|
hard work contributed by @mawi12345. The new Docker image,
|
|
|
|
ocrmypdf-alpine, is based on Alpine Linux, and includes most of the
|
|
|
|
functionality of three existed images in a smaller package. This
|
|
|
|
image will replace the main Docker image eventually but for now all
|
|
|
|
are being built. `See documentation for
|
|
|
|
details <https://ocrmypdf.readthedocs.io/en/latest/docker.html>`__.
|
|
|
|
- Documentation reorganized especially around the use of Docker images.
|
|
|
|
- Fixed a problem with PDF image optimization, where the optimizer
|
|
|
|
would unnecessarily decompress and recompress PNG images, in some
|
|
|
|
cases losing the benefits of the quantization it just had just
|
|
|
|
performed. The optimizer is now capable of embedding PNG images into
|
|
|
|
PDFs without transcoding them.
|
|
|
|
- Fixed a minor regression with lossy JBIG2 image optimization. All
|
|
|
|
JBIG2 candidates images were incorrectly placed into a single
|
|
|
|
optimization group for the whole file, instead of grouping pages
|
|
|
|
together. This usually makes a larger JBIG2Globals dictionary and
|
|
|
|
results in inferior compression, so it worked less well than
|
|
|
|
designed. However, quality would not be impacted. Lossless JBIG2 was
|
|
|
|
entirely unaffected.
|
|
|
|
- Updated dependencies, including pikepdf to 1.1.0. This fixes
|
|
|
|
`#358 <https://github.com/jbarlow83/OCRmyPDF/issues/358>`__.
|
|
|
|
- The install-time version checks for certain external programs have
|
|
|
|
been removed from setup.py. These tests are now performed at
|
|
|
|
run-time.
|
|
|
|
- The non-standard option to override install-time checks
|
|
|
|
(``setup.py install --force``) is now deprecated and prints a
|
|
|
|
warning. It will be removed in a future release.
|
2019-03-03 14:15:20 -08:00
|
|
|
|
2019-02-07 17:06:51 -08:00
|
|
|
v8.1.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Added a feature, ``--unpaper-args``, which allows passing arbitrary
|
|
|
|
arguments to ``unpaper`` when using ``--clean`` or ``--clean-final``.
|
|
|
|
The default, very conservative unpaper settings are suppressed.
|
|
|
|
- The argument ``--clean-final`` now implies ``--clean``. It was
|
|
|
|
possible to issue ``--clean-final`` on its before this, but it would
|
|
|
|
have no useful effect.
|
|
|
|
- Fixed an exception on traversing corrupt table of contents entries
|
|
|
|
(specifically, those with invalid destination objects)
|
|
|
|
- Fixed an issue when using ``--tesseract-timeout`` and image
|
|
|
|
processing features on a file with more than 100 pages.
|
|
|
|
`#347 <https://github.com/jbarlow83/OCRmyPDF/issues/347>`__
|
|
|
|
- OCRmyPDF now always calls ``os.nice(5)`` to signal to operating
|
|
|
|
systems that it is a background process.
|
2019-02-10 02:10:48 -08:00
|
|
|
|
2019-01-17 00:57:28 -08:00
|
|
|
v8.0.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2019-01-17 00:57:28 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an exception when parsing PDFs that are missing a required
|
|
|
|
field. `#325 <https://github.com/jbarlow83/OCRmyPDF/issues/325>`__
|
|
|
|
- pikepdf 1.0.5 is now required, to address some other PDF parsing
|
|
|
|
issues.
|
2019-01-17 00:57:28 -08:00
|
|
|
|
2019-01-05 23:35:47 -08:00
|
|
|
v8.0.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-12-19 16:41:09 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
No major features. The intent of this release is to sever support for
|
|
|
|
older versions of certain dependencies.
|
2019-01-05 23:35:47 -08:00
|
|
|
|
|
|
|
**Breaking changes**
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Dropped support for Tesseract 3.x. Tesseract 4.0 or newer is now
|
|
|
|
required.
|
|
|
|
- Dropped support for Python 3.5.
|
|
|
|
- Some ``ocrmypdf.pdfa`` APIs that were deprecated in v7.x were
|
|
|
|
removed. This functionality has been moved to pikepdf.
|
2019-01-05 23:35:47 -08:00
|
|
|
|
|
|
|
**Other changes**
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an unhandled exception when attempting to mask barcodes.
|
|
|
|
`#322 <https://github.com/jbarlow83/OCRmyPDF/issues/322>`__
|
|
|
|
- It is now possible to use ocrmypdf without pdfminer.six, to support
|
|
|
|
distributions that do not have it or cannot currently use it (e.g.
|
|
|
|
Homebrew). Downstream maintainers should include pdfminer.six if
|
|
|
|
possible.
|
|
|
|
- A warning is now issue when PDF/A conversion removes some XMP
|
|
|
|
metadata from the input PDF. (Only a "whitelist" of certain XMP
|
|
|
|
metadata types are allowed in PDF/A.)
|
|
|
|
- Fixed several issues that caused PDF/As to be produced with
|
|
|
|
nonconforming XMP metadata (would fail validation with veraPDF).
|
|
|
|
- Fixed some instances where invalid DocumentInfo from a PDF cause XMP
|
|
|
|
metadata creation to fail.
|
|
|
|
- Fixed a few documentation problems.
|
|
|
|
- pikepdf 1.0.2 is now required.
|
2018-12-19 16:41:09 -08:00
|
|
|
|
2018-12-15 15:27:23 -08:00
|
|
|
v7.4.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- ``--force-ocr`` may now be used with the new ``--threshold`` and
|
|
|
|
``--mask-barcodes`` features
|
|
|
|
- pikepdf >= 0.9.1 is now required.
|
|
|
|
- Changed metadata handling to pikepdf 0.9.1. As a result, metadata
|
|
|
|
handling of non-ASCII characters in Ghostscript 9.25 or later is
|
|
|
|
fixed.
|
|
|
|
- chardet >= 3.0.4 is temporarily listed as required. pdfminer.six
|
|
|
|
depends on it, but the most recent release does not specify this
|
|
|
|
requirement.
|
|
|
|
(`#326 <https://github.com/jbarlow83/OCRmyPDF/issues/326>`__)
|
|
|
|
- python-xmp-toolkit and libexempi are no longer required.
|
|
|
|
- A new Docker image is now being provided for users who wish to access
|
|
|
|
OCRmyPDF over a simple HTTP interface, instead of the command line.
|
|
|
|
- Increase tolerance of PDFs that overflow or underflow the PDF
|
|
|
|
graphics stack.
|
|
|
|
(`#325 <https://github.com/jbarlow83/OCRmyPDF/issues/325>`__)
|
2018-10-04 01:21:17 -07:00
|
|
|
|
2018-11-16 02:13:41 -08:00
|
|
|
v7.3.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-11-16 02:13:41 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed performance regression from v7.3.0; fast page analysis was not
|
|
|
|
selected when it should be.
|
|
|
|
- Fixed a few exceptions related to the new ``--mask-barcodes`` feature
|
|
|
|
and improved argument checking
|
|
|
|
- Added missing detection of TrueType fonts that lack a Unicode mapping
|
2018-11-16 02:13:41 -08:00
|
|
|
|
2018-11-10 01:09:19 -08:00
|
|
|
v7.3.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Added a new feature ``--redo-ocr`` to detect existing OCR in a file,
|
|
|
|
remove it, and redo the OCR. This may be particularly helpful for
|
|
|
|
anyone who wants to take advantage of OCR quality improvements in
|
|
|
|
Tesseract 4.0. Note that OCR added by OCRmyPDF before version 3.0
|
|
|
|
cannot be detected since it was not properly marked as invisible text
|
|
|
|
in the earliest versions. OCR that constructs a font from visible
|
|
|
|
text, such as Adobe Acrobat's ClearScan.
|
|
|
|
- OCRmyPDF's content detection is generally more sophisticated. It
|
|
|
|
learns more about the contents of each PDF and makes better
|
|
|
|
recommendations:
|
|
|
|
|
|
|
|
- OCRmyPDF can now detect when a PDF contains text that cannot be
|
|
|
|
mapped to Unicode (meaning it is readable to human eyes but
|
|
|
|
copy-pastes as gibberish). In these cases it recommends
|
|
|
|
``--force-ocr`` to make the text searchable.
|
|
|
|
- PDFs containing vector objects are now rendered at more
|
|
|
|
appropriate resolution for OCR.
|
|
|
|
- We now exit with an error for PDFs that contain Adobe LiveCycle
|
|
|
|
Designer's dynamic XFA forms. Currently the open source community
|
|
|
|
does not have tools to work with these files.
|
|
|
|
- OCRmyPDF now warns when a PDF that contains Adobe AcroForms, since
|
|
|
|
such files probably do not need OCR. It can work with these files.
|
|
|
|
|
|
|
|
- Added three new **experimental** features to improve OCR quality in
|
|
|
|
certain conditions. The name, syntax and behavior of these arguments
|
|
|
|
is subject to change. They may also be incompatible with some other
|
|
|
|
features.
|
|
|
|
|
|
|
|
- ``--remove-vectors`` which strips out vector graphics. This can
|
|
|
|
improve OCR quality since OCR will not search artwork for readable
|
|
|
|
text; however, it currently removes "text as curves" as well.
|
|
|
|
- ``--mask-barcodes`` to detect and suppress barcodes in files. We
|
|
|
|
have observed that barcodes can interfere with OCR because they
|
|
|
|
are "text-like" but not actually textual.
|
|
|
|
- ``--threshold`` which uses a more sophisticated thresholding
|
|
|
|
algorithm than is currently in use in Tesseract OCR. This works
|
|
|
|
around a `known issue in Tesseract
|
|
|
|
4.0 <https://github.com/tesseract-ocr/tesseract/issues/1990>`__
|
|
|
|
with dark text on bright backgrounds.
|
|
|
|
|
|
|
|
- Fixed an issue where an error message was not reported when the
|
|
|
|
installed Ghostscript was very old.
|
|
|
|
- The PDF optimizer now saves files with object streams enabled when
|
|
|
|
the optimization level is ``--optimize 1`` or higher (the default).
|
|
|
|
This makes files a little bit smaller, but requires PDF 1.5. PDF 1.5
|
|
|
|
was first released in 2003 and is broadly supported by PDF viewers,
|
|
|
|
but some rudimentary PDF parsers such as PyPDF2 do not understand
|
|
|
|
object streams. You can use the command line tool
|
|
|
|
``qpdf --object-streams=disable`` or
|
|
|
|
`pikepdf <https://github.com/pikepdf/pikepdf>`__ library to remove
|
|
|
|
them.
|
|
|
|
- New dependency: pdfminer.six 20181108. Note this is a fork of the
|
|
|
|
Python 2-only pdfminer.
|
|
|
|
- Deprecation notice: At the end of 2018, we will be ending support for
|
|
|
|
Python 3.5 and Tesseract 3.x. OCRmyPDF v7 will continue to work with
|
|
|
|
older versions.
|
2018-11-10 01:09:19 -08:00
|
|
|
|
2018-10-11 15:55:01 -07:00
|
|
|
v7.2.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-10-11 15:55:01 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix compatibility with an API change in pikepdf 0.3.5.
|
|
|
|
- A kludge to support Leptonica versions older than 1.72 in the test
|
|
|
|
suite was dropped. Older versions of Leptonica are likely still
|
|
|
|
compatible. The only impact is that a portion of the test suite will
|
|
|
|
be skipped.
|
2018-10-11 15:55:01 -07:00
|
|
|
|
2018-10-04 01:21:17 -07:00
|
|
|
v7.2.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-10-04 01:21:17 -07:00
|
|
|
|
2018-10-05 01:27:00 -07:00
|
|
|
**Lossy JBIG2 behavior change**
|
2018-10-04 01:21:17 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
A user reported that ocrmypdf was in fact using JBIG2 in **lossy**
|
|
|
|
compression mode. This was not the intended behavior. Users should
|
|
|
|
`review the technical concerns with JBIG2 in lossy
|
|
|
|
mode <https://abbyy.technology/en:kb:tip:jbig2_compression_and_ocr>`__
|
|
|
|
and decide if this is a concern for their use case.
|
2018-10-04 01:21:17 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
JBIG2 lossy mode does achieve higher compression ratios than any other
|
|
|
|
monochrome compression technology; for large text documents the savings
|
|
|
|
are considerable. JBIG2 lossless still gives great compression ratios
|
|
|
|
and is a major improvement over the older CCITT G4 standard.
|
2018-10-04 01:21:17 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
Only users who have reviewed the concerns with JBIG2 in lossy mode
|
|
|
|
should opt-in. As such, lossy mode JBIG2 is only turned on when the new
|
|
|
|
argument ``--jbig2-lossy`` is issued. This is independent of the setting
|
|
|
|
for ``--optimize``.
|
2018-10-04 01:21:17 -07:00
|
|
|
|
|
|
|
Users who did not install an optional JBIG2 encoder are unaffected.
|
|
|
|
|
|
|
|
(Thanks to user 'bsdice' for reporting this issue.)
|
|
|
|
|
|
|
|
**Other issues**
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- When the image optimizer quantizes an image to 1 bit per pixel, it
|
|
|
|
will now attempt to further optimize that image as CCITT or JBIG2,
|
|
|
|
instead of keeping it in the "flate" encoding which is not efficient
|
|
|
|
for 1 bpp images.
|
|
|
|
(`#297 <https://github.com/jbarlow83/OCRmyPDF/issues/297>`__)
|
|
|
|
- Images in PDFs that are used as soft masks (i.e. transparency masks
|
|
|
|
or alpha channels) are now excluded from optimization.
|
|
|
|
- Fixed handling of Tesseract 4.0-rc1 which now accepts invalid
|
|
|
|
Tesseract configuration files, which broke the test suite.
|
2018-10-04 01:21:17 -07:00
|
|
|
|
2018-09-19 20:57:18 -07:00
|
|
|
v7.1.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Improve the performance of initial text extraction, which is done to
|
|
|
|
determine if a file contains existing text of some kind or not. On
|
|
|
|
large files, this initial processing is now about 20x times faster.
|
|
|
|
(`#299 <https://github.com/jbarlow83/OCRmyPDF/issues/299>`__)
|
|
|
|
- pikepdf 0.3.3 is now required.
|
|
|
|
- Fixed issue
|
|
|
|
`#231 <https://github.com/jbarlow83/OCRmyPDF/issues/231>`__, a
|
|
|
|
problem with JPEG2000 images where image metadata was only available
|
|
|
|
inside the JPEG2000 file.
|
|
|
|
- Fixed some additional Ghostscript 9.25 compatibility issues.
|
|
|
|
- Improved handling of KeyboardInterrupt error messages.
|
|
|
|
(`#301 <https://github.com/jbarlow83/OCRmyPDF/issues/301>`__)
|
|
|
|
- README.md is now served in GitHub markdown instead of
|
|
|
|
reStructuredText.
|
2018-09-19 21:01:24 -07:00
|
|
|
|
2018-09-14 15:53:26 -07:00
|
|
|
v7.0.6
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-09-14 15:53:26 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Blacklist Ghostscript 9.24, now that 9.25 is available and fixes many
|
|
|
|
regressions in 9.24.
|
2018-09-14 15:53:26 -07:00
|
|
|
|
2018-09-13 23:29:54 -07:00
|
|
|
v7.0.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Improve capability with Ghostscript 9.24, and enable the JPEG
|
|
|
|
passthrough feature when this version in installed.
|
|
|
|
- Ghostscript 9.24 lost the ability to set PDF title, author, subject
|
|
|
|
and keyword metadata to Unicode strings. OCRmyPDF will set ASCII
|
|
|
|
strings and warn when Unicode is suppressed. Other software may be
|
|
|
|
used to update metadata. This is a short term work around.
|
|
|
|
- PDFs generated by Kodak Capture Desktop, or generally PDFs that
|
|
|
|
contain indirect references to null objects in their table of
|
|
|
|
contents, would have an invalid table of contents after processing by
|
|
|
|
OCRmyPDF that might interfere with other viewers. This has been
|
|
|
|
fixed.
|
|
|
|
- Detect PDFs generated by Adobe LiveCycle, which can only be displayed
|
|
|
|
in Adobe Acrobat and Reader currently. When these are encountered,
|
|
|
|
exit with an error instead of performing OCR on the "Please wait"
|
|
|
|
error message page.
|
2018-09-13 23:29:54 -07:00
|
|
|
|
2018-08-24 12:41:53 -07:00
|
|
|
v7.0.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-08-24 12:41:53 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix exception thrown when trying to optimize a certain type of PNG
|
|
|
|
embedded in a PDF with the ``-O2``
|
|
|
|
- Update to pikepdf 0.3.2, to gain support for optimizing some
|
|
|
|
additional image types that were previously excluded from
|
|
|
|
optimization (CMYK and grayscale). Fixes
|
|
|
|
`#285 <https://github.com/jbarlow83/OCRmyPDF/issues/285>`__.
|
2018-08-24 12:41:53 -07:00
|
|
|
|
2018-08-10 16:59:08 -07:00
|
|
|
v7.0.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-08-10 16:59:08 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix issue
|
|
|
|
`#284 <https://github.com/jbarlow83/OCRmyPDF/issues/284>`__, an error
|
|
|
|
when parsing inline images that have are also image masks, by
|
|
|
|
upgrading pikepdf to 0.3.1
|
2018-08-10 16:59:08 -07:00
|
|
|
|
2018-08-03 13:37:18 -07:00
|
|
|
v7.0.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-08-03 13:37:18 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix a regression with ``--rotate-pages`` on pages that already had
|
|
|
|
rotations applied.
|
|
|
|
(`#279 <https://github.com/jbarlow83/OCRmyPDF/issues/279>`__)
|
|
|
|
- Improve quality of page rotation in some cases by rasterizing a
|
|
|
|
higher quality preview image.
|
|
|
|
(`#281 <https://github.com/jbarlow83/OCRmyPDF/issues/281>`__)
|
2018-08-03 13:37:18 -07:00
|
|
|
|
2018-08-01 15:17:33 -07:00
|
|
|
v7.0.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-08-01 15:17:33 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix compatibility with img2pdf >= 0.3.0 by rejecting input images
|
|
|
|
that have an alpha channel
|
|
|
|
- Add forward compatibility for pikepdf 0.3.0 (unrelated to img2pdf)
|
|
|
|
- Various documentation updates for v7.0.0 changes
|
2018-08-01 15:17:49 -07:00
|
|
|
|
2018-07-09 12:51:56 -07:00
|
|
|
v7.0.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- The core algorithm for combining OCR layers with existing PDF pages
|
|
|
|
has been rewritten and improved considerably. PDFs are no longer
|
|
|
|
split into single page PDFs for processing; instead, images are
|
|
|
|
rendered and the OCR results are grafted onto the input PDF. The new
|
|
|
|
algorithm uses less temporary disk space and is much more performant
|
|
|
|
especially for large files.
|
|
|
|
- New dependency: `pikepdf <https://github.com/pikepdf/pikepdf>`__.
|
|
|
|
pikepdf is a powerful new Python PDF library driving the latest
|
|
|
|
OCRmyPDF features, built on the QPDF C++ library (libqpdf).
|
|
|
|
- New feature: PDF optimization with ``-O`` or ``--optimize``. After
|
|
|
|
OCR, OCRmyPDF will perform image optimizations relevant to OCR PDFs.
|
|
|
|
|
|
|
|
- If a JBIG2 encoder is available, then monochrome images will be
|
|
|
|
converted, with the potential for huge savings on large black and
|
|
|
|
white images, since JBIG2 is far more efficient than any other
|
|
|
|
monochrome (bi-level) compression. (All known US patents related
|
|
|
|
to JBIG2 have probably expired, but it remains the responsibility
|
|
|
|
of the user to supply a JBIG2 encoder such as
|
|
|
|
`jbig2enc <https://github.com/agl/jbig2enc>`__. OCRmyPDF does not
|
|
|
|
implement JBIG2 encoding.)
|
|
|
|
- If ``pngquant`` is installed, OCRmyPDF will optionally use it to
|
|
|
|
perform lossy quantization and compression of PNG images.
|
|
|
|
- The quality of JPEGs can also be lowered, on the assumption that a
|
|
|
|
lower quality image may be suitable for storage after OCR.
|
|
|
|
- This image optimization component will eventually be offered as an
|
|
|
|
independent command line utility.
|
|
|
|
- Optimization ranges from ``-O0`` through ``-O3``, where ``0``
|
|
|
|
disables optimization and ``3`` implements all options. ``1``, the
|
|
|
|
default, performs only safe and lossless optimizations. (This is
|
|
|
|
similar to GCC's optimization parameter.) The exact type of
|
|
|
|
optimizations performed will vary over time.
|
|
|
|
|
|
|
|
- Small amounts of text in the margins of a page, such as watermarks,
|
|
|
|
page numbers, or digital stamps, will no longer prevent the rest of a
|
|
|
|
page from being OCRed when ``--skip-text`` is issued. This behavior
|
|
|
|
is based on a heuristic.
|
|
|
|
- Removed features
|
|
|
|
|
|
|
|
- The deprecated ``--pdf-renderer tesseract`` PDF renderer was
|
|
|
|
removed.
|
|
|
|
- ``-g``, the option to generate debug text pages, was removed
|
|
|
|
because it was a maintenance burden and only worked in isolated
|
|
|
|
cases. HOCR pages can still be previewed by running the
|
|
|
|
hocrtransform.py with appropriate settings.
|
|
|
|
|
|
|
|
- Removed dependencies
|
|
|
|
|
|
|
|
- ``PyPDF2``
|
|
|
|
- ``defusedxml``
|
|
|
|
- ``PyMuPDF``
|
|
|
|
|
|
|
|
- The ``sandwich`` PDF renderer can be used with all supported versions
|
|
|
|
of Tesseract, including that those prior to v3.05 which don't support
|
|
|
|
``-c textonly``. (Tesseract v4.0.0 is recommended and more
|
|
|
|
efficient.)
|
|
|
|
- ``--pdf-renderer auto`` option and the diagnostics used to select a
|
|
|
|
PDF renderer now work better with old versions, but may make
|
|
|
|
different decisions than past versions.
|
|
|
|
- If everything succeeds but PDF/A conversion fails, a distinct return
|
|
|
|
code is now returned (``ExitCode.pdfa_conversion_failed (10)``) where
|
|
|
|
this situation previously returned
|
|
|
|
``ExitCode.invalid_output_pdf (4)``. The latter is now returned only
|
|
|
|
if there is some indication that the output file is invalid.
|
|
|
|
- Notes for downstream packagers
|
|
|
|
|
|
|
|
- There is also a new dependency on ``python-xmp-toolkit`` which in
|
|
|
|
turn depends on ``libexempi3``.
|
|
|
|
- It may be necessary to separately ``pip install pycparser`` to
|
|
|
|
avoid `another Python 3.7
|
|
|
|
issue <https://github.com/eliben/pycparser/pull/135>`__.
|
2018-06-23 03:01:01 -07:00
|
|
|
|
2018-10-28 16:19:37 -07:00
|
|
|
v6.2.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Disable a failing test due to Tesseract 4.0rc1 behavior change.
|
|
|
|
Previously, Tesseract would exit with an error message if its
|
|
|
|
configuration was invalid, and OCRmyPDF would intercept this message.
|
|
|
|
Now Tesseract issues a warning, which OCRmyPDF v6.2.5 may relay or
|
|
|
|
ignore. (In v7.x, OCRmyPDF will respond to the warning.)
|
|
|
|
- This release branch no longer supports using the optional PyMuPDF
|
|
|
|
installation, since it was removed in v7.x.
|
|
|
|
- This release branch no longer supports macOS. macOS users should
|
|
|
|
upgrade to v7.x.
|
2018-10-28 16:19:37 -07:00
|
|
|
|
2018-09-16 15:07:53 -07:00
|
|
|
v6.2.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-09-16 15:07:53 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Backport Ghostscript 9.25 compatibility fixes, which removes support
|
|
|
|
for setting Unicode metadata
|
|
|
|
- Backport blacklisting Ghostscript 9.24
|
|
|
|
- Older versions of Ghostscript are still supported
|
2018-09-16 15:07:53 -07:00
|
|
|
|
2018-07-31 23:45:28 -07:00
|
|
|
v6.2.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-07-31 23:45:28 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix compatibility with img2pdf >= 0.3.0 by rejecting input images
|
|
|
|
that have an alpha channel
|
|
|
|
- This version will be included in Ubuntu 18.10
|
2018-07-31 23:45:28 -07:00
|
|
|
|
2018-07-09 13:56:23 -07:00
|
|
|
v6.2.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-07-09 13:56:23 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Backport compatibility fixes for Python 3.7 and ruffus 2.7.0 from
|
|
|
|
v7.0.0
|
|
|
|
- Backport fix to ignore masks when deciding what colors are on a page
|
|
|
|
- Backport some minor improvements from v7.0.0: better argument
|
|
|
|
validation and warnings about the Tesseract 4.0.0 ``--user-words``
|
|
|
|
regression
|
2018-07-09 13:56:23 -07:00
|
|
|
|
2018-06-23 03:01:01 -07:00
|
|
|
v6.2.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-06-23 03:01:01 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix recent versions of Tesseract (after 4.0.0-beta1) not being
|
|
|
|
detected as supporting the ``sandwich`` renderer
|
|
|
|
(`#271 <https://github.com/ppjbarlow83/OCRmyPDF/issues/271>`__).
|
2018-06-23 03:01:01 -07:00
|
|
|
|
2018-05-03 16:47:21 -07:00
|
|
|
v6.2.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- **Docker**: The Docker image ``ocrmypdf-tess4`` has been removed. The
|
|
|
|
main Docker images, ``ocrmypdf`` and ``ocrmypdf-polyglot`` now use
|
|
|
|
Ubuntu 18.04 as a base image, and as such Tesseract 4.0.0-beta1 is
|
|
|
|
now the Tesseract version they use. There is no Docker image based on
|
|
|
|
Tesseract 3.05 anymore.
|
|
|
|
- Creation of PDF/A-3 is now supported. However, there is no ability to
|
|
|
|
attach files to PDF/A-3.
|
|
|
|
- Lists more reasons why the file size might grow.
|
|
|
|
- Fix issue
|
|
|
|
`#262 <https://github.com/ppjbarlow83/OCRmyPDF/issues/262>`__,
|
|
|
|
``--remove-background`` error on PDFs contained colormapped
|
|
|
|
(paletted) images.
|
|
|
|
- Fix another XMP metadata validation issue, in cases where the input
|
|
|
|
file's creation date has no timezone and the creation date is not
|
|
|
|
overridden.
|
2018-05-03 16:47:21 -07:00
|
|
|
|
2018-04-17 15:23:35 -07:00
|
|
|
v6.1.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-04-17 15:23:35 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix issue
|
|
|
|
`#253 <https://github.com/jbarlow83/OCRmyPDF/issues/253>`__, a
|
|
|
|
possible division by zero when using the ``hocr`` renderer.
|
|
|
|
- Fix incorrectly formatted ``<xmp:ModifyDate>`` field inside XMP
|
|
|
|
metadata for PDF/As. veraPDF flags this as a PDF/A validation
|
|
|
|
failure. The error is caused the timezone and final digit of the
|
|
|
|
seconds of modified time to be omitted, so at worst the modification
|
|
|
|
time stamp is rounded to the nearest 10 seconds.
|
2018-04-17 15:23:35 -07:00
|
|
|
|
2018-04-05 02:14:33 -07:00
|
|
|
v6.1.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Fix issue `#248 <https://github.com/jbarlow83/OCRmyPDF/issues/248>`__
|
|
|
|
``--clean`` argument may remove OCR from left column of text on
|
|
|
|
certain documents. We now set ``--layout none`` to suppress this.
|
|
|
|
- The test cache was updated to reflect the change above.
|
|
|
|
- Change test suite to accommodate Ghostscript 9.23's new ability to
|
|
|
|
insert JPEGs into PDFs without transcoding.
|
|
|
|
- XMP metadata in PDFs is now examined using ``defusedxml`` for safety.
|
|
|
|
- If an external process exits with a signal when asked to report its
|
|
|
|
version, we now print the system error message instead of suppressing
|
|
|
|
it. This occurred when the required executable was found but was
|
|
|
|
missing a shared library.
|
|
|
|
- qpdf 7.0.0 or newer is now required as the test suite can no longer
|
|
|
|
pass without it.
|
2018-04-10 15:53:02 -07:00
|
|
|
|
2018-04-12 16:28:48 -07:00
|
|
|
Notes
|
2019-06-22 17:29:26 -07:00
|
|
|
-----
|
2018-04-12 16:28:48 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- An apparent `regression in Ghostscript
|
|
|
|
9.23 <https://bugs.ghostscript.com/show_bug.cgi?id=699216>`__ will
|
|
|
|
cause some ocrmypdf output files to become invalid in rare cases; the
|
|
|
|
workaround for the moment is to set ``--force-ocr``.
|
2018-04-12 00:55:45 -07:00
|
|
|
|
2018-04-03 00:11:20 -07:00
|
|
|
v6.1.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-04-10 15:53:02 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix issue
|
|
|
|
`#247 <https://github.com/jbarlow83/OCRmyPDF/issues/247>`__,
|
|
|
|
``/CreationDate`` metadata not copied from input to output.
|
|
|
|
- A warning is now issued when Python 3.5 is used on files with a large
|
|
|
|
page count, as this case is known to regress to single core
|
|
|
|
performance. The cause of this problem is unknown.
|
2018-04-03 00:11:20 -07:00
|
|
|
|
2018-03-30 12:39:33 -07:00
|
|
|
v6.1.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-03-30 14:00:36 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Upgrade to PyMuPDF v1.12.5 which includes a more complete fix to
|
|
|
|
`#239 <https://github.com/jbarlow83/OCRmyPDF/issues/239>`__.
|
|
|
|
- Add ``defusedxml`` dependency.
|
2018-03-30 12:39:33 -07:00
|
|
|
|
2018-03-30 00:11:52 -07:00
|
|
|
v6.1.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-03-30 00:11:52 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix text being reported as found on all pages if PyMuPDF is not
|
|
|
|
installed.
|
2018-03-30 00:11:52 -07:00
|
|
|
|
2018-03-28 00:39:32 -07:00
|
|
|
v6.1.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- PyMuPDF is now an optional but recommended dependency, to alleviate
|
|
|
|
installation difficulties on platforms that have less access to
|
|
|
|
PyMuPDF than the author anticipated. (For version 6.x only) install
|
|
|
|
OCRmyPDF with ``pip install ocrmypdf[fitz]`` to use it to its full
|
|
|
|
potential.
|
|
|
|
- Fix ``FileExistsError`` that could occur if OCR timed out while it
|
|
|
|
was generating the output file.
|
|
|
|
(`#218 <https://github.com/jbarlow83/OCRmyPDF/issues/218>`__)
|
|
|
|
- Fix table of contents/bookmarks all being redirected to page 1 when
|
|
|
|
generating a PDF/A (with PyMuPDF). (Without PyMuPDF the table of
|
|
|
|
contents is removed in PDF/A mode.)
|
|
|
|
- Fix "RuntimeError: invalid key in dict" when table of
|
|
|
|
contents/bookmarks titles contained the character ``)``.
|
|
|
|
(`#239 <https://github.com/jbarlow83/OCRmyPDF/issues/239>`__)
|
|
|
|
- Added a new argument ``--skip-repair`` to skip the initial PDF repair
|
|
|
|
step if the PDF is already well-formed (because another program
|
|
|
|
repaired it).
|
2018-03-26 01:44:01 -07:00
|
|
|
|
|
|
|
v6.0.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- The software license has been changed to GPLv3. Test resource files
|
|
|
|
and some individual sources may have other licenses.
|
|
|
|
- OCRmyPDF now depends on
|
|
|
|
`PyMuPDF <https://pymupdf.readthedocs.io/en/latest/installation/>`__.
|
|
|
|
Including PyMuPDF is the primary reason for the change to GPLv3.
|
|
|
|
- Other backward incompatible changes
|
|
|
|
|
|
|
|
- The ``OCRMYPDF_TESSERACT``, ``OCRMYPDF_QPDF``, ``OCRMYPDF_GS`` and
|
|
|
|
``OCRMYPDF_UNPAPER`` environment variables are no longer used.
|
|
|
|
Change ``PATH`` if you need to override the external programs
|
|
|
|
OCRmyPDF uses.
|
|
|
|
- The ``ocrmypdf`` package has been moved to ``src/ocrmypdf`` to
|
|
|
|
avoid issues with accidental import.
|
|
|
|
- The function ``ocrmypdf.exec.get_program`` was removed.
|
|
|
|
- The deprecated module ``ocrmypdf.pageinfo`` was removed.
|
|
|
|
- The ``--pdf-renderer tess4`` alias for ``sandwich`` was removed.
|
|
|
|
|
|
|
|
- Fixed an issue where OCRmyPDF failed to detect existing text on
|
|
|
|
pages, depending on how the text and fonts were encoded within the
|
|
|
|
PDF. (`#233 <https://github.com/jbarlow83/OCRmyPDF/issues/233>`__,
|
|
|
|
`#232 <https://github.com/jbarlow83/OCRmyPDF/issues/232>`__)
|
|
|
|
- Fixed an issue that caused dramatic inflation of file sizes when
|
|
|
|
``--skip-text --output-type pdf`` was used. OCRmyPDF now removes
|
|
|
|
duplicate resources such as fonts, images and other objects that it
|
|
|
|
generates.
|
|
|
|
(`#237 <https://github.com/jbarlow83/OCRmyPDF/issues/237>`__)
|
|
|
|
- Improved performance of the initial page splitting step. Originally
|
|
|
|
this step was not believed to be expensive and ran in a process.
|
|
|
|
Large file testing revealed it to be a bottleneck, so it is now
|
|
|
|
parallelized. On a 700 page file with quad core machine, this change
|
|
|
|
saves about 2 minutes.
|
|
|
|
(`#234 <https://github.com/jbarlow83/OCRmyPDF/issues/234>`__)
|
|
|
|
- The test suite now includes a cache that can be used to speed up test
|
|
|
|
runs across platforms. This also does not require computing
|
|
|
|
checksums, so it's faster.
|
|
|
|
(`#217 <https://github.com/jbarlow83/OCRmyPDF/issues/217>`__)
|
2018-03-24 02:52:56 -07:00
|
|
|
|
2018-03-15 16:59:59 -07:00
|
|
|
v5.7.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed an issue that caused poor CPU utilization on machines with more
|
|
|
|
than 4 cores when running Tesseract 4. (Related to issue
|
|
|
|
`#217 <https://github.com/jbarlow83/OCRmyPDF/issues/217>`__.)
|
|
|
|
- The 'hocr' renderer has been improved. The 'sandwich' and 'tesseract'
|
|
|
|
renderers are still better for most use cases, but 'hocr' may be
|
|
|
|
useful for people who work with the PDF.js renderer in English/ASCII
|
|
|
|
languages.
|
|
|
|
(`#225 <https://github.com/jbarlow83/OCRmyPDF/issues/225>`__)
|
|
|
|
|
|
|
|
- It now formats text in a matter that is easier for certain PDF
|
|
|
|
viewers to select and extract copy and paste text. This should
|
|
|
|
help macOS Preview and PDF.js in particular.
|
|
|
|
- The appearance of selected text and behavior of selecting text is
|
|
|
|
improved.
|
|
|
|
- The PDF content stream now uses relative moves, making it more
|
|
|
|
compact and easier for viewers to determine when two words on the
|
|
|
|
same line.
|
|
|
|
- It can now deal with text on a skewed baseline.
|
|
|
|
- Thanks to @cforcey for the pull request, @jbreiden for many
|
|
|
|
helpful suggestions, @ctbarbour for another round of improvements,
|
|
|
|
and @acaloiaro for an independent review.
|
2018-03-15 16:59:59 -07:00
|
|
|
|
2018-03-12 03:41:12 -07:00
|
|
|
v5.6.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-03-09 15:37:08 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Suppress two debug messages that were too verbose
|
2018-03-09 15:37:08 -08:00
|
|
|
|
2018-03-12 03:41:12 -07:00
|
|
|
v5.6.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-03-12 03:41:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Development branch accidentally tagged as release. Do not use.
|
2018-03-12 03:41:12 -07:00
|
|
|
|
2018-03-09 08:00:42 -08:00
|
|
|
v5.6.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Fix issue
|
|
|
|
`#219 <https://github.com/jbarlow83/OCRmyPDF/issues/219>`__: change
|
|
|
|
how the final output file is created to avoid triggering permission
|
|
|
|
errors when the output is a special file such as ``/dev/null``
|
|
|
|
- Fix test suite failures due to a qpdf 8.0.0 regression and Python
|
|
|
|
3.5's handling of symlink
|
|
|
|
- The "encrypted PDF" error message was different depending on the type
|
|
|
|
of PDF encryption. Now a single clear message appears for all types
|
|
|
|
of PDF encryption.
|
|
|
|
- ocrmypdf is now in Homebrew. Homebrew users are advised to the
|
|
|
|
version of ocrmypdf in the official homebrew-core formulas rather
|
|
|
|
than the private tap.
|
|
|
|
- Some linting
|
2018-02-27 15:08:22 -08:00
|
|
|
|
2018-02-07 16:48:04 -08:00
|
|
|
v5.6.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-02-07 16:48:04 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix issue
|
|
|
|
`#216 <https://github.com/jbarlow83/OCRmyPDF/issues/216>`__: preserve
|
|
|
|
"text as curves" PDFs without rasterizing file
|
|
|
|
- Related to the above, messages about rasterizing are more consistent
|
|
|
|
- For consistency versions minor releases will now get the trailing .0
|
|
|
|
they always should have had.
|
2018-02-07 16:48:04 -08:00
|
|
|
|
2018-01-10 15:43:59 -08:00
|
|
|
v5.5
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
|
|
|
|
|
|
|
- Add new argument ``--max-image-mpixels``. Pillow 5.0 now raises an
|
|
|
|
exception when images may be decompression bombs. This argument can
|
|
|
|
be used to override the limit Pillow sets.
|
|
|
|
- Fix output page cropped when using the sandwich renderer and OCR is
|
|
|
|
skipped on a rotated and image-processed page
|
|
|
|
- A warning is now issued when old versions of Ghostscript are used in
|
|
|
|
cases known to cause issues with non-Latin characters
|
|
|
|
- Fix a few parameter validation checks for ``-output-type pdfa-1`` and
|
|
|
|
``pdfa-2``
|
2018-01-10 15:43:59 -08:00
|
|
|
|
2017-11-26 23:08:55 -08:00
|
|
|
v5.4.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Fix issue
|
|
|
|
`#181 <https://github.com/jbarlow83/OCRmyPDF/issues/181>`__: fix
|
|
|
|
final merge failure for PDFs with more pages than the system file
|
|
|
|
handle limit (``ulimit -n``)
|
|
|
|
- Fix issue
|
|
|
|
`#200 <https://github.com/jbarlow83/OCRmyPDF/issues/200>`__: an
|
|
|
|
uncommon syntax for formatting decimal numbers in a PDF would cause
|
|
|
|
qpdf to issue a warning, which ocrmypdf treated as an error. Now this
|
|
|
|
the warning is relayed.
|
|
|
|
- Fix an issue where intermediate PDFs would be created at version 1.3
|
|
|
|
instead of the version of the original file. It's possible but
|
|
|
|
unlikely this had side effects.
|
|
|
|
- A warning is now issued when older versions of qpdf are used since
|
|
|
|
issues like
|
|
|
|
`#200 <https://github.com/jbarlow83/OCRmyPDF/issues/200>`__ cause
|
|
|
|
qpdf to infinite-loop
|
|
|
|
- Address issue
|
|
|
|
`#140 <https://github.com/jbarlow83/OCRmyPDF/issues/140>`__: if
|
|
|
|
Tesseract outputs invalid UTF-8, escape it and print its message
|
|
|
|
instead of aborting with a Unicode error
|
|
|
|
- Adding previously unlisted setup requirement, pytest-runner
|
|
|
|
- Update documentation: fix an error in the example script for Synology
|
|
|
|
with Docker images, improved security guidance, advised
|
|
|
|
``pip install --user``
|
2017-11-29 14:08:07 -08:00
|
|
|
|
2017-11-17 02:28:02 -08:00
|
|
|
v5.4.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-11-17 02:28:02 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- If a subprocess fails to report its version when queried, exit
|
|
|
|
cleanly with an error instead of throwing an exception
|
|
|
|
- Added test to confirm that the system locale is Unicode-aware and
|
|
|
|
fail early if it's not
|
|
|
|
- Clarified some copyright information
|
|
|
|
- Updated pinned requirements.txt so the homebrew formula captures more
|
|
|
|
recent versions
|
2017-11-17 02:28:02 -08:00
|
|
|
|
2017-10-26 18:15:31 -07:00
|
|
|
v5.4.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-10-26 18:15:31 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed a regression from v5.4.1 that caused sidecar files to be
|
|
|
|
created as empty files
|
2017-10-26 18:15:31 -07:00
|
|
|
|
2017-10-12 14:04:45 -07:00
|
|
|
v5.4.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-10-12 14:04:45 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Add workaround for Tesseract v4.00alpha crash when trying to obtain
|
|
|
|
orientation and the latest language packs are installed
|
2017-10-12 14:04:45 -07:00
|
|
|
|
|
|
|
v5.4
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2017-10-08 12:41:03 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Change wording of a deprecation warning to improve clarity
|
|
|
|
- Added option to generate PDF/A-1b output if desired
|
|
|
|
(``--output-type pdfa-1``); default remains PDF/A-2b generation
|
|
|
|
- Update documentation
|
2017-09-01 12:50:45 -07:00
|
|
|
|
2017-09-01 16:17:26 -07:00
|
|
|
v5.3.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-09-01 12:50:45 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed missing error message that should occur when trying to force
|
|
|
|
``--pdf-renderer sandwich`` on old versions of Tesseract
|
|
|
|
- Update copyright information in test files
|
|
|
|
- Set system ``LANG`` to UTF-8 in Dockerfiles to avoid UTF-8 encoding
|
|
|
|
errors
|
2017-09-01 12:50:45 -07:00
|
|
|
|
2017-08-24 13:01:02 -07:00
|
|
|
v5.3.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-08-24 13:01:02 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed a broken test case related to language packs
|
2017-08-24 13:01:02 -07:00
|
|
|
|
2017-08-24 01:09:19 -07:00
|
|
|
v5.3.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-08-24 01:09:19 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed wrong return code given for missing Tesseract language packs
|
|
|
|
- Fixed "brew audit" crashing on Travis when trying to auto-brew
|
2017-08-24 01:09:19 -07:00
|
|
|
|
2017-07-27 00:11:12 -07:00
|
|
|
v5.3
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
|
|
|
|
|
|
|
- Added ``--user-words`` and ``--user-patterns`` arguments which are
|
|
|
|
forwarded to Tesseract OCR as words and regular expressions
|
|
|
|
respective to use to guide OCR. Supplying a list of subject-domain
|
|
|
|
words should assist Tesseract with resolving words.
|
|
|
|
(`#165 <https://github.com/jbarlow83/OCRmyPDF/issues/165>`__)
|
|
|
|
- Using a non Latin-1 language with the "hocr" renderer now warns about
|
|
|
|
possible OCR quality and recommends workarounds
|
|
|
|
(`#176 <https://github.com/jbarlow83/OCRmyPDF/issues/176>`__)
|
|
|
|
- Output file path added to error message when that location is not
|
|
|
|
writable
|
|
|
|
(`#175 <https://github.com/jbarlow83/OCRmyPDF/issues/175>`__)
|
|
|
|
- Otherwise valid PDFs with leading whitespace at the beginning of the
|
|
|
|
file are now accepted
|
2017-07-27 00:11:12 -07:00
|
|
|
|
2017-06-13 13:09:12 -07:00
|
|
|
v5.2
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2017-06-13 13:09:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- When using Tesseract 3.05.01 or newer, OCRmyPDF will select the
|
|
|
|
"sandwich" PDF renderer by default, unless another PDF renderer is
|
|
|
|
specified with the ``--pdf-renderer`` argument. The previous behavior
|
|
|
|
was to select ``--pdf-renderer=hocr``.
|
|
|
|
- The "tesseract" PDF renderer is now deprecated, since it can cause
|
|
|
|
problems with Ghostscript on Tesseract 3.05.00
|
|
|
|
- The "tess4" PDF renderer has been renamed to "sandwich". "tess4" is
|
|
|
|
now a deprecated alias for "sandwich".
|
2017-06-13 13:09:12 -07:00
|
|
|
|
2017-05-29 14:36:50 -07:00
|
|
|
v5.1
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2017-05-29 14:36:50 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Files with pages larger than 200" (5080 mm) in either dimension are
|
|
|
|
now supported with ``--output-type=pdf`` with the page size preserved
|
|
|
|
(in the PDF specification this feature is called UserUnit scaling).
|
|
|
|
Due to Ghostscript limitations this is not available in conjunction
|
|
|
|
with PDF/A output.
|
2017-05-11 23:11:12 -07:00
|
|
|
|
2017-05-14 23:59:09 -07:00
|
|
|
v5.0.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-05-14 23:59:09 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed issue
|
|
|
|
`#169 <https://github.com/jbarlow83/OCRmyPDF/issues/169>`__,
|
|
|
|
exception due to failure to create sidecar text files on some
|
|
|
|
versions of Tesseract 3.04, including the jbarlow83/ocrmypdf Docker
|
|
|
|
image
|
2017-05-14 23:59:09 -07:00
|
|
|
|
2017-05-12 14:14:28 -07:00
|
|
|
v5.0
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2017-03-29 15:43:54 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Backward incompatible changes
|
2017-05-11 23:11:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Support for Python 3.4 dropped. Python 3.5 is now required.
|
|
|
|
- Support for Tesseract 3.02 and 3.03 dropped. Tesseract 3.04 or
|
|
|
|
newer is required. Tesseract 4.00 (alpha) is supported.
|
|
|
|
- The OCRmyPDF.sh script was removed.
|
2017-05-11 23:11:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Add a new feature, ``--sidecar``, which allows creating "sidecar"
|
|
|
|
text files which contain the OCR results in plain text. These OCR
|
|
|
|
text is more reliable than extracting text from PDFs. Closes
|
|
|
|
`#126 <https://github.com/jbarlow83/OCRmyPDF/issues/126>`__.
|
2017-03-29 15:43:54 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- New feature: ``--pdfa-image-compression``, which allows overriding
|
|
|
|
Ghostscript's lossy-or-lossless image encoding heuristic and making
|
|
|
|
all images JPEG encoded or lossless encoded as desired. Fixes
|
|
|
|
`#163 <https://github.com/jbarlow83/OCRmyPDF/issues/163>`__.
|
|
|
|
|
|
|
|
- Fixed issue
|
|
|
|
`#143 <https://github.com/jbarlow83/OCRmyPDF/issues/143>`__, added
|
|
|
|
``--quiet`` to suppress "INFO" messages
|
|
|
|
|
|
|
|
- Fixed issue
|
|
|
|
`#164 <https://github.com/jbarlow83/OCRmyPDF/issues/164>`__, a typo
|
2017-05-01 15:55:02 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Removed the command line parameters ``-n`` and ``--just-print`` since
|
|
|
|
they have not worked for some time (reported as Ubuntu bug
|
|
|
|
`#1687308 <https://bugs.launchpad.net/ubuntu/+source/ocrmypdf/+bug/1687308>`__)
|
|
|
|
|
|
|
|
v4.5.6
|
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed issue
|
|
|
|
`#156 <https://github.com/jbarlow83/OCRmyPDF/issues/156>`__,
|
|
|
|
'NoneType' object has no attribute 'getObject' on pages with no
|
|
|
|
optional /Contents record. This should resolve all issues related to
|
|
|
|
pages with no /Contents record.
|
|
|
|
- Fixed issue
|
|
|
|
`#158 <https://github.com/jbarlow83/OCRmyPDF/issues/158>`__, ocrmypdf
|
|
|
|
now stops and terminates if Ghostscript fails on an intermediate
|
|
|
|
step, as it is not possible to proceed.
|
|
|
|
- Fixed issue
|
|
|
|
`#160 <https://github.com/jbarlow83/OCRmyPDF/issues/160>`__,
|
|
|
|
exception thrown on certain invalid arguments instead of error
|
|
|
|
message
|
2017-05-01 15:55:02 -07:00
|
|
|
|
2017-04-28 15:27:41 -07:00
|
|
|
v4.5.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-04-28 15:27:41 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Automated update of macOS homebrew tap
|
|
|
|
- Fixed issue
|
|
|
|
`#154 <https://github.com/jbarlow83/OCRmyPDF/issues/154>`__, KeyError
|
|
|
|
'/Contents' when searching for text on blank pages that have no
|
|
|
|
/Contents record. Note: incomplete fix for this issue.
|
2017-04-28 15:27:41 -07:00
|
|
|
|
2017-04-18 18:09:15 -07:00
|
|
|
v4.5.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-04-18 18:09:15 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix ``--skip-big`` raising an exception if a page contains no images
|
|
|
|
(`#152 <https://github.com/jbarlow83/OCRmyPDF/issues/152>`__) (thanks
|
|
|
|
to @TomRaz)
|
|
|
|
- Fix an issue where pages with no images might trigger "cannot write
|
|
|
|
mode P as JPEG"
|
|
|
|
(`#151 <https://github.com/jbarlow83/OCRmyPDF/issues/151>`__)
|
2017-04-18 18:09:15 -07:00
|
|
|
|
2017-03-29 13:19:34 -07:00
|
|
|
v4.5.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Added a workaround for Ghostscript 9.21 and probably earlier versions
|
|
|
|
would fail with the error message "VMerror -25", due to a Ghostscript
|
|
|
|
bug in XMP metadata handling
|
|
|
|
- High Unicode characters (U+10000 and up) are no longer accepted for
|
|
|
|
setting metadata on the command line, as Ghostscript may not handle
|
|
|
|
them correctly.
|
|
|
|
- Fixed an issue where the ``tess4`` renderer would duplicate content
|
|
|
|
onto output pages if tesseract failed or timed out
|
|
|
|
- Fixed ``tess4`` renderer not recognized when lossless reconstruction
|
|
|
|
is possible
|
2017-03-29 13:19:34 -07:00
|
|
|
|
2017-03-24 13:23:03 -07:00
|
|
|
v4.5.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-03-24 13:23:03 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix issue
|
|
|
|
`#147 <https://github.com/jbarlow83/OCRmyPDF/issues/147>`__.
|
|
|
|
``--pdf-renderer tess4 --clean`` will produce an oversized page
|
|
|
|
containing the original image in the bottom left corner, due to loss
|
|
|
|
DPI information.
|
|
|
|
- Make "using Tesseract 4.0" warning less ominous
|
|
|
|
- Set up machinery for homebrew OCRmyPDF tap
|
2017-03-24 13:23:03 -07:00
|
|
|
|
2017-02-26 17:13:16 -08:00
|
|
|
v4.5.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-02-26 17:13:16 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fix issue
|
|
|
|
`#137 <https://github.com/jbarlow83/OCRmyPDF/issues/137>`__,
|
|
|
|
proportions of images with a non-square pixel aspect ratio would be
|
|
|
|
distorted in output for ``--force-ocr`` and some other combinations
|
|
|
|
of flags
|
2017-02-26 17:13:16 -08:00
|
|
|
|
2017-02-14 13:03:48 -08:00
|
|
|
v4.5
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
|
|
|
|
|
|
|
- PDFs containing "Form XObjects" are now supported (issue
|
|
|
|
`#134 <https://github.com/jbarlow83/OCRmyPDF/issues/134>`__; PDF
|
|
|
|
reference manual 8.10), and images they contain are taken into
|
|
|
|
account when determining the resolution for rasterizing
|
|
|
|
- The Tesseract 4 Docker image no longer includes all languages,
|
|
|
|
because it took so long to build something would tend to fail
|
|
|
|
- OCRmyPDF now warns about using ``--pdf-renderer tesseract`` with
|
|
|
|
Tesseract 3.04 or lower due to issues with Ghostscript corrupting the
|
|
|
|
OCR text in these cases
|
2017-02-14 13:03:48 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.4.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-02-06 21:56:55 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- The Docker images (ocrmypdf, ocrmypdf-polyglot, ocrmypdf-tess4) are
|
|
|
|
now based on Ubuntu 16.10 instead of Debian stretch
|
2017-02-06 21:56:55 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- This makes supporting the Tesseract 4 image easier
|
|
|
|
- This could be a disruptive change for any Docker users who built
|
|
|
|
customized these images with their own changes, and made those
|
|
|
|
changes in a way that depends on Debian and not Ubuntu
|
2017-02-06 21:56:55 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- OCRmyPDF now prevents running the Tesseract 4 renderer with Tesseract
|
|
|
|
3.04, which was permitted in v4.4 and v4.4.1 but will not work
|
2017-02-06 21:56:55 -08:00
|
|
|
|
|
|
|
v4.4.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- To prevent a `TIFF output
|
|
|
|
error <https://github.com/python-pillow/Pillow/issues/2206>`__ caused
|
|
|
|
by img2pdf >= 0.2.1 and Pillow <= 3.4.2, dependencies have been
|
|
|
|
tightened
|
|
|
|
- The Tesseract 4.00 simultaneous process limit was increased from 1 to
|
|
|
|
2, since it was observed that 1 lowers performance
|
|
|
|
- Documentation improvements to describe the ``--tesseract-config``
|
|
|
|
feature
|
|
|
|
- Added test cases and fixed error handling for ``--tesseract-config``
|
|
|
|
- Tweaks to setup.py to deal with issues in the v4.4 release
|
2017-01-28 22:23:35 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.4
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
|
|
|
|
|
|
|
- Tesseract 4.00 is now supported on an experimental basis.
|
|
|
|
|
|
|
|
- A new rendering option ``--pdf-renderer tess4`` exploits Tesseract
|
|
|
|
4's new text-only output PDF mode. See the documentation on PDF
|
|
|
|
Renderers for details.
|
|
|
|
- The ``--tesseract-oem`` argument allows control over the Tesseract
|
|
|
|
4 OCR engine mode (tesseract's ``--oem``). Use
|
|
|
|
``--tesseract-oem 2`` to enforce the new LSTM mode.
|
|
|
|
- Fixed poor performance with Tesseract 4.00 on Linux
|
|
|
|
|
|
|
|
- Fixed an issue that caused corruption of output to stdout in some
|
|
|
|
cases
|
|
|
|
- Removed test for Pillow JPEG and PNG support, as the minimum
|
|
|
|
supported version of Pillow now enforces this
|
|
|
|
- OCRmyPDF now tests that the intended destination file is writable
|
|
|
|
before proceeding
|
|
|
|
- The test suite now requires ``pytest-helpers-namespace`` to run (but
|
|
|
|
not install)
|
|
|
|
- Significant code reorganization to make OCRmyPDF re-entrant and
|
|
|
|
improve performance. All changes should be backward compatible for
|
|
|
|
the v4.x series.
|
|
|
|
|
|
|
|
- However, OCRmyPDF's dependency "ruffus" is not re-entrant, so no
|
|
|
|
Python API is available. Scripts should continue to use the
|
|
|
|
command line interface.
|
2017-01-26 12:29:11 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.3.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-01-03 00:45:33 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Update documentation to confirm Python 3.6.0 compatibility. No code
|
|
|
|
changes were needed, so many earlier versions are likely supported.
|
2017-01-03 00:45:33 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.3.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-12-08 16:34:09 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed "decimal.InvalidOperation: quantize result has too many digits"
|
|
|
|
for high DPI images
|
2016-12-08 16:34:09 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.3.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-12-02 16:26:34 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed PDF/A creation with Ghostscript 9.20 properly
|
|
|
|
- Fixed an exception on inline stencil masks with a missing optional
|
|
|
|
parameter
|
2016-12-02 16:26:34 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.3.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-11-10 23:16:08 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed a PDF/A creation issue with Ghostscript 9.20 (note: this fix
|
|
|
|
did not actually work)
|
2016-11-10 23:16:08 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.3.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed an issue where pages produced by the "hocr" renderer after a
|
|
|
|
Tesseract timeout would be rotated incorrectly if the input page was
|
|
|
|
rotated with a /Rotate marker
|
|
|
|
- Fixed a file handle leak in LeptonicaErrorTrap that would cause a
|
|
|
|
"too many open files" error for files around hundred pages of pages
|
|
|
|
long when ``--deskew`` or ``--remove-background`` or other Leptonica
|
|
|
|
based image processing features were in use, depending on the system
|
|
|
|
value of ``ulimit -n``
|
|
|
|
- Ability to specify multiple languages for multilingual documents is
|
|
|
|
now advertised in documentation
|
|
|
|
- Reduced the file sizes of some test resources
|
|
|
|
- Cleaned up debug output
|
|
|
|
- Tesseract caching in test cases is now more cautious about false
|
|
|
|
cache hits and reproducing exact output, not that any problems were
|
|
|
|
observed
|
2016-11-07 14:36:08 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.3
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2016-10-27 23:48:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- New feature ``--remove-background`` to detect and erase the
|
|
|
|
background of color and grayscale images
|
|
|
|
- Better documentation
|
|
|
|
- Fixed an issue with PDFs that draw images when the raster stack depth
|
|
|
|
is zero
|
|
|
|
- ocrmypdf can now redirect its output to stdout for use in a shell
|
|
|
|
pipeline
|
2016-10-27 23:48:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- This does not improve performance since temporary files are still
|
|
|
|
used for buffering
|
|
|
|
- Some output validation is disabled in this mode
|
2016-10-27 23:48:12 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.2.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-10-13 13:26:39 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an issue
|
|
|
|
(`#100 <https://github.com/jbarlow83/OCRmyPDF/issues/100>`__) with
|
|
|
|
PDFs that omit the optional /BitsPerComponent parameter on images
|
|
|
|
- Removed non-free file milk.pdf
|
2016-10-13 13:26:39 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.2.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-09-01 21:33:38 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an error
|
|
|
|
(`#90 <https://github.com/jbarlow83/OCRmyPDF/issues/90>`__) caused by
|
|
|
|
PDFs that use stencil masks properly
|
|
|
|
- Fixed handling of PDFs that try to draw images or stencil masks
|
|
|
|
without properly setting up the graphics state (such images are now
|
|
|
|
ignored for the purposes of calculating DPI)
|
2016-09-01 21:33:38 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.2.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-08-31 13:19:27 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an issue with PDFs that store page rotation (/Rotate) in an
|
|
|
|
indirect object
|
|
|
|
- Integrated a few fixes to simplify downstream packaging (Debian)
|
2016-08-31 13:19:27 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- The test suite no longer assumes it is installed
|
|
|
|
- If running Linux, skip a test that passes Unicode on the command
|
|
|
|
line
|
2016-08-31 13:19:27 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Added a test case to check explicit masks and stencil masks
|
|
|
|
- Added a test case for indirect objects and linearized PDFs
|
|
|
|
- Deprecated the OCRmyPDF.sh shell script
|
2016-08-31 13:19:27 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.2.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-08-25 14:46:09 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Improvements to documentation
|
2016-08-25 14:46:09 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.2.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-08-24 14:16:22 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an issue where PDF pages that contained stencil masks would
|
|
|
|
report an incorrect DPI and cause Ghostscript to abort
|
|
|
|
- Implemented stdin streaming
|
2016-08-24 14:16:22 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.2
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2016-07-27 14:47:59 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- ocrmypdf will now try to convert single image files to PDFs if they
|
|
|
|
are provided as input
|
|
|
|
(`#15 <https://github.com/jbarlow83/OCRmyPDF/issues/15>`__)
|
2017-09-01 12:47:22 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- This is a basic convenience feature. It only supports a single
|
|
|
|
image and always makes the image fill the whole page.
|
|
|
|
- For better control over image to PDF conversion, use ``img2pdf``
|
|
|
|
(one of ocrmypdf's dependencies)
|
2016-08-03 03:36:45 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- New argument ``--output-type {pdf|pdfa}`` allows disabling
|
|
|
|
Ghostscript PDF/A generation
|
2016-08-03 03:36:45 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- ``pdfa`` is the default, consistent with past behavior
|
|
|
|
- ``pdf`` provides a workaround for users concerned about the
|
|
|
|
increase in file size from Ghostscript forcing JBIG2 images to
|
|
|
|
CCITT and transcoding JPEGs
|
|
|
|
- ``pdf`` preserves as much as it can about the original file,
|
|
|
|
including problems that PDF/A conversion fixes
|
2016-08-03 03:36:45 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- PDFs containing images with "non-square" pixel aspect ratios, such as
|
|
|
|
200x100 DPI, are now handled and converted properly (fixing a bug
|
|
|
|
that caused to be cropped)
|
|
|
|
- ``--force-ocr`` rasterizes pages even if they contain no images
|
2016-08-03 03:36:45 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- supports users who want to use OCRmyPDF to reconstruct text
|
|
|
|
information in PDFs with damaged Unicode maps (copy and paste text
|
|
|
|
does not match displayed text)
|
|
|
|
- supports reinterpreting PDFs where text was rendered as curves for
|
|
|
|
printing, and text needs to be recovered
|
|
|
|
- fixes issue
|
|
|
|
`#82 <https://github.com/jbarlow83/OCRmyPDF/issues/82>`__
|
2016-07-29 03:08:59 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixes an issue where, with certain settings, monochrome images in
|
|
|
|
PDFs would be converted to 8-bit grayscale, increasing file size
|
|
|
|
(`#79 <https://github.com/jbarlow83/OCRmyPDF/issues/79>`__)
|
|
|
|
- Support for Ubuntu 12.04 LTS "precise" has been dropped in favor of
|
|
|
|
(roughly) Ubuntu 14.04 LTS "trusty"
|
2016-07-27 14:47:59 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Some Ubuntu "PPAs" (backports) are needed to make it work
|
2016-08-02 01:29:33 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Support for some older dependencies dropped
|
2016-07-29 03:08:59 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Ghostscript 9.15 or later is now required (available in Ubuntu
|
|
|
|
trusty with backports)
|
|
|
|
- Tesseract 3.03 or later is now required (available in Ubuntu
|
|
|
|
trusty)
|
2016-07-27 14:47:59 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Ghostscript now runs in "safer" mode where possible
|
2016-07-27 14:47:59 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.1.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-07-17 00:35:06 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Bug fix: monochrome images with an ICC profile attached were
|
|
|
|
incorrectly converted to full color images if lossless reconstruction
|
|
|
|
was not possible due to other settings; consequence was increased
|
|
|
|
file size for these images
|
2016-07-17 00:35:06 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.1.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-06-23 13:47:56 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- More helpful error message for PDFs with version 4 security handler
|
|
|
|
- Update usage instructions for Windows/Docker users
|
|
|
|
- Fix order of operations for matrix multiplication (no effect on most
|
|
|
|
users)
|
|
|
|
- Add a few leptonica wrapper functions (no effect on most users)
|
2016-06-23 13:47:56 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.1.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-05-10 21:48:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Replace IEC sRGB ICC profile with Debian's sRGB (from
|
|
|
|
icc-profiles-free) which is more compatible with the MIT license
|
|
|
|
- More helpful error message for an error related to certain types of
|
|
|
|
malformed PDFs
|
2016-05-10 21:48:32 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.1
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2016-04-28 00:46:16 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- ``--rotate-pages`` now only rotates pages when reasonably confidence
|
|
|
|
in the orientation. This behavior can be adjusted with the new
|
|
|
|
argument ``--rotate-pages-threshold``
|
|
|
|
- Fixed problems in error checking if ``unpaper`` is uninstalled or
|
|
|
|
missing at run-time
|
|
|
|
- Fixed problems with "RethrownJobError" errors during error handling
|
|
|
|
that suppressed the useful error messages
|
2016-04-28 00:46:16 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.7
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-03-02 06:27:01 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Minor correction to Ghostscript output settings
|
2016-03-02 06:27:01 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.6
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-03-01 01:58:32 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Update install instructions
|
|
|
|
- Provide a sRGB profile instead of using Ghostscript's
|
2016-03-01 01:58:32 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-02-27 00:22:37 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Remove some verbose debug messages from v4.0.4
|
|
|
|
- Fixed temporary that wasn't being deleted
|
|
|
|
- DPI is now calculated correctly for cropped images, along with other
|
|
|
|
image transformations
|
|
|
|
- Inline images are now checked during DPI calculation instead of
|
|
|
|
rejecting the image
|
2016-02-27 00:22:37 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-02-27 01:01:38 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
Released with verbose debug message turned on. Do not use. Skip to
|
|
|
|
v4.0.5.
|
2016-02-27 01:01:38 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-02-26 01:12:15 -08:00
|
|
|
|
|
|
|
New features
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Page orientations detected are now reported in a summary comment
|
2016-02-26 01:12:15 -08:00
|
|
|
|
|
|
|
Fixes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Show stack trace if unexpected errors occur
|
|
|
|
- Treat "too few characters" error message from Tesseract as a reason
|
|
|
|
to skip that page rather than abort the file
|
|
|
|
- Docker: fix blank JPEG2000 issue by insisting on Ghostscript versions
|
|
|
|
that have this fixed
|
2016-02-26 01:12:15 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-02-20 03:36:37 -08:00
|
|
|
|
|
|
|
Fixes
|
2018-08-27 01:25:30 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed compatibility with Tesseract 3.04.01 release, particularly its
|
|
|
|
different way of outputting orientation information
|
|
|
|
- Improved handling of Tesseract errors and crashes
|
|
|
|
- Fixed use of chmod on Docker that broke most test cases
|
2016-02-20 03:36:37 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-02-20 03:36:37 -08:00
|
|
|
|
|
|
|
Fixes
|
2018-08-27 01:25:30 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed a KeyError if tesseract fails to find page orientation
|
|
|
|
information
|
2016-02-20 03:36:37 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2016-02-15 14:03:59 -08:00
|
|
|
|
|
|
|
New features
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Automatic page rotation (``-r``) is now available. It uses ignores
|
|
|
|
any prior rotation information on PDFs and sets rotation based on the
|
|
|
|
dominant orientation of detectable text. This feature is fairly
|
|
|
|
reliable but some false positives occur especially if there is not
|
|
|
|
much text to work with.
|
|
|
|
(`#4 <https://github.com/jbarlow83/OCRmyPDF/issues/4>`__)
|
|
|
|
- Deskewing is now performed using Leptonica instead of unpaper.
|
|
|
|
Leptonica is faster and more reliable at image deskewing than
|
|
|
|
unpaper.
|
2016-02-15 14:03:59 -08:00
|
|
|
|
|
|
|
Fixes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an issue where lossless reconstruction could cause some pages
|
|
|
|
to be appear incorrectly if the page was rotated by the user in
|
|
|
|
Acrobat after being scanned (specifically if it a /Rotate tag)
|
|
|
|
- Fixed an issue where lossless reconstruction could misalign the
|
|
|
|
graphics layer with respect to text layer if the page had been
|
|
|
|
cropped such that its origin is not (0, 0)
|
|
|
|
(`#49 <https://github.com/jbarlow83/OCRmyPDF/issues/49>`__)
|
2016-02-15 14:03:59 -08:00
|
|
|
|
|
|
|
Changes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Logging output is now much easier to read
|
|
|
|
- ``--deskew`` is now performed by Leptonica instead of unpaper
|
|
|
|
(`#25 <https://github.com/jbarlow83/OCRmyPDF/issues/25>`__)
|
|
|
|
- libffi is now required
|
|
|
|
- Some changes were made to the Docker and Travis build environments to
|
|
|
|
support libffi
|
|
|
|
- ``--pdf-renderer=tesseract`` now displays a warning if the Tesseract
|
|
|
|
version is less than 3.04.01, the planned release that will include
|
|
|
|
fixes to an important OCR text rendering bug in Tesseract 3.04.00.
|
|
|
|
You can also manually install ./share/sharp2.ttf on top of pdf.ttf in
|
|
|
|
your Tesseract tessdata folder to correct the problem.
|
2016-02-15 14:03:59 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v3.2.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-02-05 16:10:18 -08:00
|
|
|
|
|
|
|
Changes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed issue `#47 <https://github.com/jbarlow83/OCRmyPDF/issues/47>`__
|
|
|
|
"convert() got and unexpected keyword argument 'dpi'" by upgrading to
|
|
|
|
img2pdf 0.2
|
|
|
|
- Tweaked the Dockerfiles
|
2016-02-05 16:10:18 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v3.2
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2016-01-19 16:49:49 -08:00
|
|
|
|
|
|
|
New features
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Lossless reconstruction: when possible, OCRmyPDF will inject text
|
|
|
|
layers without otherwise manipulating the content and layout of a PDF
|
|
|
|
page. For example, a PDF containing a mix of vector and raster
|
|
|
|
content would see the vector content preserved. Images may still be
|
|
|
|
transcoded during PDF/A conversion. (``--deskew`` and
|
|
|
|
``--clean-final`` disable this mode, necessarily.)
|
|
|
|
- New argument ``--tesseract-pagesegmode`` allows you to pass page
|
|
|
|
segmentation arguments to Tesseract OCR. This helps for two column
|
|
|
|
text and other situations that confuse Tesseract.
|
|
|
|
- Added a new "polyglot" version of the Docker image, that generates
|
|
|
|
Tesseract with all languages packs installed, for the polyglots among
|
|
|
|
us. It is much larger.
|
2016-01-19 16:49:49 -08:00
|
|
|
|
2016-02-04 23:41:33 -08:00
|
|
|
Changes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- JPEG transcoding quality is now 95 instead of the default 75. Bigger
|
|
|
|
file sizes for less degradation.
|
2016-01-19 16:49:49 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v3.1.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2015-12-17 09:05:10 -08:00
|
|
|
|
|
|
|
Changes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed bug that caused incorrect page size and DPI calculations on
|
|
|
|
documents with mixed page sizes
|
2015-12-17 09:05:10 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v3.1
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2015-12-04 04:31:01 -08:00
|
|
|
|
|
|
|
Changes
|
2015-12-02 01:48:10 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Default output format is now PDF/A-2b instead of PDF/A-1b
|
|
|
|
- Python 3.5 and macOS El Capitan are now supported platforms - no
|
|
|
|
changes were needed to implement support
|
|
|
|
- Improved some error messages related to missing input files
|
|
|
|
- Fixed issue `#20 <https://github.com/jbarlow83/OCRmyPDF/issues/20>`__
|
|
|
|
- uppercase .PDF extension not accepted
|
|
|
|
- Fixed an issue where OCRmyPDF failed to text that certain pages
|
|
|
|
contained previously OCR'ed text, such as OCR text produced by
|
|
|
|
Tesseract 3.04
|
|
|
|
- Inserts /Creator tag into PDFs so that errors can be traced back to
|
|
|
|
this project
|
|
|
|
- Added new option ``--pdf-renderer=auto``, to let OCRmyPDF pick the
|
|
|
|
best PDF renderer. Currently it always chooses the 'hocrtransform'
|
|
|
|
renderer but that behavior may change.
|
|
|
|
- Set up Travis CI automatic integration testing
|
2015-12-02 01:48:10 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v3.0
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2015-07-26 03:00:21 -07:00
|
|
|
|
2015-07-28 04:36:58 -07:00
|
|
|
New features
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Easier installation with a Docker container or Python's ``pip``
|
|
|
|
package manager
|
|
|
|
- Eliminated many external dependencies, so it's easier to setup
|
|
|
|
- Now installs ``ocrmypdf`` to ``/usr/local/bin`` or equivalent for
|
|
|
|
system-wide access and easier typing
|
|
|
|
- Improved command line syntax and usage help (``--help``)
|
|
|
|
- Tesseract 3.03+ PDF page rendering can be used instead for better
|
|
|
|
positioning of recognized text (``--pdf-renderer tesseract``)
|
|
|
|
- PDF metadata (title, author, keywords) are now transferred to the
|
|
|
|
output PDF
|
|
|
|
- PDF metadata can also be set from the command line (``--title``,
|
|
|
|
etc.)
|
|
|
|
- Automatic repairs malformed input PDFs if possible
|
|
|
|
- Added test cases to confirm everything is working
|
|
|
|
- Added option to skip extremely large pages that take too long to OCR
|
|
|
|
and are often not OCRable (e.g. large scanned maps or diagrams);
|
|
|
|
other pages are still processed (``--skip-big``)
|
|
|
|
- Added option to kill Tesseract OCR process if it seems to be taking
|
|
|
|
too long on a page, while still processing other pages
|
|
|
|
(``--tesseract-timeout``)
|
|
|
|
- Less common colorspaces (CMYK, palette) are now supported by
|
|
|
|
conversion to RGB
|
|
|
|
- Multiple images on the same PDF page are now supported
|
2015-07-28 04:36:58 -07:00
|
|
|
|
2015-07-26 03:00:21 -07:00
|
|
|
Changes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- New, robust rewrite in Python 3.4+ with
|
|
|
|
`ruffus <http://www.ruffus.org.uk/index.html>`__ pipelines
|
|
|
|
- Now uses Ghostscript 9.14's improved color conversion model to
|
|
|
|
preserve PDF colors
|
|
|
|
- OCR text is now rendered in the PDF as invisible text. Previous
|
|
|
|
versions of OCRmyPDF incorrectly rendered visible text with an image
|
|
|
|
on top.
|
|
|
|
- All "tasks" in the pipeline can be executed in parallel on any
|
|
|
|
available CPUs, increasing performance
|
|
|
|
- The ``-o DPI`` argument has been phased out, in favor of
|
|
|
|
``--oversample DPI``, in case we need ``-o OUTPUTFILE`` in the future
|
|
|
|
- Removed several dependencies, so it's easier to install. We no longer
|
|
|
|
use:
|
|
|
|
|
|
|
|
- GNU `parallel <https://www.gnu.org/software/parallel/>`__
|
|
|
|
- `ImageMagick <http://www.imagemagick.org/script/index.php>`__
|
|
|
|
- Python 2.7
|
|
|
|
- Poppler
|
|
|
|
- `MuPDF <http://mupdf.com/docs/>`__ tools
|
|
|
|
- shell scripts
|
|
|
|
- Java and `JHOVE <http://jhove.sourceforge.net/>`__
|
|
|
|
- libxml2
|
|
|
|
|
|
|
|
- Some new external dependencies are required or optional, compared to
|
|
|
|
v2.x:
|
|
|
|
|
|
|
|
- Ghostscript 9.14+
|
|
|
|
- `qpdf <http://qpdf.sourceforge.net/>`__ 5.0.0+
|
|
|
|
- `Unpaper <https://github.com/Flameeyes/unpaper>`__ 6.1 (optional)
|
|
|
|
- some automatically managed Python packages
|
2015-08-05 16:56:30 -07:00
|
|
|
|
2018-08-27 01:25:30 -07:00
|
|
|
Release candidates^
|
2015-08-05 16:56:30 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc9:
|
2015-08-29 16:43:22 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- fix issue
|
|
|
|
`#118 <https://github.com/jbarlow83/OCRmyPDF/issues/118>`__:
|
|
|
|
report error if ghostscript iccprofiles are missing
|
|
|
|
- fixed another issue related to
|
|
|
|
`#111 <https://github.com/jbarlow83/OCRmyPDF/issues/111>`__: PDF
|
|
|
|
rasterized to palette file
|
|
|
|
- add support image files with a palette
|
|
|
|
- don't try to validate PDF file after an exception occurs
|
2015-08-29 16:43:22 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc8:
|
2015-08-24 01:25:01 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- fix issue
|
|
|
|
`#111 <https://github.com/jbarlow83/OCRmyPDF/issues/111>`__:
|
|
|
|
exception thrown if PDF is missing DocumentInfo dictionary
|
2015-08-24 01:25:01 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc7:
|
2015-08-23 12:30:40 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- fix error when installing direct from pip, "no such file
|
|
|
|
'requirements.txt'"
|
2015-08-23 12:30:40 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc6:
|
2015-08-17 15:26:07 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- dropped libxml2 (Python lxml) since Python 3's internal XML parser
|
|
|
|
is sufficient
|
|
|
|
- set up Docker container
|
|
|
|
- fix Unicode errors if recognized text contains Unicode characters
|
|
|
|
and system locale is not UTF-8
|
2015-08-17 15:26:07 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc5:
|
2015-08-11 15:31:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- dropped Java and JHOVE in favour of qpdf
|
|
|
|
- improved command line error output
|
|
|
|
- additional tests and bug fixes
|
|
|
|
- tested on Ubuntu 14.04 LTS
|
2015-08-11 15:31:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc4:
|
2015-08-05 16:56:30 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- dropped MuPDF in favour of qpdf
|
|
|
|
- fixed some installer issues and errors in installation
|
|
|
|
instructions
|
|
|
|
- improve performance: run Ghostscript with multithreaded rendering
|
|
|
|
- improve performance: use multiple cores by default
|
|
|
|
- bug fix: checking for wrong exception on process timeout
|
2015-08-05 16:56:30 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc3: skipping version number intentionally to avoid confusion with
|
|
|
|
Tesseract
|
|
|
|
- rc2: first release for public testing to test-PyPI, Github
|
|
|
|
- rc1: testing release process
|
2015-08-05 16:56:30 -07:00
|
|
|
|
2015-07-28 04:36:58 -07:00
|
|
|
Compatibility notes
|
2019-06-22 17:29:26 -07:00
|
|
|
===================
|
2015-07-28 04:36:58 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- ``./OCRmyPDF.sh`` script is still available for now
|
|
|
|
- Stacking the verbosity option like ``-vvv`` is no longer supported
|
|
|
|
- The configuration file ``config.sh`` has been removed. Instead, you
|
|
|
|
can feed a file to the arguments for common settings:
|
2015-07-28 04:36:58 -07:00
|
|
|
|
|
|
|
::
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
ocrmypdf input.pdf output.pdf @settings.txt
|
2015-07-28 04:36:58 -07:00
|
|
|
|
2015-08-05 23:17:38 -07:00
|
|
|
where ``settings.txt`` contains *one argument per line*, for example:
|
2015-07-28 04:36:58 -07:00
|
|
|
|
|
|
|
::
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
-l
|
|
|
|
deu
|
|
|
|
--author
|
|
|
|
A. Merkel
|
|
|
|
--pdf-renderer
|
|
|
|
tesseract
|
2015-07-26 03:00:21 -07:00
|
|
|
|
|
|
|
Fixes
|
2018-08-27 01:25:30 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Handling of filenames containing spaces: fixed
|
2015-07-28 04:36:58 -07:00
|
|
|
|
2015-08-29 16:43:22 -07:00
|
|
|
Notes and known issues
|
2015-07-28 04:36:58 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Some dependencies may work with lower versions than tested, so try
|
|
|
|
overriding dependencies if they are "in the way" to see if they work.
|
|
|
|
- ``--pdf-renderer tesseract`` will output files with an incorrect page
|
|
|
|
size in Tesseract 3.03, due to a bug in Tesseract.
|
|
|
|
- PDF files containing "inline images" are not supported and won't be
|
|
|
|
for the 3.0 release. Scanned images almost never contain inline
|
|
|
|
images.
|
2015-07-26 03:00:21 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v2.2-stable (2014-09-29)
|
2019-06-22 17:29:26 -07:00
|
|
|
========================
|
2015-07-28 04:59:49 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF versions 1 and 2 were implemented as shell scripts. OCRmyPDF
|
|
|
|
3.0+ is a fork that gradually replaced all shell scripts with Python
|
|
|
|
while maintaining the existing command line arguments. No one is
|
|
|
|
maintaining old versions.
|
2015-07-26 03:00:21 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
For details on older versions, see the `final version of its release
|
|
|
|
notes <https://github.com/fritz-hh/OCRmyPDF/blob/7fd3dbdf42ca53a619412ce8add7532c5e81a9d1/RELEASE_NOTES.md>`__.
|