2019-06-22 17:29:26 -07:00
|
|
|
=============
|
2017-04-18 18:26:31 -07:00
|
|
|
Release notes
|
2015-07-26 03:00:21 -07:00
|
|
|
=============
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF uses `semantic versioning <http://semver.org/>`__ for its
|
|
|
|
command line interface and its public API.
|
2017-06-13 10:15:11 -07:00
|
|
|
|
2020-06-15 12:51:28 -07:00
|
|
|
OCRmyPDF's output messages are not considered part of the stable interface -
|
|
|
|
that is, output messages may be improved at any release level, so parsing them
|
|
|
|
may be unreliable. Use the API to depend on precise behavior.
|
|
|
|
|
|
|
|
The public API may be useful in scripts that launch OCRmyPDF processes or that
|
|
|
|
wish to use some of its features for working with PDFs.
|
2018-07-09 14:28:37 -07:00
|
|
|
|
2021-09-08 00:04:21 -07:00
|
|
|
.. note::
|
|
|
|
|
|
|
|
Python 3.6 reaches end of life on December 23, 2021. We will end support
|
|
|
|
for Python 3.6 around that time. The change will be marked with a major
|
|
|
|
release.
|
|
|
|
|
2021-09-14 17:26:07 -07:00
|
|
|
v12.5.0
|
|
|
|
=======
|
2021-09-09 15:53:08 -07:00
|
|
|
|
|
|
|
- Fixed build failure for the combination of PyPy 3.6 and pikepdf 3.0. This
|
|
|
|
combination can work in a source build but does not work with wheels.
|
|
|
|
- Accepted bot that wanted to upgrade our deprecated requirements.txt.
|
2021-09-14 17:26:07 -07:00
|
|
|
- Documentation updates.
|
|
|
|
- Replace pkg_resources and install dependency on setuptools with
|
|
|
|
importlib-metadata and importlib-resources.
|
|
|
|
- Fixed regression in hocrtransform causing text to be omitted when this
|
|
|
|
renderer was used.
|
2021-09-15 00:27:32 -07:00
|
|
|
- Fixed some typing errors.
|
2021-09-09 15:53:08 -07:00
|
|
|
|
2021-08-31 02:35:39 -07:00
|
|
|
v12.4.0
|
|
|
|
=======
|
|
|
|
|
|
|
|
- When grafting text layers, use pikepdf's ``unparse_content_stream`` if available.
|
|
|
|
- Confirmed support for pluggy 1.0. (Thanks @QuLogic.)
|
|
|
|
- Fixed some typing issues, improved pre-commit settings, and fixed issues
|
|
|
|
flagged by linters.
|
|
|
|
- PyPy 7.3.3 (=Python 3.6) is now supported. Note that PyPy does not necessarily
|
|
|
|
run faster, because the vast majority of OCRmyPDF's execution time is spent
|
|
|
|
running OCR or generally executing native code. However, PyPy may bring speed
|
|
|
|
improvements in some areas.
|
|
|
|
|
2021-08-21 18:06:14 -07:00
|
|
|
v12.3.3
|
|
|
|
=======
|
|
|
|
|
|
|
|
- watcher.py: fixed interpretation of boolean env vars (:issue:`821`).
|
|
|
|
- Adjust CI scripts to test Tesseract 5 betas.
|
|
|
|
- Document our support for the Tesseract 5 betas.
|
|
|
|
|
2021-08-04 02:52:45 -07:00
|
|
|
v12.3.2
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Indicate support for flask 2.x, watcher 2.x (:issue:`815, 816`).
|
2021-06-16 00:39:40 -07:00
|
|
|
|
2021-08-01 01:00:14 -07:00
|
|
|
v12.3.1
|
|
|
|
=======
|
|
|
|
|
2021-08-04 02:52:45 -07:00
|
|
|
- Fixed issue with selection of text when using the hOCR renderer (:issue:`813`).
|
2021-08-01 01:00:14 -07:00
|
|
|
- Fixed build errors with the Docker image by upgrading to a newer Ubuntu.
|
|
|
|
Also set the timezone of this image to UTC.
|
|
|
|
|
2021-07-14 02:38:23 -07:00
|
|
|
v12.3.0
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed a regression introduced in Pillow 8.3.0. Pillow no longer rounds DPI
|
2021-08-04 02:52:45 -07:00
|
|
|
for image resolutions. We now account for this (:issue:`802`).
|
2021-07-14 02:38:23 -07:00
|
|
|
- We no longer use some API calls that are deprecated in the latest versions of
|
|
|
|
pikepdf.
|
|
|
|
- Improved error message when a language is requested that doesn't look like a
|
|
|
|
typical ISO 639-2 code.
|
|
|
|
- Fixed some tests that attempted to symlink on Windows, breaking tests on a
|
|
|
|
Windows desktop but not usually on CI.
|
|
|
|
- Documentation fixes (thanks to @mara004)
|
|
|
|
|
2021-06-16 00:40:47 -07:00
|
|
|
v12.2.0
|
2021-06-16 00:39:40 -07:00
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed invalid Tesseract version number on Windows (:issue:`795`).
|
2021-06-16 00:40:47 -07:00
|
|
|
- Documentation tweaks. Documentation build now depends on sphinx-issues package.
|
2021-06-16 00:39:40 -07:00
|
|
|
|
2021-06-13 02:26:10 -07:00
|
|
|
v12.1.0
|
|
|
|
=======
|
|
|
|
|
|
|
|
- For security reasons we now require Pillow >= 8.2.x. (Older versions will continue
|
|
|
|
to work if upgrading is not an option.)
|
|
|
|
- The build system was reorganized to rely on ``setup.cfg`` instead of ``setup.py``.
|
|
|
|
All changes should work with previously supported versions of setuptools.
|
|
|
|
- The files in ``requirements/*`` are now considered deprecated but will be retained for v12.
|
|
|
|
Instead use ``pip install ocrmypdf[test]`` instead of ``requirements/test.txt``, etc.
|
|
|
|
These files will be removed in v13.
|
|
|
|
|
2021-05-27 13:53:33 -07:00
|
|
|
v12.0.3
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Expand the list of languages supported by the hocr PDF renderer.
|
|
|
|
Several languages were previously considered not supported, particularly those
|
|
|
|
non-European languages that use the Latin alphabet.
|
|
|
|
- Fixed a case where the exception stack trace was suppressed in verbose mode.
|
|
|
|
- Improved documentation around commercial OCR.
|
2021-05-18 23:22:11 -07:00
|
|
|
|
|
|
|
v12.0.2
|
|
|
|
=======
|
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed exception thrown when using ``--remove-background`` on files containing small
|
|
|
|
images (:issue:`769`).
|
2021-05-18 23:22:11 -07:00
|
|
|
- Improve documentation for description of adding language packs to the Docker image
|
2021-06-13 02:26:10 -07:00
|
|
|
and corrected name of French language pack.
|
2021-05-18 23:22:11 -07:00
|
|
|
|
2021-04-26 01:18:57 -07:00
|
|
|
v12.0.1
|
|
|
|
=======
|
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed "invalid version number" for untagged tesseract versions (:issue:`770`).
|
2021-04-26 01:18:57 -07:00
|
|
|
|
2021-04-02 01:10:53 -07:00
|
|
|
v12.0.0
|
|
|
|
=======
|
|
|
|
|
|
|
|
**Breaking changes**
|
|
|
|
|
|
|
|
- Due to recent security issues in pikepdf, Pillow and reportlab, we now require
|
|
|
|
newer versions of these libraries and some of their dependencies. (If necessary,
|
2021-04-06 00:17:36 -07:00
|
|
|
package maintainers may override these versions at their discretion; lower
|
|
|
|
versions will often work.)
|
2021-04-02 01:10:53 -07:00
|
|
|
- We now use the "LeaveColorUnchanged" color conversion strategy when directing
|
|
|
|
Ghostscript to create a PDF/A. Generally this is faster than performing a
|
|
|
|
color conversion, which is not always necessary.
|
|
|
|
- OCR text is now packaged in a Form XObject. This makes it easier to isolate
|
2021-04-23 00:04:21 -07:00
|
|
|
OCR from other document content. However, some poorly implemented PDF text
|
|
|
|
extraction algorithms may fail to detect the text.
|
2021-04-02 01:10:53 -07:00
|
|
|
- Many API functions have stricter parameter checking or expect keyword arguments
|
|
|
|
were they previously did not.
|
|
|
|
- Some deprecated functions in ``ocrmypdf.optimize`` were removed.
|
2021-04-23 00:04:21 -07:00
|
|
|
- The ``ocrmypdf.leptonica`` module is now deprecated, due to difficulties with
|
|
|
|
the current strategy of ABI binding on newer platforms like Apple Silicon.
|
|
|
|
It will be removed and replaced, either by repackaging Leptonica as an
|
|
|
|
independent library using or using a different image processing library.
|
2021-04-06 00:17:36 -07:00
|
|
|
- Continuous integration moved to GitHub Actions.
|
2021-04-07 01:56:51 -07:00
|
|
|
- We no longer depend on ``pytest_helpers_namespace`` for testing.
|
2021-04-02 01:10:53 -07:00
|
|
|
|
|
|
|
**New features**
|
|
|
|
|
|
|
|
- New plugin hook: ``get_progressbar_class``, for progress reporting,
|
|
|
|
allowing developers to replace the standard console progress bar with some
|
|
|
|
other mechanism, such as updating a GUI progress bar.
|
|
|
|
- New plugin hook: ``get_executor``, for replacing the concurrency model.
|
|
|
|
This is primarily to support execution on AWS Lambda, which does not support
|
|
|
|
standard Python ``multiprocessing`` due to its lack of shared memory.
|
|
|
|
- New plugin hook: ``get_logging_console``, for replacing the standard
|
|
|
|
way OCRmyPDF outputs its messages.
|
|
|
|
- New plugin hook: ``filter_pdf_page``, for modifying individual PDF
|
|
|
|
pages produced by OCRmyPDF.
|
2021-04-23 00:04:21 -07:00
|
|
|
- OCRmyPDF now runs on nonstandard execution environments that do not have
|
|
|
|
interprocess semaphores, such as AWS Lambda and Android Termux. If the environment
|
|
|
|
does not have semaphores, OCRmyPDF will automatically select an alternate
|
|
|
|
process executor that does not use semaphores.
|
2021-04-15 23:49:58 -07:00
|
|
|
- Continuous integration moved to GitHub Actions.
|
2021-04-06 00:17:36 -07:00
|
|
|
- We now generate an ARM64-compatible Docker image alongside the x64 image.
|
2021-04-23 00:04:21 -07:00
|
|
|
Thanks to @andkrause for doing most of the work in a pull request several months
|
|
|
|
ago, which we were finally able to integrate now. Also thanks to @0x326 for
|
|
|
|
review comments.
|
2021-04-02 01:10:53 -07:00
|
|
|
|
2021-04-15 23:49:58 -07:00
|
|
|
**Fixes**
|
|
|
|
|
|
|
|
- Fixed a possible deadlock on attempting to flush ``sys.stderr`` when older
|
|
|
|
versions of Leptonica are in use.
|
|
|
|
- Some worker processes inherited resources from their parents such as log
|
|
|
|
handlers that may have also lead to deadlocks. These resources are now released.
|
|
|
|
- Improvements to test coverage.
|
|
|
|
- Removed vestiges of support for Tesseract versions older than 4.0.0-beta1 (
|
|
|
|
which ships with Ubuntu 18.04).
|
|
|
|
- OCRmyPDF can now parse all of Tesseract version numbers, since several
|
|
|
|
schemes have been in use.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed an issue with parsing PDFs that contain images drawn at a scale of 0. (:issue:`761`)
|
2021-04-23 00:04:21 -07:00
|
|
|
- Removed a frequently repeated message about disabling mmap.
|
2021-04-15 23:49:58 -07:00
|
|
|
|
2021-03-20 23:30:42 -07:00
|
|
|
v11.7.3
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Exclude CCITT Group 3 images from being optimized. Some libraries
|
|
|
|
OCRmyPDF uses do not seem to handle this obscure compression format properly.
|
|
|
|
You may get errors or possible corrupted output images without this fix.
|
|
|
|
|
2021-03-19 00:31:38 -07:00
|
|
|
v11.7.2
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Updated pinned versions in main.txt, primarily to upgrade Pillow to 8.1.2, due
|
|
|
|
to recently disclosed security vulnerabilities in that software.
|
|
|
|
- The ``--sidecar`` parameter now causes an exception if set to the same file as
|
|
|
|
the input or output PDF.
|
|
|
|
|
2021-03-03 00:46:35 -08:00
|
|
|
v11.7.1
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Some exceptions while attempting image optimization were only logged at the debug
|
|
|
|
level, causing them to be suppressed. These errors are now logged appropriately.
|
|
|
|
- Improved the error message related to ``--unpaper-args``.
|
|
|
|
- Updated documentation to mention the new conda distribution.
|
|
|
|
|
2021-02-26 00:29:52 -08:00
|
|
|
v11.7.0
|
|
|
|
=======
|
|
|
|
|
|
|
|
- We now support using ``--sidecar`` in conjunction with ``--pages``; these arguments
|
2021-06-16 00:39:40 -07:00
|
|
|
used to be mutually exclusive. (:issue:`735`)
|
2021-02-26 00:29:52 -08:00
|
|
|
- Fixed a possible issue with PDF/A-1b generation. Acrobat complained that our PDFs use
|
|
|
|
object streams. More robust PDF/A validators like veraPDF don't consider this a
|
|
|
|
problem, but we'll honor Acrobat's objection from here on. This may increase file
|
|
|
|
size of PDF/A-1b files. PDF/A-2b files will not be affected.
|
|
|
|
|
2021-02-15 01:48:14 -08:00
|
|
|
v11.6.2
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed a regression where the wrong page orientation would be produced when using
|
2021-06-16 00:39:40 -07:00
|
|
|
arguments such as ``--deskew --rotate-pages`` (:issue:`730`).
|
2021-02-15 01:48:14 -08:00
|
|
|
|
2021-02-14 01:51:26 -08:00
|
|
|
v11.6.1
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed an issue with attempting optimize unusually narrow-width images by excluding
|
2021-06-16 00:39:40 -07:00
|
|
|
these images from optimization (:issue:`732`).
|
2021-02-14 01:51:26 -08:00
|
|
|
- Remove an obsolete compatibility shim for a version of pikepdf that is no longer
|
|
|
|
supported.
|
|
|
|
|
2021-01-26 01:47:49 -08:00
|
|
|
v11.6.0
|
|
|
|
=======
|
|
|
|
|
|
|
|
- OCRmyPDF will now automatically register plugins from the same virtual environment
|
|
|
|
with an appropriate setuptools entrypoint.
|
|
|
|
- Refactor the plugin manager to remove unnecessary complications and make plugin
|
|
|
|
registration more automatic.
|
|
|
|
- ``PageContext`` and ``PdfContext`` are now formally part of the API, as they
|
|
|
|
should have been, since they were part of ``ocrmypdf.pluginspec``.
|
|
|
|
|
2021-01-09 16:46:15 -08:00
|
|
|
v11.5.0
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed an issue where the output page size might differ by a fractional amount
|
|
|
|
due to rounding, when ``--force-ocr`` was used and the page contained objects
|
|
|
|
with multiple resolutions.
|
|
|
|
- When determining the resolution at which to rasterize a page, we now consider
|
|
|
|
printed text on the page as requiring a higher resolution. This fixes issues
|
|
|
|
with certain pages being rendered with unacceptably low resolution text, but
|
|
|
|
may increase output file sizes in some workflows where low resolution text
|
|
|
|
is acceptable.
|
|
|
|
- Added a workaround to fix an exception that occurs when trying to
|
|
|
|
``import ocrmypdf.leptonica`` on Apple ARM silicon (or potentially, other
|
|
|
|
platforms that do not permit write+executable memory).
|
|
|
|
|
2021-01-06 11:42:28 -08:00
|
|
|
v11.4.5
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed an issue where files may not be closed when the API is used.
|
|
|
|
- Improved ``setup.cfg`` with better settings for test coverage.
|
|
|
|
|
2021-01-01 01:39:24 -08:00
|
|
|
v11.4.4
|
|
|
|
=======
|
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed ``AttributeError: 'NoneType' object has no attribute 'userunit'`` (:issue:`700`),
|
2021-01-01 01:39:24 -08:00
|
|
|
related to OCRmyPDF not properly forwarded an error message from pdfminer.six.
|
|
|
|
- Adjusted typing of some arguments.
|
|
|
|
- ``ocrmypdf.ocr`` now takes a ``threading.Lock`` for reasons outlined in the
|
|
|
|
documentation.
|
|
|
|
|
2020-12-29 21:40:35 -08:00
|
|
|
v11.4.3
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Removed a redundant debug message.
|
|
|
|
- Test suite now asserts that most patched functions are called when they should be.
|
|
|
|
- Test suite now skips a test that fails on two particular versions of piekpdf.
|
|
|
|
|
2020-12-27 03:29:35 -08:00
|
|
|
v11.4.2
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed support for Cygwin, hopefully.
|
|
|
|
- watcher.py: Fixed an issue with the OCR_LOGLEVEL not being interpreted.
|
|
|
|
|
2020-12-22 01:40:31 -08:00
|
|
|
v11.4.1
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed an issue where invalid pages ranges passed using the ``pages`` argument,
|
|
|
|
such as "1-0" would cause unhandled exceptions.
|
|
|
|
- Accepted a user-contributed to the Synology demo script in misc/synology.py.
|
|
|
|
- Clarified documentation about change of temporary file location ``ocrmypdf.io``.
|
|
|
|
- Fixed Python wheel tag which was incorrectly set to py35 even though we long
|
|
|
|
since dropped support for Python 3.5.
|
|
|
|
|
2020-12-04 13:28:04 -08:00
|
|
|
v11.4.0
|
|
|
|
=======
|
|
|
|
|
2020-12-11 15:09:41 -08:00
|
|
|
- When looking for Tesseract and Ghostscript, we now check the Windows Registry to
|
|
|
|
see if their installers registered the location of their executables. This should
|
|
|
|
help Windows users who have installed these programs to non-standard
|
2020-12-04 13:28:04 -08:00
|
|
|
locations.
|
|
|
|
- We now report on the progress of PDF/A conversion, since this operation is
|
|
|
|
sometimes slow.
|
|
|
|
- Improved command line completions.
|
2020-12-22 00:47:25 -08:00
|
|
|
- The prefix of the temporary folder OCRmyPDF creates has been changed from
|
|
|
|
``com.github.ocrmypdf`` to ``ocrmypdf.io``. Scripts that chose to depend on this
|
|
|
|
prefix may need to be adjusted. (This has always been an implementation detail so is
|
|
|
|
not considered part of the semantic versioning "contract".)
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`692`, where a particular file with malformed fonts would flood an
|
2020-12-11 15:09:41 -08:00
|
|
|
internal message cue by generating so many debug messages.
|
|
|
|
- Fixed an exception on processing hOCR files with no page record. Tesseract
|
|
|
|
is not known to generate such files.
|
2020-12-04 13:28:04 -08:00
|
|
|
|
2020-11-18 11:56:29 -08:00
|
|
|
v11.3.4
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed an error message 'called readLinearizationData for file that is not
|
|
|
|
linearized' that may occur when pikepdf 2.1.0 is used. (Upgrading to pikepdf
|
|
|
|
2.1.1 also fixes the issue.)
|
2020-11-18 11:57:28 -08:00
|
|
|
- File watcher now automatically includes ``.PDF`` in addition to ``.pdf`` to
|
|
|
|
better support case sensitive file systems.
|
|
|
|
- Some documentation and comment improvements.
|
2020-11-18 11:56:29 -08:00
|
|
|
|
2020-11-07 00:53:33 -08:00
|
|
|
v11.3.3
|
|
|
|
=======
|
|
|
|
|
|
|
|
- If unpaper outputs non-UTF-8 data, quietly fix this rather than choke on the
|
2021-06-16 00:39:40 -07:00
|
|
|
conversion. (Possibly addresses :issue:`671`.)
|
2020-11-07 00:53:33 -08:00
|
|
|
|
2020-11-02 02:43:32 -08:00
|
|
|
v11.3.2
|
|
|
|
=======
|
|
|
|
|
2020-11-03 00:45:47 -08:00
|
|
|
- Explicitly require pikepdf 2.0.0 or newer when running on Python 3.9. (There are
|
|
|
|
concerns about the stability of pybind11 2.5.x with Python 3.9, which is used in
|
|
|
|
pikepdf 1.x.)
|
2020-11-03 02:03:25 -08:00
|
|
|
- Fixed another issue related to page rotation.
|
|
|
|
- Fixed an issue where image marked as image masks were not properly considered
|
|
|
|
as optimization candidates.
|
2020-11-02 02:43:32 -08:00
|
|
|
- On some systems, unpaper seems to be unable to process the PNGs we offer it
|
|
|
|
as input. We now convert the input to PNM format, which unpaper always accepts.
|
2021-06-16 00:39:40 -07:00
|
|
|
Fixes :issue:`665` and :issue:`667`.
|
2020-11-03 02:03:25 -08:00
|
|
|
- DPI sent to unpaper is now rounded to a more reasonable number of decimal digits.
|
2020-11-02 02:43:32 -08:00
|
|
|
- Debug and error messages from unpaper were being suppressed.
|
|
|
|
- Some documentation tweaks.
|
|
|
|
|
2020-10-27 23:11:11 -07:00
|
|
|
v11.3.1
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Declare support for new versions: pdfminer.six 20201018 and pikepdf 2.x
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed warning related to ``--pdfa-image-compression`` that appears at the wrong
|
2020-10-27 23:11:11 -07:00
|
|
|
time.
|
|
|
|
|
2020-10-24 03:19:32 -07:00
|
|
|
v11.3.0
|
|
|
|
=======
|
|
|
|
|
|
|
|
- The "OCR" step is describing as "Image processing" in the output messages when
|
|
|
|
OCR is disabled, to better explain the application's behavior.
|
|
|
|
- Debug logs are now only created when run as a command line, and not when OCR
|
|
|
|
is performed for an API call. It is the calling application's responsibility
|
|
|
|
to set up logging.
|
|
|
|
- For PDFs with a low number of pages, we gathered information about the input PDF
|
|
|
|
in a thread rather than process (when there are more pages). When run as a
|
|
|
|
thread, we did not close the file handle to the working PDF, leaking one file
|
|
|
|
handle per call of ``ocrmypdf.ocr``.
|
|
|
|
- Fixed an issue where debug messages send by child worker processes did not match
|
|
|
|
the log settings of parent process, causing messages to be dropped. This affected
|
|
|
|
macOS and Windows only where the parent process is not forked.
|
|
|
|
- Fixed the hookspec of rasterize_pdf_page to remove default parameters that
|
|
|
|
were not handled in an expected way by pluggy.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed another issue with automatic page rotation (:issue:`658`) due to the issue above.
|
2020-10-24 03:19:32 -07:00
|
|
|
|
2020-10-07 04:08:50 -07:00
|
|
|
v11.2.1
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed an issue where optimization of a 1-bit image with a color palette or
|
|
|
|
associated ICC that was optimized to JBIG2 could have its colors inverted.
|
|
|
|
|
2020-10-06 03:24:31 -07:00
|
|
|
v11.2.0
|
2020-10-06 03:22:48 -07:00
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed an issue with optimizing PNG-type images that had soft masks or image masks.
|
2020-10-06 03:24:31 -07:00
|
|
|
This is a regression introduced in (or about) v11.1.0.
|
|
|
|
- Improved type checking of the ``plugins`` parameter for the ``ocrmypdf.ocr``
|
|
|
|
API call.
|
2020-10-06 03:22:48 -07:00
|
|
|
|
2020-09-29 02:46:18 -07:00
|
|
|
v11.1.2
|
|
|
|
=======
|
|
|
|
|
2020-10-06 03:22:48 -07:00
|
|
|
- Fixed hOCR renderer writing the text in roughly reverse order. This should not
|
2020-09-29 02:46:18 -07:00
|
|
|
affect reasonably smart PDF readers that properly locate the position of all
|
|
|
|
text, but may confuse those that rely on the order of objects in the content
|
2021-06-16 00:39:40 -07:00
|
|
|
stream. (:issue:`642`)
|
2020-09-29 02:46:18 -07:00
|
|
|
|
2020-09-25 00:28:38 -07:00
|
|
|
v11.1.1
|
|
|
|
=======
|
|
|
|
|
|
|
|
- We now avoid using named temporary files when using pngquant allowing containerized
|
|
|
|
pngquant installs to be used.
|
|
|
|
- Clarified an error message.
|
|
|
|
- Highest number of 1's in a release ever!
|
|
|
|
|
2020-09-17 03:21:06 -07:00
|
|
|
v11.1.0
|
|
|
|
=======
|
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed page rotation issues: :issue:`634,589`.
|
2020-09-17 03:21:06 -07:00
|
|
|
- Fixed some cases where optimization created an invalid image such as a
|
2021-06-16 00:39:40 -07:00
|
|
|
1-bit "RGB" image: :issue:`629,620`.
|
2020-09-17 03:21:06 -07:00
|
|
|
- Page numbers are now displayed in debug logs when pages are being grafted.
|
|
|
|
- ocrmypdf.optimize.rewrite_png and ocrmypdf.optimize.rewrite_png_as_g4 were
|
|
|
|
marked deprecated. Strictly speaking these should have been internal APIs,
|
|
|
|
but they were never hidden.
|
|
|
|
- As a precaution, pikepdf mmap-based file access has been disabled due to a
|
|
|
|
rare race condition that causes a crash when certain objects are deallocated.
|
|
|
|
The problem is likely in pikepdf's dependency pybind11.
|
|
|
|
- Extended the example plugin to demonstrate conversion to mono.
|
|
|
|
|
2020-09-08 02:38:57 -07:00
|
|
|
v11.0.2
|
|
|
|
=======
|
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`612`, TypeError exception. Fixed by eliminating unnecessary repair of
|
2020-09-08 02:38:57 -07:00
|
|
|
input PDF metadata in memory.
|
|
|
|
|
2020-08-17 23:25:31 -07:00
|
|
|
v11.0.1
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Blacklist pdfminer.six 20200720, which has a regression fixed in 20200726.
|
|
|
|
- Approve img2pdf 0.4 as it passes tests.
|
|
|
|
- Clarify that the GPL-3 portion of pdfa.py was removed with the changes in v11.0.0;
|
|
|
|
the debian/copyright file did not properly annotate this change.
|
|
|
|
|
2020-08-12 12:18:59 -07:00
|
|
|
v11.0.0
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Project license changed to Mozilla Public License 2.0. Some miscellaneous
|
|
|
|
code is now under MIT license and non-code content/media remains under
|
|
|
|
CC-BY-SA 4.0. License changed with approval of all people who were found
|
2021-06-16 00:39:40 -07:00
|
|
|
to have contributed to GPLv3 licensed sections of the project. (:issue:`600`)
|
2020-08-12 12:18:59 -07:00
|
|
|
- Because the license changed, this is being treated as a major version number
|
|
|
|
change; however, there are no known breaking changes in functional behavior
|
|
|
|
or API compared to v10.x.
|
2018-07-09 14:28:37 -07:00
|
|
|
|
2020-08-07 02:23:21 -07:00
|
|
|
v10.3.3
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed a "KeyError: 'dpi'" error message when using ``--threshold`` on an image.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`607`)
|
2020-08-07 02:23:21 -07:00
|
|
|
|
2020-08-05 01:36:45 -07:00
|
|
|
v10.3.2
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed a case where we reported "no reason" for a file size increase, when we
|
|
|
|
could determine the reason.
|
|
|
|
- Enabled support for pdfminer.six 20200726.
|
|
|
|
|
2020-07-26 21:53:08 -07:00
|
|
|
v10.3.1
|
|
|
|
=======
|
|
|
|
|
2020-08-05 00:44:42 -07:00
|
|
|
- Fixed a number of test suite failures with pdfminer.six older than veresion 20200402.
|
2020-07-26 21:53:08 -07:00
|
|
|
- Enabled support for pdfminer.six 20200720.
|
|
|
|
|
2020-07-22 00:34:27 -07:00
|
|
|
v10.3.0
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed an issue where we would consider images that were already JBIG2-encoded
|
|
|
|
for optimization, potentially producing a less optimized image than the original.
|
|
|
|
We do not believe this issue would ever cause an image to loss fidelity.
|
|
|
|
- Where available, pikepdf memory mapping is now used. This improves performance.
|
|
|
|
- When Leptonica 1.79+ is installed, use its new error handling API to avoid
|
|
|
|
a "messy" redirection of stderr which was necessary to capture its error
|
|
|
|
messages.
|
|
|
|
- For older versions of Leptonica, added a new thread level lock. This fixes a
|
|
|
|
possible race condition in handling error conditions in Leptonica (although
|
|
|
|
there is no evidence it ever caused issues in practice).
|
|
|
|
- Documentation improvements and more type hinting.
|
|
|
|
|
2020-07-01 03:26:57 -07:00
|
|
|
v10.2.1
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Disabled calculation of text box order with pdfminer. We never needed this result
|
|
|
|
and it is expensive to calculate on files with complex pre-existing text.
|
|
|
|
- Fixed plugin manager to accept ``Path(plugin)`` as a path to a plugin.
|
|
|
|
- Fixed some typing errors.
|
|
|
|
- Documentation improvements.
|
|
|
|
|
2020-06-22 16:37:51 -07:00
|
|
|
v10.2.0
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Update Docker image to use Ubuntu 20.04.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed issue PDF/A acquires title "Untitled" after conversion. (:issue:`582`)
|
2020-06-22 16:37:51 -07:00
|
|
|
- Fixed a problem where, when using ``--pdf-renderer hocr``, some text would
|
|
|
|
be missing from the output when using a more recent version of Tesseract.
|
|
|
|
Tesseract began adding more detailed markup about the semantics of text
|
|
|
|
that our HOCR transform did not recognize, so it ignored them. This option is
|
|
|
|
not the default. If necessary ``--redo-ocr`` also redoing OCR to fix such issues.
|
|
|
|
- Fixed an error in Python 3.9 beta, due to removal of deprecated
|
2021-06-16 00:39:40 -07:00
|
|
|
``Element.getchildren()``. (:issue:`584`)
|
2020-06-22 16:37:51 -07:00
|
|
|
- Implemented support using the API with ``BytesIO`` and other file stream objects.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`545`)
|
2020-06-22 16:37:51 -07:00
|
|
|
|
2020-06-17 14:45:32 -07:00
|
|
|
v10.1.1
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Fixed ``OMP_THREAD_LIMIT`` set to invalid value error messages on some input
|
|
|
|
files. (The error was harmless, apart from less than optimal performance in
|
|
|
|
some cases.)
|
|
|
|
|
2020-06-16 00:55:28 -07:00
|
|
|
v10.1.0
|
|
|
|
=======
|
|
|
|
|
|
|
|
- Previously, we ``--clean-final`` would cause an unpaper-cleaned page image to
|
|
|
|
be produced twice, which was necessary in some cases but not in general. We
|
|
|
|
now take this optimization opportunity and reuse the image if possible.
|
|
|
|
- We now provide PNG files as input to unpaper, since it accepts them, instead
|
|
|
|
of generating PPM files which can be very large. This can improve performance
|
|
|
|
and temporary disk usage.
|
|
|
|
- Documentation updated for plugins.
|
|
|
|
|
2020-06-12 12:11:13 -07:00
|
|
|
v10.0.1
|
|
|
|
=======
|
|
|
|
|
2020-06-22 16:37:51 -07:00
|
|
|
- Fixed regression when ``-l lang1+lang2`` is used from command line.
|
2020-06-12 12:11:13 -07:00
|
|
|
|
2020-06-10 14:27:47 -07:00
|
|
|
v10.0.0
|
|
|
|
=======
|
2020-04-26 05:14:59 -07:00
|
|
|
|
|
|
|
**Breaking changes**
|
|
|
|
|
|
|
|
- Support for pdfminer.six version 20181108 has been dropped, along with a
|
|
|
|
monkeypatch that made this version work.
|
|
|
|
- Output messages are now displayed in color (when supported by the terminal)
|
|
|
|
and prefixes describing the severity of the message are removed. As such
|
|
|
|
programs that parse OCRmyPDF's log message will need to be revised. (Please
|
|
|
|
consider using OCRmyPDF as a library instead.)
|
2020-06-10 14:27:47 -07:00
|
|
|
- The minimum version for certain dependencies has increased.
|
|
|
|
- Many API changes; see developer changes.
|
|
|
|
- The Python libraries pluggy and coloredlogs are now required.
|
|
|
|
|
|
|
|
**New features and improvements**
|
|
|
|
|
|
|
|
- PDF page scanning is now parallelized across CPUs, speeding up this phase
|
2020-06-15 12:51:28 -07:00
|
|
|
dramatically for files with a high page counts.
|
2020-06-10 14:27:47 -07:00
|
|
|
- PDF page scanning is optimized, addressing some performance regressions.
|
2020-06-15 12:51:28 -07:00
|
|
|
- PDF page scanning is no longer run on pages that are not selected when the
|
|
|
|
``--pages`` argument is used.
|
|
|
|
- PDF page scanning is now independent of Ghostscript, ending our past reliance
|
|
|
|
on this occasionally unstable feature in Ghostscript.
|
2020-06-10 14:27:47 -07:00
|
|
|
- A plugin architecture has been added, currently allowing one to more easily
|
|
|
|
use a different OCR engine or PDF renderer from Tesseract and Ghostscript,
|
|
|
|
respectively. A plugin can also override some decisions, such changing
|
|
|
|
the OCR settings after initial scanning.
|
|
|
|
- Colored log messages.
|
|
|
|
|
|
|
|
**Developer changes**
|
|
|
|
|
2020-06-15 12:51:28 -07:00
|
|
|
- The test spoofing mechanism, used to test correct handling of failures in
|
2020-06-10 14:27:47 -07:00
|
|
|
Tesseract and Ghostscript, has been removed in favor of using plugins for
|
|
|
|
testing. The spoofing mechanism was fairly complex and required many special
|
|
|
|
hacks for Windows.
|
2020-04-26 05:14:59 -07:00
|
|
|
- Code describing the resolution in DPI of images was refactored into a
|
|
|
|
``ocrmypdf.helpers.Resolution`` class.
|
2020-06-10 14:27:47 -07:00
|
|
|
- The module ``ocrmypdf._exec`` is now private to OCRmyPDF.
|
2020-04-26 05:14:59 -07:00
|
|
|
- The ``ocrmypdf.hocrtransform`` module has been updated to follow PEP8 naming
|
|
|
|
conventions.
|
2020-06-10 14:27:47 -07:00
|
|
|
- Ghostscript is no longer used for finding the location of text in PDFs, and
|
|
|
|
APIs related to this feature have been removed.
|
|
|
|
- Lots of internal reorganization to support plugins.
|
2020-03-03 03:37:39 -08:00
|
|
|
|
2020-06-03 13:28:35 -07:00
|
|
|
v9.8.2
|
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed an issue where OCRmyPDF would ignore text inside Form XObject when
|
|
|
|
making certain decisions about whether a document already had text.
|
|
|
|
- Fixed file size increase warning to take overhead of small files into account.
|
|
|
|
- Added instructions for installing on Cygwin.
|
|
|
|
|
2020-05-28 15:04:23 -07:00
|
|
|
v9.8.1
|
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed an issue where unexpected files in the ``%PROGRAMFILES%\gs`` directory
|
|
|
|
(Windows) caused an exception.
|
|
|
|
- Mark pdfminer.six 20200517 as supported.
|
|
|
|
- If jbig2enc is missing and optimization is requested, a warning is issued
|
|
|
|
instead of an error, which was the intended behavior.
|
|
|
|
- Documentation updates.
|
|
|
|
|
2020-04-28 02:40:17 -07:00
|
|
|
v9.8.0
|
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed issue where only the first PNG (FlateDecode) image in a file would be
|
|
|
|
considered for optimization. File sizes should be improved from here on.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed a startup crash when the chosen language was Japanese (:issue:`543`).
|
2020-04-28 02:40:17 -07:00
|
|
|
- Added options to configure polling and log level to watcher.py.
|
|
|
|
|
2020-04-15 02:56:46 -07:00
|
|
|
v9.7.2
|
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed an issue with ``ocrmypdf.ocr(...language=)`` not accepting a list of
|
|
|
|
languages as documented.
|
|
|
|
- Updated setup.py to confirm that pdfminer.six version 20200402 is supported.
|
2020-03-03 03:37:39 -08:00
|
|
|
|
2020-04-10 12:53:24 -07:00
|
|
|
v9.7.1
|
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed version check failing when used with qpdf 10.0.0.
|
|
|
|
- Added some missing type annotations.
|
|
|
|
- Updated documentation to warn about need for "ifmain" guard and Windows.
|
|
|
|
|
2020-03-29 22:45:25 -07:00
|
|
|
v9.7.0
|
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed an error in watcher.py if ``OCR_JSON_SETTINGS`` was not defined.
|
|
|
|
- Ghostscript 9.51 is now blacklisted, due to numerous problems with this version.
|
|
|
|
- Added a workaround for a problem with "txtwrite" in Ghostscript 9.52.
|
|
|
|
- Fixed an issue where the incorrect number of threads used was shown when
|
|
|
|
``OMP_THREAD_LIMIT`` was manipulated.
|
|
|
|
- Removed a possible performance bottlenecks for files that use hundreds to
|
|
|
|
thousands of images on the same page.
|
|
|
|
- Documentation improvements.
|
|
|
|
- Optimization will now be applied to some monochrome images that have a color
|
|
|
|
profile defined instead of only black and white.
|
|
|
|
- ICC profiles are consulted when determining the simplified colorspace of an
|
|
|
|
image.
|
|
|
|
|
2020-03-03 03:37:39 -08:00
|
|
|
v9.6.1
|
|
|
|
======
|
|
|
|
|
|
|
|
- Documentation improvements - thanks to many users for their contributions!
|
|
|
|
|
|
|
|
- Fixed installation instructions for ArchLinux (@pigmonkey)
|
|
|
|
- Updated installation instructions for FreeBSD and other OSes (@knobix)
|
|
|
|
- Added instructions for using Docker Compose with watchdog (@ianalexander,
|
|
|
|
@deisi)
|
|
|
|
- Other miscellany (@mb720, @toy, @caiofacchinato)
|
|
|
|
- Some scripts provided in the documentation have been migrated out so that
|
|
|
|
they can be copied out as whole files, and to ensure syntax checking
|
|
|
|
is maintained.
|
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed an error that caused bash completions to fail on macOS. (:issue:`502,504`;
|
2020-03-03 03:37:39 -08:00
|
|
|
@AlexanderWillner)
|
|
|
|
- Fixed a rare case where OCRmyPDF threw an exception while processing a PDF
|
|
|
|
with the wrong object type in its ``/Trailer /Info``. The error is now logged
|
2021-06-16 00:39:40 -07:00
|
|
|
and incorrect object is ignored. (:issue:`497`)
|
2020-03-03 03:37:39 -08:00
|
|
|
- Removed potentially non-free file ``enron1.pdf`` and simplified the test that
|
|
|
|
used it.
|
|
|
|
- Removed potentially non-free file ``misc/media/logo.afdesign``.
|
|
|
|
|
2020-02-10 01:01:49 -08:00
|
|
|
v9.6.0
|
|
|
|
======
|
|
|
|
|
2020-02-10 01:13:28 -08:00
|
|
|
- Fixed a regression with transferring metadata from the input PDF to the output
|
|
|
|
PDF in certain situations.
|
2020-02-10 01:01:49 -08:00
|
|
|
- pdfminer.six is now supported up to version 2020-01-24.
|
|
|
|
- Messages are explaining page rotation decisions are now shown at the standard
|
|
|
|
verbosity level again when ``--rotate-pages``. In some previous version they
|
|
|
|
were set to debug level messages that only appeared with the parameter ``-v1``.
|
2020-02-10 01:13:28 -08:00
|
|
|
- Improvements to ``misc/watcher.py``. Thanks to @ianalexander and @svenihoney.
|
2020-02-10 01:01:49 -08:00
|
|
|
- Documentation improvements.
|
|
|
|
|
2020-01-17 03:11:33 -08:00
|
|
|
v9.5.0
|
|
|
|
======
|
|
|
|
|
|
|
|
- Added API functions to measure OCR quality.
|
2020-01-18 01:48:33 -08:00
|
|
|
- Modest improvements to handling PDFs with difficult/non compliant metadata.
|
2020-01-17 03:11:33 -08:00
|
|
|
|
2020-01-05 21:35:52 -08:00
|
|
|
v9.4.0
|
|
|
|
======
|
|
|
|
|
|
|
|
- Updated recommended dependency versions.
|
|
|
|
- Improvements to test coverage and changes to facilitate better measurement of
|
|
|
|
test coverage, such as when tests run in subprocesses.
|
|
|
|
- Improvements to error messages when Leptonica is not installed correctly.
|
|
|
|
- Fixed use of pytest "session scope" that may have caused some intermittent
|
|
|
|
CI failures.
|
|
|
|
- When the argument ``--keep-temporary-files`` or verbosity is set to ``-v1``,
|
|
|
|
a debug log file is generated in the working temporary folder.
|
|
|
|
|
2019-12-28 15:42:24 -08:00
|
|
|
v9.3.0
|
|
|
|
======
|
|
|
|
|
|
|
|
- Improved native Windows support: we now check in the obvious places in
|
|
|
|
the "Program Files" folders installations of Tesseract and Ghostscript,
|
|
|
|
rather than relying on the user to edit ``PATH`` to specify their location.
|
|
|
|
The ``PATH`` environment variable can still be used to differentiate when
|
|
|
|
multiple installations are present or the programs are installed to non-
|
|
|
|
standard locations.
|
|
|
|
- Fixed an exception on parsing Ghostscript error messages.
|
|
|
|
- Added an improved example demonstrating how to set up a watched folder
|
|
|
|
for automated OCR processing (thanks to @ianalexander for the contribution).
|
|
|
|
|
2019-12-11 13:13:51 -08:00
|
|
|
v9.2.0
|
|
|
|
======
|
|
|
|
|
|
|
|
- Native Windows is now supported.
|
|
|
|
- Continuous integration moved to Azure Pipelines.
|
|
|
|
- Improved test coverage and speed of tests.
|
|
|
|
- Fixed an issue where a page that was originally a JPEG would be saved as a
|
|
|
|
PNG, increasing file size. This occurred only when a preprocessing option
|
|
|
|
was selected along with ``--output-type=pdf`` and all images on the original
|
|
|
|
page were JPEGs. Regression since v7.0.0.
|
|
|
|
- OCRmyPDF no longer depends on the QPDF executable ``qpdf`` or ``libqpdf``.
|
|
|
|
It uses pikepdf (which in turn depends on ``libqpdf``). Package maintainers
|
|
|
|
should adjust dependencies so that OCRmyPDF no longer calls for libqpdf on
|
|
|
|
its own. For users of Python binary wheels, this change means a separate
|
|
|
|
installation of QPDF is no longer necessary. This change is mainly to
|
|
|
|
simplify installation on Windows.
|
|
|
|
- Fixed a rare case where log messages from Tesseract would be discarded.
|
|
|
|
- Fixed incorrect function signature for pixFindPageForeground, causing
|
|
|
|
exceptions on certain platforms/Leptonica versions.
|
|
|
|
|
2019-11-18 15:17:00 -08:00
|
|
|
v9.1.1
|
|
|
|
======
|
|
|
|
|
|
|
|
- Expand the range of pdfminer.six versions that are supported.
|
|
|
|
- Fixed Docker build when using pikepdf 1.7.0.
|
|
|
|
- Fixed documentation to recommend using pip from get-pip.py.
|
|
|
|
|
2019-11-11 22:39:33 -08:00
|
|
|
v9.1.0
|
|
|
|
======
|
|
|
|
|
|
|
|
- Improved diagnostics when file size increases at output. Now warns if JBIG2
|
|
|
|
or pngquant were not available.
|
|
|
|
- pikepdf 1.7.0 is now required, to pick up changes that remove the need for
|
|
|
|
a source install on Linux systems running Python 3.8.
|
|
|
|
|
2019-11-04 03:00:15 -08:00
|
|
|
v9.0.5
|
|
|
|
======
|
|
|
|
|
|
|
|
- The Alpine Docker image (jbarlow83/ocrmypdf-alpine) has been dropped due to
|
|
|
|
the difficulties of supporting Alpine Linux.
|
|
|
|
- The primary Docker image (jbarlow83/ocrmypdf) has been improved to take on
|
|
|
|
the extra features that used to be exclusive to the Alpine image.
|
|
|
|
- No changes to application code.
|
2019-11-04 03:15:59 -08:00
|
|
|
- pdfminer.six version 20191020 is now supported.
|
2019-11-04 03:00:15 -08:00
|
|
|
|
2019-10-20 03:20:54 -07:00
|
|
|
v9.0.4
|
|
|
|
======
|
|
|
|
|
2019-11-03 01:49:36 -08:00
|
|
|
- Fixed compatibility with Python 3.8 (but requires source install for the moment).
|
2019-10-20 03:20:54 -07:00
|
|
|
- Fixed Tesseract settings for ``--user-words`` and ``--user-patterns``.
|
2019-10-24 16:58:39 -07:00
|
|
|
- Changed to pikepdf 1.6.5 (for Python 3.8).
|
|
|
|
- Changed to Pillow 6.2.0 (to mitigate a security vulnerability in earlier Pillow).
|
|
|
|
- A debug message now mentions when English is automatically selected if the locale
|
|
|
|
is not English.
|
2019-10-20 03:20:54 -07:00
|
|
|
|
2019-09-05 13:17:26 -07:00
|
|
|
v9.0.3
|
|
|
|
======
|
|
|
|
|
|
|
|
- Embed an encoded version of the sRGB ICC profile in the intermediate
|
|
|
|
Postscript file (used for PDF/A conversion). Previously we included the
|
|
|
|
filename, which required Postscript to run with file access enabled. For
|
|
|
|
security, Ghostscript 9.28 enables ``-dSAFER`` and as such, no longer
|
|
|
|
permits access to any file by default. This fix is necessary for
|
|
|
|
compatibility with Ghostscript 9.28.
|
2019-09-05 13:39:43 -07:00
|
|
|
- Exclude a test that sometimes times out and fails in continuous integration
|
|
|
|
from the standard test suite.
|
2019-09-05 13:17:26 -07:00
|
|
|
|
2019-09-04 02:34:21 -07:00
|
|
|
v9.0.2
|
|
|
|
======
|
|
|
|
|
|
|
|
- The image optimizer now skips optimizing flate (PNG) encoded images in some
|
|
|
|
situations where the optimization effort was likely wasted.
|
|
|
|
- The image optimizer now ignores images that specify arbitrary decode arrays,
|
|
|
|
since these are rare.
|
|
|
|
- Fixed an issue that caused inversion of black and white in monochrome images.
|
|
|
|
We are not certain but the problem seems to be linked to Leptonica 1.76.0 and
|
|
|
|
older.
|
2019-09-05 13:39:43 -07:00
|
|
|
- Fixed some cases where the test suite failed if
|
2019-09-04 02:34:21 -07:00
|
|
|
English or German Tesseract language packs were not installed.
|
|
|
|
- Fixed a runtime error if the Tesseract English language is not installed.
|
|
|
|
- Improved explicit closing of Pillow images after use.
|
|
|
|
- Actually fixed of Alpine Docker image build.
|
|
|
|
- Changed to pikepdf 1.6.3.
|
|
|
|
|
2019-08-11 17:14:11 -07:00
|
|
|
v9.0.1
|
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed test suite failing when either of optional dependencies unpaper and
|
|
|
|
pngquant were missing.
|
2019-09-04 02:34:21 -07:00
|
|
|
- Attempted fix of Alpine Docker image build.
|
2019-08-11 17:14:11 -07:00
|
|
|
- Documented that FreeBSD ports are now available.
|
2019-09-04 02:34:21 -07:00
|
|
|
- Changed to pikepdf 1.6.1.
|
2019-08-11 17:14:11 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
v9.0.0
|
|
|
|
======
|
2017-06-13 10:15:11 -07:00
|
|
|
|
2019-06-23 16:54:53 -07:00
|
|
|
**Breaking changes**
|
2018-04-06 14:52:40 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
- The ``--mask-barcodes`` experimental feature has been dropped due to poor
|
|
|
|
reliability and occasional crashes, both due to the underlying library that
|
|
|
|
implements this feature (Leptonica).
|
2019-07-27 03:23:56 -07:00
|
|
|
- The ``-v`` (verbosity level) parameter now accepts only ``0``, ``1``, and
|
|
|
|
``2``.
|
2019-07-27 16:15:48 -07:00
|
|
|
- Dropped support for Tesseract 4.00.00-alpha releases. Tesseract 4.0 beta and
|
2019-07-27 03:23:56 -07:00
|
|
|
later remain supported.
|
2019-07-27 16:15:48 -07:00
|
|
|
- Dropped the ``ocrmypdf-polyglot`` and ``ocrmypdf-webservice`` images.
|
2019-07-10 13:36:06 -07:00
|
|
|
|
2019-07-27 16:15:48 -07:00
|
|
|
**New features**
|
2019-07-10 13:36:06 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
- Added a high level API for applications that want to integrate OCRmyPDF.
|
|
|
|
Special thanks to Martin Wind (@mawi1988) whose made significant contributions
|
2020-08-05 00:44:42 -07:00
|
|
|
to this effort.
|
2019-07-27 03:23:56 -07:00
|
|
|
- Added progress bars for long-running steps. ■■■■■■■□□
|
2019-07-27 16:15:48 -07:00
|
|
|
- We now create linearized ("fast web view") PDFs by default. The new parameter
|
|
|
|
``--fast-web-view`` provides control over when this feature is applied.
|
|
|
|
- Added a new ``--pages`` feature to limit OCR to only a specific page range.
|
|
|
|
The list may contain commas or single pages, such as ``1, 3, 5-11``.
|
2019-07-27 03:23:56 -07:00
|
|
|
- When the number of pages is small compared to the number of allowed jobs, we
|
|
|
|
run Tesseract in multithreaded (OpenMP) mode when available. This should
|
|
|
|
improve performance on files with low page counts.
|
2019-07-27 02:03:42 -07:00
|
|
|
- Removed dependency on ``ruffus``, and with that, the non-reentrancy
|
|
|
|
restrictions that previous made an API impossible.
|
2019-07-27 03:23:56 -07:00
|
|
|
- Output and logging messages overhauled so that ocrmypdf may be integrated
|
|
|
|
into applications that use the logging module.
|
|
|
|
- pikepdf 1.6.0 is required.
|
2019-07-27 16:15:48 -07:00
|
|
|
- Added a logo. 😊
|
2019-07-10 13:36:06 -07:00
|
|
|
|
2019-07-27 16:15:48 -07:00
|
|
|
**Bug fixes**
|
2019-07-03 02:22:50 -07:00
|
|
|
|
2019-07-27 16:15:48 -07:00
|
|
|
- Pages with vector artwork are treated as full color. Previously, vectors
|
|
|
|
were ignored when considering the colorspace needed to cover a page, which
|
|
|
|
could cause loss of color under certain settings.
|
2019-07-27 02:03:42 -07:00
|
|
|
- Test suite now spawns processes less frequently, allowing more accurate
|
|
|
|
measurement of code coverage.
|
2019-07-27 03:23:56 -07:00
|
|
|
- Improved test coverage.
|
|
|
|
- Fixed a rare division by zero (if optimization produced an invalid file).
|
2019-07-27 02:03:42 -07:00
|
|
|
- Updated Docker images to use newer versions.
|
2019-07-27 03:23:56 -07:00
|
|
|
- Fixed images encoded as JBIG2 with a colorspace other than ``/DeviceGray``
|
|
|
|
were not interpreted correctly.
|
2019-07-30 00:39:14 -07:00
|
|
|
- Fixed a OCR text-image registration (i.e. alignment) problem when the page
|
|
|
|
when MediaBox had a nonzero corner.
|
2019-07-03 02:22:50 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
v8.3.2
|
|
|
|
======
|
2019-05-11 12:50:44 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
- Dropped workaround for macOS that allowed it work without pdfminer.six,
|
|
|
|
now a proper sdist release of pdfminer.six is available.
|
2019-05-11 12:50:44 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
- pikepdf 1.5.0 is now required.
|
2019-05-11 12:50:44 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
v8.3.1
|
|
|
|
======
|
2019-05-11 12:50:44 -07:00
|
|
|
|
2019-07-27 02:03:42 -07:00
|
|
|
- Fixed an issue where PDFs with malformed metadata would be rendered as
|
2021-06-16 00:39:40 -07:00
|
|
|
blank pages. :issue:`398`.
|
2019-05-11 12:50:44 -07:00
|
|
|
|
|
|
|
v8.3.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Improved the strategy for updating pages when a new image of the page
|
|
|
|
was produced. We now attempt to preserve more content from the
|
|
|
|
original file, for annotations in particular.
|
|
|
|
- For PDFs with more than 100 pages and a sequence where one PDF page
|
|
|
|
was replaced and one or more subsequent ones were skipped, an
|
|
|
|
intermediate file would be corrupted while grafting OCR text, causing
|
|
|
|
processing to fail. This is a regression, likely introduced in
|
|
|
|
v8.2.4.
|
|
|
|
- Previously, we resized the images produced by Ghostscript by a small
|
|
|
|
number of pixels to ensure the output image size was an exactly what
|
|
|
|
we wanted. Having discovered a way to get Ghostscript to produce the
|
|
|
|
exact image sizes we require, we eliminated the resizing step.
|
|
|
|
- Command line completions for ``bash`` are now available, in addition
|
|
|
|
to ``fish``, both in ``misc/completion``. Package maintainers, please
|
|
|
|
install these so users can take advantage.
|
|
|
|
- Updated requirements.
|
|
|
|
- pikepdf 1.3.0 is now required.
|
2019-05-11 12:50:44 -07:00
|
|
|
|
2019-04-23 00:07:12 -07:00
|
|
|
v8.2.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed a false positive while checking for a certain type of PDF that
|
|
|
|
only Acrobat can read. We now more accurately detect Acrobat-only
|
|
|
|
PDFs.
|
|
|
|
- OCRmyPDF holds fewer open file handles and is more prompt about
|
|
|
|
releasing those it no longer needs.
|
|
|
|
- Minor optimization: we no longer traverse the table of contents to
|
|
|
|
ensure all references in it are resolved, as changes to libqpdf have
|
|
|
|
made this unnecessary.
|
|
|
|
- pikepdf 1.2.0 is now required.
|
2019-04-23 00:07:12 -07:00
|
|
|
|
2019-04-03 01:19:12 -07:00
|
|
|
v8.2.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2019-04-03 01:19:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed that ``--mask-barcodes`` would occasionally leave a unwanted
|
|
|
|
temporary file named ``junkpixt`` in the current working folder.
|
|
|
|
- Fixed (hopefully) handling of Leptonica errors in an environment
|
|
|
|
where a non-standard ``sys.stderr`` is present.
|
|
|
|
- Improved help text for ``--verbose``.
|
2019-04-03 01:19:12 -07:00
|
|
|
|
2019-03-07 14:27:16 -08:00
|
|
|
v8.2.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2019-03-06 22:22:50 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed a regression from v8.2.0, an exception that occurred while
|
|
|
|
attempting to report that ``unpaper`` or another optional dependency
|
|
|
|
was unavailable.
|
|
|
|
- In some cases, ``ocrmypdf [-c|--clean]`` failed to exit with an error
|
|
|
|
when ``unpaper`` is not installed.
|
2019-03-06 22:22:50 -08:00
|
|
|
|
2019-03-07 14:27:16 -08:00
|
|
|
v8.2.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2019-03-07 14:27:16 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- This release was canceled.
|
2019-03-07 14:27:16 -08:00
|
|
|
|
2019-03-03 14:15:20 -08:00
|
|
|
v8.2.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- A major improvement to our Docker image is now available thanks to
|
|
|
|
hard work contributed by @mawi12345. The new Docker image,
|
|
|
|
ocrmypdf-alpine, is based on Alpine Linux, and includes most of the
|
|
|
|
functionality of three existed images in a smaller package. This
|
|
|
|
image will replace the main Docker image eventually but for now all
|
|
|
|
are being built. `See documentation for
|
|
|
|
details <https://ocrmypdf.readthedocs.io/en/latest/docker.html>`__.
|
|
|
|
- Documentation reorganized especially around the use of Docker images.
|
|
|
|
- Fixed a problem with PDF image optimization, where the optimizer
|
|
|
|
would unnecessarily decompress and recompress PNG images, in some
|
|
|
|
cases losing the benefits of the quantization it just had just
|
|
|
|
performed. The optimizer is now capable of embedding PNG images into
|
|
|
|
PDFs without transcoding them.
|
|
|
|
- Fixed a minor regression with lossy JBIG2 image optimization. All
|
|
|
|
JBIG2 candidates images were incorrectly placed into a single
|
|
|
|
optimization group for the whole file, instead of grouping pages
|
|
|
|
together. This usually makes a larger JBIG2Globals dictionary and
|
|
|
|
results in inferior compression, so it worked less well than
|
|
|
|
designed. However, quality would not be impacted. Lossless JBIG2 was
|
|
|
|
entirely unaffected.
|
|
|
|
- Updated dependencies, including pikepdf to 1.1.0. This fixes
|
2021-06-16 00:39:40 -07:00
|
|
|
:issue:`358`.
|
2019-06-22 17:29:26 -07:00
|
|
|
- The install-time version checks for certain external programs have
|
|
|
|
been removed from setup.py. These tests are now performed at
|
|
|
|
run-time.
|
|
|
|
- The non-standard option to override install-time checks
|
|
|
|
(``setup.py install --force``) is now deprecated and prints a
|
|
|
|
warning. It will be removed in a future release.
|
2019-03-03 14:15:20 -08:00
|
|
|
|
2019-02-07 17:06:51 -08:00
|
|
|
v8.1.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Added a feature, ``--unpaper-args``, which allows passing arbitrary
|
|
|
|
arguments to ``unpaper`` when using ``--clean`` or ``--clean-final``.
|
|
|
|
The default, very conservative unpaper settings are suppressed.
|
|
|
|
- The argument ``--clean-final`` now implies ``--clean``. It was
|
|
|
|
possible to issue ``--clean-final`` on its before this, but it would
|
|
|
|
have no useful effect.
|
|
|
|
- Fixed an exception on traversing corrupt table of contents entries
|
|
|
|
(specifically, those with invalid destination objects)
|
|
|
|
- Fixed an issue when using ``--tesseract-timeout`` and image
|
|
|
|
processing features on a file with more than 100 pages.
|
2021-06-16 00:39:40 -07:00
|
|
|
:issue:`347`
|
2019-06-22 17:29:26 -07:00
|
|
|
- OCRmyPDF now always calls ``os.nice(5)`` to signal to operating
|
|
|
|
systems that it is a background process.
|
2019-02-10 02:10:48 -08:00
|
|
|
|
2019-01-17 00:57:28 -08:00
|
|
|
v8.0.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2019-01-17 00:57:28 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an exception when parsing PDFs that are missing a required
|
2021-06-16 00:39:40 -07:00
|
|
|
field. :issue:`325`
|
2019-06-22 17:29:26 -07:00
|
|
|
- pikepdf 1.0.5 is now required, to address some other PDF parsing
|
|
|
|
issues.
|
2019-01-17 00:57:28 -08:00
|
|
|
|
2019-01-05 23:35:47 -08:00
|
|
|
v8.0.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-12-19 16:41:09 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
No major features. The intent of this release is to sever support for
|
|
|
|
older versions of certain dependencies.
|
2019-01-05 23:35:47 -08:00
|
|
|
|
|
|
|
**Breaking changes**
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Dropped support for Tesseract 3.x. Tesseract 4.0 or newer is now
|
|
|
|
required.
|
|
|
|
- Dropped support for Python 3.5.
|
|
|
|
- Some ``ocrmypdf.pdfa`` APIs that were deprecated in v7.x were
|
|
|
|
removed. This functionality has been moved to pikepdf.
|
2019-01-05 23:35:47 -08:00
|
|
|
|
|
|
|
**Other changes**
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an unhandled exception when attempting to mask barcodes.
|
2021-06-16 00:39:40 -07:00
|
|
|
:issue:`322`
|
2019-06-22 17:29:26 -07:00
|
|
|
- It is now possible to use ocrmypdf without pdfminer.six, to support
|
|
|
|
distributions that do not have it or cannot currently use it (e.g.
|
|
|
|
Homebrew). Downstream maintainers should include pdfminer.six if
|
|
|
|
possible.
|
|
|
|
- A warning is now issue when PDF/A conversion removes some XMP
|
|
|
|
metadata from the input PDF. (Only a "whitelist" of certain XMP
|
|
|
|
metadata types are allowed in PDF/A.)
|
|
|
|
- Fixed several issues that caused PDF/As to be produced with
|
|
|
|
nonconforming XMP metadata (would fail validation with veraPDF).
|
|
|
|
- Fixed some instances where invalid DocumentInfo from a PDF cause XMP
|
|
|
|
metadata creation to fail.
|
|
|
|
- Fixed a few documentation problems.
|
|
|
|
- pikepdf 1.0.2 is now required.
|
2018-12-19 16:41:09 -08:00
|
|
|
|
2018-12-15 15:27:23 -08:00
|
|
|
v7.4.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- ``--force-ocr`` may now be used with the new ``--threshold`` and
|
|
|
|
``--mask-barcodes`` features
|
|
|
|
- pikepdf >= 0.9.1 is now required.
|
|
|
|
- Changed metadata handling to pikepdf 0.9.1. As a result, metadata
|
|
|
|
handling of non-ASCII characters in Ghostscript 9.25 or later is
|
|
|
|
fixed.
|
|
|
|
- chardet >= 3.0.4 is temporarily listed as required. pdfminer.six
|
|
|
|
depends on it, but the most recent release does not specify this
|
|
|
|
requirement.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`326`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- python-xmp-toolkit and libexempi are no longer required.
|
|
|
|
- A new Docker image is now being provided for users who wish to access
|
|
|
|
OCRmyPDF over a simple HTTP interface, instead of the command line.
|
|
|
|
- Increase tolerance of PDFs that overflow or underflow the PDF
|
|
|
|
graphics stack.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`325`)
|
2018-10-04 01:21:17 -07:00
|
|
|
|
2018-11-16 02:13:41 -08:00
|
|
|
v7.3.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-11-16 02:13:41 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed performance regression from v7.3.0; fast page analysis was not
|
|
|
|
selected when it should be.
|
|
|
|
- Fixed a few exceptions related to the new ``--mask-barcodes`` feature
|
|
|
|
and improved argument checking
|
|
|
|
- Added missing detection of TrueType fonts that lack a Unicode mapping
|
2018-11-16 02:13:41 -08:00
|
|
|
|
2018-11-10 01:09:19 -08:00
|
|
|
v7.3.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Added a new feature ``--redo-ocr`` to detect existing OCR in a file,
|
|
|
|
remove it, and redo the OCR. This may be particularly helpful for
|
|
|
|
anyone who wants to take advantage of OCR quality improvements in
|
|
|
|
Tesseract 4.0. Note that OCR added by OCRmyPDF before version 3.0
|
|
|
|
cannot be detected since it was not properly marked as invisible text
|
|
|
|
in the earliest versions. OCR that constructs a font from visible
|
|
|
|
text, such as Adobe Acrobat's ClearScan.
|
|
|
|
- OCRmyPDF's content detection is generally more sophisticated. It
|
|
|
|
learns more about the contents of each PDF and makes better
|
|
|
|
recommendations:
|
|
|
|
|
|
|
|
- OCRmyPDF can now detect when a PDF contains text that cannot be
|
|
|
|
mapped to Unicode (meaning it is readable to human eyes but
|
|
|
|
copy-pastes as gibberish). In these cases it recommends
|
|
|
|
``--force-ocr`` to make the text searchable.
|
|
|
|
- PDFs containing vector objects are now rendered at more
|
|
|
|
appropriate resolution for OCR.
|
|
|
|
- We now exit with an error for PDFs that contain Adobe LiveCycle
|
|
|
|
Designer's dynamic XFA forms. Currently the open source community
|
|
|
|
does not have tools to work with these files.
|
|
|
|
- OCRmyPDF now warns when a PDF that contains Adobe AcroForms, since
|
|
|
|
such files probably do not need OCR. It can work with these files.
|
|
|
|
|
|
|
|
- Added three new **experimental** features to improve OCR quality in
|
|
|
|
certain conditions. The name, syntax and behavior of these arguments
|
|
|
|
is subject to change. They may also be incompatible with some other
|
|
|
|
features.
|
|
|
|
|
|
|
|
- ``--remove-vectors`` which strips out vector graphics. This can
|
|
|
|
improve OCR quality since OCR will not search artwork for readable
|
|
|
|
text; however, it currently removes "text as curves" as well.
|
|
|
|
- ``--mask-barcodes`` to detect and suppress barcodes in files. We
|
|
|
|
have observed that barcodes can interfere with OCR because they
|
|
|
|
are "text-like" but not actually textual.
|
|
|
|
- ``--threshold`` which uses a more sophisticated thresholding
|
|
|
|
algorithm than is currently in use in Tesseract OCR. This works
|
|
|
|
around a `known issue in Tesseract
|
|
|
|
4.0 <https://github.com/tesseract-ocr/tesseract/issues/1990>`__
|
|
|
|
with dark text on bright backgrounds.
|
|
|
|
|
|
|
|
- Fixed an issue where an error message was not reported when the
|
|
|
|
installed Ghostscript was very old.
|
|
|
|
- The PDF optimizer now saves files with object streams enabled when
|
|
|
|
the optimization level is ``--optimize 1`` or higher (the default).
|
|
|
|
This makes files a little bit smaller, but requires PDF 1.5. PDF 1.5
|
|
|
|
was first released in 2003 and is broadly supported by PDF viewers,
|
|
|
|
but some rudimentary PDF parsers such as PyPDF2 do not understand
|
|
|
|
object streams. You can use the command line tool
|
|
|
|
``qpdf --object-streams=disable`` or
|
|
|
|
`pikepdf <https://github.com/pikepdf/pikepdf>`__ library to remove
|
|
|
|
them.
|
|
|
|
- New dependency: pdfminer.six 20181108. Note this is a fork of the
|
|
|
|
Python 2-only pdfminer.
|
|
|
|
- Deprecation notice: At the end of 2018, we will be ending support for
|
|
|
|
Python 3.5 and Tesseract 3.x. OCRmyPDF v7 will continue to work with
|
|
|
|
older versions.
|
2018-11-10 01:09:19 -08:00
|
|
|
|
2018-10-11 15:55:01 -07:00
|
|
|
v7.2.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-10-11 15:55:01 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed compatibility with an API change in pikepdf 0.3.5.
|
2019-06-22 17:29:26 -07:00
|
|
|
- A kludge to support Leptonica versions older than 1.72 in the test
|
|
|
|
suite was dropped. Older versions of Leptonica are likely still
|
|
|
|
compatible. The only impact is that a portion of the test suite will
|
|
|
|
be skipped.
|
2018-10-11 15:55:01 -07:00
|
|
|
|
2018-10-04 01:21:17 -07:00
|
|
|
v7.2.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-10-04 01:21:17 -07:00
|
|
|
|
2018-10-05 01:27:00 -07:00
|
|
|
**Lossy JBIG2 behavior change**
|
2018-10-04 01:21:17 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
A user reported that ocrmypdf was in fact using JBIG2 in **lossy**
|
|
|
|
compression mode. This was not the intended behavior. Users should
|
|
|
|
`review the technical concerns with JBIG2 in lossy
|
|
|
|
mode <https://abbyy.technology/en:kb:tip:jbig2_compression_and_ocr>`__
|
|
|
|
and decide if this is a concern for their use case.
|
2018-10-04 01:21:17 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
JBIG2 lossy mode does achieve higher compression ratios than any other
|
|
|
|
monochrome compression technology; for large text documents the savings
|
|
|
|
are considerable. JBIG2 lossless still gives great compression ratios
|
|
|
|
and is a major improvement over the older CCITT G4 standard.
|
2018-10-04 01:21:17 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
Only users who have reviewed the concerns with JBIG2 in lossy mode
|
|
|
|
should opt-in. As such, lossy mode JBIG2 is only turned on when the new
|
|
|
|
argument ``--jbig2-lossy`` is issued. This is independent of the setting
|
|
|
|
for ``--optimize``.
|
2018-10-04 01:21:17 -07:00
|
|
|
|
|
|
|
Users who did not install an optional JBIG2 encoder are unaffected.
|
|
|
|
|
|
|
|
(Thanks to user 'bsdice' for reporting this issue.)
|
|
|
|
|
|
|
|
**Other issues**
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- When the image optimizer quantizes an image to 1 bit per pixel, it
|
|
|
|
will now attempt to further optimize that image as CCITT or JBIG2,
|
|
|
|
instead of keeping it in the "flate" encoding which is not efficient
|
|
|
|
for 1 bpp images.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`297`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- Images in PDFs that are used as soft masks (i.e. transparency masks
|
|
|
|
or alpha channels) are now excluded from optimization.
|
|
|
|
- Fixed handling of Tesseract 4.0-rc1 which now accepts invalid
|
|
|
|
Tesseract configuration files, which broke the test suite.
|
2018-10-04 01:21:17 -07:00
|
|
|
|
2018-09-19 20:57:18 -07:00
|
|
|
v7.1.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Improve the performance of initial text extraction, which is done to
|
|
|
|
determine if a file contains existing text of some kind or not. On
|
|
|
|
large files, this initial processing is now about 20x times faster.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`299`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- pikepdf 0.3.3 is now required.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`231`, a
|
2019-06-22 17:29:26 -07:00
|
|
|
problem with JPEG2000 images where image metadata was only available
|
|
|
|
inside the JPEG2000 file.
|
|
|
|
- Fixed some additional Ghostscript 9.25 compatibility issues.
|
|
|
|
- Improved handling of KeyboardInterrupt error messages.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`301`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- README.md is now served in GitHub markdown instead of
|
|
|
|
reStructuredText.
|
2018-09-19 21:01:24 -07:00
|
|
|
|
2018-09-14 15:53:26 -07:00
|
|
|
v7.0.6
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-09-14 15:53:26 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Blacklist Ghostscript 9.24, now that 9.25 is available and fixes many
|
|
|
|
regressions in 9.24.
|
2018-09-14 15:53:26 -07:00
|
|
|
|
2018-09-13 23:29:54 -07:00
|
|
|
v7.0.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Improve capability with Ghostscript 9.24, and enable the JPEG
|
|
|
|
passthrough feature when this version in installed.
|
|
|
|
- Ghostscript 9.24 lost the ability to set PDF title, author, subject
|
|
|
|
and keyword metadata to Unicode strings. OCRmyPDF will set ASCII
|
|
|
|
strings and warn when Unicode is suppressed. Other software may be
|
|
|
|
used to update metadata. This is a short term work around.
|
|
|
|
- PDFs generated by Kodak Capture Desktop, or generally PDFs that
|
|
|
|
contain indirect references to null objects in their table of
|
|
|
|
contents, would have an invalid table of contents after processing by
|
|
|
|
OCRmyPDF that might interfere with other viewers. This has been
|
|
|
|
fixed.
|
|
|
|
- Detect PDFs generated by Adobe LiveCycle, which can only be displayed
|
|
|
|
in Adobe Acrobat and Reader currently. When these are encountered,
|
|
|
|
exit with an error instead of performing OCR on the "Please wait"
|
|
|
|
error message page.
|
2018-09-13 23:29:54 -07:00
|
|
|
|
2018-08-24 12:41:53 -07:00
|
|
|
v7.0.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-08-24 12:41:53 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed exception thrown when trying to optimize a certain type of PNG
|
2019-06-22 17:29:26 -07:00
|
|
|
embedded in a PDF with the ``-O2``
|
|
|
|
- Update to pikepdf 0.3.2, to gain support for optimizing some
|
|
|
|
additional image types that were previously excluded from
|
|
|
|
optimization (CMYK and grayscale). Fixes
|
2021-06-16 00:39:40 -07:00
|
|
|
:issue:`285`.
|
2018-08-24 12:41:53 -07:00
|
|
|
|
2018-08-10 16:59:08 -07:00
|
|
|
v7.0.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-08-10 16:59:08 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`284`, an error
|
2019-06-22 17:29:26 -07:00
|
|
|
when parsing inline images that have are also image masks, by
|
|
|
|
upgrading pikepdf to 0.3.1
|
2018-08-10 16:59:08 -07:00
|
|
|
|
2018-08-03 13:37:18 -07:00
|
|
|
v7.0.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-08-03 13:37:18 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed a regression with ``--rotate-pages`` on pages that already had
|
2019-06-22 17:29:26 -07:00
|
|
|
rotations applied.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`279`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- Improve quality of page rotation in some cases by rasterizing a
|
|
|
|
higher quality preview image.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`281`)
|
2018-08-03 13:37:18 -07:00
|
|
|
|
2018-08-01 15:17:33 -07:00
|
|
|
v7.0.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-08-01 15:17:33 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed compatibility with img2pdf >= 0.3.0 by rejecting input images
|
2019-06-22 17:29:26 -07:00
|
|
|
that have an alpha channel
|
|
|
|
- Add forward compatibility for pikepdf 0.3.0 (unrelated to img2pdf)
|
|
|
|
- Various documentation updates for v7.0.0 changes
|
2018-08-01 15:17:49 -07:00
|
|
|
|
2018-07-09 12:51:56 -07:00
|
|
|
v7.0.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- The core algorithm for combining OCR layers with existing PDF pages
|
|
|
|
has been rewritten and improved considerably. PDFs are no longer
|
|
|
|
split into single page PDFs for processing; instead, images are
|
|
|
|
rendered and the OCR results are grafted onto the input PDF. The new
|
|
|
|
algorithm uses less temporary disk space and is much more performant
|
|
|
|
especially for large files.
|
|
|
|
- New dependency: `pikepdf <https://github.com/pikepdf/pikepdf>`__.
|
|
|
|
pikepdf is a powerful new Python PDF library driving the latest
|
|
|
|
OCRmyPDF features, built on the QPDF C++ library (libqpdf).
|
|
|
|
- New feature: PDF optimization with ``-O`` or ``--optimize``. After
|
|
|
|
OCR, OCRmyPDF will perform image optimizations relevant to OCR PDFs.
|
|
|
|
|
|
|
|
- If a JBIG2 encoder is available, then monochrome images will be
|
|
|
|
converted, with the potential for huge savings on large black and
|
|
|
|
white images, since JBIG2 is far more efficient than any other
|
|
|
|
monochrome (bi-level) compression. (All known US patents related
|
|
|
|
to JBIG2 have probably expired, but it remains the responsibility
|
|
|
|
of the user to supply a JBIG2 encoder such as
|
|
|
|
`jbig2enc <https://github.com/agl/jbig2enc>`__. OCRmyPDF does not
|
|
|
|
implement JBIG2 encoding.)
|
|
|
|
- If ``pngquant`` is installed, OCRmyPDF will optionally use it to
|
|
|
|
perform lossy quantization and compression of PNG images.
|
|
|
|
- The quality of JPEGs can also be lowered, on the assumption that a
|
|
|
|
lower quality image may be suitable for storage after OCR.
|
|
|
|
- This image optimization component will eventually be offered as an
|
|
|
|
independent command line utility.
|
|
|
|
- Optimization ranges from ``-O0`` through ``-O3``, where ``0``
|
|
|
|
disables optimization and ``3`` implements all options. ``1``, the
|
|
|
|
default, performs only safe and lossless optimizations. (This is
|
|
|
|
similar to GCC's optimization parameter.) The exact type of
|
|
|
|
optimizations performed will vary over time.
|
|
|
|
|
|
|
|
- Small amounts of text in the margins of a page, such as watermarks,
|
|
|
|
page numbers, or digital stamps, will no longer prevent the rest of a
|
|
|
|
page from being OCRed when ``--skip-text`` is issued. This behavior
|
|
|
|
is based on a heuristic.
|
|
|
|
- Removed features
|
|
|
|
|
|
|
|
- The deprecated ``--pdf-renderer tesseract`` PDF renderer was
|
|
|
|
removed.
|
|
|
|
- ``-g``, the option to generate debug text pages, was removed
|
|
|
|
because it was a maintenance burden and only worked in isolated
|
|
|
|
cases. HOCR pages can still be previewed by running the
|
|
|
|
hocrtransform.py with appropriate settings.
|
|
|
|
|
|
|
|
- Removed dependencies
|
|
|
|
|
|
|
|
- ``PyPDF2``
|
|
|
|
- ``defusedxml``
|
|
|
|
- ``PyMuPDF``
|
|
|
|
|
|
|
|
- The ``sandwich`` PDF renderer can be used with all supported versions
|
|
|
|
of Tesseract, including that those prior to v3.05 which don't support
|
|
|
|
``-c textonly``. (Tesseract v4.0.0 is recommended and more
|
|
|
|
efficient.)
|
|
|
|
- ``--pdf-renderer auto`` option and the diagnostics used to select a
|
|
|
|
PDF renderer now work better with old versions, but may make
|
|
|
|
different decisions than past versions.
|
|
|
|
- If everything succeeds but PDF/A conversion fails, a distinct return
|
|
|
|
code is now returned (``ExitCode.pdfa_conversion_failed (10)``) where
|
|
|
|
this situation previously returned
|
|
|
|
``ExitCode.invalid_output_pdf (4)``. The latter is now returned only
|
|
|
|
if there is some indication that the output file is invalid.
|
|
|
|
- Notes for downstream packagers
|
|
|
|
|
|
|
|
- There is also a new dependency on ``python-xmp-toolkit`` which in
|
|
|
|
turn depends on ``libexempi3``.
|
|
|
|
- It may be necessary to separately ``pip install pycparser`` to
|
|
|
|
avoid `another Python 3.7
|
|
|
|
issue <https://github.com/eliben/pycparser/pull/135>`__.
|
2018-06-23 03:01:01 -07:00
|
|
|
|
2018-10-28 16:19:37 -07:00
|
|
|
v6.2.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Disable a failing test due to Tesseract 4.0rc1 behavior change.
|
|
|
|
Previously, Tesseract would exit with an error message if its
|
|
|
|
configuration was invalid, and OCRmyPDF would intercept this message.
|
|
|
|
Now Tesseract issues a warning, which OCRmyPDF v6.2.5 may relay or
|
|
|
|
ignore. (In v7.x, OCRmyPDF will respond to the warning.)
|
|
|
|
- This release branch no longer supports using the optional PyMuPDF
|
|
|
|
installation, since it was removed in v7.x.
|
|
|
|
- This release branch no longer supports macOS. macOS users should
|
|
|
|
upgrade to v7.x.
|
2018-10-28 16:19:37 -07:00
|
|
|
|
2018-09-16 15:07:53 -07:00
|
|
|
v6.2.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-09-16 15:07:53 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Backport Ghostscript 9.25 compatibility fixes, which removes support
|
|
|
|
for setting Unicode metadata
|
|
|
|
- Backport blacklisting Ghostscript 9.24
|
|
|
|
- Older versions of Ghostscript are still supported
|
2018-09-16 15:07:53 -07:00
|
|
|
|
2018-07-31 23:45:28 -07:00
|
|
|
v6.2.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-07-31 23:45:28 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed compatibility with img2pdf >= 0.3.0 by rejecting input images
|
2019-06-22 17:29:26 -07:00
|
|
|
that have an alpha channel
|
|
|
|
- This version will be included in Ubuntu 18.10
|
2018-07-31 23:45:28 -07:00
|
|
|
|
2018-07-09 13:56:23 -07:00
|
|
|
v6.2.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-07-09 13:56:23 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Backport compatibility fixes for Python 3.7 and ruffus 2.7.0 from
|
|
|
|
v7.0.0
|
|
|
|
- Backport fix to ignore masks when deciding what colors are on a page
|
|
|
|
- Backport some minor improvements from v7.0.0: better argument
|
|
|
|
validation and warnings about the Tesseract 4.0.0 ``--user-words``
|
|
|
|
regression
|
2018-07-09 13:56:23 -07:00
|
|
|
|
2018-06-23 03:01:01 -07:00
|
|
|
v6.2.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-06-23 03:01:01 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed recent versions of Tesseract (after 4.0.0-beta1) not being
|
|
|
|
detected as supporting the ``sandwich`` renderer (:issue:`271`).
|
2018-06-23 03:01:01 -07:00
|
|
|
|
2018-05-03 16:47:21 -07:00
|
|
|
v6.2.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- **Docker**: The Docker image ``ocrmypdf-tess4`` has been removed. The
|
|
|
|
main Docker images, ``ocrmypdf`` and ``ocrmypdf-polyglot`` now use
|
|
|
|
Ubuntu 18.04 as a base image, and as such Tesseract 4.0.0-beta1 is
|
|
|
|
now the Tesseract version they use. There is no Docker image based on
|
|
|
|
Tesseract 3.05 anymore.
|
|
|
|
- Creation of PDF/A-3 is now supported. However, there is no ability to
|
|
|
|
attach files to PDF/A-3.
|
|
|
|
- Lists more reasons why the file size might grow.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`262`,
|
2019-06-22 17:29:26 -07:00
|
|
|
``--remove-background`` error on PDFs contained colormapped
|
|
|
|
(paletted) images.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed another XMP metadata validation issue, in cases where the input
|
2019-06-22 17:29:26 -07:00
|
|
|
file's creation date has no timezone and the creation date is not
|
|
|
|
overridden.
|
2018-05-03 16:47:21 -07:00
|
|
|
|
2018-04-17 15:23:35 -07:00
|
|
|
v6.1.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-04-17 15:23:35 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`253`, a
|
2019-06-22 17:29:26 -07:00
|
|
|
possible division by zero when using the ``hocr`` renderer.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed incorrectly formatted ``<xmp:ModifyDate>`` field inside XMP
|
2019-06-22 17:29:26 -07:00
|
|
|
metadata for PDF/As. veraPDF flags this as a PDF/A validation
|
|
|
|
failure. The error is caused the timezone and final digit of the
|
|
|
|
seconds of modified time to be omitted, so at worst the modification
|
|
|
|
time stamp is rounded to the nearest 10 seconds.
|
2018-04-17 15:23:35 -07:00
|
|
|
|
2018-04-05 02:14:33 -07:00
|
|
|
v6.1.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`248`
|
2019-06-22 17:29:26 -07:00
|
|
|
``--clean`` argument may remove OCR from left column of text on
|
|
|
|
certain documents. We now set ``--layout none`` to suppress this.
|
|
|
|
- The test cache was updated to reflect the change above.
|
|
|
|
- Change test suite to accommodate Ghostscript 9.23's new ability to
|
|
|
|
insert JPEGs into PDFs without transcoding.
|
|
|
|
- XMP metadata in PDFs is now examined using ``defusedxml`` for safety.
|
|
|
|
- If an external process exits with a signal when asked to report its
|
|
|
|
version, we now print the system error message instead of suppressing
|
|
|
|
it. This occurred when the required executable was found but was
|
|
|
|
missing a shared library.
|
|
|
|
- qpdf 7.0.0 or newer is now required as the test suite can no longer
|
|
|
|
pass without it.
|
2018-04-10 15:53:02 -07:00
|
|
|
|
2018-04-12 16:28:48 -07:00
|
|
|
Notes
|
2019-06-22 17:29:26 -07:00
|
|
|
-----
|
2018-04-12 16:28:48 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- An apparent `regression in Ghostscript
|
|
|
|
9.23 <https://bugs.ghostscript.com/show_bug.cgi?id=699216>`__ will
|
|
|
|
cause some ocrmypdf output files to become invalid in rare cases; the
|
|
|
|
workaround for the moment is to set ``--force-ocr``.
|
2018-04-12 00:55:45 -07:00
|
|
|
|
2018-04-03 00:11:20 -07:00
|
|
|
v6.1.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-04-10 15:53:02 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`247`,
|
2019-06-22 17:29:26 -07:00
|
|
|
``/CreationDate`` metadata not copied from input to output.
|
|
|
|
- A warning is now issued when Python 3.5 is used on files with a large
|
|
|
|
page count, as this case is known to regress to single core
|
|
|
|
performance. The cause of this problem is unknown.
|
2018-04-03 00:11:20 -07:00
|
|
|
|
2018-03-30 12:39:33 -07:00
|
|
|
v6.1.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-03-30 14:00:36 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Upgrade to PyMuPDF v1.12.5 which includes a more complete fix to
|
2021-06-16 00:39:40 -07:00
|
|
|
:issue:`239`.
|
2019-06-22 17:29:26 -07:00
|
|
|
- Add ``defusedxml`` dependency.
|
2018-03-30 12:39:33 -07:00
|
|
|
|
2018-03-30 00:11:52 -07:00
|
|
|
v6.1.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-03-30 00:11:52 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed text being reported as found on all pages if PyMuPDF is not
|
2019-06-22 17:29:26 -07:00
|
|
|
installed.
|
2018-03-30 00:11:52 -07:00
|
|
|
|
2018-03-28 00:39:32 -07:00
|
|
|
v6.1.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- PyMuPDF is now an optional but recommended dependency, to alleviate
|
|
|
|
installation difficulties on platforms that have less access to
|
|
|
|
PyMuPDF than the author anticipated. (For version 6.x only) install
|
|
|
|
OCRmyPDF with ``pip install ocrmypdf[fitz]`` to use it to its full
|
|
|
|
potential.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed ``FileExistsError`` that could occur if OCR timed out while it
|
2019-06-22 17:29:26 -07:00
|
|
|
was generating the output file.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`218`)
|
|
|
|
- Fixed table of contents/bookmarks all being redirected to page 1 when
|
2019-06-22 17:29:26 -07:00
|
|
|
generating a PDF/A (with PyMuPDF). (Without PyMuPDF the table of
|
|
|
|
contents is removed in PDF/A mode.)
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed "RuntimeError: invalid key in dict" when table of
|
2019-06-22 17:29:26 -07:00
|
|
|
contents/bookmarks titles contained the character ``)``.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`239`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- Added a new argument ``--skip-repair`` to skip the initial PDF repair
|
|
|
|
step if the PDF is already well-formed (because another program
|
|
|
|
repaired it).
|
2018-03-26 01:44:01 -07:00
|
|
|
|
|
|
|
v6.0.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
2020-08-05 00:44:42 -07:00
|
|
|
- The software license has been changed to GPLv3 [it has since changed again].
|
|
|
|
Test resource files and some individual sources may have other licenses.
|
2019-06-22 17:29:26 -07:00
|
|
|
- OCRmyPDF now depends on
|
|
|
|
`PyMuPDF <https://pymupdf.readthedocs.io/en/latest/installation/>`__.
|
|
|
|
Including PyMuPDF is the primary reason for the change to GPLv3.
|
|
|
|
- Other backward incompatible changes
|
|
|
|
|
|
|
|
- The ``OCRMYPDF_TESSERACT``, ``OCRMYPDF_QPDF``, ``OCRMYPDF_GS`` and
|
|
|
|
``OCRMYPDF_UNPAPER`` environment variables are no longer used.
|
|
|
|
Change ``PATH`` if you need to override the external programs
|
|
|
|
OCRmyPDF uses.
|
|
|
|
- The ``ocrmypdf`` package has been moved to ``src/ocrmypdf`` to
|
|
|
|
avoid issues with accidental import.
|
|
|
|
- The function ``ocrmypdf.exec.get_program`` was removed.
|
|
|
|
- The deprecated module ``ocrmypdf.pageinfo`` was removed.
|
|
|
|
- The ``--pdf-renderer tess4`` alias for ``sandwich`` was removed.
|
|
|
|
|
|
|
|
- Fixed an issue where OCRmyPDF failed to detect existing text on
|
|
|
|
pages, depending on how the text and fonts were encoded within the
|
2021-06-16 00:39:40 -07:00
|
|
|
PDF. (:issue:`233,232`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an issue that caused dramatic inflation of file sizes when
|
|
|
|
``--skip-text --output-type pdf`` was used. OCRmyPDF now removes
|
|
|
|
duplicate resources such as fonts, images and other objects that it
|
2021-06-16 00:39:40 -07:00
|
|
|
generates. (:issue:`237`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- Improved performance of the initial page splitting step. Originally
|
|
|
|
this step was not believed to be expensive and ran in a process.
|
|
|
|
Large file testing revealed it to be a bottleneck, so it is now
|
|
|
|
parallelized. On a 700 page file with quad core machine, this change
|
2021-06-16 00:39:40 -07:00
|
|
|
saves about 2 minutes. (:issue:`234`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- The test suite now includes a cache that can be used to speed up test
|
|
|
|
runs across platforms. This also does not require computing
|
2021-06-16 00:39:40 -07:00
|
|
|
checksums, so it's faster. (:issue:`217`)
|
2018-03-24 02:52:56 -07:00
|
|
|
|
2018-03-15 16:59:59 -07:00
|
|
|
v5.7.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed an issue that caused poor CPU utilization on machines with more
|
2021-06-16 00:39:40 -07:00
|
|
|
than 4 cores when running Tesseract 4. (Related to :issue:`217`.)
|
2019-06-22 17:29:26 -07:00
|
|
|
- The 'hocr' renderer has been improved. The 'sandwich' and 'tesseract'
|
|
|
|
renderers are still better for most use cases, but 'hocr' may be
|
|
|
|
useful for people who work with the PDF.js renderer in English/ASCII
|
2021-06-16 00:39:40 -07:00
|
|
|
languages. (:issue:`225`)
|
2019-06-22 17:29:26 -07:00
|
|
|
|
|
|
|
- It now formats text in a matter that is easier for certain PDF
|
|
|
|
viewers to select and extract copy and paste text. This should
|
|
|
|
help macOS Preview and PDF.js in particular.
|
|
|
|
- The appearance of selected text and behavior of selecting text is
|
|
|
|
improved.
|
|
|
|
- The PDF content stream now uses relative moves, making it more
|
|
|
|
compact and easier for viewers to determine when two words on the
|
|
|
|
same line.
|
|
|
|
- It can now deal with text on a skewed baseline.
|
|
|
|
- Thanks to @cforcey for the pull request, @jbreiden for many
|
|
|
|
helpful suggestions, @ctbarbour for another round of improvements,
|
|
|
|
and @acaloiaro for an independent review.
|
2018-03-15 16:59:59 -07:00
|
|
|
|
2018-03-12 03:41:12 -07:00
|
|
|
v5.6.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-03-09 15:37:08 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Suppress two debug messages that were too verbose
|
2018-03-09 15:37:08 -08:00
|
|
|
|
2018-03-12 03:41:12 -07:00
|
|
|
v5.6.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-03-12 03:41:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Development branch accidentally tagged as release. Do not use.
|
2018-03-12 03:41:12 -07:00
|
|
|
|
2018-03-09 08:00:42 -08:00
|
|
|
v5.6.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`219`: change
|
2019-06-22 17:29:26 -07:00
|
|
|
how the final output file is created to avoid triggering permission
|
|
|
|
errors when the output is a special file such as ``/dev/null``
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed test suite failures due to a qpdf 8.0.0 regression and Python
|
2019-06-22 17:29:26 -07:00
|
|
|
3.5's handling of symlink
|
|
|
|
- The "encrypted PDF" error message was different depending on the type
|
|
|
|
of PDF encryption. Now a single clear message appears for all types
|
|
|
|
of PDF encryption.
|
|
|
|
- ocrmypdf is now in Homebrew. Homebrew users are advised to the
|
|
|
|
version of ocrmypdf in the official homebrew-core formulas rather
|
|
|
|
than the private tap.
|
|
|
|
- Some linting
|
2018-02-27 15:08:22 -08:00
|
|
|
|
2018-02-07 16:48:04 -08:00
|
|
|
v5.6.0
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2018-02-07 16:48:04 -08:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`216`: preserve
|
2019-06-22 17:29:26 -07:00
|
|
|
"text as curves" PDFs without rasterizing file
|
|
|
|
- Related to the above, messages about rasterizing are more consistent
|
|
|
|
- For consistency versions minor releases will now get the trailing .0
|
|
|
|
they always should have had.
|
2018-02-07 16:48:04 -08:00
|
|
|
|
2018-01-10 15:43:59 -08:00
|
|
|
v5.5
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
|
|
|
|
|
|
|
- Add new argument ``--max-image-mpixels``. Pillow 5.0 now raises an
|
|
|
|
exception when images may be decompression bombs. This argument can
|
|
|
|
be used to override the limit Pillow sets.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed output page cropped when using the sandwich renderer and OCR is
|
2019-06-22 17:29:26 -07:00
|
|
|
skipped on a rotated and image-processed page
|
|
|
|
- A warning is now issued when old versions of Ghostscript are used in
|
|
|
|
cases known to cause issues with non-Latin characters
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed a few parameter validation checks for ``-output-type pdfa-1`` and
|
2019-06-22 17:29:26 -07:00
|
|
|
``pdfa-2``
|
2018-01-10 15:43:59 -08:00
|
|
|
|
2017-11-26 23:08:55 -08:00
|
|
|
v5.4.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`181`: fix
|
2019-06-22 17:29:26 -07:00
|
|
|
final merge failure for PDFs with more pages than the system file
|
|
|
|
handle limit (``ulimit -n``)
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`200`: an
|
2019-06-22 17:29:26 -07:00
|
|
|
uncommon syntax for formatting decimal numbers in a PDF would cause
|
|
|
|
qpdf to issue a warning, which ocrmypdf treated as an error. Now this
|
|
|
|
the warning is relayed.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed an issue where intermediate PDFs would be created at version 1.3
|
2019-06-22 17:29:26 -07:00
|
|
|
instead of the version of the original file. It's possible but
|
|
|
|
unlikely this had side effects.
|
|
|
|
- A warning is now issued when older versions of qpdf are used since
|
|
|
|
issues like
|
2021-06-16 00:39:40 -07:00
|
|
|
:issue:`200` cause
|
2019-06-22 17:29:26 -07:00
|
|
|
qpdf to infinite-loop
|
|
|
|
- Address issue
|
2021-06-16 00:39:40 -07:00
|
|
|
:issue:`140`: if
|
2019-06-22 17:29:26 -07:00
|
|
|
Tesseract outputs invalid UTF-8, escape it and print its message
|
|
|
|
instead of aborting with a Unicode error
|
|
|
|
- Adding previously unlisted setup requirement, pytest-runner
|
|
|
|
- Update documentation: fix an error in the example script for Synology
|
|
|
|
with Docker images, improved security guidance, advised
|
|
|
|
``pip install --user``
|
2017-11-29 14:08:07 -08:00
|
|
|
|
2017-11-17 02:28:02 -08:00
|
|
|
v5.4.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-11-17 02:28:02 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- If a subprocess fails to report its version when queried, exit
|
|
|
|
cleanly with an error instead of throwing an exception
|
|
|
|
- Added test to confirm that the system locale is Unicode-aware and
|
|
|
|
fail early if it's not
|
|
|
|
- Clarified some copyright information
|
|
|
|
- Updated pinned requirements.txt so the homebrew formula captures more
|
|
|
|
recent versions
|
2017-11-17 02:28:02 -08:00
|
|
|
|
2017-10-26 18:15:31 -07:00
|
|
|
v5.4.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-10-26 18:15:31 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed a regression from v5.4.1 that caused sidecar files to be
|
|
|
|
created as empty files
|
2017-10-26 18:15:31 -07:00
|
|
|
|
2017-10-12 14:04:45 -07:00
|
|
|
v5.4.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-10-12 14:04:45 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Add workaround for Tesseract v4.00alpha crash when trying to obtain
|
|
|
|
orientation and the latest language packs are installed
|
2017-10-12 14:04:45 -07:00
|
|
|
|
|
|
|
v5.4
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2017-10-08 12:41:03 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Change wording of a deprecation warning to improve clarity
|
|
|
|
- Added option to generate PDF/A-1b output if desired
|
|
|
|
(``--output-type pdfa-1``); default remains PDF/A-2b generation
|
|
|
|
- Update documentation
|
2017-09-01 12:50:45 -07:00
|
|
|
|
2017-09-01 16:17:26 -07:00
|
|
|
v5.3.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-09-01 12:50:45 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed missing error message that should occur when trying to force
|
|
|
|
``--pdf-renderer sandwich`` on old versions of Tesseract
|
|
|
|
- Update copyright information in test files
|
|
|
|
- Set system ``LANG`` to UTF-8 in Dockerfiles to avoid UTF-8 encoding
|
|
|
|
errors
|
2017-09-01 12:50:45 -07:00
|
|
|
|
2017-08-24 13:01:02 -07:00
|
|
|
v5.3.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-08-24 13:01:02 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed a broken test case related to language packs
|
2017-08-24 13:01:02 -07:00
|
|
|
|
2017-08-24 01:09:19 -07:00
|
|
|
v5.3.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-08-24 01:09:19 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed wrong return code given for missing Tesseract language packs
|
|
|
|
- Fixed "brew audit" crashing on Travis when trying to auto-brew
|
2017-08-24 01:09:19 -07:00
|
|
|
|
2017-07-27 00:11:12 -07:00
|
|
|
v5.3
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
|
|
|
|
|
|
|
- Added ``--user-words`` and ``--user-patterns`` arguments which are
|
|
|
|
forwarded to Tesseract OCR as words and regular expressions
|
|
|
|
respective to use to guide OCR. Supplying a list of subject-domain
|
|
|
|
words should assist Tesseract with resolving words.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`165`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- Using a non Latin-1 language with the "hocr" renderer now warns about
|
|
|
|
possible OCR quality and recommends workarounds
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`176`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- Output file path added to error message when that location is not
|
|
|
|
writable
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`175`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- Otherwise valid PDFs with leading whitespace at the beginning of the
|
|
|
|
file are now accepted
|
2017-07-27 00:11:12 -07:00
|
|
|
|
2017-06-13 13:09:12 -07:00
|
|
|
v5.2
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2017-06-13 13:09:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- When using Tesseract 3.05.01 or newer, OCRmyPDF will select the
|
|
|
|
"sandwich" PDF renderer by default, unless another PDF renderer is
|
|
|
|
specified with the ``--pdf-renderer`` argument. The previous behavior
|
|
|
|
was to select ``--pdf-renderer=hocr``.
|
|
|
|
- The "tesseract" PDF renderer is now deprecated, since it can cause
|
|
|
|
problems with Ghostscript on Tesseract 3.05.00
|
|
|
|
- The "tess4" PDF renderer has been renamed to "sandwich". "tess4" is
|
|
|
|
now a deprecated alias for "sandwich".
|
2017-06-13 13:09:12 -07:00
|
|
|
|
2017-05-29 14:36:50 -07:00
|
|
|
v5.1
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2017-05-29 14:36:50 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Files with pages larger than 200" (5080 mm) in either dimension are
|
|
|
|
now supported with ``--output-type=pdf`` with the page size preserved
|
|
|
|
(in the PDF specification this feature is called UserUnit scaling).
|
|
|
|
Due to Ghostscript limitations this is not available in conjunction
|
|
|
|
with PDF/A output.
|
2017-05-11 23:11:12 -07:00
|
|
|
|
2017-05-14 23:59:09 -07:00
|
|
|
v5.0.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-05-14 23:59:09 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`169`,
|
2019-06-22 17:29:26 -07:00
|
|
|
exception due to failure to create sidecar text files on some
|
|
|
|
versions of Tesseract 3.04, including the jbarlow83/ocrmypdf Docker
|
|
|
|
image
|
2017-05-14 23:59:09 -07:00
|
|
|
|
2017-05-12 14:14:28 -07:00
|
|
|
v5.0
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2017-03-29 15:43:54 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Backward incompatible changes
|
2017-05-11 23:11:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Support for Python 3.4 dropped. Python 3.5 is now required.
|
|
|
|
- Support for Tesseract 3.02 and 3.03 dropped. Tesseract 3.04 or
|
|
|
|
newer is required. Tesseract 4.00 (alpha) is supported.
|
|
|
|
- The OCRmyPDF.sh script was removed.
|
2017-05-11 23:11:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Add a new feature, ``--sidecar``, which allows creating "sidecar"
|
|
|
|
text files which contain the OCR results in plain text. These OCR
|
|
|
|
text is more reliable than extracting text from PDFs. Closes
|
2021-06-16 00:39:40 -07:00
|
|
|
:issue:`126`.
|
2017-03-29 15:43:54 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- New feature: ``--pdfa-image-compression``, which allows overriding
|
|
|
|
Ghostscript's lossy-or-lossless image encoding heuristic and making
|
|
|
|
all images JPEG encoded or lossless encoded as desired. Fixes
|
2021-06-16 00:39:40 -07:00
|
|
|
:issue:`163`.
|
2019-06-22 17:29:26 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`143`, added
|
2019-06-22 17:29:26 -07:00
|
|
|
``--quiet`` to suppress "INFO" messages
|
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`164`, a typo
|
2017-05-01 15:55:02 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Removed the command line parameters ``-n`` and ``--just-print`` since
|
|
|
|
they have not worked for some time (reported as Ubuntu bug
|
|
|
|
`#1687308 <https://bugs.launchpad.net/ubuntu/+source/ocrmypdf/+bug/1687308>`__)
|
|
|
|
|
|
|
|
v4.5.6
|
|
|
|
======
|
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`156`,
|
2019-06-22 17:29:26 -07:00
|
|
|
'NoneType' object has no attribute 'getObject' on pages with no
|
|
|
|
optional /Contents record. This should resolve all issues related to
|
|
|
|
pages with no /Contents record.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`158`, ocrmypdf
|
2019-06-22 17:29:26 -07:00
|
|
|
now stops and terminates if Ghostscript fails on an intermediate
|
|
|
|
step, as it is not possible to proceed.
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`160`,
|
2019-06-22 17:29:26 -07:00
|
|
|
exception thrown on certain invalid arguments instead of error
|
|
|
|
message
|
2017-05-01 15:55:02 -07:00
|
|
|
|
2017-04-28 15:27:41 -07:00
|
|
|
v4.5.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-04-28 15:27:41 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Automated update of macOS homebrew tap
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`154`, KeyError
|
2019-06-22 17:29:26 -07:00
|
|
|
'/Contents' when searching for text on blank pages that have no
|
|
|
|
/Contents record. Note: incomplete fix for this issue.
|
2017-04-28 15:27:41 -07:00
|
|
|
|
2017-04-18 18:09:15 -07:00
|
|
|
v4.5.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-04-18 18:09:15 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed ``--skip-big`` raising an exception if a page contains no images
|
|
|
|
(:issue:`152`) (thanks
|
2019-06-22 17:29:26 -07:00
|
|
|
to @TomRaz)
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed an issue where pages with no images might trigger "cannot write
|
2019-06-22 17:29:26 -07:00
|
|
|
mode P as JPEG"
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`151`)
|
2017-04-18 18:09:15 -07:00
|
|
|
|
2017-03-29 13:19:34 -07:00
|
|
|
v4.5.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Added a workaround for Ghostscript 9.21 and probably earlier versions
|
|
|
|
would fail with the error message "VMerror -25", due to a Ghostscript
|
|
|
|
bug in XMP metadata handling
|
|
|
|
- High Unicode characters (U+10000 and up) are no longer accepted for
|
|
|
|
setting metadata on the command line, as Ghostscript may not handle
|
|
|
|
them correctly.
|
|
|
|
- Fixed an issue where the ``tess4`` renderer would duplicate content
|
|
|
|
onto output pages if tesseract failed or timed out
|
|
|
|
- Fixed ``tess4`` renderer not recognized when lossless reconstruction
|
|
|
|
is possible
|
2017-03-29 13:19:34 -07:00
|
|
|
|
2017-03-24 13:23:03 -07:00
|
|
|
v4.5.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-03-24 13:23:03 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`147`,
|
2019-06-22 17:29:26 -07:00
|
|
|
``--pdf-renderer tess4 --clean`` will produce an oversized page
|
|
|
|
containing the original image in the bottom left corner, due to loss
|
|
|
|
DPI information.
|
|
|
|
- Make "using Tesseract 4.0" warning less ominous
|
|
|
|
- Set up machinery for homebrew OCRmyPDF tap
|
2017-03-24 13:23:03 -07:00
|
|
|
|
2017-02-26 17:13:16 -08:00
|
|
|
v4.5.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-02-26 17:13:16 -08:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`137`,
|
2019-06-22 17:29:26 -07:00
|
|
|
proportions of images with a non-square pixel aspect ratio would be
|
|
|
|
distorted in output for ``--force-ocr`` and some other combinations
|
|
|
|
of flags
|
2017-02-26 17:13:16 -08:00
|
|
|
|
2017-02-14 13:03:48 -08:00
|
|
|
v4.5
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
|
|
|
|
|
|
|
- PDFs containing "Form XObjects" are now supported (issue
|
2021-06-16 00:39:40 -07:00
|
|
|
:issue:`134`; PDF
|
2019-06-22 17:29:26 -07:00
|
|
|
reference manual 8.10), and images they contain are taken into
|
|
|
|
account when determining the resolution for rasterizing
|
|
|
|
- The Tesseract 4 Docker image no longer includes all languages,
|
|
|
|
because it took so long to build something would tend to fail
|
|
|
|
- OCRmyPDF now warns about using ``--pdf-renderer tesseract`` with
|
|
|
|
Tesseract 3.04 or lower due to issues with Ghostscript corrupting the
|
|
|
|
OCR text in these cases
|
2017-02-14 13:03:48 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.4.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-02-06 21:56:55 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- The Docker images (ocrmypdf, ocrmypdf-polyglot, ocrmypdf-tess4) are
|
|
|
|
now based on Ubuntu 16.10 instead of Debian stretch
|
2017-02-06 21:56:55 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- This makes supporting the Tesseract 4 image easier
|
|
|
|
- This could be a disruptive change for any Docker users who built
|
|
|
|
customized these images with their own changes, and made those
|
|
|
|
changes in a way that depends on Debian and not Ubuntu
|
2017-02-06 21:56:55 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- OCRmyPDF now prevents running the Tesseract 4 renderer with Tesseract
|
|
|
|
3.04, which was permitted in v4.4 and v4.4.1 but will not work
|
2017-02-06 21:56:55 -08:00
|
|
|
|
|
|
|
v4.4.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- To prevent a `TIFF output
|
|
|
|
error <https://github.com/python-pillow/Pillow/issues/2206>`__ caused
|
|
|
|
by img2pdf >= 0.2.1 and Pillow <= 3.4.2, dependencies have been
|
|
|
|
tightened
|
|
|
|
- The Tesseract 4.00 simultaneous process limit was increased from 1 to
|
|
|
|
2, since it was observed that 1 lowers performance
|
|
|
|
- Documentation improvements to describe the ``--tesseract-config``
|
|
|
|
feature
|
|
|
|
- Added test cases and fixed error handling for ``--tesseract-config``
|
|
|
|
- Tweaks to setup.py to deal with issues in the v4.4 release
|
2017-01-28 22:23:35 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.4
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
|
|
|
|
|
|
|
- Tesseract 4.00 is now supported on an experimental basis.
|
|
|
|
|
|
|
|
- A new rendering option ``--pdf-renderer tess4`` exploits Tesseract
|
|
|
|
4's new text-only output PDF mode. See the documentation on PDF
|
|
|
|
Renderers for details.
|
|
|
|
- The ``--tesseract-oem`` argument allows control over the Tesseract
|
|
|
|
4 OCR engine mode (tesseract's ``--oem``). Use
|
|
|
|
``--tesseract-oem 2`` to enforce the new LSTM mode.
|
|
|
|
- Fixed poor performance with Tesseract 4.00 on Linux
|
|
|
|
|
|
|
|
- Fixed an issue that caused corruption of output to stdout in some
|
|
|
|
cases
|
|
|
|
- Removed test for Pillow JPEG and PNG support, as the minimum
|
|
|
|
supported version of Pillow now enforces this
|
|
|
|
- OCRmyPDF now tests that the intended destination file is writable
|
|
|
|
before proceeding
|
|
|
|
- The test suite now requires ``pytest-helpers-namespace`` to run (but
|
|
|
|
not install)
|
|
|
|
- Significant code reorganization to make OCRmyPDF re-entrant and
|
|
|
|
improve performance. All changes should be backward compatible for
|
|
|
|
the v4.x series.
|
|
|
|
|
|
|
|
- However, OCRmyPDF's dependency "ruffus" is not re-entrant, so no
|
|
|
|
Python API is available. Scripts should continue to use the
|
|
|
|
command line interface.
|
2017-01-26 12:29:11 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.3.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2017-01-03 00:45:33 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Update documentation to confirm Python 3.6.0 compatibility. No code
|
|
|
|
changes were needed, so many earlier versions are likely supported.
|
2017-01-03 00:45:33 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.3.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-12-08 16:34:09 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed "decimal.InvalidOperation: quantize result has too many digits"
|
|
|
|
for high DPI images
|
2016-12-08 16:34:09 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.3.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-12-02 16:26:34 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed PDF/A creation with Ghostscript 9.20 properly
|
|
|
|
- Fixed an exception on inline stencil masks with a missing optional
|
|
|
|
parameter
|
2016-12-02 16:26:34 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.3.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-11-10 23:16:08 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed a PDF/A creation issue with Ghostscript 9.20 (note: this fix
|
|
|
|
did not actually work)
|
2016-11-10 23:16:08 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.3.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
|
|
|
|
|
|
|
- Fixed an issue where pages produced by the "hocr" renderer after a
|
|
|
|
Tesseract timeout would be rotated incorrectly if the input page was
|
|
|
|
rotated with a /Rotate marker
|
|
|
|
- Fixed a file handle leak in LeptonicaErrorTrap that would cause a
|
|
|
|
"too many open files" error for files around hundred pages of pages
|
|
|
|
long when ``--deskew`` or ``--remove-background`` or other Leptonica
|
|
|
|
based image processing features were in use, depending on the system
|
|
|
|
value of ``ulimit -n``
|
|
|
|
- Ability to specify multiple languages for multilingual documents is
|
|
|
|
now advertised in documentation
|
|
|
|
- Reduced the file sizes of some test resources
|
|
|
|
- Cleaned up debug output
|
|
|
|
- Tesseract caching in test cases is now more cautious about false
|
|
|
|
cache hits and reproducing exact output, not that any problems were
|
|
|
|
observed
|
2016-11-07 14:36:08 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.3
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2016-10-27 23:48:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- New feature ``--remove-background`` to detect and erase the
|
|
|
|
background of color and grayscale images
|
|
|
|
- Better documentation
|
|
|
|
- Fixed an issue with PDFs that draw images when the raster stack depth
|
|
|
|
is zero
|
|
|
|
- ocrmypdf can now redirect its output to stdout for use in a shell
|
|
|
|
pipeline
|
2016-10-27 23:48:12 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- This does not improve performance since temporary files are still
|
|
|
|
used for buffering
|
|
|
|
- Some output validation is disabled in this mode
|
2016-10-27 23:48:12 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.2.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-10-13 13:26:39 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an issue
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`100`) with
|
2019-06-22 17:29:26 -07:00
|
|
|
PDFs that omit the optional /BitsPerComponent parameter on images
|
|
|
|
- Removed non-free file milk.pdf
|
2016-10-13 13:26:39 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.2.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-09-01 21:33:38 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an error
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`90`) caused by
|
2019-06-22 17:29:26 -07:00
|
|
|
PDFs that use stencil masks properly
|
|
|
|
- Fixed handling of PDFs that try to draw images or stencil masks
|
|
|
|
without properly setting up the graphics state (such images are now
|
|
|
|
ignored for the purposes of calculating DPI)
|
2016-09-01 21:33:38 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.2.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-08-31 13:19:27 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an issue with PDFs that store page rotation (/Rotate) in an
|
|
|
|
indirect object
|
|
|
|
- Integrated a few fixes to simplify downstream packaging (Debian)
|
2016-08-31 13:19:27 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- The test suite no longer assumes it is installed
|
|
|
|
- If running Linux, skip a test that passes Unicode on the command
|
|
|
|
line
|
2016-08-31 13:19:27 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Added a test case to check explicit masks and stencil masks
|
|
|
|
- Added a test case for indirect objects and linearized PDFs
|
|
|
|
- Deprecated the OCRmyPDF.sh shell script
|
2016-08-31 13:19:27 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.2.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-08-25 14:46:09 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Improvements to documentation
|
2016-08-25 14:46:09 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.2.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-08-24 14:16:22 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an issue where PDF pages that contained stencil masks would
|
|
|
|
report an incorrect DPI and cause Ghostscript to abort
|
|
|
|
- Implemented stdin streaming
|
2016-08-24 14:16:22 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.2
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2016-07-27 14:47:59 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- ocrmypdf will now try to convert single image files to PDFs if they
|
|
|
|
are provided as input
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`15`)
|
2017-09-01 12:47:22 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- This is a basic convenience feature. It only supports a single
|
|
|
|
image and always makes the image fill the whole page.
|
|
|
|
- For better control over image to PDF conversion, use ``img2pdf``
|
|
|
|
(one of ocrmypdf's dependencies)
|
2016-08-03 03:36:45 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- New argument ``--output-type {pdf|pdfa}`` allows disabling
|
|
|
|
Ghostscript PDF/A generation
|
2016-08-03 03:36:45 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- ``pdfa`` is the default, consistent with past behavior
|
|
|
|
- ``pdf`` provides a workaround for users concerned about the
|
|
|
|
increase in file size from Ghostscript forcing JBIG2 images to
|
|
|
|
CCITT and transcoding JPEGs
|
|
|
|
- ``pdf`` preserves as much as it can about the original file,
|
|
|
|
including problems that PDF/A conversion fixes
|
2016-08-03 03:36:45 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- PDFs containing images with "non-square" pixel aspect ratios, such as
|
|
|
|
200x100 DPI, are now handled and converted properly (fixing a bug
|
|
|
|
that caused to be cropped)
|
|
|
|
- ``--force-ocr`` rasterizes pages even if they contain no images
|
2016-08-03 03:36:45 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- supports users who want to use OCRmyPDF to reconstruct text
|
|
|
|
information in PDFs with damaged Unicode maps (copy and paste text
|
|
|
|
does not match displayed text)
|
|
|
|
- supports reinterpreting PDFs where text was rendered as curves for
|
|
|
|
printing, and text needs to be recovered
|
|
|
|
- fixes issue
|
2021-06-16 00:39:40 -07:00
|
|
|
:issue:`82`
|
2016-07-29 03:08:59 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixes an issue where, with certain settings, monochrome images in
|
|
|
|
PDFs would be converted to 8-bit grayscale, increasing file size
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`79`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- Support for Ubuntu 12.04 LTS "precise" has been dropped in favor of
|
|
|
|
(roughly) Ubuntu 14.04 LTS "trusty"
|
2016-07-27 14:47:59 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Some Ubuntu "PPAs" (backports) are needed to make it work
|
2016-08-02 01:29:33 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Support for some older dependencies dropped
|
2016-07-29 03:08:59 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Ghostscript 9.15 or later is now required (available in Ubuntu
|
|
|
|
trusty with backports)
|
|
|
|
- Tesseract 3.03 or later is now required (available in Ubuntu
|
|
|
|
trusty)
|
2016-07-27 14:47:59 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Ghostscript now runs in "safer" mode where possible
|
2016-07-27 14:47:59 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.1.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-07-17 00:35:06 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Bug fix: monochrome images with an ICC profile attached were
|
|
|
|
incorrectly converted to full color images if lossless reconstruction
|
|
|
|
was not possible due to other settings; consequence was increased
|
|
|
|
file size for these images
|
2016-07-17 00:35:06 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.1.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-06-23 13:47:56 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- More helpful error message for PDFs with version 4 security handler
|
|
|
|
- Update usage instructions for Windows/Docker users
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed order of operations for matrix multiplication (no effect on most
|
2019-06-22 17:29:26 -07:00
|
|
|
users)
|
|
|
|
- Add a few leptonica wrapper functions (no effect on most users)
|
2016-06-23 13:47:56 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.1.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-05-10 21:48:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Replace IEC sRGB ICC profile with Debian's sRGB (from
|
|
|
|
icc-profiles-free) which is more compatible with the MIT license
|
|
|
|
- More helpful error message for an error related to certain types of
|
|
|
|
malformed PDFs
|
2016-05-10 21:48:32 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.1
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2016-04-28 00:46:16 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- ``--rotate-pages`` now only rotates pages when reasonably confidence
|
|
|
|
in the orientation. This behavior can be adjusted with the new
|
|
|
|
argument ``--rotate-pages-threshold``
|
|
|
|
- Fixed problems in error checking if ``unpaper`` is uninstalled or
|
|
|
|
missing at run-time
|
|
|
|
- Fixed problems with "RethrownJobError" errors during error handling
|
|
|
|
that suppressed the useful error messages
|
2016-04-28 00:46:16 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.7
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-03-02 06:27:01 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Minor correction to Ghostscript output settings
|
2016-03-02 06:27:01 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.6
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-03-01 01:58:32 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Update install instructions
|
|
|
|
- Provide a sRGB profile instead of using Ghostscript's
|
2016-03-01 01:58:32 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.5
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-02-27 00:22:37 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Remove some verbose debug messages from v4.0.4
|
|
|
|
- Fixed temporary that wasn't being deleted
|
|
|
|
- DPI is now calculated correctly for cropped images, along with other
|
|
|
|
image transformations
|
|
|
|
- Inline images are now checked during DPI calculation instead of
|
|
|
|
rejecting the image
|
2016-02-27 00:22:37 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.4
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-02-27 01:01:38 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
Released with verbose debug message turned on. Do not use. Skip to
|
|
|
|
v4.0.5.
|
2016-02-27 01:01:38 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.3
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-02-26 01:12:15 -08:00
|
|
|
|
|
|
|
New features
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Page orientations detected are now reported in a summary comment
|
2016-02-26 01:12:15 -08:00
|
|
|
|
|
|
|
Fixes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Show stack trace if unexpected errors occur
|
|
|
|
- Treat "too few characters" error message from Tesseract as a reason
|
|
|
|
to skip that page rather than abort the file
|
|
|
|
- Docker: fix blank JPEG2000 issue by insisting on Ghostscript versions
|
|
|
|
that have this fixed
|
2016-02-26 01:12:15 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.2
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-02-20 03:36:37 -08:00
|
|
|
|
|
|
|
Fixes
|
2018-08-27 01:25:30 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed compatibility with Tesseract 3.04.01 release, particularly its
|
|
|
|
different way of outputting orientation information
|
|
|
|
- Improved handling of Tesseract errors and crashes
|
|
|
|
- Fixed use of chmod on Docker that broke most test cases
|
2016-02-20 03:36:37 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-02-20 03:36:37 -08:00
|
|
|
|
|
|
|
Fixes
|
2018-08-27 01:25:30 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed a KeyError if tesseract fails to find page orientation
|
|
|
|
information
|
2016-02-20 03:36:37 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v4.0
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2016-02-15 14:03:59 -08:00
|
|
|
|
|
|
|
New features
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Automatic page rotation (``-r``) is now available. It uses ignores
|
|
|
|
any prior rotation information on PDFs and sets rotation based on the
|
|
|
|
dominant orientation of detectable text. This feature is fairly
|
|
|
|
reliable but some false positives occur especially if there is not
|
|
|
|
much text to work with.
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`4`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- Deskewing is now performed using Leptonica instead of unpaper.
|
|
|
|
Leptonica is faster and more reliable at image deskewing than
|
|
|
|
unpaper.
|
2016-02-15 14:03:59 -08:00
|
|
|
|
|
|
|
Fixes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an issue where lossless reconstruction could cause some pages
|
|
|
|
to be appear incorrectly if the page was rotated by the user in
|
|
|
|
Acrobat after being scanned (specifically if it a /Rotate tag)
|
|
|
|
- Fixed an issue where lossless reconstruction could misalign the
|
|
|
|
graphics layer with respect to text layer if the page had been
|
|
|
|
cropped such that its origin is not (0, 0)
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`49`)
|
2016-02-15 14:03:59 -08:00
|
|
|
|
|
|
|
Changes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Logging output is now much easier to read
|
|
|
|
- ``--deskew`` is now performed by Leptonica instead of unpaper
|
2021-06-16 00:39:40 -07:00
|
|
|
(:issue:`25`)
|
2019-06-22 17:29:26 -07:00
|
|
|
- libffi is now required
|
|
|
|
- Some changes were made to the Docker and Travis build environments to
|
|
|
|
support libffi
|
|
|
|
- ``--pdf-renderer=tesseract`` now displays a warning if the Tesseract
|
|
|
|
version is less than 3.04.01, the planned release that will include
|
|
|
|
fixes to an important OCR text rendering bug in Tesseract 3.04.00.
|
|
|
|
You can also manually install ./share/sharp2.ttf on top of pdf.ttf in
|
|
|
|
your Tesseract tessdata folder to correct the problem.
|
2016-02-15 14:03:59 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v3.2.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2016-02-05 16:10:18 -08:00
|
|
|
|
|
|
|
Changes
|
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`47`
|
2019-06-22 17:29:26 -07:00
|
|
|
"convert() got and unexpected keyword argument 'dpi'" by upgrading to
|
|
|
|
img2pdf 0.2
|
|
|
|
- Tweaked the Dockerfiles
|
2016-02-05 16:10:18 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v3.2
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2016-01-19 16:49:49 -08:00
|
|
|
|
|
|
|
New features
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Lossless reconstruction: when possible, OCRmyPDF will inject text
|
|
|
|
layers without otherwise manipulating the content and layout of a PDF
|
|
|
|
page. For example, a PDF containing a mix of vector and raster
|
|
|
|
content would see the vector content preserved. Images may still be
|
|
|
|
transcoded during PDF/A conversion. (``--deskew`` and
|
|
|
|
``--clean-final`` disable this mode, necessarily.)
|
|
|
|
- New argument ``--tesseract-pagesegmode`` allows you to pass page
|
|
|
|
segmentation arguments to Tesseract OCR. This helps for two column
|
|
|
|
text and other situations that confuse Tesseract.
|
|
|
|
- Added a new "polyglot" version of the Docker image, that generates
|
|
|
|
Tesseract with all languages packs installed, for the polyglots among
|
|
|
|
us. It is much larger.
|
2016-01-19 16:49:49 -08:00
|
|
|
|
2016-02-04 23:41:33 -08:00
|
|
|
Changes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- JPEG transcoding quality is now 95 instead of the default 75. Bigger
|
|
|
|
file sizes for less degradation.
|
2016-01-19 16:49:49 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v3.1.1
|
2019-06-22 17:29:26 -07:00
|
|
|
======
|
2015-12-17 09:05:10 -08:00
|
|
|
|
|
|
|
Changes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed bug that caused incorrect page size and DPI calculations on
|
|
|
|
documents with mixed page sizes
|
2015-12-17 09:05:10 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v3.1
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2015-12-04 04:31:01 -08:00
|
|
|
|
|
|
|
Changes
|
2015-12-02 01:48:10 -08:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Default output format is now PDF/A-2b instead of PDF/A-1b
|
|
|
|
- Python 3.5 and macOS El Capitan are now supported platforms - no
|
|
|
|
changes were needed to implement support
|
|
|
|
- Improved some error messages related to missing input files
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fixed :issue:`20`: uppercase .PDF extension not accepted
|
2019-06-22 17:29:26 -07:00
|
|
|
- Fixed an issue where OCRmyPDF failed to text that certain pages
|
|
|
|
contained previously OCR'ed text, such as OCR text produced by
|
|
|
|
Tesseract 3.04
|
|
|
|
- Inserts /Creator tag into PDFs so that errors can be traced back to
|
|
|
|
this project
|
|
|
|
- Added new option ``--pdf-renderer=auto``, to let OCRmyPDF pick the
|
|
|
|
best PDF renderer. Currently it always chooses the 'hocrtransform'
|
|
|
|
renderer but that behavior may change.
|
|
|
|
- Set up Travis CI automatic integration testing
|
2015-12-02 01:48:10 -08:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v3.0
|
2019-06-22 17:29:26 -07:00
|
|
|
====
|
2015-07-26 03:00:21 -07:00
|
|
|
|
2015-07-28 04:36:58 -07:00
|
|
|
New features
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Easier installation with a Docker container or Python's ``pip``
|
|
|
|
package manager
|
|
|
|
- Eliminated many external dependencies, so it's easier to setup
|
|
|
|
- Now installs ``ocrmypdf`` to ``/usr/local/bin`` or equivalent for
|
|
|
|
system-wide access and easier typing
|
|
|
|
- Improved command line syntax and usage help (``--help``)
|
|
|
|
- Tesseract 3.03+ PDF page rendering can be used instead for better
|
|
|
|
positioning of recognized text (``--pdf-renderer tesseract``)
|
|
|
|
- PDF metadata (title, author, keywords) are now transferred to the
|
|
|
|
output PDF
|
|
|
|
- PDF metadata can also be set from the command line (``--title``,
|
|
|
|
etc.)
|
|
|
|
- Automatic repairs malformed input PDFs if possible
|
|
|
|
- Added test cases to confirm everything is working
|
|
|
|
- Added option to skip extremely large pages that take too long to OCR
|
|
|
|
and are often not OCRable (e.g. large scanned maps or diagrams);
|
|
|
|
other pages are still processed (``--skip-big``)
|
|
|
|
- Added option to kill Tesseract OCR process if it seems to be taking
|
|
|
|
too long on a page, while still processing other pages
|
|
|
|
(``--tesseract-timeout``)
|
|
|
|
- Less common colorspaces (CMYK, palette) are now supported by
|
|
|
|
conversion to RGB
|
|
|
|
- Multiple images on the same PDF page are now supported
|
2015-07-28 04:36:58 -07:00
|
|
|
|
2015-07-26 03:00:21 -07:00
|
|
|
Changes
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- New, robust rewrite in Python 3.4+ with
|
|
|
|
`ruffus <http://www.ruffus.org.uk/index.html>`__ pipelines
|
|
|
|
- Now uses Ghostscript 9.14's improved color conversion model to
|
|
|
|
preserve PDF colors
|
|
|
|
- OCR text is now rendered in the PDF as invisible text. Previous
|
|
|
|
versions of OCRmyPDF incorrectly rendered visible text with an image
|
|
|
|
on top.
|
|
|
|
- All "tasks" in the pipeline can be executed in parallel on any
|
|
|
|
available CPUs, increasing performance
|
|
|
|
- The ``-o DPI`` argument has been phased out, in favor of
|
|
|
|
``--oversample DPI``, in case we need ``-o OUTPUTFILE`` in the future
|
|
|
|
- Removed several dependencies, so it's easier to install. We no longer
|
|
|
|
use:
|
|
|
|
|
|
|
|
- GNU `parallel <https://www.gnu.org/software/parallel/>`__
|
|
|
|
- `ImageMagick <http://www.imagemagick.org/script/index.php>`__
|
|
|
|
- Python 2.7
|
|
|
|
- Poppler
|
|
|
|
- `MuPDF <http://mupdf.com/docs/>`__ tools
|
|
|
|
- shell scripts
|
|
|
|
- Java and `JHOVE <http://jhove.sourceforge.net/>`__
|
|
|
|
- libxml2
|
|
|
|
|
|
|
|
- Some new external dependencies are required or optional, compared to
|
|
|
|
v2.x:
|
|
|
|
|
|
|
|
- Ghostscript 9.14+
|
|
|
|
- `qpdf <http://qpdf.sourceforge.net/>`__ 5.0.0+
|
|
|
|
- `Unpaper <https://github.com/Flameeyes/unpaper>`__ 6.1 (optional)
|
|
|
|
- some automatically managed Python packages
|
2015-08-05 16:56:30 -07:00
|
|
|
|
2018-08-27 01:25:30 -07:00
|
|
|
Release candidates^
|
2015-08-05 16:56:30 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc9:
|
2015-08-29 16:43:22 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fix
|
|
|
|
:issue:`118`:
|
2019-06-22 17:29:26 -07:00
|
|
|
report error if ghostscript iccprofiles are missing
|
|
|
|
- fixed another issue related to
|
2021-06-16 00:39:40 -07:00
|
|
|
:issue:`111`: PDF
|
2019-06-22 17:29:26 -07:00
|
|
|
rasterized to palette file
|
|
|
|
- add support image files with a palette
|
|
|
|
- don't try to validate PDF file after an exception occurs
|
2015-08-29 16:43:22 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc8:
|
2015-08-24 01:25:01 -07:00
|
|
|
|
2021-06-16 00:39:40 -07:00
|
|
|
- Fix
|
|
|
|
:issue:`111`:
|
2019-06-22 17:29:26 -07:00
|
|
|
exception thrown if PDF is missing DocumentInfo dictionary
|
2015-08-24 01:25:01 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc7:
|
2015-08-23 12:30:40 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- fix error when installing direct from pip, "no such file
|
|
|
|
'requirements.txt'"
|
2015-08-23 12:30:40 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc6:
|
2015-08-17 15:26:07 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- dropped libxml2 (Python lxml) since Python 3's internal XML parser
|
|
|
|
is sufficient
|
|
|
|
- set up Docker container
|
|
|
|
- fix Unicode errors if recognized text contains Unicode characters
|
|
|
|
and system locale is not UTF-8
|
2015-08-17 15:26:07 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc5:
|
2015-08-11 15:31:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- dropped Java and JHOVE in favour of qpdf
|
|
|
|
- improved command line error output
|
|
|
|
- additional tests and bug fixes
|
|
|
|
- tested on Ubuntu 14.04 LTS
|
2015-08-11 15:31:32 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc4:
|
2015-08-05 16:56:30 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- dropped MuPDF in favour of qpdf
|
|
|
|
- fixed some installer issues and errors in installation
|
|
|
|
instructions
|
|
|
|
- improve performance: run Ghostscript with multithreaded rendering
|
|
|
|
- improve performance: use multiple cores by default
|
|
|
|
- bug fix: checking for wrong exception on process timeout
|
2015-08-05 16:56:30 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- rc3: skipping version number intentionally to avoid confusion with
|
|
|
|
Tesseract
|
|
|
|
- rc2: first release for public testing to test-PyPI, Github
|
|
|
|
- rc1: testing release process
|
2015-08-05 16:56:30 -07:00
|
|
|
|
2015-07-28 04:36:58 -07:00
|
|
|
Compatibility notes
|
2019-06-22 17:29:26 -07:00
|
|
|
===================
|
2015-07-28 04:36:58 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- ``./OCRmyPDF.sh`` script is still available for now
|
|
|
|
- Stacking the verbosity option like ``-vvv`` is no longer supported
|
|
|
|
- The configuration file ``config.sh`` has been removed. Instead, you
|
|
|
|
can feed a file to the arguments for common settings:
|
2015-07-28 04:36:58 -07:00
|
|
|
|
|
|
|
::
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
ocrmypdf input.pdf output.pdf @settings.txt
|
2015-07-28 04:36:58 -07:00
|
|
|
|
2015-08-05 23:17:38 -07:00
|
|
|
where ``settings.txt`` contains *one argument per line*, for example:
|
2015-07-28 04:36:58 -07:00
|
|
|
|
|
|
|
::
|
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
-l
|
|
|
|
deu
|
|
|
|
--author
|
|
|
|
A. Merkel
|
|
|
|
--pdf-renderer
|
|
|
|
tesseract
|
2015-07-26 03:00:21 -07:00
|
|
|
|
|
|
|
Fixes
|
2018-08-27 01:25:30 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Handling of filenames containing spaces: fixed
|
2015-07-28 04:36:58 -07:00
|
|
|
|
2015-08-29 16:43:22 -07:00
|
|
|
Notes and known issues
|
2015-07-28 04:36:58 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
- Some dependencies may work with lower versions than tested, so try
|
|
|
|
overriding dependencies if they are "in the way" to see if they work.
|
|
|
|
- ``--pdf-renderer tesseract`` will output files with an incorrect page
|
|
|
|
size in Tesseract 3.03, due to a bug in Tesseract.
|
|
|
|
- PDF files containing "inline images" are not supported and won't be
|
|
|
|
for the 3.0 release. Scanned images almost never contain inline
|
|
|
|
images.
|
2015-07-26 03:00:21 -07:00
|
|
|
|
2017-02-06 21:56:55 -08:00
|
|
|
v2.2-stable (2014-09-29)
|
2019-06-22 17:29:26 -07:00
|
|
|
========================
|
2015-07-28 04:59:49 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
OCRmyPDF versions 1 and 2 were implemented as shell scripts. OCRmyPDF
|
|
|
|
3.0+ is a fork that gradually replaced all shell scripts with Python
|
|
|
|
while maintaining the existing command line arguments. No one is
|
|
|
|
maintaining old versions.
|
2015-07-26 03:00:21 -07:00
|
|
|
|
2019-06-22 17:29:26 -07:00
|
|
|
For details on older versions, see the `final version of its release
|
|
|
|
notes <https://github.com/fritz-hh/OCRmyPDF/blob/7fd3dbdf42ca53a619412ce8add7532c5e81a9d1/RELEASE_NOTES.md>`__.
|