3117 Commits

Author SHA1 Message Date
James R. Barlow
0323738ada ocrmypdf.fish: fix indents
[ci skip]
2021-12-06 15:38:27 -08:00
FPille
aae5591f7e Update ocrmypdf.bash completion
Squashed commit of the following:

commit 974de2e8ccad7fd34694f2c3a7a17c64bb52cdab
Merge: a8d7f969 ee04aa72
Author: James R. Barlow <james@purplerock.ca>
Date:   Sat Dec 4 20:22:50 2021 -0800

    Merge branch 'update_bash-completion' of git://github.com/FPille/OCRmyPDF into FPille-update_bash-completion

commit ee04aa722504272891d8c74171f1de9bc954ca09
Author: FPille <f.pille@gmail.com>
Date:   Thu Oct 14 11:09:23 2021 +0200

    update

commit 76f64537aa5549278483ce338fe03764d0ce8065
Author: FPille <f.pille@gmail.com>
Date:   Thu Oct 14 11:04:10 2021 +0200

    updated and descriptions for arguments and choices added
    deprecated arguments removed
    bug fix: typo "_init_completion" instead of "_init_completions"

commit de9b93e852b3a6aca29b77ff7bdf433a07b42794
Merge: c23374de 42713b77
Author: Frank <50119297+FPille@users.noreply.github.com>
Date:   Thu Oct 14 08:08:11 2021 +0200

    Merge branch 'jbarlow83:master' into master

commit c23374de818edddb789073251386e5ee1cfaef84
Merge: 40b2ebcb c409fa58
Author: Frank <50119297+FPille@users.noreply.github.com>
Date:   Wed May 26 20:31:00 2021 +0200

    Merge branch 'jbarlow83:master' into master

commit 40b2ebcb37b6a21845e2733d4ad8078c09d08d0a
Merge: 79c84eef 7e388f59
Author: Frank <50119297+FPille@users.noreply.github.com>
Date:   Sat Jun 1 11:09:07 2019 +0200

    Merge pull request #1 from jbarlow83/master

    update master
2021-12-06 15:38:26 -08:00
James R. Barlow
4c1ff1086c tess cache: don't include full platform - could be sensitive 2021-12-06 15:38:26 -08:00
James R. Barlow
f91faf9795 Add new argument --tesseract-thresholding to control tesseract thresholding where available
Also add missing test for --tesseract-oem
2021-12-06 15:38:14 -08:00
James R. Barlow
793cc33a90 Whitespace 2021-12-04 16:07:34 -08:00
James R. Barlow
fbd72efd45 build: typo v13.0.0 2021-12-04 01:41:31 -08:00
James R. Barlow
1115923995 build: address checksum error from choco 2021-12-04 01:26:38 -08:00
James R. Barlow
8478d67b28 Merge branch 'release/v13' of github.com:jbarlow83/OCRmyPDF into release/v13 2021-11-15 16:38:11 -08:00
James R. Barlow
c75ff4687a Turning on Ghostscript interpolation changes this test
Seems acceptable. We don't normally use Ghostscript to downsample PDFs
like is happening in this test.
2021-11-15 16:36:24 -08:00
mara004
312c1e51b5
[ci skip] minor corrections to maintainers.rst (#858) 2021-11-15 15:13:12 -08:00
James R. Barlow
cfe2bb25ba Merge commit 'cd49e70154f82f54bf74fc5bb2586fe7e0358971' into release/v13 2021-11-15 00:33:34 -08:00
Tristan Porteries
cd49e70154
ghostscript: force interpolation when rendering (#855)
Specifying option --oversample tends to introduce upsampling in rendering
by rasterizing page to an higher DPI.

This upsampling improves OCR results, but a correct choice of interpolation
method can increase even more the OCR quality.

Ghostscript seems to use a nearest interpolation as default choice for pdf.
This method doesn't average new introduced pixels with original pixels
resulting in an almost similar image but with more pixels.

Providing -dInterpolateControl=-1 force switching interpolation on.

In this commit the above option is passed to all ghostscript rendering
calls.

After testing, rendering a page at same DPI with interpolation
enabled does not introduce significant time overhead.

time (repeat 40 gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m \
	-dFirstPage=1 -dLastPage=1 -r100.000000x100.000000 \
	-dInterpolateControl=-1 -o /dev/null -dAutoRotatePages=/None -f pzII.pdf)
7,66s user 0,33s system 99% cpu 8,012 total

time (repeat 40 gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m \
	-dFirstPage=1 -dLastPage=1 -r100.000000x100.000000 \
        -o /dev/null -dAutoRotatePages=/None -f pzII.pdf)
7,42s user 0,39s system 99% cpu 7,808 total

Ghostscript interpolation control reference:
https://www.ghostscript.com/doc/current/Use.htm
2021-11-15 00:32:58 -08:00
James R. Barlow
7ce1692eef windows: default version to '0' when looking for Ghostscript
To avoid ValueError: max() arg is an empty sequence

As suggested by @meet1919 in #833.
2021-11-14 23:00:08 -08:00
James R. Barlow
7959f7628d pyproject: tell black to target py37 2021-11-14 15:49:01 -08:00
James R. Barlow
4634b20de5 Raise max-image-mpixels again
PDFs are quite likely to have a lot of pixels, e.g. large high resolution scans.
250 MP is a page of A0 sized paper scanned at 400 DPI,
should be enough in most cases.
2021-11-14 15:47:39 -08:00
James R. Barlow
3810e576ff optimize: fix mypy lint 2021-11-13 14:48:00 -08:00
James R. Barlow
01c7895044 pipeline: tidy 2021-11-13 14:47:49 -08:00
James R. Barlow
fdc6aa03fb docs: new maintainer notes 2021-11-13 14:29:30 -08:00
James R. Barlow
25cc17ee03 v13 release notes (2) v13.0.0rc1 2021-11-13 02:02:04 -08:00
James R. Barlow
e8098a1475 Dockerfile: remove requirements/ 2021-11-13 01:57:17 -08:00
James R. Barlow
6b773883dc build: use latest pip and wheel in all cases 2021-11-13 01:57:03 -08:00
James R. Barlow
4ed9622335 v13 release notes 2021-11-13 01:37:38 -08:00
James R. Barlow
acc9d58c39 Skip no language test for Tess 5 2021-11-13 01:37:27 -08:00
James R. Barlow
659e738f92 Remove some 'liblept' references we no longer need 2021-11-13 01:22:09 -08:00
James R. Barlow
7b3d7ca92a ghostscript: choco doesn't put Ghostscript on PATH anymore
It seems that chocolately doesn't put gswin[32,64]c on PATH anymore,
so compensate.
2021-11-13 01:18:12 -08:00
James R. Barlow
e3126d2806 Adjust test to support Tesseract 5 working harder to find its files 2021-11-13 01:16:35 -08:00
James R. Barlow
45020a7fcd build: tweak CI 2021-11-13 00:56:49 -08:00
James R. Barlow
f51164aff8 Upgrade test version of pymupdf 2021-11-13 00:53:41 -08:00
James R. Barlow
6f58a14351 pdfa: remove deprecated pkg_resources based access and tests 2021-11-13 00:52:03 -08:00
James R. Barlow
7ba04267b1 Remove shims to support for old versions of pikepdf < 4 2021-11-13 00:43:20 -08:00
James R. Barlow
9749564313 Remove requirements/*.txt - use pip install ocrmypdf[etc] instead 2021-11-13 00:31:42 -08:00
James R. Barlow
698e8791d7 Remove Python 3.6 specific unicode environment checks 2021-11-13 00:28:52 -08:00
James R. Barlow
380b981763 Remove most Python 3.6 special casing 2021-11-13 00:27:48 -08:00
James R. Barlow
5abfb14c2a Remove leptonica and cffi 2021-11-13 00:06:35 -08:00
James R. Barlow
036afc4d88 Update cache, related to previous apparently 2021-11-12 23:57:50 -08:00
James R. Barlow
59642a98b2 Disable --remove-background so we can remove leptonica 2021-11-12 23:56:52 -08:00
James R. Barlow
f8c6be2e26 test_rotation: replace leptonica test with Pillow channel ops
New function is likely not as robust but seems capable of inexact image comparison.
2021-11-12 23:49:38 -08:00
James R. Barlow
42bf5476dd optimize: replace leptonica compdata with direct insert of JPEG
Confirmed that img2pdf just inserts JPEG verbatim. Never had to go through
the trouble we did.
2021-11-12 23:20:49 -08:00
James R. Barlow
30440104ba Remove --threshold argument
Tesseract is now included better thresholding (binarization) in v5. Users that have
thresholding issues should try that first. If we find further problems
this can be brought back as a plugin.
2021-11-12 20:09:55 -08:00
James R. Barlow
b159e02110 Convert deskew to use degrees, since all our other angles are in degrees 2021-11-12 16:40:51 -08:00
James R. Barlow
a55ab05d16 Replace leptonica deskew with tesseract find skew and pillow rotate
Also rebuild the cache.
2021-11-12 16:35:08 -08:00
James R. Barlow
25d046ae95 Modernize OrientationConfidence definition 2021-11-10 00:55:23 -08:00
James R. Barlow
01b0f76e36 Remove duplicate definition of OrientationConfidence 2021-11-10 00:55:07 -08:00
James R. Barlow
d74d315e8b docs: update some old OS versions 2021-11-10 00:30:48 -08:00
James R. Barlow
8be9a68c5e v12.7.2 release notes v12.7.2 2021-11-04 00:20:25 -07:00
James R. Barlow
6c34d59836 tesseract: yet another version variant 2021-11-04 00:17:18 -07:00
James R. Barlow
386453d178 pdfa: replace read_binary() with files() 2021-10-31 02:01:11 -07:00
James R. Barlow
615a7561b5 tesseract: tidy some uses of str paths instead of Path 2021-10-31 02:01:05 -07:00
James R. Barlow
c4c64c3ea0 pre-commit updates 2021-10-31 01:31:57 -07:00
James R. Barlow
21279f5784 Fix leaked file handle for output_type none 2021-10-28 02:50:17 -07:00