James R. Barlow
0323738ada
ocrmypdf.fish: fix indents
...
[ci skip]
2021-12-06 15:38:27 -08:00
FPille
aae5591f7e
Update ocrmypdf.bash completion
...
Squashed commit of the following:
commit 974de2e8ccad7fd34694f2c3a7a17c64bb52cdab
Merge: a8d7f969 ee04aa72
Author: James R. Barlow <james@purplerock.ca>
Date: Sat Dec 4 20:22:50 2021 -0800
Merge branch 'update_bash-completion' of git://github.com/FPille/OCRmyPDF into FPille-update_bash-completion
commit ee04aa722504272891d8c74171f1de9bc954ca09
Author: FPille <f.pille@gmail.com>
Date: Thu Oct 14 11:09:23 2021 +0200
update
commit 76f64537aa5549278483ce338fe03764d0ce8065
Author: FPille <f.pille@gmail.com>
Date: Thu Oct 14 11:04:10 2021 +0200
updated and descriptions for arguments and choices added
deprecated arguments removed
bug fix: typo "_init_completion" instead of "_init_completions"
commit de9b93e852b3a6aca29b77ff7bdf433a07b42794
Merge: c23374de 42713b77
Author: Frank <50119297+FPille@users.noreply.github.com>
Date: Thu Oct 14 08:08:11 2021 +0200
Merge branch 'jbarlow83:master' into master
commit c23374de818edddb789073251386e5ee1cfaef84
Merge: 40b2ebcb c409fa58
Author: Frank <50119297+FPille@users.noreply.github.com>
Date: Wed May 26 20:31:00 2021 +0200
Merge branch 'jbarlow83:master' into master
commit 40b2ebcb37b6a21845e2733d4ad8078c09d08d0a
Merge: 79c84eef 7e388f59
Author: Frank <50119297+FPille@users.noreply.github.com>
Date: Sat Jun 1 11:09:07 2019 +0200
Merge pull request #1 from jbarlow83/master
update master
2021-12-06 15:38:26 -08:00
James R. Barlow
4c1ff1086c
tess cache: don't include full platform - could be sensitive
2021-12-06 15:38:26 -08:00
James R. Barlow
f91faf9795
Add new argument --tesseract-thresholding to control tesseract thresholding where available
...
Also add missing test for --tesseract-oem
2021-12-06 15:38:14 -08:00
James R. Barlow
793cc33a90
Whitespace
2021-12-04 16:07:34 -08:00
James R. Barlow
fbd72efd45
build: typo
v13.0.0
2021-12-04 01:41:31 -08:00
James R. Barlow
1115923995
build: address checksum error from choco
2021-12-04 01:26:38 -08:00
James R. Barlow
8478d67b28
Merge branch 'release/v13' of github.com:jbarlow83/OCRmyPDF into release/v13
2021-11-15 16:38:11 -08:00
James R. Barlow
c75ff4687a
Turning on Ghostscript interpolation changes this test
...
Seems acceptable. We don't normally use Ghostscript to downsample PDFs
like is happening in this test.
2021-11-15 16:36:24 -08:00
mara004
312c1e51b5
[ci skip] minor corrections to maintainers.rst ( #858 )
2021-11-15 15:13:12 -08:00
James R. Barlow
cfe2bb25ba
Merge commit 'cd49e70154f82f54bf74fc5bb2586fe7e0358971' into release/v13
2021-11-15 00:33:34 -08:00
Tristan Porteries
cd49e70154
ghostscript: force interpolation when rendering ( #855 )
...
Specifying option --oversample tends to introduce upsampling in rendering
by rasterizing page to an higher DPI.
This upsampling improves OCR results, but a correct choice of interpolation
method can increase even more the OCR quality.
Ghostscript seems to use a nearest interpolation as default choice for pdf.
This method doesn't average new introduced pixels with original pixels
resulting in an almost similar image but with more pixels.
Providing -dInterpolateControl=-1 force switching interpolation on.
In this commit the above option is passed to all ghostscript rendering
calls.
After testing, rendering a page at same DPI with interpolation
enabled does not introduce significant time overhead.
time (repeat 40 gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m \
-dFirstPage=1 -dLastPage=1 -r100.000000x100.000000 \
-dInterpolateControl=-1 -o /dev/null -dAutoRotatePages=/None -f pzII.pdf)
7,66s user 0,33s system 99% cpu 8,012 total
time (repeat 40 gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m \
-dFirstPage=1 -dLastPage=1 -r100.000000x100.000000 \
-o /dev/null -dAutoRotatePages=/None -f pzII.pdf)
7,42s user 0,39s system 99% cpu 7,808 total
Ghostscript interpolation control reference:
https://www.ghostscript.com/doc/current/Use.htm
2021-11-15 00:32:58 -08:00
James R. Barlow
7ce1692eef
windows: default version to '0' when looking for Ghostscript
...
To avoid ValueError: max() arg is an empty sequence
As suggested by @meet1919 in #833 .
2021-11-14 23:00:08 -08:00
James R. Barlow
7959f7628d
pyproject: tell black to target py37
2021-11-14 15:49:01 -08:00
James R. Barlow
4634b20de5
Raise max-image-mpixels again
...
PDFs are quite likely to have a lot of pixels, e.g. large high resolution scans.
250 MP is a page of A0 sized paper scanned at 400 DPI,
should be enough in most cases.
2021-11-14 15:47:39 -08:00
James R. Barlow
3810e576ff
optimize: fix mypy lint
2021-11-13 14:48:00 -08:00
James R. Barlow
01c7895044
pipeline: tidy
2021-11-13 14:47:49 -08:00
James R. Barlow
fdc6aa03fb
docs: new maintainer notes
2021-11-13 14:29:30 -08:00
James R. Barlow
25cc17ee03
v13 release notes (2)
v13.0.0rc1
2021-11-13 02:02:04 -08:00
James R. Barlow
e8098a1475
Dockerfile: remove requirements/
2021-11-13 01:57:17 -08:00
James R. Barlow
6b773883dc
build: use latest pip and wheel in all cases
2021-11-13 01:57:03 -08:00
James R. Barlow
4ed9622335
v13 release notes
2021-11-13 01:37:38 -08:00
James R. Barlow
acc9d58c39
Skip no language test for Tess 5
2021-11-13 01:37:27 -08:00
James R. Barlow
659e738f92
Remove some 'liblept' references we no longer need
2021-11-13 01:22:09 -08:00
James R. Barlow
7b3d7ca92a
ghostscript: choco doesn't put Ghostscript on PATH anymore
...
It seems that chocolately doesn't put gswin[32,64]c on PATH anymore,
so compensate.
2021-11-13 01:18:12 -08:00
James R. Barlow
e3126d2806
Adjust test to support Tesseract 5 working harder to find its files
2021-11-13 01:16:35 -08:00
James R. Barlow
45020a7fcd
build: tweak CI
2021-11-13 00:56:49 -08:00
James R. Barlow
f51164aff8
Upgrade test version of pymupdf
2021-11-13 00:53:41 -08:00
James R. Barlow
6f58a14351
pdfa: remove deprecated pkg_resources based access and tests
2021-11-13 00:52:03 -08:00
James R. Barlow
7ba04267b1
Remove shims to support for old versions of pikepdf < 4
2021-11-13 00:43:20 -08:00
James R. Barlow
9749564313
Remove requirements/*.txt - use pip install ocrmypdf[etc] instead
2021-11-13 00:31:42 -08:00
James R. Barlow
698e8791d7
Remove Python 3.6 specific unicode environment checks
2021-11-13 00:28:52 -08:00
James R. Barlow
380b981763
Remove most Python 3.6 special casing
2021-11-13 00:27:48 -08:00
James R. Barlow
5abfb14c2a
Remove leptonica and cffi
2021-11-13 00:06:35 -08:00
James R. Barlow
036afc4d88
Update cache, related to previous apparently
2021-11-12 23:57:50 -08:00
James R. Barlow
59642a98b2
Disable --remove-background so we can remove leptonica
2021-11-12 23:56:52 -08:00
James R. Barlow
f8c6be2e26
test_rotation: replace leptonica test with Pillow channel ops
...
New function is likely not as robust but seems capable of inexact image comparison.
2021-11-12 23:49:38 -08:00
James R. Barlow
42bf5476dd
optimize: replace leptonica compdata with direct insert of JPEG
...
Confirmed that img2pdf just inserts JPEG verbatim. Never had to go through
the trouble we did.
2021-11-12 23:20:49 -08:00
James R. Barlow
30440104ba
Remove --threshold argument
...
Tesseract is now included better thresholding (binarization) in v5. Users that have
thresholding issues should try that first. If we find further problems
this can be brought back as a plugin.
2021-11-12 20:09:55 -08:00
James R. Barlow
b159e02110
Convert deskew to use degrees, since all our other angles are in degrees
2021-11-12 16:40:51 -08:00
James R. Barlow
a55ab05d16
Replace leptonica deskew with tesseract find skew and pillow rotate
...
Also rebuild the cache.
2021-11-12 16:35:08 -08:00
James R. Barlow
25d046ae95
Modernize OrientationConfidence definition
2021-11-10 00:55:23 -08:00
James R. Barlow
01b0f76e36
Remove duplicate definition of OrientationConfidence
2021-11-10 00:55:07 -08:00
James R. Barlow
d74d315e8b
docs: update some old OS versions
2021-11-10 00:30:48 -08:00
James R. Barlow
8be9a68c5e
v12.7.2 release notes
v12.7.2
2021-11-04 00:20:25 -07:00
James R. Barlow
6c34d59836
tesseract: yet another version variant
2021-11-04 00:17:18 -07:00
James R. Barlow
386453d178
pdfa: replace read_binary() with files()
2021-10-31 02:01:11 -07:00
James R. Barlow
615a7561b5
tesseract: tidy some uses of str paths instead of Path
2021-10-31 02:01:05 -07:00
James R. Barlow
c4c64c3ea0
pre-commit updates
2021-10-31 01:31:57 -07:00
James R. Barlow
21279f5784
Fix leaked file handle for output_type none
2021-10-28 02:50:17 -07:00