319 Commits

Author SHA1 Message Date
James R. Barlow
46d0978a09
Update version scripts to support Ghostscript 10.0 2022-10-03 21:59:31 -07:00
James R. Barlow
6dbaebdc0c
Merge branch 'master' into feature/drop-3.7 2022-09-15 23:00:27 -07:00
James R. Barlow
f4155dca77
tests: convert all uses of multipage.pdf to fixture 2022-08-11 01:13:10 -07:00
James R. Barlow
79db985181
Improve encryption tests; drop some public domain resources
Generate the encrypted files we need and remove special test files we retained for this.

Replace jbig2.pdf based on congress.jpg with version based on ccitt.pdf.
2022-08-06 14:37:45 -07:00
James R. Barlow
acc70036cc
Set minimum Tesseract to 4.1.1 2022-08-02 15:20:29 -07:00
James R. Barlow
67773da309
Drop support for Ghostscript <9.50 2022-08-02 15:01:10 -07:00
James R. Barlow
80ed2117cc
Change to SPDX license tracking 2022-07-28 01:10:07 -07:00
James R. Barlow
dc6f1a266a
Modernize type annotations 2022-07-23 00:39:24 -07:00
James R. Barlow
5d0cc0a092 tests: Extract some test fixtures for better clarity 2022-05-26 00:57:31 -07:00
James R. Barlow
6c427f82ea Add test case for corrupt ICC profiles 2022-05-26 00:41:19 -07:00
James Barlow
776ada6713 Upgrade pre-commit and associated tools; various lints 2022-04-03 20:53:01 -07:00
James R. Barlow
13af3252ff tests: simplify run_ocrmypdf API 2021-12-06 17:00:25 -08:00
James R. Barlow
6910c48b81 Fix test_outputtype_none on Windows and cleanup docs 2021-12-06 15:38:38 -08:00
James R. Barlow
f91faf9795 Add new argument --tesseract-thresholding to control tesseract thresholding where available
Also add missing test for --tesseract-oem
2021-12-06 15:38:14 -08:00
James R. Barlow
e3126d2806 Adjust test to support Tesseract 5 working harder to find its files 2021-11-13 01:16:35 -08:00
James R. Barlow
7ba04267b1 Remove shims to support for old versions of pikepdf < 4 2021-11-13 00:43:20 -08:00
James R. Barlow
380b981763 Remove most Python 3.6 special casing 2021-11-13 00:27:48 -08:00
James R. Barlow
59642a98b2 Disable --remove-background so we can remove leptonica 2021-11-12 23:56:52 -08:00
James R. Barlow
30440104ba Remove --threshold argument
Tesseract is now included better thresholding (binarization) in v5. Users that have
thresholding issues should try that first. If we find further problems
this can be brought back as a plugin.
2021-11-12 20:09:55 -08:00
James R. Barlow
a55ab05d16 Replace leptonica deskew with tesseract find skew and pillow rotate
Also rebuild the cache.
2021-11-12 16:35:08 -08:00
James R. Barlow
78f391536b Offer hint to user to use --max-image-mpixels after decompression bob error
Closes #801
2021-10-06 00:19:11 -07:00
James R. Barlow
790d3022f6 Implement --output-type=none to skip producing the PDF and use only the sidecar
Closes #787
2021-09-26 01:07:34 -07:00
James R. Barlow
4eca0a165b pre-commit: pyupgrade modernizing 2021-08-26 18:04:38 -07:00
James R. Barlow
906d77b389
tests: remove obsolete running_in_travis() 2021-04-07 02:25:10 -07:00
James R. Barlow
aa115a8be3
Remove pytest_helpers_namespace 2021-04-07 01:56:51 -07:00
James R. Barlow
bb258fc99c
pdfinfo: Refactor pageinfo dictionary into a class 2020-12-24 01:47:53 -08:00
James R. Barlow
895fddd85e
Replace most uses of universal_newlines with text
The parameters are equivalent but the latter is better named. Since
Python 3.6 doesn't support text= we use our wrapper to add it in that
place.

This is for subprocess.run.
2020-11-07 00:48:08 -08:00
James R. Barlow
e6a7b58863 Merge branch 'de-gpl' 2020-08-12 12:20:38 -07:00
James R. Barlow
9b641055e1
Fix KeyError: 'dpi' when using --threshold on image to PDF
Fixes #607
2020-08-07 02:21:02 -07:00
James R. Barlow
aa0ec40102
Change license of all GPLv3 files to MPL-2.0
https://github.com/jbarlow83/OCRmyPDF/issues/600
2020-08-05 00:44:42 -07:00
James R. Barlow
86a73191b0
Plugin manager: accept Path(plugin) 2020-06-30 04:17:30 -07:00
James R. Barlow
48e2750551
Fix some tests that were failing in Docker 2020-06-21 01:48:13 -07:00
James R. Barlow
892db88f0e
test_two_languages: use narrower test 2020-06-12 14:33:02 -07:00
James R. Barlow
393c5a9ea4 Fix error on -l lang1+lang2 2020-06-12 12:10:29 -07:00
James R. Barlow
0f942fb714 Rename ocrmypdf.exec -> ocrmypdf._exec 2020-06-09 14:59:09 -07:00
James R. Barlow
be8ca589d4
Move ocrmypdf.exec.run and friends to ocrmypdf.subprocess 2020-06-09 14:53:10 -07:00
James R. Barlow
b109445215
Move Ghostscript rasterize_pdf to plugin 2020-06-08 17:10:27 -07:00
James R. Barlow
a9a473f2e5 Convert all tesseract cache usages to plugin 2020-06-05 17:55:18 -07:00
James R. Barlow
6268e2faff
Begin replacing tests/spoof/tesseract_cache with plugin 2020-06-05 17:27:10 -07:00
James R. Barlow
1b92f447c3
Convert tesseract_crash to plugin 2020-06-02 02:36:41 -07:00
James R. Barlow
4f4ad0fb76
Convert tesseract_big_image_error to plugin 2020-06-02 01:49:47 -07:00
James R. Barlow
1598f2f0e5 Abolish spoof_tesseract_noop 2020-06-01 03:07:53 -07:00
James R. Barlow
2b23f7ec73
tesseract_noop: begin implementing with plugin 2020-06-01 02:45:49 -07:00
James R. Barlow
41eb54cc0a
Standardize tesseract.generate_hocr and _pdf parameters 2020-05-14 03:23:25 -07:00
James R. Barlow
12a2f78c4d
Fix validation of languages not using tesseract_env
And some related issues.
2020-05-14 03:19:22 -07:00
James R. Barlow
85cbf94a6e
Convert many uses of str paths to Path 2020-05-06 02:53:47 -07:00
James R. Barlow
c85278b31d
Delinting 2020-05-03 00:53:29 -07:00
James R. Barlow
8f5c95f0f4
Remove last vestiges of command line usage of qpdf - change to check_pdf 2020-04-26 05:33:26 -07:00
James R. Barlow
991db17fde
Remove Ghostscript-based text extraction
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
94c52a6fa3
Refactor 'xyres' into Resolution 2020-04-24 04:12:05 -07:00