James R. Barlow
46d0978a09
Update version scripts to support Ghostscript 10.0
2022-10-03 21:59:31 -07:00
James R. Barlow
6dbaebdc0c
Merge branch 'master' into feature/drop-3.7
2022-09-15 23:00:27 -07:00
James R. Barlow
f4155dca77
tests: convert all uses of multipage.pdf to fixture
2022-08-11 01:13:10 -07:00
James R. Barlow
79db985181
Improve encryption tests; drop some public domain resources
...
Generate the encrypted files we need and remove special test files we retained for this.
Replace jbig2.pdf based on congress.jpg with version based on ccitt.pdf.
2022-08-06 14:37:45 -07:00
James R. Barlow
acc70036cc
Set minimum Tesseract to 4.1.1
2022-08-02 15:20:29 -07:00
James R. Barlow
67773da309
Drop support for Ghostscript <9.50
2022-08-02 15:01:10 -07:00
James R. Barlow
80ed2117cc
Change to SPDX license tracking
2022-07-28 01:10:07 -07:00
James R. Barlow
dc6f1a266a
Modernize type annotations
2022-07-23 00:39:24 -07:00
James R. Barlow
5d0cc0a092
tests: Extract some test fixtures for better clarity
2022-05-26 00:57:31 -07:00
James R. Barlow
6c427f82ea
Add test case for corrupt ICC profiles
2022-05-26 00:41:19 -07:00
James Barlow
776ada6713
Upgrade pre-commit and associated tools; various lints
2022-04-03 20:53:01 -07:00
James R. Barlow
13af3252ff
tests: simplify run_ocrmypdf API
2021-12-06 17:00:25 -08:00
James R. Barlow
6910c48b81
Fix test_outputtype_none on Windows and cleanup docs
2021-12-06 15:38:38 -08:00
James R. Barlow
f91faf9795
Add new argument --tesseract-thresholding to control tesseract thresholding where available
...
Also add missing test for --tesseract-oem
2021-12-06 15:38:14 -08:00
James R. Barlow
e3126d2806
Adjust test to support Tesseract 5 working harder to find its files
2021-11-13 01:16:35 -08:00
James R. Barlow
7ba04267b1
Remove shims to support for old versions of pikepdf < 4
2021-11-13 00:43:20 -08:00
James R. Barlow
380b981763
Remove most Python 3.6 special casing
2021-11-13 00:27:48 -08:00
James R. Barlow
59642a98b2
Disable --remove-background so we can remove leptonica
2021-11-12 23:56:52 -08:00
James R. Barlow
30440104ba
Remove --threshold argument
...
Tesseract is now included better thresholding (binarization) in v5. Users that have
thresholding issues should try that first. If we find further problems
this can be brought back as a plugin.
2021-11-12 20:09:55 -08:00
James R. Barlow
a55ab05d16
Replace leptonica deskew with tesseract find skew and pillow rotate
...
Also rebuild the cache.
2021-11-12 16:35:08 -08:00
James R. Barlow
78f391536b
Offer hint to user to use --max-image-mpixels after decompression bob error
...
Closes #801
2021-10-06 00:19:11 -07:00
James R. Barlow
790d3022f6
Implement --output-type=none to skip producing the PDF and use only the sidecar
...
Closes #787
2021-09-26 01:07:34 -07:00
James R. Barlow
4eca0a165b
pre-commit: pyupgrade modernizing
2021-08-26 18:04:38 -07:00
James R. Barlow
906d77b389
tests: remove obsolete running_in_travis()
2021-04-07 02:25:10 -07:00
James R. Barlow
aa115a8be3
Remove pytest_helpers_namespace
2021-04-07 01:56:51 -07:00
James R. Barlow
bb258fc99c
pdfinfo: Refactor pageinfo dictionary into a class
2020-12-24 01:47:53 -08:00
James R. Barlow
895fddd85e
Replace most uses of universal_newlines with text
...
The parameters are equivalent but the latter is better named. Since
Python 3.6 doesn't support text= we use our wrapper to add it in that
place.
This is for subprocess.run.
2020-11-07 00:48:08 -08:00
James R. Barlow
e6a7b58863
Merge branch 'de-gpl'
2020-08-12 12:20:38 -07:00
James R. Barlow
9b641055e1
Fix KeyError: 'dpi' when using --threshold on image to PDF
...
Fixes #607
2020-08-07 02:21:02 -07:00
James R. Barlow
aa0ec40102
Change license of all GPLv3 files to MPL-2.0
...
https://github.com/jbarlow83/OCRmyPDF/issues/600
2020-08-05 00:44:42 -07:00
James R. Barlow
86a73191b0
Plugin manager: accept Path(plugin)
2020-06-30 04:17:30 -07:00
James R. Barlow
48e2750551
Fix some tests that were failing in Docker
2020-06-21 01:48:13 -07:00
James R. Barlow
892db88f0e
test_two_languages: use narrower test
2020-06-12 14:33:02 -07:00
James R. Barlow
393c5a9ea4
Fix error on -l lang1+lang2
2020-06-12 12:10:29 -07:00
James R. Barlow
0f942fb714
Rename ocrmypdf.exec -> ocrmypdf._exec
2020-06-09 14:59:09 -07:00
James R. Barlow
be8ca589d4
Move ocrmypdf.exec.run and friends to ocrmypdf.subprocess
2020-06-09 14:53:10 -07:00
James R. Barlow
b109445215
Move Ghostscript rasterize_pdf to plugin
2020-06-08 17:10:27 -07:00
James R. Barlow
a9a473f2e5
Convert all tesseract cache usages to plugin
2020-06-05 17:55:18 -07:00
James R. Barlow
6268e2faff
Begin replacing tests/spoof/tesseract_cache with plugin
2020-06-05 17:27:10 -07:00
James R. Barlow
1b92f447c3
Convert tesseract_crash to plugin
2020-06-02 02:36:41 -07:00
James R. Barlow
4f4ad0fb76
Convert tesseract_big_image_error to plugin
2020-06-02 01:49:47 -07:00
James R. Barlow
1598f2f0e5
Abolish spoof_tesseract_noop
2020-06-01 03:07:53 -07:00
James R. Barlow
2b23f7ec73
tesseract_noop: begin implementing with plugin
2020-06-01 02:45:49 -07:00
James R. Barlow
41eb54cc0a
Standardize tesseract.generate_hocr and _pdf parameters
2020-05-14 03:23:25 -07:00
James R. Barlow
12a2f78c4d
Fix validation of languages not using tesseract_env
...
And some related issues.
2020-05-14 03:19:22 -07:00
James R. Barlow
85cbf94a6e
Convert many uses of str paths to Path
2020-05-06 02:53:47 -07:00
James R. Barlow
c85278b31d
Delinting
2020-05-03 00:53:29 -07:00
James R. Barlow
8f5c95f0f4
Remove last vestiges of command line usage of qpdf - change to check_pdf
2020-04-26 05:33:26 -07:00
James R. Barlow
991db17fde
Remove Ghostscript-based text extraction
...
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
94c52a6fa3
Refactor 'xyres' into Resolution
2020-04-24 04:12:05 -07:00