2895 Commits

Author SHA1 Message Date
James R. Barlow
871979abd6 Temporarily unbreak without fitz mode 2018-05-11 17:32:15 -07:00
James R. Barlow
efb95722ca Travis: Use declarative APT for Tesseract too 2018-05-11 12:46:10 -07:00
James R. Barlow
d9bbb80a6b Don't try to run jbig2 when not available 2018-05-11 12:42:00 -07:00
James R. Barlow
3254315127 Update test cache 2018-05-11 12:19:50 -07:00
James R. Barlow
ca297fd26b Update tests 2018-05-11 02:33:44 -07:00
James R. Barlow
ac36a43cef Warn about --user-words not having any effect
Might be available in full release of Tess4
2018-05-11 02:31:07 -07:00
James R. Barlow
f00183115d Update our dependencies 2018-05-11 02:11:55 -07:00
James R. Barlow
161b29a899 Check jbig2 when optimizing is requested 2018-05-11 02:11:01 -07:00
James R. Barlow
72253d09fa Add arguments to control optimization 2018-05-10 22:23:24 -07:00
James R. Barlow
40d09ddb23 Fix merge error in Leptonica 2018-05-10 21:17:47 -07:00
James R. Barlow
3026d86a9e Remove jbig2enc.py 2018-05-10 21:15:07 -07:00
James R. Barlow
0661a7edc3 Merge optimize 2018-05-10 21:05:32 -07:00
James R. Barlow
24b0adfacc Merge branch 'master' into develop 2018-05-10 20:54:55 -07:00
James R. Barlow
acc6698ab3 Make XML metadata test actually work 2018-05-10 20:37:10 -07:00
James R. Barlow
606d3e6aa1 Remove tests that exercise obsolete features (tesseract, -g) 2018-05-10 20:33:32 -07:00
James R. Barlow
687a7954d6 test_main: uses leptonica 2018-05-10 19:05:31 -07:00
James R. Barlow
36a53a7b37 Weave: Unconditionally rotate and scale the text layerThis solves two issues. First, the text layer can end up being adifferent size, probably if the DPI is not an integer; scaling helps itfit slightly better. Second, other printable text on the page can end uphorizontally scaled or misaligned if we don't all of our drawing in aq/Q pair. 2018-05-10 19:03:31 -07:00
James R. Barlow
0a5982a902 PyMuPDF tweaks: don't clean
In MuPDF 1.13 clean might be unreliable, so explicitly don't do it,
even though it doesn't cause trouble in 1.12.
2018-05-10 18:50:52 -07:00
James R. Barlow
601863f9e9 Return to PyMuPDF 1.12.5 2018-05-10 18:47:10 -07:00
James R. Barlow
c9ce731119 Fix DPI mismatch between OCR page and source page 2018-05-10 17:34:08 -07:00
James R. Barlow
abed8e034e Add metadata preservation test from stash 2018-05-10 16:43:28 -07:00
James R. Barlow
63032d304d Revert "Since PyMuPDF 1.13.3 corrupts text, pin 1.12.5 and work around it"
This reverts commit b0ce7c63dd27257d9c979fde9013243b8ae38c98.
2018-05-10 16:27:17 -07:00
James R. Barlow
a57ecede78 Refactor textareas to remove duplicate code 2018-05-10 16:26:52 -07:00
James R. Barlow
b0ce7c63dd Since PyMuPDF 1.13.3 corrupts text, pin 1.12.5 and work around it 2018-05-10 16:10:24 -07:00
James R. Barlow
d139a11c16 Weave: periodically save to prevent indefinite growth of open file list 2018-05-10 15:08:57 -07:00
James R. Barlow
aef043db0b Revise parameter validation for output-type, pdf-renderer, lang 2018-05-10 14:53:22 -07:00
James R. Barlow
b8f3ead541 Remove tesseract renderer entirely
Grafting lets us work with older Tesseract versions as if they could use
sandwich, so there is no point in keeping it. It's been deprecated for a
long time now anyway.
2018-05-10 14:06:13 -07:00
James R. Barlow
e0bb898f29 Remove hocr debug renderer (-g)
The fact that this produces additional pages makes it a maintenance
burden. hocr can be debugged using hocrtransform.
2018-05-10 13:48:39 -07:00
James R. Barlow
45336c7c28 textareas: filter out images 2018-05-10 01:17:28 -07:00
James R. Barlow
20aabb2e83 When deciding if there is a text on a page, ignore the margins
Margins may include watermarks or digital stamps on otherwise
text-free pages.
2018-05-10 01:16:11 -07:00
James R. Barlow
1539e24d61 Ignore masks when deciding what color to rasterize at 2018-05-10 00:49:36 -07:00
Fabian Rodriguez
c7cf041e4a Fixed language option example (French) (#266)
Replace fre to fra.
2018-05-10 00:10:27 -07:00
James R. Barlow
da80d3f354 Add unconditional (for now) whiteout of text areas 2018-05-07 17:37:46 -07:00
James R. Barlow
001c8d7678 Upgrade PyMuPDF version 2018-05-07 16:24:26 -07:00
James R. Barlow
38ab03655b Restore unpaper
It's a suggested/recommended dep not required in Deb/Ubu.
2018-05-06 21:36:12 -07:00
James R. Barlow
9226f8a5d1 Trap PDF/A-3 errors on old Ghostscript v6.2.0 2018-05-04 15:29:43 -07:00
James R. Barlow
5c8a007f3e Fix failure to prevent use of Ghostscript on /UserUnit files 2018-05-04 13:34:34 -07:00
James R. Barlow
b3ad3e297d v6.2.0 fixes 2018-05-03 17:04:23 -07:00
James R. Barlow
d607553e48 v6.2.0 Release notes 2018-05-03 16:47:21 -07:00
James R. Barlow
7cf83c77ca Merge branch 'feature/pdfa3' 2018-05-03 16:45:57 -07:00
James R. Barlow
8a9f174f63 Fix XMP validation issue with /CreationDate
Related to previous validation issue. If the /CreationDate had no
timezone, Ghostscript also creates invalid metadata. Work around this.
Also fix up PDF date decoding, and transcode dates to standardize them.
2018-05-03 16:30:20 -07:00
James R. Barlow
98a0786c32 Add 18.04 update procedure 2018-05-03 13:55:16 -07:00
James R. Barlow
df1129724c Update Dockerfile for Ubuntu 18.04 2018-05-03 01:27:13 -07:00
James R. Barlow
423cef08bf Handle procset properly 2018-05-02 14:48:02 -07:00
James R. Barlow
04580accb4 Document aliasing of tesseract renderer 2018-05-02 14:47:38 -07:00
James R. Barlow
6376f77b8c Refactor, remove trigonometry 2018-05-02 12:30:34 -07:00
James R. Barlow
e27e614ed9 Fixed rotation hard case 2018-05-02 01:32:11 -07:00
James R. Barlow
b0c04704a1 Fixed all but one rotation case 2018-05-02 01:24:21 -07:00
James R. Barlow
6bb6bf8323 Fix correction angle used from wrong page 2018-05-02 01:00:30 -07:00
James R. Barlow
e22fe8aefc Silence debug messages 2018-05-01 23:51:54 -07:00