James R. Barlow
c9ce731119
Fix DPI mismatch between OCR page and source page
2018-05-10 17:34:08 -07:00
James R. Barlow
abed8e034e
Add metadata preservation test from stash
2018-05-10 16:43:28 -07:00
James R. Barlow
63032d304d
Revert "Since PyMuPDF 1.13.3 corrupts text, pin 1.12.5 and work around it"
...
This reverts commit b0ce7c63dd27257d9c979fde9013243b8ae38c98.
2018-05-10 16:27:17 -07:00
James R. Barlow
a57ecede78
Refactor textareas to remove duplicate code
2018-05-10 16:26:52 -07:00
James R. Barlow
b0ce7c63dd
Since PyMuPDF 1.13.3 corrupts text, pin 1.12.5 and work around it
2018-05-10 16:10:24 -07:00
James R. Barlow
d139a11c16
Weave: periodically save to prevent indefinite growth of open file list
2018-05-10 15:08:57 -07:00
James R. Barlow
aef043db0b
Revise parameter validation for output-type, pdf-renderer, lang
2018-05-10 14:53:22 -07:00
James R. Barlow
b8f3ead541
Remove tesseract renderer entirely
...
Grafting lets us work with older Tesseract versions as if they could use
sandwich, so there is no point in keeping it. It's been deprecated for a
long time now anyway.
2018-05-10 14:06:13 -07:00
James R. Barlow
e0bb898f29
Remove hocr debug renderer (-g)
...
The fact that this produces additional pages makes it a maintenance
burden. hocr can be debugged using hocrtransform.
2018-05-10 13:48:39 -07:00
James R. Barlow
45336c7c28
textareas: filter out images
2018-05-10 01:17:28 -07:00
James R. Barlow
20aabb2e83
When deciding if there is a text on a page, ignore the margins
...
Margins may include watermarks or digital stamps on otherwise
text-free pages.
2018-05-10 01:16:11 -07:00
James R. Barlow
1539e24d61
Ignore masks when deciding what color to rasterize at
2018-05-10 00:49:36 -07:00
Fabian Rodriguez
c7cf041e4a
Fixed language option example (French) ( #266 )
...
Replace fre to fra.
2018-05-10 00:10:27 -07:00
James R. Barlow
da80d3f354
Add unconditional (for now) whiteout of text areas
2018-05-07 17:37:46 -07:00
James R. Barlow
001c8d7678
Upgrade PyMuPDF version
2018-05-07 16:24:26 -07:00
James R. Barlow
38ab03655b
Restore unpaper
...
It's a suggested/recommended dep not required in Deb/Ubu.
2018-05-06 21:36:12 -07:00
James R. Barlow
9226f8a5d1
Trap PDF/A-3 errors on old Ghostscript
v6.2.0
2018-05-04 15:29:43 -07:00
James R. Barlow
5c8a007f3e
Fix failure to prevent use of Ghostscript on /UserUnit files
2018-05-04 13:34:34 -07:00
James R. Barlow
b3ad3e297d
v6.2.0 fixes
2018-05-03 17:04:23 -07:00
James R. Barlow
d607553e48
v6.2.0 Release notes
2018-05-03 16:47:21 -07:00
James R. Barlow
7cf83c77ca
Merge branch 'feature/pdfa3'
2018-05-03 16:45:57 -07:00
James R. Barlow
8a9f174f63
Fix XMP validation issue with /CreationDate
...
Related to previous validation issue. If the /CreationDate had no
timezone, Ghostscript also creates invalid metadata. Work around this.
Also fix up PDF date decoding, and transcode dates to standardize them.
2018-05-03 16:30:20 -07:00
James R. Barlow
98a0786c32
Add 18.04 update procedure
2018-05-03 13:55:16 -07:00
James R. Barlow
df1129724c
Update Dockerfile for Ubuntu 18.04
2018-05-03 01:27:13 -07:00
James R. Barlow
423cef08bf
Handle procset properly
2018-05-02 14:48:02 -07:00
James R. Barlow
04580accb4
Document aliasing of tesseract renderer
2018-05-02 14:47:38 -07:00
James R. Barlow
6376f77b8c
Refactor, remove trigonometry
2018-05-02 12:30:34 -07:00
James R. Barlow
e27e614ed9
Fixed rotation hard case
2018-05-02 01:32:11 -07:00
James R. Barlow
b0c04704a1
Fixed all but one rotation case
2018-05-02 01:24:21 -07:00
James R. Barlow
6bb6bf8323
Fix correction angle used from wrong page
2018-05-02 01:00:30 -07:00
James R. Barlow
e22fe8aefc
Silence debug messages
2018-05-01 23:51:54 -07:00
James R. Barlow
76276f61e5
Split out rotation related tests
2018-05-01 23:51:35 -07:00
James R. Barlow
bfd26e6ec6
Tests: confirm OCR layer copied
2018-05-01 23:16:41 -07:00
James R. Barlow
d787e1ea0f
ghostscript.py not saved in last commit
...
Given importance of last one, confirmed that when the file is saved all tests pass too.
Passing is invariant with this change.
2018-05-01 22:59:22 -07:00
James R. Barlow
b5d7e9cbb0
Fix all issues with rotations
...
All tests now pass
2018-05-01 22:50:20 -07:00
James R. Barlow
f3b6d9dcdf
Fix a comment about Tesseract behavior in certain versions
2018-05-01 21:31:09 -07:00
James R. Barlow
a9abe13185
Remove the old tesseract pdf_renderer
2018-05-01 17:31:34 -07:00
James R. Barlow
6b315e8315
Add ability to disable cache
2018-05-01 15:52:00 -07:00
James R. Barlow
37677de884
Fix regressions: pdfa.ps not used, PDF/A failures, handling of text layers with no font
2018-05-01 15:51:46 -07:00
James R. Barlow
c7387de325
Fix auto rotate
2018-05-01 15:18:28 -07:00
James R. Barlow
2495b1e038
Refactor find font, get test cases working again
2018-05-01 14:48:41 -07:00
James R. Barlow
073ee52ce7
Use hocr and weave; eliminate old combine layers and merge pages
2018-05-01 14:21:53 -07:00
James R. Barlow
54150a14e9
Further elimination of tesseract renderer special casing
...
We don't need to keep a "skip page" around anymore since
skipping means just not grafting on the text layer.
2018-05-01 13:36:20 -07:00
James R. Barlow
88ff091cce
Unify tesseract and sandwich renderer paths
...
Since the new weaving method copies the font and content
stream from the Tesseract PDF, it doesn't matter if Tesseract
happens to have an image or not.
If Tesseract is text-only capable we use that feature for efficiency,
but ignore the image either way.
2018-05-01 13:24:20 -07:00
James R. Barlow
e87a5776f1
Remove now-unnecessary code to rotate pages
...
Track only the decision to change rotation.
2018-05-01 13:01:25 -07:00
James R. Barlow
0806ce6406
Fix rotation for unsplit (modulo --rotate-pages)
2018-04-30 20:58:42 -07:00
James R. Barlow
6409894a71
feature/unsplit-try-imagerotate
2018-04-30 20:48:59 -07:00
James R. Barlow
e7286f6129
Unsplit now works with multipage, --force-ocr
2018-04-30 14:46:20 -07:00
James R. Barlow
2ab94b3151
unsplit: it's alive
...
First successful file output.
2018-04-28 01:57:41 -07:00
James R. Barlow
7ee90890ec
Add copying of essential information from Tesseract textonly
2018-04-27 23:19:08 -07:00