2676 Commits

Author SHA1 Message Date
James R. Barlow
0e4d978d20 pdfinfo: all -> not any 2018-10-27 23:22:28 -07:00
James R. Barlow
b12c2cfedf Fix handling of Type3 fonts with no ToUnicode mapping 2018-10-27 01:24:48 -07:00
James R. Barlow
58cc70725e Reorganize around getting bboxes for visible/invisible text 2018-10-26 01:07:02 -07:00
James R. Barlow
339afb02aa --redo-ocr now works in the presence of printable text 2018-10-25 16:53:47 -07:00
James R. Barlow
7ba0ff5c36 Fix strip invisible text bug: missing BT operator 2018-10-25 16:52:23 -07:00
James R. Barlow
ff41fbf673 Add pdfminer based layout analysis 2018-10-25 12:42:35 -07:00
James R. Barlow
2435cd23ce Move pdfinfo into a package 2018-10-25 00:37:38 -07:00
James R. Barlow
a063cff720 Rename/expose strip_invisible_text 2018-10-24 21:53:24 -07:00
James R. Barlow
0d396e1ac0 option check: Remove always-True condition
Both renderers are now lossless reconstruction-capable. (Have
been since 7.0)
2018-10-22 22:13:59 -07:00
James R. Barlow
f5807a2053 Require pikepdf 0.3.5 2018-10-21 21:37:15 -07:00
James R. Barlow
eb4938a36f Fix KeyError 'has_vector' 2018-10-20 01:20:22 -07:00
James R. Barlow
c5ad530bbf pdfinfo: reminder about 'INLINE IMAGE' sentinel 2018-10-20 01:17:08 -07:00
James R. Barlow
d11c428407 Redo OCR: disallow in cases that will damage the output PDF 2018-10-20 01:14:33 -07:00
James R. Barlow
6182b1f53e Merge branch 'feature/remove-vectors' into feature/redo-ocr 2018-10-20 01:13:24 -07:00
James R. Barlow
00fc1a12e2 optimize: should remove unreference resources too 2018-10-19 00:03:56 -07:00
James R. Barlow
16af753206 Add functional "redo OCR" feature
Needs argument validation and some other changes. Needs testing
with mixed-content PDFs.

Only really works for pure invisible text at the moment.
2018-10-19 00:02:19 -07:00
James R. Barlow
fa48205bb8 Add feature to remove vector graphics objects 2018-10-18 21:46:08 -07:00
James R. Barlow
f7dbf94071 pipeline: if vector graphic objects exist, ensure the DPI is reasonable 2018-10-18 01:23:31 -07:00
James R. Barlow
b18e66e2ca pdfinfo: learn to detect vector graphic objects 2018-10-18 01:21:51 -07:00
James R. Barlow
7a5504dfa5 pdfinfo: fix terminology (operands, command) -> (operands, operator) 2018-10-18 01:18:30 -07:00
James R. Barlow
d1cad7bc68 Merge branch 'master' of github.com:jbarlow83/OCRmyPDF 2018-10-16 01:28:17 -07:00
Elliott Sales de Andrade
c58d5c097c Add Fedora install instructions. (#304)
* Add Fedora install instructions.

* Fix path to fedora_rawhide badget
2018-10-14 13:28:50 -07:00
James R. Barlow
46157ca94e docs: some redundancies 2018-10-12 21:29:27 -07:00
jbarlow83
dd99511bcc
Fix broken badges in README 2018-10-12 21:16:08 -07:00
M.Yasoob Ullah Khalid ☺
5bc2efd3c7 Removed extra word from docs (#303) 2018-10-12 21:02:16 -07:00
James R. Barlow
1b18dbecf5 Fix filename test.txt v7.2.1 2018-10-11 16:03:25 -07:00
James R. Barlow
9f82c0eb6e v7.2.1 release notes 2018-10-11 15:55:01 -07:00
James R. Barlow
68bac1b177 Fix compatibility with pikepdf 0.3.5 API change 2018-10-11 15:51:34 -07:00
James R. Barlow
1495b78330 Remove cruft to support leptonica < 1.72 in test suite 2018-10-11 01:37:32 -07:00
James R. Barlow
6f777d2848 Include Debian copyright file 2018-10-10 23:55:48 -07:00
James R. Barlow
5650eba848 Cleanup MANIFEST.in, reorg requirements/*.txt, fix non-Unicode readme 2018-10-10 23:53:08 -07:00
James R. Barlow
5bc5dc93f3 v7.2.0 release notes update v7.2.0 2018-10-05 01:27:00 -07:00
James R. Barlow
c1e18bb825 optimize: Exclude soft masks (SMasks) from optimization
Soft masks are only allowed to be of colorspace DeviceGray so we
shouldn't use pngquant on them. For now, avoid this exceptional
case by excluded soft masks from optimization.
2018-10-05 01:23:26 -07:00
James R. Barlow
58282ea0fb optimize: more refactoring
Now properly generalized/specialized where it should be
2018-10-04 13:44:51 -07:00
James R. Barlow
891da7834c optimize: refactor image extraction 2018-10-04 12:34:22 -07:00
James R. Barlow
5c229d48d5 optimize: Reorganize so JBIG2 can be performed on images reduced to 1bpp
Closes #297
2018-10-04 11:53:11 -07:00
James R. Barlow
53f660cf35 Travis: use newer macos image 2018-10-04 08:59:40 -07:00
James R. Barlow
7b66ca68f2 ...and document lossy JBIG2 2018-10-04 01:31:53 -07:00
James R. Barlow
ba71c3ffbd requirements: request pikepdf 0.3.4 2018-10-04 01:22:03 -07:00
James R. Barlow
6707ad427a v7.2.0 release notes 2018-10-04 01:21:17 -07:00
James R. Barlow
5b84549716 Change JBIG2 lossy mode to require --jbig2-lossy 2018-10-04 01:20:49 -07:00
James R. Barlow
c74f2ee6e8 Refactor the detailed error messages 2018-10-04 00:10:59 -07:00
James R. Barlow
b32dd9f9d3 Fix lossless JBIG2 when there are multiple JBIG2 images on a single page 2018-10-03 17:40:26 -07:00
James R. Barlow
fb8b161f6c Fix suppression of tesseract config error messages 2018-10-03 17:39:50 -07:00
James R. Barlow
baddd6d233 Remove libtiff from Brewfile
For some reason, brew complains about it now.
2018-10-03 16:17:59 -07:00
James R. Barlow
6f554c6ae8 tesseract: account for behavior changes when params are missing
Tesseract 4.0-rc1 now accepts invalid parameters in config and
won't return an error anymore. We prefer to raise an error if this
occurs.

See: 741ea00d70
2018-10-03 15:11:34 -07:00
James R. Barlow
a71e4488b3 test: fix pytest warning about direct use of a fixture 2018-10-03 15:04:46 -07:00
James R. Barlow
72156b5653 Degrade more gracefully when --optimize is set but JBIG2 is not present 2018-10-03 14:24:20 -07:00
James R. Barlow
9fa471e053 Test: send stderr to stderr, why don't we? 2018-10-03 14:23:34 -07:00
James R. Barlow
31ef2fe907 test: this error message changed case in newer Tesseract 2018-10-03 13:58:20 -07:00