2895 Commits

Author SHA1 Message Date
James R. Barlow
600d31a907 Require pikepdf 0.3.7 2018-10-30 16:22:05 -07:00
James R. Barlow
be31cec332 Add corrupt text warning (when using --redo-ocr) 2018-10-30 16:19:58 -07:00
James R. Barlow
22a7cd3421 Add argument checks for --redo-ocr 2018-10-30 16:19:13 -07:00
James R. Barlow
8b61d2d521 pdfminer: If font descent claims to be positive, treat it as negative 2018-10-30 14:40:53 -07:00
James R. Barlow
559e5269d2 Ensure inline image is parsed correctly
Requires pikepdf > 0.3.6
2018-10-29 23:30:53 -07:00
James R. Barlow
ebf6acb318 pdfminer patch: Type3 font height calculation is incorrect
Not sure where it goes wrong or why it needs special treatment, but
this does address it.
2018-10-29 22:27:25 -07:00
James R. Barlow
7acd75f013 pipeline: fix bbox coordinates 2018-10-29 22:26:37 -07:00
James R. Barlow
93623b2226 Refactor TextboxInfo 2018-10-29 14:46:40 -07:00
James R. Barlow
d71fd089cb layout: allow names beginning with /i0123 for now
Showed up in GGastro2.pdf. Need to check if this pattern has valid
Unicode mappings but allow for now.
2018-10-29 14:45:59 -07:00
James R. Barlow
05aa43c856 Require pdfminer 2018-10-29 12:45:15 -07:00
James R. Barlow
de80fb6bc8 Fix some failing tests after --redo-ocr changes 2018-10-29 11:49:38 -07:00
James R. Barlow
8e396f4be2 Document --redo-ocr more accurately 2018-10-29 02:03:58 -07:00
James R. Barlow
efec6da377 Fix error on serializing bad character markers
(Since they held a reference to their font, which in turn, had an
open file handle.)
2018-10-29 02:02:00 -07:00
James R. Barlow
00ef53195e Fix corrupt Unicode mapping detection's false positives 2018-10-29 01:30:19 -07:00
James R. Barlow
f564aaf485 Remove only_ocr_text 2018-10-28 22:41:18 -07:00
James R. Barlow
5ac2d31d0d Redo OCR can now handle visible and invisible text, so adjust accordingly
Still can't filter out corrupt text
2018-10-28 14:06:25 -07:00
James R. Barlow
fda890ab47 pdfinfo: further layout improvements
Rather than grouping visible/invisible in a custom analysis step,
use pdfminer's analysis and iterate.
Make iteration predicate and return more generic.
2018-10-28 14:05:50 -07:00
Stefan Weil
a873278c2a Fix some recommendations from LGTM (#309)
* Fix unreachable code

This fixes an issue reported by LGTM.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* Remove unused imports

This fixes several recommendations from LGTM.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-28 13:59:58 -07:00
James R. Barlow
e6d64be890 pdfinfo: formatting 2018-10-27 23:22:44 -07:00
James R. Barlow
0e4d978d20 pdfinfo: all -> not any 2018-10-27 23:22:28 -07:00
James R. Barlow
b12c2cfedf Fix handling of Type3 fonts with no ToUnicode mapping 2018-10-27 01:24:48 -07:00
James R. Barlow
58cc70725e Reorganize around getting bboxes for visible/invisible text 2018-10-26 01:07:02 -07:00
James R. Barlow
339afb02aa --redo-ocr now works in the presence of printable text 2018-10-25 16:53:47 -07:00
James R. Barlow
7ba0ff5c36 Fix strip invisible text bug: missing BT operator 2018-10-25 16:52:23 -07:00
James R. Barlow
ff41fbf673 Add pdfminer based layout analysis 2018-10-25 12:42:35 -07:00
James R. Barlow
2435cd23ce Move pdfinfo into a package 2018-10-25 00:37:38 -07:00
James R. Barlow
a063cff720 Rename/expose strip_invisible_text 2018-10-24 21:53:24 -07:00
James R. Barlow
0d396e1ac0 option check: Remove always-True condition
Both renderers are now lossless reconstruction-capable. (Have
been since 7.0)
2018-10-22 22:13:59 -07:00
James R. Barlow
f5807a2053 Require pikepdf 0.3.5 2018-10-21 21:37:15 -07:00
James R. Barlow
eb4938a36f Fix KeyError 'has_vector' 2018-10-20 01:20:22 -07:00
James R. Barlow
c5ad530bbf pdfinfo: reminder about 'INLINE IMAGE' sentinel 2018-10-20 01:17:08 -07:00
James R. Barlow
d11c428407 Redo OCR: disallow in cases that will damage the output PDF 2018-10-20 01:14:33 -07:00
James R. Barlow
6182b1f53e Merge branch 'feature/remove-vectors' into feature/redo-ocr 2018-10-20 01:13:24 -07:00
James R. Barlow
00fc1a12e2 optimize: should remove unreference resources too 2018-10-19 00:03:56 -07:00
James R. Barlow
16af753206 Add functional "redo OCR" feature
Needs argument validation and some other changes. Needs testing
with mixed-content PDFs.

Only really works for pure invisible text at the moment.
2018-10-19 00:02:19 -07:00
James R. Barlow
fa48205bb8 Add feature to remove vector graphics objects 2018-10-18 21:46:08 -07:00
James R. Barlow
f7dbf94071 pipeline: if vector graphic objects exist, ensure the DPI is reasonable 2018-10-18 01:23:31 -07:00
James R. Barlow
b18e66e2ca pdfinfo: learn to detect vector graphic objects 2018-10-18 01:21:51 -07:00
James R. Barlow
7a5504dfa5 pdfinfo: fix terminology (operands, command) -> (operands, operator) 2018-10-18 01:18:30 -07:00
James R. Barlow
d1cad7bc68 Merge branch 'master' of github.com:jbarlow83/OCRmyPDF 2018-10-16 01:28:17 -07:00
Elliott Sales de Andrade
c58d5c097c Add Fedora install instructions. (#304)
* Add Fedora install instructions.

* Fix path to fedora_rawhide badget
2018-10-14 13:28:50 -07:00
James R. Barlow
46157ca94e docs: some redundancies 2018-10-12 21:29:27 -07:00
jbarlow83
dd99511bcc
Fix broken badges in README 2018-10-12 21:16:08 -07:00
M.Yasoob Ullah Khalid ☺
5bc2efd3c7 Removed extra word from docs (#303) 2018-10-12 21:02:16 -07:00
James R. Barlow
1b18dbecf5 Fix filename test.txt v7.2.1 2018-10-11 16:03:25 -07:00
James R. Barlow
9f82c0eb6e v7.2.1 release notes 2018-10-11 15:55:01 -07:00
James R. Barlow
68bac1b177 Fix compatibility with pikepdf 0.3.5 API change 2018-10-11 15:51:34 -07:00
James R. Barlow
1495b78330 Remove cruft to support leptonica < 1.72 in test suite 2018-10-11 01:37:32 -07:00
James R. Barlow
6f777d2848 Include Debian copyright file 2018-10-10 23:55:48 -07:00
James R. Barlow
5650eba848 Cleanup MANIFEST.in, reorg requirements/*.txt, fix non-Unicode readme 2018-10-10 23:53:08 -07:00