James R. Barlow
600d31a907
Require pikepdf 0.3.7
2018-10-30 16:22:05 -07:00
James R. Barlow
be31cec332
Add corrupt text warning (when using --redo-ocr)
2018-10-30 16:19:58 -07:00
James R. Barlow
22a7cd3421
Add argument checks for --redo-ocr
2018-10-30 16:19:13 -07:00
James R. Barlow
8b61d2d521
pdfminer: If font descent claims to be positive, treat it as negative
2018-10-30 14:40:53 -07:00
James R. Barlow
559e5269d2
Ensure inline image is parsed correctly
...
Requires pikepdf > 0.3.6
2018-10-29 23:30:53 -07:00
James R. Barlow
ebf6acb318
pdfminer patch: Type3 font height calculation is incorrect
...
Not sure where it goes wrong or why it needs special treatment, but
this does address it.
2018-10-29 22:27:25 -07:00
James R. Barlow
7acd75f013
pipeline: fix bbox coordinates
2018-10-29 22:26:37 -07:00
James R. Barlow
93623b2226
Refactor TextboxInfo
2018-10-29 14:46:40 -07:00
James R. Barlow
d71fd089cb
layout: allow names beginning with /i0123 for now
...
Showed up in GGastro2.pdf. Need to check if this pattern has valid
Unicode mappings but allow for now.
2018-10-29 14:45:59 -07:00
James R. Barlow
05aa43c856
Require pdfminer
2018-10-29 12:45:15 -07:00
James R. Barlow
de80fb6bc8
Fix some failing tests after --redo-ocr changes
2018-10-29 11:49:38 -07:00
James R. Barlow
8e396f4be2
Document --redo-ocr more accurately
2018-10-29 02:03:58 -07:00
James R. Barlow
efec6da377
Fix error on serializing bad character markers
...
(Since they held a reference to their font, which in turn, had an
open file handle.)
2018-10-29 02:02:00 -07:00
James R. Barlow
00ef53195e
Fix corrupt Unicode mapping detection's false positives
2018-10-29 01:30:19 -07:00
James R. Barlow
f564aaf485
Remove only_ocr_text
2018-10-28 22:41:18 -07:00
James R. Barlow
5ac2d31d0d
Redo OCR can now handle visible and invisible text, so adjust accordingly
...
Still can't filter out corrupt text
2018-10-28 14:06:25 -07:00
James R. Barlow
fda890ab47
pdfinfo: further layout improvements
...
Rather than grouping visible/invisible in a custom analysis step,
use pdfminer's analysis and iterate.
Make iteration predicate and return more generic.
2018-10-28 14:05:50 -07:00
Stefan Weil
a873278c2a
Fix some recommendations from LGTM ( #309 )
...
* Fix unreachable code
This fixes an issue reported by LGTM.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* Remove unused imports
This fixes several recommendations from LGTM.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-28 13:59:58 -07:00
James R. Barlow
e6d64be890
pdfinfo: formatting
2018-10-27 23:22:44 -07:00
James R. Barlow
0e4d978d20
pdfinfo: all -> not any
2018-10-27 23:22:28 -07:00
James R. Barlow
b12c2cfedf
Fix handling of Type3 fonts with no ToUnicode mapping
2018-10-27 01:24:48 -07:00
James R. Barlow
58cc70725e
Reorganize around getting bboxes for visible/invisible text
2018-10-26 01:07:02 -07:00
James R. Barlow
339afb02aa
--redo-ocr now works in the presence of printable text
2018-10-25 16:53:47 -07:00
James R. Barlow
7ba0ff5c36
Fix strip invisible text bug: missing BT operator
2018-10-25 16:52:23 -07:00
James R. Barlow
ff41fbf673
Add pdfminer based layout analysis
2018-10-25 12:42:35 -07:00
James R. Barlow
2435cd23ce
Move pdfinfo into a package
2018-10-25 00:37:38 -07:00
James R. Barlow
a063cff720
Rename/expose strip_invisible_text
2018-10-24 21:53:24 -07:00
James R. Barlow
0d396e1ac0
option check: Remove always-True condition
...
Both renderers are now lossless reconstruction-capable. (Have
been since 7.0)
2018-10-22 22:13:59 -07:00
James R. Barlow
f5807a2053
Require pikepdf 0.3.5
2018-10-21 21:37:15 -07:00
James R. Barlow
eb4938a36f
Fix KeyError 'has_vector'
2018-10-20 01:20:22 -07:00
James R. Barlow
c5ad530bbf
pdfinfo: reminder about 'INLINE IMAGE' sentinel
2018-10-20 01:17:08 -07:00
James R. Barlow
d11c428407
Redo OCR: disallow in cases that will damage the output PDF
2018-10-20 01:14:33 -07:00
James R. Barlow
6182b1f53e
Merge branch 'feature/remove-vectors' into feature/redo-ocr
2018-10-20 01:13:24 -07:00
James R. Barlow
00fc1a12e2
optimize: should remove unreference resources too
2018-10-19 00:03:56 -07:00
James R. Barlow
16af753206
Add functional "redo OCR" feature
...
Needs argument validation and some other changes. Needs testing
with mixed-content PDFs.
Only really works for pure invisible text at the moment.
2018-10-19 00:02:19 -07:00
James R. Barlow
fa48205bb8
Add feature to remove vector graphics objects
2018-10-18 21:46:08 -07:00
James R. Barlow
f7dbf94071
pipeline: if vector graphic objects exist, ensure the DPI is reasonable
2018-10-18 01:23:31 -07:00
James R. Barlow
b18e66e2ca
pdfinfo: learn to detect vector graphic objects
2018-10-18 01:21:51 -07:00
James R. Barlow
7a5504dfa5
pdfinfo: fix terminology (operands, command) -> (operands, operator)
2018-10-18 01:18:30 -07:00
James R. Barlow
d1cad7bc68
Merge branch 'master' of github.com:jbarlow83/OCRmyPDF
2018-10-16 01:28:17 -07:00
Elliott Sales de Andrade
c58d5c097c
Add Fedora install instructions. ( #304 )
...
* Add Fedora install instructions.
* Fix path to fedora_rawhide badget
2018-10-14 13:28:50 -07:00
James R. Barlow
46157ca94e
docs: some redundancies
2018-10-12 21:29:27 -07:00
jbarlow83
dd99511bcc
Fix broken badges in README
2018-10-12 21:16:08 -07:00
M.Yasoob Ullah Khalid ☺
5bc2efd3c7
Removed extra word from docs ( #303 )
2018-10-12 21:02:16 -07:00
James R. Barlow
1b18dbecf5
Fix filename test.txt
v7.2.1
2018-10-11 16:03:25 -07:00
James R. Barlow
9f82c0eb6e
v7.2.1 release notes
2018-10-11 15:55:01 -07:00
James R. Barlow
68bac1b177
Fix compatibility with pikepdf 0.3.5 API change
2018-10-11 15:51:34 -07:00
James R. Barlow
1495b78330
Remove cruft to support leptonica < 1.72 in test suite
2018-10-11 01:37:32 -07:00
James R. Barlow
6f777d2848
Include Debian copyright file
2018-10-10 23:55:48 -07:00
James R. Barlow
5650eba848
Cleanup MANIFEST.in, reorg requirements/*.txt, fix non-Unicode readme
2018-10-10 23:53:08 -07:00