James R. Barlow
0e4d978d20
pdfinfo: all -> not any
2018-10-27 23:22:28 -07:00
James R. Barlow
b12c2cfedf
Fix handling of Type3 fonts with no ToUnicode mapping
2018-10-27 01:24:48 -07:00
James R. Barlow
58cc70725e
Reorganize around getting bboxes for visible/invisible text
2018-10-26 01:07:02 -07:00
James R. Barlow
339afb02aa
--redo-ocr now works in the presence of printable text
2018-10-25 16:53:47 -07:00
James R. Barlow
7ba0ff5c36
Fix strip invisible text bug: missing BT operator
2018-10-25 16:52:23 -07:00
James R. Barlow
ff41fbf673
Add pdfminer based layout analysis
2018-10-25 12:42:35 -07:00
James R. Barlow
2435cd23ce
Move pdfinfo into a package
2018-10-25 00:37:38 -07:00
James R. Barlow
a063cff720
Rename/expose strip_invisible_text
2018-10-24 21:53:24 -07:00
James R. Barlow
0d396e1ac0
option check: Remove always-True condition
...
Both renderers are now lossless reconstruction-capable. (Have
been since 7.0)
2018-10-22 22:13:59 -07:00
James R. Barlow
f5807a2053
Require pikepdf 0.3.5
2018-10-21 21:37:15 -07:00
James R. Barlow
eb4938a36f
Fix KeyError 'has_vector'
2018-10-20 01:20:22 -07:00
James R. Barlow
c5ad530bbf
pdfinfo: reminder about 'INLINE IMAGE' sentinel
2018-10-20 01:17:08 -07:00
James R. Barlow
d11c428407
Redo OCR: disallow in cases that will damage the output PDF
2018-10-20 01:14:33 -07:00
James R. Barlow
6182b1f53e
Merge branch 'feature/remove-vectors' into feature/redo-ocr
2018-10-20 01:13:24 -07:00
James R. Barlow
00fc1a12e2
optimize: should remove unreference resources too
2018-10-19 00:03:56 -07:00
James R. Barlow
16af753206
Add functional "redo OCR" feature
...
Needs argument validation and some other changes. Needs testing
with mixed-content PDFs.
Only really works for pure invisible text at the moment.
2018-10-19 00:02:19 -07:00
James R. Barlow
fa48205bb8
Add feature to remove vector graphics objects
2018-10-18 21:46:08 -07:00
James R. Barlow
f7dbf94071
pipeline: if vector graphic objects exist, ensure the DPI is reasonable
2018-10-18 01:23:31 -07:00
James R. Barlow
b18e66e2ca
pdfinfo: learn to detect vector graphic objects
2018-10-18 01:21:51 -07:00
James R. Barlow
7a5504dfa5
pdfinfo: fix terminology (operands, command) -> (operands, operator)
2018-10-18 01:18:30 -07:00
James R. Barlow
d1cad7bc68
Merge branch 'master' of github.com:jbarlow83/OCRmyPDF
2018-10-16 01:28:17 -07:00
Elliott Sales de Andrade
c58d5c097c
Add Fedora install instructions. ( #304 )
...
* Add Fedora install instructions.
* Fix path to fedora_rawhide badget
2018-10-14 13:28:50 -07:00
James R. Barlow
46157ca94e
docs: some redundancies
2018-10-12 21:29:27 -07:00
jbarlow83
dd99511bcc
Fix broken badges in README
2018-10-12 21:16:08 -07:00
M.Yasoob Ullah Khalid ☺
5bc2efd3c7
Removed extra word from docs ( #303 )
2018-10-12 21:02:16 -07:00
James R. Barlow
1b18dbecf5
Fix filename test.txt
v7.2.1
2018-10-11 16:03:25 -07:00
James R. Barlow
9f82c0eb6e
v7.2.1 release notes
2018-10-11 15:55:01 -07:00
James R. Barlow
68bac1b177
Fix compatibility with pikepdf 0.3.5 API change
2018-10-11 15:51:34 -07:00
James R. Barlow
1495b78330
Remove cruft to support leptonica < 1.72 in test suite
2018-10-11 01:37:32 -07:00
James R. Barlow
6f777d2848
Include Debian copyright file
2018-10-10 23:55:48 -07:00
James R. Barlow
5650eba848
Cleanup MANIFEST.in, reorg requirements/*.txt, fix non-Unicode readme
2018-10-10 23:53:08 -07:00
James R. Barlow
5bc5dc93f3
v7.2.0 release notes update
v7.2.0
2018-10-05 01:27:00 -07:00
James R. Barlow
c1e18bb825
optimize: Exclude soft masks (SMasks) from optimization
...
Soft masks are only allowed to be of colorspace DeviceGray so we
shouldn't use pngquant on them. For now, avoid this exceptional
case by excluded soft masks from optimization.
2018-10-05 01:23:26 -07:00
James R. Barlow
58282ea0fb
optimize: more refactoring
...
Now properly generalized/specialized where it should be
2018-10-04 13:44:51 -07:00
James R. Barlow
891da7834c
optimize: refactor image extraction
2018-10-04 12:34:22 -07:00
James R. Barlow
5c229d48d5
optimize: Reorganize so JBIG2 can be performed on images reduced to 1bpp
...
Closes #297
2018-10-04 11:53:11 -07:00
James R. Barlow
53f660cf35
Travis: use newer macos image
2018-10-04 08:59:40 -07:00
James R. Barlow
7b66ca68f2
...and document lossy JBIG2
2018-10-04 01:31:53 -07:00
James R. Barlow
ba71c3ffbd
requirements: request pikepdf 0.3.4
2018-10-04 01:22:03 -07:00
James R. Barlow
6707ad427a
v7.2.0 release notes
2018-10-04 01:21:17 -07:00
James R. Barlow
5b84549716
Change JBIG2 lossy mode to require --jbig2-lossy
2018-10-04 01:20:49 -07:00
James R. Barlow
c74f2ee6e8
Refactor the detailed error messages
2018-10-04 00:10:59 -07:00
James R. Barlow
b32dd9f9d3
Fix lossless JBIG2 when there are multiple JBIG2 images on a single page
2018-10-03 17:40:26 -07:00
James R. Barlow
fb8b161f6c
Fix suppression of tesseract config error messages
2018-10-03 17:39:50 -07:00
James R. Barlow
baddd6d233
Remove libtiff from Brewfile
...
For some reason, brew complains about it now.
2018-10-03 16:17:59 -07:00
James R. Barlow
6f554c6ae8
tesseract: account for behavior changes when params are missing
...
Tesseract 4.0-rc1 now accepts invalid parameters in config and
won't return an error anymore. We prefer to raise an error if this
occurs.
See: 741ea00d70
2018-10-03 15:11:34 -07:00
James R. Barlow
a71e4488b3
test: fix pytest warning about direct use of a fixture
2018-10-03 15:04:46 -07:00
James R. Barlow
72156b5653
Degrade more gracefully when --optimize is set but JBIG2 is not present
2018-10-03 14:24:20 -07:00
James R. Barlow
9fa471e053
Test: send stderr to stderr, why don't we?
2018-10-03 14:23:34 -07:00
James R. Barlow
31ef2fe907
test: this error message changed case in newer Tesseract
2018-10-03 13:58:20 -07:00