James R. Barlow
eacd26a68b
Mention v6.2.5 release
2018-11-10 01:10:45 -08:00
James R. Barlow
0e88b3c38a
Update v7.3.0 release notes
2018-11-10 01:09:19 -08:00
James R. Barlow
a2170ef8d6
test: test version check code
2018-11-10 00:56:22 -08:00
James R. Barlow
eed0424390
Update requirements
2018-11-10 00:56:04 -08:00
James R. Barlow
5ed05e08b1
Fix "no languages" test and misuse of os.environ
2018-11-09 01:57:11 -08:00
James R. Barlow
58b26f6715
Leptonica: learn to despeckle 1bpp images
2018-11-07 01:49:13 -08:00
James R. Barlow
806daf4284
leptonica: reduce boilerplate for PIX (2/2)
2018-11-06 20:33:40 -08:00
James R. Barlow
c64bc9329e
leptonica: reduce boilerplate for wrapper classes (except PIX)
2018-11-06 20:12:09 -08:00
James R. Barlow
dd01745519
Leptonica: add masked threshold fn
2018-11-06 19:31:06 -08:00
James R. Barlow
501ce726e7
Fix two failing tests
2018-11-06 11:16:08 -08:00
James R. Barlow
03076e89ce
Leptonica: reduce verbosity, more error trapping, more garbage collection
2018-11-06 11:10:59 -08:00
James R. Barlow
02f37293ee
Integrate barcode masking
2018-11-05 13:01:13 -08:00
James R. Barlow
590942ad14
Leptonica: Add barcode API
2018-11-05 01:48:38 -08:00
James R. Barlow
2ac028c759
test: Add a basic redo OCR test
2018-11-04 15:54:41 -08:00
James R. Barlow
2125b5bfab
Remove text detection from our parser interpret_contents
...
It's redundant now
2018-11-04 15:47:55 -08:00
James R. Barlow
b96532caa4
Only do detailed page analysis when needed by --redo-ocr
2018-11-04 15:40:49 -08:00
James R. Barlow
995fc58466
Move Ghostscript text analysis into its own module
2018-11-04 14:55:48 -08:00
James R. Barlow
c023cae299
Make pdfminer Type3 patch conditional on PScript5.dll
...
It appears that PDFs created by this software have a bug in their BBox
which will cause us to misjudge the space occupied by the font.
Other programs probably work around this by ignoring BBox and reading
each character procedure.
2018-11-04 01:53:53 -07:00
James R. Barlow
237eaf9130
Exception message not printed in some cases
...
Closes #310
2018-11-03 17:10:24 -07:00
James R. Barlow
8b9ab25125
coverage: test compile leptonica
2018-11-02 01:55:25 -07:00
James R. Barlow
77e87abe8f
coverage: ensure get_orientation is checked
2018-11-02 01:32:20 -07:00
James R. Barlow
3be02e1e8d
coverage: improve leptonic; don't create objects with null pointers
2018-11-02 01:10:10 -07:00
James R. Barlow
64c9ede979
leptonica: barcodes, BOXA
2018-11-02 00:42:01 -07:00
James R. Barlow
5b8d197812
coverage: make it more likely timeout is tested
2018-11-02 00:41:15 -07:00
James R. Barlow
2cba62dc4f
coverage: ensure rotation is actually tested
2018-11-02 00:40:56 -07:00
James R. Barlow
288e28328f
coverage: add qpdf
2018-11-02 00:37:33 -07:00
James R. Barlow
b8214b3c49
coverage: exclude unicodefun.py
2018-11-02 00:33:08 -07:00
James R. Barlow
8681693994
Set up code coverage (it works with multiprocessing now!)
2018-11-02 00:31:50 -07:00
James R. Barlow
1364c63b7c
Fix failure to pickle file with AcroForm
2018-11-01 20:07:53 -07:00
James R. Barlow
4ba9e8fe25
Add AcroForm detection
2018-10-30 22:28:44 -07:00
James R. Barlow
a195713bb4
Throw exception on corrupt text
2018-10-30 16:35:09 -07:00
James R. Barlow
600d31a907
Require pikepdf 0.3.7
2018-10-30 16:22:05 -07:00
James R. Barlow
be31cec332
Add corrupt text warning (when using --redo-ocr)
2018-10-30 16:19:58 -07:00
James R. Barlow
22a7cd3421
Add argument checks for --redo-ocr
2018-10-30 16:19:13 -07:00
James R. Barlow
8b61d2d521
pdfminer: If font descent claims to be positive, treat it as negative
2018-10-30 14:40:53 -07:00
James R. Barlow
559e5269d2
Ensure inline image is parsed correctly
...
Requires pikepdf > 0.3.6
2018-10-29 23:30:53 -07:00
James R. Barlow
ebf6acb318
pdfminer patch: Type3 font height calculation is incorrect
...
Not sure where it goes wrong or why it needs special treatment, but
this does address it.
2018-10-29 22:27:25 -07:00
James R. Barlow
7acd75f013
pipeline: fix bbox coordinates
2018-10-29 22:26:37 -07:00
James R. Barlow
93623b2226
Refactor TextboxInfo
2018-10-29 14:46:40 -07:00
James R. Barlow
d71fd089cb
layout: allow names beginning with /i0123 for now
...
Showed up in GGastro2.pdf. Need to check if this pattern has valid
Unicode mappings but allow for now.
2018-10-29 14:45:59 -07:00
James R. Barlow
05aa43c856
Require pdfminer
2018-10-29 12:45:15 -07:00
James R. Barlow
de80fb6bc8
Fix some failing tests after --redo-ocr changes
2018-10-29 11:49:38 -07:00
James R. Barlow
8e396f4be2
Document --redo-ocr more accurately
2018-10-29 02:03:58 -07:00
James R. Barlow
efec6da377
Fix error on serializing bad character markers
...
(Since they held a reference to their font, which in turn, had an
open file handle.)
2018-10-29 02:02:00 -07:00
James R. Barlow
00ef53195e
Fix corrupt Unicode mapping detection's false positives
2018-10-29 01:30:19 -07:00
James R. Barlow
f564aaf485
Remove only_ocr_text
2018-10-28 22:41:18 -07:00
James R. Barlow
5ac2d31d0d
Redo OCR can now handle visible and invisible text, so adjust accordingly
...
Still can't filter out corrupt text
2018-10-28 14:06:25 -07:00
James R. Barlow
fda890ab47
pdfinfo: further layout improvements
...
Rather than grouping visible/invisible in a custom analysis step,
use pdfminer's analysis and iterate.
Make iteration predicate and return more generic.
2018-10-28 14:05:50 -07:00
Stefan Weil
a873278c2a
Fix some recommendations from LGTM ( #309 )
...
* Fix unreachable code
This fixes an issue reported by LGTM.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* Remove unused imports
This fixes several recommendations from LGTM.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-28 13:59:58 -07:00
James R. Barlow
e6d64be890
pdfinfo: formatting
2018-10-27 23:22:44 -07:00