2676 Commits

Author SHA1 Message Date
James R. Barlow
eacd26a68b Mention v6.2.5 release 2018-11-10 01:10:45 -08:00
James R. Barlow
0e88b3c38a Update v7.3.0 release notes 2018-11-10 01:09:19 -08:00
James R. Barlow
a2170ef8d6 test: test version check code 2018-11-10 00:56:22 -08:00
James R. Barlow
eed0424390 Update requirements 2018-11-10 00:56:04 -08:00
James R. Barlow
5ed05e08b1 Fix "no languages" test and misuse of os.environ 2018-11-09 01:57:11 -08:00
James R. Barlow
58b26f6715 Leptonica: learn to despeckle 1bpp images 2018-11-07 01:49:13 -08:00
James R. Barlow
806daf4284 leptonica: reduce boilerplate for PIX (2/2) 2018-11-06 20:33:40 -08:00
James R. Barlow
c64bc9329e leptonica: reduce boilerplate for wrapper classes (except PIX) 2018-11-06 20:12:09 -08:00
James R. Barlow
dd01745519 Leptonica: add masked threshold fn 2018-11-06 19:31:06 -08:00
James R. Barlow
501ce726e7 Fix two failing tests 2018-11-06 11:16:08 -08:00
James R. Barlow
03076e89ce Leptonica: reduce verbosity, more error trapping, more garbage collection 2018-11-06 11:10:59 -08:00
James R. Barlow
02f37293ee Integrate barcode masking 2018-11-05 13:01:13 -08:00
James R. Barlow
590942ad14 Leptonica: Add barcode API 2018-11-05 01:48:38 -08:00
James R. Barlow
2ac028c759 test: Add a basic redo OCR test 2018-11-04 15:54:41 -08:00
James R. Barlow
2125b5bfab Remove text detection from our parser interpret_contents
It's redundant now
2018-11-04 15:47:55 -08:00
James R. Barlow
b96532caa4 Only do detailed page analysis when needed by --redo-ocr 2018-11-04 15:40:49 -08:00
James R. Barlow
995fc58466 Move Ghostscript text analysis into its own module 2018-11-04 14:55:48 -08:00
James R. Barlow
c023cae299 Make pdfminer Type3 patch conditional on PScript5.dll
It appears that PDFs created by this software have a bug in their BBox
which will cause us to misjudge the space occupied by the font.

Other programs probably work around this by ignoring BBox and reading
each character procedure.
2018-11-04 01:53:53 -07:00
James R. Barlow
237eaf9130 Exception message not printed in some cases
Closes #310
2018-11-03 17:10:24 -07:00
James R. Barlow
8b9ab25125 coverage: test compile leptonica 2018-11-02 01:55:25 -07:00
James R. Barlow
77e87abe8f coverage: ensure get_orientation is checked 2018-11-02 01:32:20 -07:00
James R. Barlow
3be02e1e8d coverage: improve leptonic; don't create objects with null pointers 2018-11-02 01:10:10 -07:00
James R. Barlow
64c9ede979 leptonica: barcodes, BOXA 2018-11-02 00:42:01 -07:00
James R. Barlow
5b8d197812 coverage: make it more likely timeout is tested 2018-11-02 00:41:15 -07:00
James R. Barlow
2cba62dc4f coverage: ensure rotation is actually tested 2018-11-02 00:40:56 -07:00
James R. Barlow
288e28328f coverage: add qpdf 2018-11-02 00:37:33 -07:00
James R. Barlow
b8214b3c49 coverage: exclude unicodefun.py 2018-11-02 00:33:08 -07:00
James R. Barlow
8681693994 Set up code coverage (it works with multiprocessing now!) 2018-11-02 00:31:50 -07:00
James R. Barlow
1364c63b7c Fix failure to pickle file with AcroForm 2018-11-01 20:07:53 -07:00
James R. Barlow
4ba9e8fe25 Add AcroForm detection 2018-10-30 22:28:44 -07:00
James R. Barlow
a195713bb4 Throw exception on corrupt text 2018-10-30 16:35:09 -07:00
James R. Barlow
600d31a907 Require pikepdf 0.3.7 2018-10-30 16:22:05 -07:00
James R. Barlow
be31cec332 Add corrupt text warning (when using --redo-ocr) 2018-10-30 16:19:58 -07:00
James R. Barlow
22a7cd3421 Add argument checks for --redo-ocr 2018-10-30 16:19:13 -07:00
James R. Barlow
8b61d2d521 pdfminer: If font descent claims to be positive, treat it as negative 2018-10-30 14:40:53 -07:00
James R. Barlow
559e5269d2 Ensure inline image is parsed correctly
Requires pikepdf > 0.3.6
2018-10-29 23:30:53 -07:00
James R. Barlow
ebf6acb318 pdfminer patch: Type3 font height calculation is incorrect
Not sure where it goes wrong or why it needs special treatment, but
this does address it.
2018-10-29 22:27:25 -07:00
James R. Barlow
7acd75f013 pipeline: fix bbox coordinates 2018-10-29 22:26:37 -07:00
James R. Barlow
93623b2226 Refactor TextboxInfo 2018-10-29 14:46:40 -07:00
James R. Barlow
d71fd089cb layout: allow names beginning with /i0123 for now
Showed up in GGastro2.pdf. Need to check if this pattern has valid
Unicode mappings but allow for now.
2018-10-29 14:45:59 -07:00
James R. Barlow
05aa43c856 Require pdfminer 2018-10-29 12:45:15 -07:00
James R. Barlow
de80fb6bc8 Fix some failing tests after --redo-ocr changes 2018-10-29 11:49:38 -07:00
James R. Barlow
8e396f4be2 Document --redo-ocr more accurately 2018-10-29 02:03:58 -07:00
James R. Barlow
efec6da377 Fix error on serializing bad character markers
(Since they held a reference to their font, which in turn, had an
open file handle.)
2018-10-29 02:02:00 -07:00
James R. Barlow
00ef53195e Fix corrupt Unicode mapping detection's false positives 2018-10-29 01:30:19 -07:00
James R. Barlow
f564aaf485 Remove only_ocr_text 2018-10-28 22:41:18 -07:00
James R. Barlow
5ac2d31d0d Redo OCR can now handle visible and invisible text, so adjust accordingly
Still can't filter out corrupt text
2018-10-28 14:06:25 -07:00
James R. Barlow
fda890ab47 pdfinfo: further layout improvements
Rather than grouping visible/invisible in a custom analysis step,
use pdfminer's analysis and iterate.
Make iteration predicate and return more generic.
2018-10-28 14:05:50 -07:00
Stefan Weil
a873278c2a Fix some recommendations from LGTM (#309)
* Fix unreachable code

This fixes an issue reported by LGTM.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* Remove unused imports

This fixes several recommendations from LGTM.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-28 13:59:58 -07:00
James R. Barlow
e6d64be890 pdfinfo: formatting 2018-10-27 23:22:44 -07:00