3712 Commits

Author SHA1 Message Date
James R. Barlow
f6e90a5934
hOCR renderer is now default 2023-12-02 19:58:00 -08:00
James R. Barlow
43618e6b3f
Move canvas API to pikepdf and import it 2023-12-02 19:42:35 -08:00
James R. Barlow
e97f89de3b
Refactor font so glyphless isn't as hard coded 2023-12-02 08:55:01 -08:00
James R. Barlow
11d3e32f1e
Fix hocrtransform CLI 2023-12-02 08:08:29 -08:00
James R. Barlow
aacaba3d26
Ignore pypy for now 2023-11-21 01:05:23 -08:00
James R. Barlow
fec53be841
Remove next major release deprecations 2023-11-21 00:47:51 -08:00
James R. Barlow
3f7b540f76
Drop Python 3.9 support 2023-11-21 00:46:00 -08:00
James R. Barlow
d217856166
Make hocrdebug work, and try to handle CJK spacing better 2023-11-21 00:33:02 -08:00
James R. Barlow
e2be457e9b
Avoid divzero 2023-11-20 23:08:00 -08:00
James R. Barlow
4850f486d2
Make text API more like an accessor 2023-11-20 22:59:50 -08:00
James R. Barlow
729c7febd9
Fix placement of spaces in debug mode 2023-11-20 22:44:12 -08:00
James R. Barlow
6c6aca2f1e
Refactor save_state 2023-11-20 22:29:21 -08:00
James R. Barlow
c69823f496
Refactor; accumulate content stream as bytes rather than discrete pikepdf objects 2023-11-20 22:11:59 -08:00
James R. Barlow
73f8f6aac8
Add RTL output - seems to work, but debug does not 2023-11-20 20:28:07 -08:00
James R. Barlow
d944254e45
hocr: typing cont'd 2023-11-20 17:07:52 -08:00
James R. Barlow
f7ddffe554
hocr: typing 2023-11-20 16:52:55 -08:00
James R. Barlow
8a73ed5d5a
Fix JBIG2 not updating progress bar 2023-11-20 16:25:30 -08:00
James R. Barlow
03669183d7
Rationalize canvas interface 2023-11-20 15:54:13 -08:00
James R. Barlow
74e101a2fa
Improve canvas interface with chaining 2023-11-20 14:42:48 -08:00
James R. Barlow
532cf18ad3
Restructure hocrtransform submodule to avoid having everything in __init__ 2023-11-20 00:57:58 -08:00
James R. Barlow
0b90b697e2
More tidying 2023-11-20 00:43:43 -08:00
James R. Barlow
6be7c5f7c8
Fix colors and space box rendering 2023-11-20 00:30:54 -08:00
James R. Barlow
db2e5132e6
Remove some obsolete parameters 2023-11-20 00:10:55 -08:00
James R. Barlow
b14f6f778a
Tidying new hOCR renderer 2023-11-19 23:51:27 -08:00
James R. Barlow
415de77457
imageops: fix annots since not using singledispatch anymore 2023-11-19 23:51:27 -08:00
James R. Barlow
a9466c4f58
Improve word box positioning 2023-11-19 23:51:27 -08:00
James R. Barlow
d9ae453a63
Significantly improvement overall 2023-11-19 23:51:27 -08:00
James R. Barlow
9841e09233
More adjustments 2023-11-19 23:51:27 -08:00
James R. Barlow
0ca314e066
Replace Rect with pikepdf.Rectangle, migrate line matrix to page 2023-11-19 23:51:27 -08:00
James R. Barlow
d7680cae27
Correcting Matrix logic helps
The good: don't have to do inverse and intermediate transforms.

The bad: skew looks bad, partly because the hOCR coordinate system is inconsistent around skew?
2023-11-19 23:51:27 -08:00
James R. Barlow
491b6bdb1f
Remove concept of HOCR_OK_LANGS 2023-11-19 23:51:27 -08:00
James R. Barlow
c591f9601a
Remove Latin hOCR test 2023-11-19 23:51:27 -08:00
James R. Barlow
8d1e75017e
Remote reportlab backend and make reportlab a test-only dependency 2023-11-19 23:51:27 -08:00
James R. Barlow
94615f7ad4
hOCR now works for all languages 2023-11-19 23:51:27 -08:00
James R. Barlow
e5df8e1315
Nearly pixel perfect 2023-11-19 23:51:27 -08:00
James R. Barlow
d739b91aef
Tidy up 2023-11-19 23:51:27 -08:00
James R. Barlow
686cfb2539
Add rendering of space between boxes 2023-11-19 23:51:27 -08:00
James R. Barlow
2633716bb7
Render interword spaces separately and avoid box overlap 2023-11-19 23:51:27 -08:00
James R. Barlow
0a07c0a44e
Fix more things 2023-11-19 23:51:27 -08:00
James R. Barlow
2ca6e110ca
Fix private accessors, rename pdf to canvas 2023-11-19 23:51:27 -08:00
James R. Barlow
334a07c839
Refactor debug printing 2023-11-19 23:51:27 -08:00
James R. Barlow
a57c39358d
Refactor: extract methods 2023-11-19 23:51:27 -08:00
James R. Barlow
30a0c315fb
Further exploratory improvements 2023-11-19 23:51:27 -08:00
James R. Barlow
b860f0d94c
Make coordinate system more consistent 2023-11-19 23:51:27 -08:00
James R. Barlow
14f4c19f5a
WIP improve text positioning (not there yet) 2023-11-19 23:51:27 -08:00
James R. Barlow
7ab5c55d46
More colors 2023-11-19 23:51:27 -08:00
James R. Barlow
8b6ecd5971
Fix line and rect drawing 2023-11-19 23:51:27 -08:00
James R. Barlow
7b0871ae4c
Fix position errors; ignore non-glyphless font 2023-11-19 23:51:27 -08:00
James R. Barlow
b73af7ce10
Fix dashes 2023-11-19 23:51:26 -08:00
James R. Barlow
60645717e2
Test pikepdf canvas - renders... something at this point 2023-11-19 23:51:26 -08:00