James R. Barlow
f6e90a5934
hOCR renderer is now default
2023-12-02 19:58:00 -08:00
James R. Barlow
43618e6b3f
Move canvas API to pikepdf and import it
2023-12-02 19:42:35 -08:00
James R. Barlow
e97f89de3b
Refactor font so glyphless isn't as hard coded
2023-12-02 08:55:01 -08:00
James R. Barlow
11d3e32f1e
Fix hocrtransform CLI
2023-12-02 08:08:29 -08:00
James R. Barlow
aacaba3d26
Ignore pypy for now
2023-11-21 01:05:23 -08:00
James R. Barlow
fec53be841
Remove next major release deprecations
2023-11-21 00:47:51 -08:00
James R. Barlow
3f7b540f76
Drop Python 3.9 support
2023-11-21 00:46:00 -08:00
James R. Barlow
d217856166
Make hocrdebug work, and try to handle CJK spacing better
2023-11-21 00:33:02 -08:00
James R. Barlow
e2be457e9b
Avoid divzero
2023-11-20 23:08:00 -08:00
James R. Barlow
4850f486d2
Make text API more like an accessor
2023-11-20 22:59:50 -08:00
James R. Barlow
729c7febd9
Fix placement of spaces in debug mode
2023-11-20 22:44:12 -08:00
James R. Barlow
6c6aca2f1e
Refactor save_state
2023-11-20 22:29:21 -08:00
James R. Barlow
c69823f496
Refactor; accumulate content stream as bytes rather than discrete pikepdf objects
2023-11-20 22:11:59 -08:00
James R. Barlow
73f8f6aac8
Add RTL output - seems to work, but debug does not
2023-11-20 20:28:07 -08:00
James R. Barlow
d944254e45
hocr: typing cont'd
2023-11-20 17:07:52 -08:00
James R. Barlow
f7ddffe554
hocr: typing
2023-11-20 16:52:55 -08:00
James R. Barlow
8a73ed5d5a
Fix JBIG2 not updating progress bar
2023-11-20 16:25:30 -08:00
James R. Barlow
03669183d7
Rationalize canvas interface
2023-11-20 15:54:13 -08:00
James R. Barlow
74e101a2fa
Improve canvas interface with chaining
2023-11-20 14:42:48 -08:00
James R. Barlow
532cf18ad3
Restructure hocrtransform submodule to avoid having everything in __init__
2023-11-20 00:57:58 -08:00
James R. Barlow
0b90b697e2
More tidying
2023-11-20 00:43:43 -08:00
James R. Barlow
6be7c5f7c8
Fix colors and space box rendering
2023-11-20 00:30:54 -08:00
James R. Barlow
db2e5132e6
Remove some obsolete parameters
2023-11-20 00:10:55 -08:00
James R. Barlow
b14f6f778a
Tidying new hOCR renderer
2023-11-19 23:51:27 -08:00
James R. Barlow
415de77457
imageops: fix annots since not using singledispatch anymore
2023-11-19 23:51:27 -08:00
James R. Barlow
a9466c4f58
Improve word box positioning
2023-11-19 23:51:27 -08:00
James R. Barlow
d9ae453a63
Significantly improvement overall
2023-11-19 23:51:27 -08:00
James R. Barlow
9841e09233
More adjustments
2023-11-19 23:51:27 -08:00
James R. Barlow
0ca314e066
Replace Rect with pikepdf.Rectangle, migrate line matrix to page
2023-11-19 23:51:27 -08:00
James R. Barlow
d7680cae27
Correcting Matrix logic helps
...
The good: don't have to do inverse and intermediate transforms.
The bad: skew looks bad, partly because the hOCR coordinate system is inconsistent around skew?
2023-11-19 23:51:27 -08:00
James R. Barlow
491b6bdb1f
Remove concept of HOCR_OK_LANGS
2023-11-19 23:51:27 -08:00
James R. Barlow
c591f9601a
Remove Latin hOCR test
2023-11-19 23:51:27 -08:00
James R. Barlow
8d1e75017e
Remote reportlab backend and make reportlab a test-only dependency
2023-11-19 23:51:27 -08:00
James R. Barlow
94615f7ad4
hOCR now works for all languages
2023-11-19 23:51:27 -08:00
James R. Barlow
e5df8e1315
Nearly pixel perfect
2023-11-19 23:51:27 -08:00
James R. Barlow
d739b91aef
Tidy up
2023-11-19 23:51:27 -08:00
James R. Barlow
686cfb2539
Add rendering of space between boxes
2023-11-19 23:51:27 -08:00
James R. Barlow
2633716bb7
Render interword spaces separately and avoid box overlap
2023-11-19 23:51:27 -08:00
James R. Barlow
0a07c0a44e
Fix more things
2023-11-19 23:51:27 -08:00
James R. Barlow
2ca6e110ca
Fix private accessors, rename pdf to canvas
2023-11-19 23:51:27 -08:00
James R. Barlow
334a07c839
Refactor debug printing
2023-11-19 23:51:27 -08:00
James R. Barlow
a57c39358d
Refactor: extract methods
2023-11-19 23:51:27 -08:00
James R. Barlow
30a0c315fb
Further exploratory improvements
2023-11-19 23:51:27 -08:00
James R. Barlow
b860f0d94c
Make coordinate system more consistent
2023-11-19 23:51:27 -08:00
James R. Barlow
14f4c19f5a
WIP improve text positioning (not there yet)
2023-11-19 23:51:27 -08:00
James R. Barlow
7ab5c55d46
More colors
2023-11-19 23:51:27 -08:00
James R. Barlow
8b6ecd5971
Fix line and rect drawing
2023-11-19 23:51:27 -08:00
James R. Barlow
7b0871ae4c
Fix position errors; ignore non-glyphless font
2023-11-19 23:51:27 -08:00
James R. Barlow
b73af7ce10
Fix dashes
2023-11-19 23:51:26 -08:00
James R. Barlow
60645717e2
Test pikepdf canvas - renders... something at this point
2023-11-19 23:51:26 -08:00