James R. Barlow
e5e011021b
Remove the OCRMYPDF_program environment variables
...
Really, this was just replicating the functionality of the PATH
environment variable, and users probably do that anyway.
2018-03-24 15:09:08 -07:00
James R. Barlow
11d74dea09
Remove the OCRMYPDF_program environment variables
...
Really, this was just replicating the functionality of the PATH
environment variable, and users probably do that anyway.
2018-03-24 15:07:02 -07:00
James R. Barlow
cbdf9c88c5
Update requirements
2018-03-24 14:03:34 -07:00
James R. Barlow
46601b1350
setup: skip 1.12.4.1 since it does not provide wheels
2018-03-24 02:59:58 -07:00
James R. Barlow
6f1a40b2ca
v6.0.0 notes, build machinery changes
2018-03-24 02:52:56 -07:00
James R. Barlow
a2b1f54eb2
Update documentation license info
2018-03-24 02:33:24 -07:00
James R. Barlow
6756016572
Add license notice to all files
...
Source files to GPL3
Exceptions:
-tests/spoof/* to MIT
-hocrtransform.py
-_unicodefun.py
Test resources to CC BY-SA 4.0 except when otherwise noted.
Add GPL license.
2018-03-24 02:33:24 -07:00
James R. Barlow
f42123afc3
pipeline: make removal of merge_qpdf more explicit
2018-03-24 02:30:05 -07:00
James R. Barlow
1425ffd274
pipeline: Merge branch 'feature/mumerge' into test
...
Replaces qpdf page merging
2018-03-24 02:26:01 -07:00
James R. Barlow
d700154e0e
Fix regressions after --skip-text improvements
2018-03-24 02:24:45 -07:00
James R. Barlow
efecf42566
Add PyMuPDF and use to detect text on pages
2018-03-24 02:16:53 -07:00
James R. Barlow
74bdfc07fb
mumerge: fix regressions
2018-03-24 01:18:22 -07:00
James R. Barlow
376dfdba1c
Fix text/image files not closed in combine_layers
2018-03-23 13:48:37 -07:00
James R. Barlow
3795d6720f
Try out pymupdf merging
...
With garbage collection it reduces waste on the worst case file.
That's nice. 1 MB -> 105 MB -> 1.5 MB.
Indicates really problem is using PyPDF2 to watermark.
Currently hacked into --output-type pdf.
2018-03-23 13:39:16 -07:00
James R. Barlow
537aaf56d7
Remove duplication between page merge functions
2018-03-23 13:38:53 -07:00
James R. Barlow
34d51b5d3d
Merge branch 'feature/faster-split'
2018-03-23 13:10:53 -07:00
James R. Barlow
dea8fcfb5b
Optimize page splitting by multiprocessing
...
Previously page splitting occurred in a single process because it was
not believed to affect performance much. It turned out to be an expensive
operation.
It now scales better with large page sizes although this has a negative
effect on small files.
Overall time changes as follows:
7 page file, 9.02s -> 9.56s
731 page file, 213s -> 97s
WITH --tesseract-timeout 0 --output-type pdf --skip-text
i.e. you don't get a 2.2x speed gain when OCR is available.
Squashed a commit to ix test suite failure on --rotate-pages
Squashed a commit to remove debug code
2018-03-23 13:07:51 -07:00
James R. Barlow
4f1f3b9b51
Move available_cpu_count to helpers
2018-03-23 13:07:51 -07:00
James R. Barlow
dfeb8812ad
Document some instances of 0 vs 1-based page numbering, import cleanup
2018-03-23 13:07:31 -07:00
James R. Barlow
63e2b4273a
Travis: avoid using set -e since it interferes with Travis
...
https://github.com/travis-ci/docs-travis-ci-com/issues/1672
2018-03-23 12:49:12 -07:00
James R. Barlow
5790dbc085
Merge commit '9e2105e08d5fc765dbf636d108809bb66ab562a5'
2018-03-20 18:15:57 -07:00
jbarlow83
9e2105e08d
Update readme shields
...
Drop Docker Hub for now, add homebrew
2018-03-20 17:16:28 -07:00
James R. Barlow
22582bbd1c
Travis: don't trigger Docker Hub anymore
...
Docker Cloud is set up to build on pushes to master and tagged releases.
Hopefully that will work out.
2018-03-19 21:09:31 -07:00
James R. Barlow
e5f27b7a12
Solve text detection issue with PyMuPDF
2018-03-15 22:29:56 -07:00
James R. Barlow
e88ec9822b
Tweak release notes
v5.7.0
2018-03-15 17:09:43 -07:00
James R. Barlow
5ffd2f5c96
Not ending Py3.5 support just yet
2018-03-15 17:06:11 -07:00
James R. Barlow
11fdb4c5d8
Update release notes for v5.7.0
2018-03-15 17:06:04 -07:00
James R. Barlow
319aff6d09
Merge better-hocr
2018-03-15 16:59:59 -07:00
James R. Barlow
a614fa3400
hocr: simplify some math expressions and add comments
2018-03-14 17:05:40 -07:00
endolith
8d691391ac
Fix typos in advanced.rst ( #228 )
2018-03-14 15:54:55 -04:00
James R. Barlow
0089a84c94
hocr: Make interword spaces default and non-optional for hocr
...
Update documentation to match.
2018-03-13 14:51:47 -07:00
James R. Barlow
90676e1c6a
hocr: Remove baseline dashes
2018-03-13 14:45:31 -07:00
James R. Barlow
062901be43
Some cleanup and variable renaming
2018-03-13 14:29:22 -07:00
James R. Barlow
b195d79b50
Refactoring
2018-03-13 11:04:34 -07:00
James R. Barlow
6d7ee98721
Force Tesseract 4 to be single threaded
...
Gives better performance (throughput basis) than the existing solution
and scales better on powerful boxes.
2018-03-13 08:54:52 -07:00
James R. Barlow
fc0800ed5d
v5.6.3 notes
v5.6.3
2018-03-12 03:41:12 -07:00
James R. Barlow
f4e3a0e5b2
v5.6.2 notes
2018-03-09 15:37:08 -08:00
James R. Barlow
d631c80024
Suppress debug message when merging large files
2018-03-09 11:10:45 -08:00
James R. Barlow
f1f0033875
Suppress spurious debug message in --output-type pdf
2018-03-09 11:07:28 -08:00
James R. Barlow
84d120e850
v5.6.1 notes
v5.6.1
2018-03-09 08:00:42 -08:00
James R. Barlow
8159cc6b88
Skip one test that fails for qpdf 8.0.[0,1], due to qpdf regression
2018-03-09 07:57:22 -08:00
James R. Barlow
995f8c106b
hocr: account for baseline offset to position text more accurately
2018-03-09 07:45:41 -08:00
Jim Barlow
7cc104b138
hocr: account for skewed baseline
2018-03-05 11:16:40 -05:00
James R. Barlow
b3a7299a62
hocr: refactor/improve PEP8 a bit
2018-03-05 10:47:36 -05:00
James R. Barlow
0e7a4deaec
hocr: add baseline function, hocr doc link
2018-03-05 10:47:36 -05:00
James R. Barlow
b4d66650bd
hocr: adjust text cursor with relative moves
2018-03-05 10:47:36 -05:00
James R. Barlow
4986afca28
hocr: Refactor use of text object
...
We don't need to declare the font on each word.
No improvement for removing trailing space or adding \n
2018-03-05 10:47:36 -05:00
James R. Barlow
2b6004a82b
hocr: Make words on line use the line height
...
Seems to improve the behavior and appearance
of selected text a fair bit.
2018-03-05 10:47:36 -05:00
James R. Barlow
04c54a7c31
Suppress spurious debug message in --output-type pdf
2018-03-05 10:47:36 -05:00
James R. Barlow
7ae6c5ae87
Trial merge interword-spaces
2018-03-02 23:47:06 -08:00