2676 Commits

Author SHA1 Message Date
James R. Barlow
e5e011021b Remove the OCRMYPDF_program environment variables
Really, this was just replicating the functionality of the PATH
environment variable, and users probably do that anyway.
2018-03-24 15:09:08 -07:00
James R. Barlow
11d74dea09 Remove the OCRMYPDF_program environment variables
Really, this was just replicating the functionality of the PATH
environment variable, and users probably do that anyway.
2018-03-24 15:07:02 -07:00
James R. Barlow
cbdf9c88c5 Update requirements 2018-03-24 14:03:34 -07:00
James R. Barlow
46601b1350 setup: skip 1.12.4.1 since it does not provide wheels 2018-03-24 02:59:58 -07:00
James R. Barlow
6f1a40b2ca v6.0.0 notes, build machinery changes 2018-03-24 02:52:56 -07:00
James R. Barlow
a2b1f54eb2 Update documentation license info 2018-03-24 02:33:24 -07:00
James R. Barlow
6756016572 Add license notice to all files
Source files to GPL3

Exceptions:
-tests/spoof/* to MIT
-hocrtransform.py
-_unicodefun.py

Test resources to CC BY-SA 4.0 except when otherwise noted.

Add GPL license.
2018-03-24 02:33:24 -07:00
James R. Barlow
f42123afc3 pipeline: make removal of merge_qpdf more explicit 2018-03-24 02:30:05 -07:00
James R. Barlow
1425ffd274 pipeline: Merge branch 'feature/mumerge' into test
Replaces qpdf page merging
2018-03-24 02:26:01 -07:00
James R. Barlow
d700154e0e Fix regressions after --skip-text improvements 2018-03-24 02:24:45 -07:00
James R. Barlow
efecf42566 Add PyMuPDF and use to detect text on pages 2018-03-24 02:16:53 -07:00
James R. Barlow
74bdfc07fb mumerge: fix regressions 2018-03-24 01:18:22 -07:00
James R. Barlow
376dfdba1c Fix text/image files not closed in combine_layers 2018-03-23 13:48:37 -07:00
James R. Barlow
3795d6720f Try out pymupdf merging
With garbage collection it reduces waste on the worst case file.
That's nice. 1 MB -> 105 MB -> 1.5 MB.

Indicates really problem is using PyPDF2 to watermark.

Currently hacked into --output-type pdf.
2018-03-23 13:39:16 -07:00
James R. Barlow
537aaf56d7 Remove duplication between page merge functions 2018-03-23 13:38:53 -07:00
James R. Barlow
34d51b5d3d Merge branch 'feature/faster-split' 2018-03-23 13:10:53 -07:00
James R. Barlow
dea8fcfb5b Optimize page splitting by multiprocessing
Previously page splitting occurred in a single process because it was
not believed to affect performance much. It turned out to be an expensive
operation.

It now scales better with large page sizes although this has a negative
effect on small files.

Overall time changes as follows:

7 page file, 9.02s -> 9.56s
731 page file, 213s -> 97s

WITH --tesseract-timeout 0 --output-type pdf --skip-text

i.e. you don't get a 2.2x speed gain when OCR is available.

Squashed a commit to ix test suite failure on --rotate-pages
Squashed a commit to remove debug code
2018-03-23 13:07:51 -07:00
James R. Barlow
4f1f3b9b51 Move available_cpu_count to helpers 2018-03-23 13:07:51 -07:00
James R. Barlow
dfeb8812ad Document some instances of 0 vs 1-based page numbering, import cleanup 2018-03-23 13:07:31 -07:00
James R. Barlow
63e2b4273a Travis: avoid using set -e since it interferes with Travis
https://github.com/travis-ci/docs-travis-ci-com/issues/1672
2018-03-23 12:49:12 -07:00
James R. Barlow
5790dbc085 Merge commit '9e2105e08d5fc765dbf636d108809bb66ab562a5' 2018-03-20 18:15:57 -07:00
jbarlow83
9e2105e08d
Update readme shields
Drop Docker Hub for now, add homebrew
2018-03-20 17:16:28 -07:00
James R. Barlow
22582bbd1c Travis: don't trigger Docker Hub anymore
Docker Cloud is set up to build on pushes to master and tagged releases.
Hopefully that will work out.
2018-03-19 21:09:31 -07:00
James R. Barlow
e5f27b7a12 Solve text detection issue with PyMuPDF 2018-03-15 22:29:56 -07:00
James R. Barlow
e88ec9822b Tweak release notes v5.7.0 2018-03-15 17:09:43 -07:00
James R. Barlow
5ffd2f5c96 Not ending Py3.5 support just yet 2018-03-15 17:06:11 -07:00
James R. Barlow
11fdb4c5d8 Update release notes for v5.7.0 2018-03-15 17:06:04 -07:00
James R. Barlow
319aff6d09 Merge better-hocr 2018-03-15 16:59:59 -07:00
James R. Barlow
a614fa3400 hocr: simplify some math expressions and add comments 2018-03-14 17:05:40 -07:00
endolith
8d691391ac Fix typos in advanced.rst (#228) 2018-03-14 15:54:55 -04:00
James R. Barlow
0089a84c94 hocr: Make interword spaces default and non-optional for hocr
Update documentation to match.
2018-03-13 14:51:47 -07:00
James R. Barlow
90676e1c6a hocr: Remove baseline dashes 2018-03-13 14:45:31 -07:00
James R. Barlow
062901be43 Some cleanup and variable renaming 2018-03-13 14:29:22 -07:00
James R. Barlow
b195d79b50 Refactoring 2018-03-13 11:04:34 -07:00
James R. Barlow
6d7ee98721 Force Tesseract 4 to be single threaded
Gives better performance (throughput basis) than the existing solution
and scales better on powerful boxes.
2018-03-13 08:54:52 -07:00
James R. Barlow
fc0800ed5d v5.6.3 notes v5.6.3 2018-03-12 03:41:12 -07:00
James R. Barlow
f4e3a0e5b2 v5.6.2 notes 2018-03-09 15:37:08 -08:00
James R. Barlow
d631c80024 Suppress debug message when merging large files 2018-03-09 11:10:45 -08:00
James R. Barlow
f1f0033875 Suppress spurious debug message in --output-type pdf 2018-03-09 11:07:28 -08:00
James R. Barlow
84d120e850 v5.6.1 notes v5.6.1 2018-03-09 08:00:42 -08:00
James R. Barlow
8159cc6b88 Skip one test that fails for qpdf 8.0.[0,1], due to qpdf regression 2018-03-09 07:57:22 -08:00
James R. Barlow
995f8c106b hocr: account for baseline offset to position text more accurately 2018-03-09 07:45:41 -08:00
Jim Barlow
7cc104b138 hocr: account for skewed baseline 2018-03-05 11:16:40 -05:00
James R. Barlow
b3a7299a62 hocr: refactor/improve PEP8 a bit 2018-03-05 10:47:36 -05:00
James R. Barlow
0e7a4deaec hocr: add baseline function, hocr doc link 2018-03-05 10:47:36 -05:00
James R. Barlow
b4d66650bd hocr: adjust text cursor with relative moves 2018-03-05 10:47:36 -05:00
James R. Barlow
4986afca28 hocr: Refactor use of text object
We don't need to declare the font on each word.

No improvement for removing trailing space or adding \n
2018-03-05 10:47:36 -05:00
James R. Barlow
2b6004a82b hocr: Make words on line use the line height
Seems to improve the behavior and appearance
of selected text a fair bit.
2018-03-05 10:47:36 -05:00
James R. Barlow
04c54a7c31 Suppress spurious debug message in --output-type pdf 2018-03-05 10:47:36 -05:00
James R. Barlow
7ae6c5ae87 Trial merge interword-spaces 2018-03-02 23:47:06 -08:00