2895 Commits

Author SHA1 Message Date
James R. Barlow
f4bca89722 Remove Tesseract 4 message 2018-03-25 12:16:31 -07:00
James R. Barlow
9fbc69df3f v6.0.0 release v6.0.0 2018-03-25 01:34:26 -07:00
James R. Barlow
230d301268 conftest: py3.5 path issue 2018-03-25 00:52:45 -07:00
James R. Barlow
1ce7b02d94 Travis: don't cache tests/cache anymore, you get it with git 2018-03-25 00:52:19 -07:00
James R. Barlow
a2d00f5f1d tess cache: fix tess3 error for -psm instead of --psm 2018-03-25 00:43:02 -07:00
James R. Barlow
f68eaa3b46 Fix PyMuPDF version for Travis 2018-03-25 00:36:26 -07:00
James R. Barlow
0199ab220e Tweak Manifest and .travis once more
Travis "do_not_include" moving around no longer needed, thankfully.
Manifest needed LICENSE.
2018-03-25 00:19:45 -07:00
James R. Barlow
656045610a Update release notes 2018-03-25 00:17:23 -07:00
James R. Barlow
8c1c61f207 test cache: fix Path + str error 2018-03-25 00:02:03 -07:00
James R. Barlow
af085b79dd Move ocrmypdf to src/ocrmypdf 2018-03-24 23:59:08 -07:00
James R. Barlow
77476965ae test cache: use .bin extension, fix .gitignore .gitattributes 2018-03-24 23:54:16 -07:00
James R. Barlow
961c1365f9 Update manifest.in 2018-03-24 23:50:58 -07:00
James R. Barlow
ca51514046 Add test cache 2018-03-24 23:50:41 -07:00
James R. Barlow
8975b72a01 Fix test_testonly_pdf generating an output file in pwd 2018-03-24 22:34:35 -07:00
James R. Barlow
874ec6a87f Add missing fixture to test_unpaper 2018-03-24 22:24:14 -07:00
James R. Barlow
909eaeeead spoof: Allow tesseract cache to share cache
Previous incarnation was only suitable for generating a local cache
where the suite was executed repeatedly. Now the cache ignores
differences, so it can be checked into Github and shared.
2018-03-24 22:17:36 -07:00
James R. Barlow
c138161fae Tests: more cleanup 2018-03-24 15:35:57 -07:00
James R. Barlow
e48590d66c Refactor out unpaper-specific tests 2018-03-24 15:21:44 -07:00
James R. Barlow
5b1c8541fc Review some skipped tests to make sure reasons still valid 2018-03-24 15:13:23 -07:00
James R. Barlow
e5e011021b Remove the OCRMYPDF_program environment variables
Really, this was just replicating the functionality of the PATH
environment variable, and users probably do that anyway.
2018-03-24 15:09:08 -07:00
James R. Barlow
11d74dea09 Remove the OCRMYPDF_program environment variables
Really, this was just replicating the functionality of the PATH
environment variable, and users probably do that anyway.
2018-03-24 15:07:02 -07:00
James R. Barlow
cbdf9c88c5 Update requirements 2018-03-24 14:03:34 -07:00
James R. Barlow
46601b1350 setup: skip 1.12.4.1 since it does not provide wheels 2018-03-24 02:59:58 -07:00
James R. Barlow
6f1a40b2ca v6.0.0 notes, build machinery changes 2018-03-24 02:52:56 -07:00
James R. Barlow
a2b1f54eb2 Update documentation license info 2018-03-24 02:33:24 -07:00
James R. Barlow
6756016572 Add license notice to all files
Source files to GPL3

Exceptions:
-tests/spoof/* to MIT
-hocrtransform.py
-_unicodefun.py

Test resources to CC BY-SA 4.0 except when otherwise noted.

Add GPL license.
2018-03-24 02:33:24 -07:00
James R. Barlow
f42123afc3 pipeline: make removal of merge_qpdf more explicit 2018-03-24 02:30:05 -07:00
James R. Barlow
1425ffd274 pipeline: Merge branch 'feature/mumerge' into test
Replaces qpdf page merging
2018-03-24 02:26:01 -07:00
James R. Barlow
d700154e0e Fix regressions after --skip-text improvements 2018-03-24 02:24:45 -07:00
James R. Barlow
efecf42566 Add PyMuPDF and use to detect text on pages 2018-03-24 02:16:53 -07:00
James R. Barlow
74bdfc07fb mumerge: fix regressions 2018-03-24 01:18:22 -07:00
James R. Barlow
376dfdba1c Fix text/image files not closed in combine_layers 2018-03-23 13:48:37 -07:00
James R. Barlow
3795d6720f Try out pymupdf merging
With garbage collection it reduces waste on the worst case file.
That's nice. 1 MB -> 105 MB -> 1.5 MB.

Indicates really problem is using PyPDF2 to watermark.

Currently hacked into --output-type pdf.
2018-03-23 13:39:16 -07:00
James R. Barlow
537aaf56d7 Remove duplication between page merge functions 2018-03-23 13:38:53 -07:00
James R. Barlow
34d51b5d3d Merge branch 'feature/faster-split' 2018-03-23 13:10:53 -07:00
James R. Barlow
dea8fcfb5b Optimize page splitting by multiprocessing
Previously page splitting occurred in a single process because it was
not believed to affect performance much. It turned out to be an expensive
operation.

It now scales better with large page sizes although this has a negative
effect on small files.

Overall time changes as follows:

7 page file, 9.02s -> 9.56s
731 page file, 213s -> 97s

WITH --tesseract-timeout 0 --output-type pdf --skip-text

i.e. you don't get a 2.2x speed gain when OCR is available.

Squashed a commit to ix test suite failure on --rotate-pages
Squashed a commit to remove debug code
2018-03-23 13:07:51 -07:00
James R. Barlow
4f1f3b9b51 Move available_cpu_count to helpers 2018-03-23 13:07:51 -07:00
James R. Barlow
dfeb8812ad Document some instances of 0 vs 1-based page numbering, import cleanup 2018-03-23 13:07:31 -07:00
James R. Barlow
63e2b4273a Travis: avoid using set -e since it interferes with Travis
https://github.com/travis-ci/docs-travis-ci-com/issues/1672
2018-03-23 12:49:12 -07:00
James R. Barlow
5790dbc085 Merge commit '9e2105e08d5fc765dbf636d108809bb66ab562a5' 2018-03-20 18:15:57 -07:00
jbarlow83
9e2105e08d
Update readme shields
Drop Docker Hub for now, add homebrew
2018-03-20 17:16:28 -07:00
James R. Barlow
22582bbd1c Travis: don't trigger Docker Hub anymore
Docker Cloud is set up to build on pushes to master and tagged releases.
Hopefully that will work out.
2018-03-19 21:09:31 -07:00
James R. Barlow
e5f27b7a12 Solve text detection issue with PyMuPDF 2018-03-15 22:29:56 -07:00
James R. Barlow
e88ec9822b Tweak release notes v5.7.0 2018-03-15 17:09:43 -07:00
James R. Barlow
5ffd2f5c96 Not ending Py3.5 support just yet 2018-03-15 17:06:11 -07:00
James R. Barlow
11fdb4c5d8 Update release notes for v5.7.0 2018-03-15 17:06:04 -07:00
James R. Barlow
319aff6d09 Merge better-hocr 2018-03-15 16:59:59 -07:00
James R. Barlow
a614fa3400 hocr: simplify some math expressions and add comments 2018-03-14 17:05:40 -07:00
endolith
8d691391ac Fix typos in advanced.rst (#228) 2018-03-14 15:54:55 -04:00
James R. Barlow
0089a84c94 hocr: Make interword spaces default and non-optional for hocr
Update documentation to match.
2018-03-13 14:51:47 -07:00