James R. Barlow
f4bca89722
Remove Tesseract 4 message
2018-03-25 12:16:31 -07:00
James R. Barlow
9fbc69df3f
v6.0.0 release
v6.0.0
2018-03-25 01:34:26 -07:00
James R. Barlow
230d301268
conftest: py3.5 path issue
2018-03-25 00:52:45 -07:00
James R. Barlow
1ce7b02d94
Travis: don't cache tests/cache anymore, you get it with git
2018-03-25 00:52:19 -07:00
James R. Barlow
a2d00f5f1d
tess cache: fix tess3 error for -psm instead of --psm
2018-03-25 00:43:02 -07:00
James R. Barlow
f68eaa3b46
Fix PyMuPDF version for Travis
2018-03-25 00:36:26 -07:00
James R. Barlow
0199ab220e
Tweak Manifest and .travis once more
...
Travis "do_not_include" moving around no longer needed, thankfully.
Manifest needed LICENSE.
2018-03-25 00:19:45 -07:00
James R. Barlow
656045610a
Update release notes
2018-03-25 00:17:23 -07:00
James R. Barlow
8c1c61f207
test cache: fix Path + str error
2018-03-25 00:02:03 -07:00
James R. Barlow
af085b79dd
Move ocrmypdf to src/ocrmypdf
2018-03-24 23:59:08 -07:00
James R. Barlow
77476965ae
test cache: use .bin extension, fix .gitignore .gitattributes
2018-03-24 23:54:16 -07:00
James R. Barlow
961c1365f9
Update manifest.in
2018-03-24 23:50:58 -07:00
James R. Barlow
ca51514046
Add test cache
2018-03-24 23:50:41 -07:00
James R. Barlow
8975b72a01
Fix test_testonly_pdf generating an output file in pwd
2018-03-24 22:34:35 -07:00
James R. Barlow
874ec6a87f
Add missing fixture to test_unpaper
2018-03-24 22:24:14 -07:00
James R. Barlow
909eaeeead
spoof: Allow tesseract cache to share cache
...
Previous incarnation was only suitable for generating a local cache
where the suite was executed repeatedly. Now the cache ignores
differences, so it can be checked into Github and shared.
2018-03-24 22:17:36 -07:00
James R. Barlow
c138161fae
Tests: more cleanup
2018-03-24 15:35:57 -07:00
James R. Barlow
e48590d66c
Refactor out unpaper-specific tests
2018-03-24 15:21:44 -07:00
James R. Barlow
5b1c8541fc
Review some skipped tests to make sure reasons still valid
2018-03-24 15:13:23 -07:00
James R. Barlow
e5e011021b
Remove the OCRMYPDF_program environment variables
...
Really, this was just replicating the functionality of the PATH
environment variable, and users probably do that anyway.
2018-03-24 15:09:08 -07:00
James R. Barlow
11d74dea09
Remove the OCRMYPDF_program environment variables
...
Really, this was just replicating the functionality of the PATH
environment variable, and users probably do that anyway.
2018-03-24 15:07:02 -07:00
James R. Barlow
cbdf9c88c5
Update requirements
2018-03-24 14:03:34 -07:00
James R. Barlow
46601b1350
setup: skip 1.12.4.1 since it does not provide wheels
2018-03-24 02:59:58 -07:00
James R. Barlow
6f1a40b2ca
v6.0.0 notes, build machinery changes
2018-03-24 02:52:56 -07:00
James R. Barlow
a2b1f54eb2
Update documentation license info
2018-03-24 02:33:24 -07:00
James R. Barlow
6756016572
Add license notice to all files
...
Source files to GPL3
Exceptions:
-tests/spoof/* to MIT
-hocrtransform.py
-_unicodefun.py
Test resources to CC BY-SA 4.0 except when otherwise noted.
Add GPL license.
2018-03-24 02:33:24 -07:00
James R. Barlow
f42123afc3
pipeline: make removal of merge_qpdf more explicit
2018-03-24 02:30:05 -07:00
James R. Barlow
1425ffd274
pipeline: Merge branch 'feature/mumerge' into test
...
Replaces qpdf page merging
2018-03-24 02:26:01 -07:00
James R. Barlow
d700154e0e
Fix regressions after --skip-text improvements
2018-03-24 02:24:45 -07:00
James R. Barlow
efecf42566
Add PyMuPDF and use to detect text on pages
2018-03-24 02:16:53 -07:00
James R. Barlow
74bdfc07fb
mumerge: fix regressions
2018-03-24 01:18:22 -07:00
James R. Barlow
376dfdba1c
Fix text/image files not closed in combine_layers
2018-03-23 13:48:37 -07:00
James R. Barlow
3795d6720f
Try out pymupdf merging
...
With garbage collection it reduces waste on the worst case file.
That's nice. 1 MB -> 105 MB -> 1.5 MB.
Indicates really problem is using PyPDF2 to watermark.
Currently hacked into --output-type pdf.
2018-03-23 13:39:16 -07:00
James R. Barlow
537aaf56d7
Remove duplication between page merge functions
2018-03-23 13:38:53 -07:00
James R. Barlow
34d51b5d3d
Merge branch 'feature/faster-split'
2018-03-23 13:10:53 -07:00
James R. Barlow
dea8fcfb5b
Optimize page splitting by multiprocessing
...
Previously page splitting occurred in a single process because it was
not believed to affect performance much. It turned out to be an expensive
operation.
It now scales better with large page sizes although this has a negative
effect on small files.
Overall time changes as follows:
7 page file, 9.02s -> 9.56s
731 page file, 213s -> 97s
WITH --tesseract-timeout 0 --output-type pdf --skip-text
i.e. you don't get a 2.2x speed gain when OCR is available.
Squashed a commit to ix test suite failure on --rotate-pages
Squashed a commit to remove debug code
2018-03-23 13:07:51 -07:00
James R. Barlow
4f1f3b9b51
Move available_cpu_count to helpers
2018-03-23 13:07:51 -07:00
James R. Barlow
dfeb8812ad
Document some instances of 0 vs 1-based page numbering, import cleanup
2018-03-23 13:07:31 -07:00
James R. Barlow
63e2b4273a
Travis: avoid using set -e since it interferes with Travis
...
https://github.com/travis-ci/docs-travis-ci-com/issues/1672
2018-03-23 12:49:12 -07:00
James R. Barlow
5790dbc085
Merge commit '9e2105e08d5fc765dbf636d108809bb66ab562a5'
2018-03-20 18:15:57 -07:00
jbarlow83
9e2105e08d
Update readme shields
...
Drop Docker Hub for now, add homebrew
2018-03-20 17:16:28 -07:00
James R. Barlow
22582bbd1c
Travis: don't trigger Docker Hub anymore
...
Docker Cloud is set up to build on pushes to master and tagged releases.
Hopefully that will work out.
2018-03-19 21:09:31 -07:00
James R. Barlow
e5f27b7a12
Solve text detection issue with PyMuPDF
2018-03-15 22:29:56 -07:00
James R. Barlow
e88ec9822b
Tweak release notes
v5.7.0
2018-03-15 17:09:43 -07:00
James R. Barlow
5ffd2f5c96
Not ending Py3.5 support just yet
2018-03-15 17:06:11 -07:00
James R. Barlow
11fdb4c5d8
Update release notes for v5.7.0
2018-03-15 17:06:04 -07:00
James R. Barlow
319aff6d09
Merge better-hocr
2018-03-15 16:59:59 -07:00
James R. Barlow
a614fa3400
hocr: simplify some math expressions and add comments
2018-03-14 17:05:40 -07:00
endolith
8d691391ac
Fix typos in advanced.rst ( #228 )
2018-03-14 15:54:55 -04:00
James R. Barlow
0089a84c94
hocr: Make interword spaces default and non-optional for hocr
...
Update documentation to match.
2018-03-13 14:51:47 -07:00