2676 Commits

Author SHA1 Message Date
James R. Barlow
2d15c09cca Merge branch 'develop' 2016-02-06 18:18:49 -08:00
James R. Barlow
04cb8865b0 Fetch application from PyPI instead of local
setuptools_scm barfs because it can't find the version, because Docker hub
retrieves the application from Github in a way that omits the necessary
details.

I suppose there is a certain logic to Docker only using the tagged
released versions from PyPI, so go with it.  The other attractive option
is to nix setuptools_scm.
2016-02-06 18:18:30 -08:00
James R. Barlow
6fe32bbaf7 v3.2.1 v3.2.1 2016-02-05 16:10:18 -08:00
James R. Barlow
4abb20390d Bump Dockerfile versions 2016-02-05 16:08:26 -08:00
James R. Barlow
daa3916430 Fix img2pdf 0.2 usage
All tests pass when forced to rely on img2pdf, so seems okay
2016-02-05 15:13:26 -08:00
James R. Barlow
e9b87cefcc Try img2pdf 0.2 2016-02-05 14:38:37 -08:00
James R. Barlow
60593b5ad3 Tighten up package requirements to deal with incompatible img2pdf 0.2 release 2016-02-05 14:37:05 -08:00
James R. Barlow
f708b11ea4 Fix Python 2.7 warning 2016-02-05 02:34:49 -08:00
James R. Barlow
7982f58b2e Try tweaking Dockerfile for automated build again v3.2.post2 2016-02-05 01:38:59 -08:00
James R. Barlow
e805c1908a Minor fix for Dockerfile polyglot v3.2.post1 2016-02-05 00:52:27 -08:00
James R. Barlow
cb3ba8e973 Merge branch 'release/v3.2' into develop 2016-02-05 00:10:41 -08:00
James R. Barlow
344fc40cbc Merge branch 'release/v3.2' v3.2 2016-02-05 00:10:41 -08:00
James R. Barlow
7e5c37137b Merge branch 'develop' into release/v3.2 2016-02-04 23:42:06 -08:00
James R. Barlow
1aae11714b Update release notes for v3.2 2016-02-04 23:41:33 -08:00
James R. Barlow
d82f14a7aa Update .gitignore 2016-02-04 18:51:41 -08:00
James R. Barlow
4b65e0b093 Set JPEG output quality to 95 for better transcoding 2016-02-04 18:49:09 -08:00
James R. Barlow
43b0faa830 Bug in tesseract_noop spoof: produced wrong page sizes
Now checks input image to ensure the implied page size of its .hocr file
matches the rest of the PDF.
2016-02-04 18:48:22 -08:00
James R. Barlow
8674c9fb20 Merge commit 'ccfbb54e8c26784e438ba2fcac2179f21e7d857b' into release/v3.2 2016-02-04 17:39:36 -08:00
jbarlow83
ccfbb54e8c Update release notes for v3.2
Fix the notes
2016-02-04 17:37:30 -08:00
James R. Barlow
9893ebf889 Suppress tesseract argument printout 2016-02-04 17:26:36 -08:00
James R. Barlow
303eb3e93a Merge commit 'ca546d70e5bff9e9b115371f7813f3c326822bd8' into release/v3.2 2016-02-04 17:25:56 -08:00
jbarlow83
ca546d70e5 Merge pull request #45 from spwhitton/hocrtransform-shebang-fix
fix shebang in hocrtransform.py
2016-02-04 17:21:33 -08:00
Sean Whitton
6a5ea2d64a fix shebang in hocrtransform.py 2016-02-03 17:48:35 -07:00
James R. Barlow
ec3d92ad8e Reorg gitignore 2016-01-30 15:28:24 -08:00
James R. Barlow
66a095d7de Improve organization of CFFI setup 2016-01-30 15:19:40 -08:00
James R. Barlow
411981efbc Experiment with CFFI instead of ctypes 2016-01-30 15:06:25 -08:00
James R. Barlow
350ad5210e Leptonica: convert to CFFI 2016-01-20 15:03:07 -08:00
James R. Barlow
f3b588764e Suppress tesseract argument printout 2016-01-20 15:02:48 -08:00
James R. Barlow
b49f5a7d77 Support optionally using leptonica to deskew
unpaper doesn't seem to be good at deskewing. It fails on test case
with a lot of italics. I think it also struggles on pages with a lot
of whitespace. Leptonica continues to shine here.

However, this is only a first crack at Leptonica. The leptonica module
should be redone to use cffi (more extensible).

Also considering the possibility of making all Lept calls in a forked
process to insulate the calling process from C code crashes and the
messy redirect of stdout/stderr to read Leptonica's errors.

I don't think the redirect is a huge problem as long as multiprocesses
rather than multithreads are used. The ruffus child process that is
handling a page is single threaded and will not be affected by the
redirection. It just feels dirty. The main reason to consider a child
process is crash isolation.
2016-01-19 17:43:40 -08:00
James R. Barlow
bacbcba58a Merge branch 'release/v3.2-rc1' v3.2rc1 2016-01-19 16:58:37 -08:00
James R. Barlow
52e8aa434f Update release notes for v3.2-rc1 2016-01-19 16:49:49 -08:00
James R. Barlow
37c508f3f8 Better versioning: no silly version files, but wrong ver in development
Small price to pay.
2016-01-19 16:07:52 -08:00
James R. Barlow
26e36422cc More fiddling with version 2016-01-19 15:07:21 -08:00
James R. Barlow
f82cb002bc Try automatic versioning with setuptools_scm 2016-01-19 13:27:18 -08:00
James R. Barlow
c1eb047a4b Fix name of pdfa_def.ps
Used to include a copy of the parent dir's name.
2016-01-19 13:11:03 -08:00
James R. Barlow
626ca18f5c Remove stale comment 2016-01-19 13:02:35 -08:00
James R. Barlow
9058dedfbe New tests for ccitt, jbig2 encodings 2016-01-19 13:01:56 -08:00
James R. Barlow
a0952bfca3 Optimize: use img2pdf stream instead of repeated copies 2016-01-18 20:24:46 -08:00
James R. Barlow
354e61946e Use os.makedirs for test output directories
Broke Travis
2016-01-16 02:47:56 -08:00
James R. Barlow
fd6d1d748a Merge branch 'feature/pypdf-page-merge' into develop 2016-01-16 02:33:23 -08:00
James R. Barlow
360acd1e2c Adjust test_oversample test case
Add -f to force generation of the background image at the desired
oversample resolution.  Our new behavior is to only send the oversampled
image to Tesseract while leaving the main page intact unless asked to
deskew, clean, etc.
2016-01-15 15:55:23 -08:00
James R. Barlow
fc0479f110 Fix all but test_oversample[hocr] 2016-01-15 15:46:47 -08:00
James R. Barlow
62728205b6 Implement image+text merging in other cases
5 failed, 28 passed

failures:
test_oversample[hocr], test_skip_ocr, test_skip_big, test_maximum_options[hocr],
test_blank_input_pdf,
2016-01-15 15:38:08 -08:00
James R. Barlow
dc0fb25e64 Render hocr page: no longer needs an image as input 2016-01-15 15:16:47 -08:00
James R. Barlow
f3e04cce56 Update pipeline.svg 2016-01-15 14:56:16 -08:00
James R. Barlow
7067110308 Add safety check to prevent merge from running when not sensible 2016-01-15 14:54:45 -08:00
James R. Barlow
599d889703 Implement "perfect reconstruction" - transfer page and watermark OCR layer
Works, does not account for changes to clean/deskew, etc.
Surprisingly, it works. PyPDF2 fixes since last attempt?
2016-01-15 14:39:12 -08:00
James R. Barlow
2fa8366632 Merge branch 'feature/test-pageinfo-cleanup' into develop 2016-01-15 14:18:01 -08:00
James R. Barlow
c368c51bad New hocrtransform test 2016-01-15 14:14:08 -08:00
James R. Barlow
7c558b3713 Move pageinfo test into tests folder 2016-01-11 17:40:44 -08:00