2895 Commits

Author SHA1 Message Date
James R. Barlow
16d24f1166 Bump version to -rc4 v3.0-rc4 2015-08-05 23:26:38 -07:00
James R. Barlow
97015ef775 Add a test case to check on the @argumentsfile syntax 2015-08-05 23:17:38 -07:00
James R. Barlow
2744dafb74 New test case: ensure metadata is preserved from input to output 2015-08-05 17:09:38 -07:00
James R. Barlow
7b268dbe1a Remove duplication in test case 2015-08-05 16:57:04 -07:00
James R. Barlow
8fcbbcef94 Improve usage text 2015-08-05 16:56:53 -07:00
James R. Barlow
8f93f0a06e Tidy docs 2015-08-05 16:56:30 -07:00
James R. Barlow
387142488c Kill duplicate file 2015-07-31 01:57:16 -07:00
James R. Barlow
6887e232fc Bug fix: exception from process timeout should be TimeoutExpired 2015-07-31 00:06:58 -07:00
James R. Barlow
6ac7ffd77b Merge branch 'feature/drop-mupdf-poppler' into develop 2015-07-30 23:38:27 -07:00
James R. Barlow
b28faa582a Automatically use all available cores unless told not to 2015-07-30 23:20:21 -07:00
James R. Barlow
454ee029c8 Run final ghostscript in multithreaded mode
This step is serialized so all cores are not busy at this stage.
2015-07-30 23:20:04 -07:00
James R. Barlow
a036de318e Replace mupdf and poppler with qpdf
Drop two dependencies and replace them with one that does the job of
both.  Smells like progress.

mupdf does PDF file repair and rendering
poppler does rendering and page splitting
qpdf does PDF file repair and page splitting
ghostscript does PDF file repair, rendering, and page splitting (sort of)

So we use qpdf.  Ghostscript's page splitting is supposed is less
efficient because it reprints the page (PDF -> Postscript -> PDF) and
possibly loses quality.  qpdf's library could be used to improve
performance.

This causes a slight performance regression:

py.test tests/test_main.py::test_maximum_options went from 187 seconds
up to 192.  This is likely due to O(n) serialized invocations of qpdf
compared to a single serialized call to pdfseparate.  Could improve on
this situation by using the example code in qpdf: pdf-split-pages.cc
or create marker files in split_pages() and then write a new @transform
function that would split pages on each CPU.  Probably not worth it,
overall, unless this causes problems on files with hundreds of pages.
2015-07-30 04:16:35 -07:00
James R. Barlow
9918c4020e Use img2pdf in test case because it does a better job 2015-07-30 03:35:56 -07:00
jbarlow83
3d6264e1b8 Fix formatting of 'motivation' 2015-07-28 17:58:26 -07:00
jbarlow83
1c25270503 Improve instructions for users that need sudo or venv 2015-07-28 17:55:56 -07:00
James R. Barlow
47e50f82c4 setup.py: allow mutool 1.7 2015-07-28 13:37:32 -07:00
James R. Barlow
27ecdfbba8 More fixes to error cases in setup.py 2015-07-28 13:05:23 -07:00
James R. Barlow
6901550065 Fix some installer issues 2015-07-28 12:41:24 -07:00
jbarlow83
6e6f918630 Actually link the release notes 2015-07-28 12:21:57 -07:00
jbarlow83
4633812246 Fix git clone command with one I tested ;) 2015-07-28 12:20:09 -07:00
jbarlow83
14bd1555aa Update README with more detailed instructions 2015-07-28 12:15:37 -07:00
James R. Barlow
b9d7687fa0 Fixes: clarify install instructions and reactivate external program checks v3.0-rc2 2015-07-28 05:44:15 -07:00
James R. Barlow
93b36965e2 Merge branch 'develop'
# Conflicts:
#	RELEASE_NOTES.md
#	src/config.sh
#	src/hocrTransform.py
#	src/ocrPage.sh
2015-07-28 04:59:49 -07:00
James R. Barlow
9e0c443c2f -rc2: because pypi won't accept -rc1 2015-07-28 04:55:10 -07:00
James R. Barlow
60832152b1 Don't mess with options 2015-07-28 04:46:21 -07:00
James R. Barlow
6a160d22fe Update release notes, add copyrights 2015-07-28 04:36:58 -07:00
James R. Barlow
e35526192c More test cases 2015-07-28 03:02:35 -07:00
James R. Barlow
bea57bdded More test cases for other parameters 2015-07-28 02:31:18 -07:00
James R. Barlow
2a9da225e4 Minor tweaks to uncommon arguments 2015-07-28 02:25:50 -07:00
James R. Barlow
a3f37de9b5 Test cases for --tesseract-timeout 2015-07-28 01:47:30 -07:00
James R. Barlow
6064160953 Get rid of subprocess call on import of tesseract, unpaper -- bit nasty 2015-07-28 01:00:29 -07:00
James R. Barlow
8508141314 Drop nose, all tests working reasonably again
Although the real issue was that the ruffus pipeline cannot be executed
twice in the same process due to its reliance on global variables.

The new OO pipeline in ruffus 2.6 would be one resolution that would
allow for more comprehensive testing as opposed to farming out the
execution to subprocess and inspecting the results, as is currently
done.
2015-07-28 00:43:22 -07:00
James R. Barlow
1c95597882 nose can't really handle external tests so looking into py.test instead
Specifically it trips over the need to reimport ocrmypdf.main.  That in
turn raises questions about whether to make that function into an
external script that imports ocrmypdf... or something else.  Would be
possible with a loop that manipulates sys_argv and then reloads
ocrmypdf.main; might need that anyway.
2015-07-27 22:07:04 -07:00
James R. Barlow
587fa63c8e --oversample: Default to 0 2015-07-27 20:42:16 -07:00
James R. Barlow
b40eec4cb0 Add --oversample test for hocr rendering 2015-07-27 17:18:02 -07:00
James R. Barlow
7bcd48c269 Add test to confirm that metadata is transferred to final PDF/A 2015-07-27 16:11:51 -07:00
James R. Barlow
2e7cd52c0f Improve argument handling, test cases 2015-07-27 15:39:54 -07:00
James R. Barlow
77d4cb367e Put ghostscript in a module 2015-07-27 15:22:00 -07:00
James R. Barlow
2c45c5abc6 Implement tesseract timeout 2015-07-27 04:23:37 -07:00
James R. Barlow
a89afabd79 Implement tesseract PDF rendering as an alternative
It's much better a rendering text baselines than hocr and seems to
produce small file sizes, so it's progress.  Not available for
Tesseract 3.02 obviously, so both modes need to remove available.
2015-07-27 04:20:49 -07:00
James R. Barlow
03f7c9bf07 setup.py: Only do program checks when installing 2015-07-27 02:14:51 -07:00
James R. Barlow
d5f4862749 setup.py: check for third party program requirements 2015-07-27 01:45:17 -07:00
James R. Barlow
8aced0b6d3 More testing: JPEG 2015-07-27 00:25:43 -07:00
James R. Barlow
6b9adef684 Don't create inline images in output PDFs
...except that Ghostscript will sometimes turn out of line images into
inline images on its own, possibly if file size is small.
2015-07-26 21:43:49 -07:00
James R. Barlow
5440d988fc Make this PDF a whole image page
Originally it had a smaller image centred in a page, which is not quite
supported.
2015-07-26 18:32:50 -07:00
James R. Barlow
30da4fc569 pageinfo: drop pdftotext and use PyPDF instead 2015-07-26 18:23:37 -07:00
James R. Barlow
2c1b5e100b Test cases for pageinfo; complain about inline images 2015-07-26 18:18:41 -07:00
James R. Barlow
3684f278ed Add some pageinfo test cases; found problem with inline images 2015-07-26 15:24:42 -07:00
James R. Barlow
6c3cb6acba Remove redundant *res_render 2015-07-26 12:56:10 -07:00
James R. Barlow
b98ba8d174 Replace .md with .rst
Github supports both, and PyPI expects .rst files, so use .rst and make
everyone happy.

Auto-converted using pandoc
find . -name '*.md' | parallel pandoc --from=markdown --to=rst --output='{.}.rst' '{}'
http://bfroehle.com/2013/04/26/converting-md-to-rst/
2015-07-26 03:01:18 -07:00