2676 Commits

Author SHA1 Message Date
jbarlow83
4633812246 Fix git clone command with one I tested ;) 2015-07-28 12:20:09 -07:00
jbarlow83
14bd1555aa Update README with more detailed instructions 2015-07-28 12:15:37 -07:00
James R. Barlow
b9d7687fa0 Fixes: clarify install instructions and reactivate external program checks v3.0-rc2 2015-07-28 05:44:15 -07:00
James R. Barlow
93b36965e2 Merge branch 'develop'
# Conflicts:
#	RELEASE_NOTES.md
#	src/config.sh
#	src/hocrTransform.py
#	src/ocrPage.sh
2015-07-28 04:59:49 -07:00
James R. Barlow
9e0c443c2f -rc2: because pypi won't accept -rc1 2015-07-28 04:55:10 -07:00
James R. Barlow
60832152b1 Don't mess with options 2015-07-28 04:46:21 -07:00
James R. Barlow
6a160d22fe Update release notes, add copyrights 2015-07-28 04:36:58 -07:00
James R. Barlow
e35526192c More test cases 2015-07-28 03:02:35 -07:00
James R. Barlow
bea57bdded More test cases for other parameters 2015-07-28 02:31:18 -07:00
James R. Barlow
2a9da225e4 Minor tweaks to uncommon arguments 2015-07-28 02:25:50 -07:00
James R. Barlow
a3f37de9b5 Test cases for --tesseract-timeout 2015-07-28 01:47:30 -07:00
James R. Barlow
6064160953 Get rid of subprocess call on import of tesseract, unpaper -- bit nasty 2015-07-28 01:00:29 -07:00
James R. Barlow
8508141314 Drop nose, all tests working reasonably again
Although the real issue was that the ruffus pipeline cannot be executed
twice in the same process due to its reliance on global variables.

The new OO pipeline in ruffus 2.6 would be one resolution that would
allow for more comprehensive testing as opposed to farming out the
execution to subprocess and inspecting the results, as is currently
done.
2015-07-28 00:43:22 -07:00
James R. Barlow
1c95597882 nose can't really handle external tests so looking into py.test instead
Specifically it trips over the need to reimport ocrmypdf.main.  That in
turn raises questions about whether to make that function into an
external script that imports ocrmypdf... or something else.  Would be
possible with a loop that manipulates sys_argv and then reloads
ocrmypdf.main; might need that anyway.
2015-07-27 22:07:04 -07:00
James R. Barlow
587fa63c8e --oversample: Default to 0 2015-07-27 20:42:16 -07:00
James R. Barlow
b40eec4cb0 Add --oversample test for hocr rendering 2015-07-27 17:18:02 -07:00
James R. Barlow
7bcd48c269 Add test to confirm that metadata is transferred to final PDF/A 2015-07-27 16:11:51 -07:00
James R. Barlow
2e7cd52c0f Improve argument handling, test cases 2015-07-27 15:39:54 -07:00
James R. Barlow
77d4cb367e Put ghostscript in a module 2015-07-27 15:22:00 -07:00
James R. Barlow
2c45c5abc6 Implement tesseract timeout 2015-07-27 04:23:37 -07:00
James R. Barlow
a89afabd79 Implement tesseract PDF rendering as an alternative
It's much better a rendering text baselines than hocr and seems to
produce small file sizes, so it's progress.  Not available for
Tesseract 3.02 obviously, so both modes need to remove available.
2015-07-27 04:20:49 -07:00
James R. Barlow
03f7c9bf07 setup.py: Only do program checks when installing 2015-07-27 02:14:51 -07:00
James R. Barlow
d5f4862749 setup.py: check for third party program requirements 2015-07-27 01:45:17 -07:00
James R. Barlow
8aced0b6d3 More testing: JPEG 2015-07-27 00:25:43 -07:00
James R. Barlow
6b9adef684 Don't create inline images in output PDFs
...except that Ghostscript will sometimes turn out of line images into
inline images on its own, possibly if file size is small.
2015-07-26 21:43:49 -07:00
James R. Barlow
5440d988fc Make this PDF a whole image page
Originally it had a smaller image centred in a page, which is not quite
supported.
2015-07-26 18:32:50 -07:00
James R. Barlow
30da4fc569 pageinfo: drop pdftotext and use PyPDF instead 2015-07-26 18:23:37 -07:00
James R. Barlow
2c1b5e100b Test cases for pageinfo; complain about inline images 2015-07-26 18:18:41 -07:00
James R. Barlow
3684f278ed Add some pageinfo test cases; found problem with inline images 2015-07-26 15:24:42 -07:00
James R. Barlow
6c3cb6acba Remove redundant *res_render 2015-07-26 12:56:10 -07:00
James R. Barlow
b98ba8d174 Replace .md with .rst
Github supports both, and PyPI expects .rst files, so use .rst and make
everyone happy.

Auto-converted using pandoc
find . -name '*.md' | parallel pandoc --from=markdown --to=rst --output='{.}.rst' '{}'
http://bfroehle.com/2013/04/26/converting-md-to-rst/
2015-07-26 03:01:18 -07:00
James R. Barlow
d3088829af More packaging changes: move jhove, fix console script 2015-07-26 01:52:08 -07:00
James R. Barlow
9aaaba1714 Packaging stuff 2015-07-25 23:45:13 -07:00
Jim Barlow
9adb0d696f Prepare for Python packaging - move to ocrmypdf folder 2015-07-25 18:22:04 -07:00
Jim Barlow
c270f1ba5f Update release notes so far 2015-07-25 18:18:37 -07:00
Jim Barlow
7b255b575a Metadata override from command lien 2015-07-25 18:12:25 -07:00
Jim Barlow
d7a9f3a2ab Transfer Unicode document information from input PDF to output PDF
What a pain getting Unicode right, but there it is.

I cannot find anything to confirm that it is acceptable to put the PDF/A
definition file at the end of the Ghostscript inputs.  I did this because
Ghostscript seems to copy document info from the last document on the
list so reportlab's information "wins" in normal order, so it fixes that
issue, and reportlab 'helpfully' fills in all of those fields even if it
does not have information.

It could also work to pass document information along to reportlab, and
set it in each output PDF: .debug.pdf, .rendered.pdf, and .page.pdf to
ensure that whatever page is last in the pipeline has the right
information. Or perhaps it's possible to write a Postscript trailer that
overwrites any previous docinfo with no side effects, but I can't find
any information on how to do that.  I don't think it's worth pursuing
unless this arrangement causes some problem with PDF/A generation.

On a minor note, Jhove misreads the way I have encoded the strings in
producing its validation log.  It reads them as UTF-16 little endian, so
will tend to produce a string of Asian characters in place of the real
data.
2015-07-25 18:05:25 -07:00
Jim Barlow
abf2e7e9bb Copy document metadata from source document into output (untested)
This works for ASCII only; will do Unicode version.
2015-07-25 15:31:02 -07:00
Jim Barlow
72e5fa9ba0 Reimplement debug pages 2015-07-25 14:14:02 -07:00
Jim Barlow
32c1078d2c Reimplement skip text pages 2015-07-25 14:13:32 -07:00
Jim Barlow
133f901a69 Change @subdivide to @split
@split is for "1 to many" operations, so it's the right tool for this
case.
2015-07-25 02:58:34 -07:00
Jim Barlow
42cd683ec0 Try to make pdfinfo less obnoxious by printing too many decimals 2015-07-25 02:47:59 -07:00
Jim Barlow
151eb05377 For now, unpaper is the only deskew provider 2015-07-25 01:46:16 -07:00
Jim Barlow
16177d0a52 Remove ability to override temporary (working) folder
Little point to this feature - on most platforms the environment
variable can be overridden if desired to set a new root location.

At the same time, this change removes the ability to resume a partially
executed pipeline by deleting all of the results on failure.  If -k is
provided then the temporary files will survive but there's no way to
resume from them.  Because resuming doesn't really work away and would
only be useful to users experiencing very specific problems, this is
probably not worth it, so no major loss.  The intent of -k is to assist
debugging.
2015-07-25 01:45:26 -07:00
Jim Barlow
5ce544289f Automatically try to use all available CPUs 2015-07-25 01:10:14 -07:00
Jim Barlow
77bd35c3c7 Remove duplicate test folder 2015-07-25 01:00:40 -07:00
Jim Barlow
0c5c208db0 Goodbye, so long, farewell, shell... 2015-07-25 00:57:07 -07:00
Jim Barlow
60eb745331 Split selecting final image and render PDF result into separate tasks
Simplifies the logic - one deals with all images, the other details
with an image and .hocr. Also add JPEG reconversion.
2015-07-25 00:54:00 -07:00
Jim Barlow
9f90b5cb0a Modularize unpaper; get -d and -c working again 2015-07-25 00:22:56 -07:00
Jim Barlow
5adff94545 Remove more dead/old code 2015-07-24 15:41:24 -07:00