39 Commits

Author SHA1 Message Date
James R. Barlow
acb31abe86 Fix issue #20 - fails on uppercase .PDF 2015-12-04 02:14:09 -08:00
James R. Barlow
281eafada0 bump to v3.0 and move repos 2015-09-05 00:53:14 -07:00
James R. Barlow
c14e10128a Bump version to -rc9 2015-08-29 16:43:22 -07:00
James R. Barlow
c4f134d694 Prevent running validation on missing file after an exception is thrown 2015-08-28 04:48:29 -07:00
James R. Barlow
83f9dfbac4 Use png256 raster device when possible
Someone reported a bug where the .png input to unpaper ended up being
type 'P' (palette) for some reason, which was not supported in unpaper.

Not sure how it happened, but seemed easier to fix by explicitly
supporting. Here we use png256 if it would capture all colors in the
input file. It's up to tesseract/reportlab to make use of the palette
PNG when rendering.
2015-08-28 04:47:57 -07:00
James R. Barlow
2ce6834be4 Bump to -rc8 2015-08-24 01:25:01 -07:00
James R. Barlow
b376672dbc Bug fix: exception thrown if input PDF was missing DocumentInfo block 2015-08-24 01:23:30 -07:00
James R. Barlow
aab08bfcc7 Fix requirements.txt problem 2015-08-23 12:30:40 -07:00
James R. Barlow
4f3673d14d Update notes for -rc6 2015-08-22 00:40:07 -07:00
James R. Barlow
cc161780df Replace fileinput with regular open-replace
fileinput is supposed to save time in these cases but it's not capable
of doing both in-place rewrites and working with a non-ascii encoding.
This was not noticed until characters outside of ASCII were picked up
by tesseract and saved in a HOCR file. Rework some surrounding code as
well and add multilingual test cases.
2015-08-18 23:27:50 -07:00
James R. Barlow
53c88093ad Bump to -rc5 2015-08-16 02:19:04 -07:00
James R. Barlow
30072e0c70 Pillow sucks
Far from being fluffy or friendly, Pillow silently allows installation
of itself without support for major image types.  Reportlab calls for
pillow 2.4.0.  On Ubuntu 14.04 LTS this will trigger an upgrade of
pillow that will be built without JPEG or ZLIB so it is effectively
neutered, and unfortunately Pillow will not detect this situation at
install time and guide users to a resolution.  Instead, you see nasty
stack traces.

So add a run-time check to ensure that Pillow is sane and capable of JPEG
and PNG support since both may be used internally.
2015-08-16 00:54:03 -07:00
James R. Barlow
adf495e8cc Remove JHOVE
JHOVE is not an effective PDF/A validator, as detailed in this article:
http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/

In short, it's buggy. Out of 670 invalid PDF/A files in a test suite,
it only flagged 5.  It only looks for certain problems that Ghostscript
generated PDFs are unlikely to have.  So use qpdf as a final check for
general ill-formed PDF problems since it is quite reliable.

JHOVE 1 is no longer maintained. There's a JHOVE 2 but it has no PDF
support.  I also don't know if it's appropriate to bundle JHOVE, with an
LGPL, under this project and its current license.

Removing a dependency on Java is a huge win.  A world with less Java is
a world with less AbstractFactoryConstructorInterfaces.
2015-08-11 15:31:32 -07:00
James R. Barlow
9247ea00bf Improve ruffus exception handling
ruffus swallows the return code if the process of handling an exception
we hit an error in ruffus' own code, which can happen.  So pick through
its error stack and find out if there's an interesting return code in
there.  Had to use eval() of all things.

Also suppress the stack trace for normal error conditions that don't
need one.
2015-08-11 02:19:46 -07:00
James R. Barlow
1cb5f6a90d Refactor exit codes; test for missing tessdata
Some versions of tesseract installed by homebrew end up without a
functional tessdata folder, and tesseract is not helpful in this
situation, so add a new test to make sure our output is at least
indicative of the problem.

In the process of properly handling return codes I discovered
test_override_metadata triggers a NPE inside JHOVE probably due to the
Unicode character checking.  This could be specific to my JRE (1.6.0_65,
Oracle) but it's probably JHOVE's fault.  A valid PDF/A (per Acrobat)
is still generated.
2015-08-11 00:17:02 -07:00
James R. Barlow
8d848284df Fix code, test case: complain when GS fails to produce PDF/A
Modified pipeline to fix regression and return the proper error code if
we did not produce a PDF/A as expected.  The wrapper forces the output
to be PDF 1.3 which is not PDF/A compliant.

The funny thing is that in some cases JHOVE incorrectly states that a
file is PDF/A-1b compliant, well formed and valid, even when it is not
according to Acrobat XI and is missing the PDF/A metadata marker, as
far as I can tell.  JHOVE may not be as beneficial as hoped.
2015-08-10 16:05:00 -07:00
James R. Barlow
16d24f1166 Bump version to -rc4 2015-08-05 23:26:38 -07:00
James R. Barlow
8fcbbcef94 Improve usage text 2015-08-05 16:56:53 -07:00
James R. Barlow
6887e232fc Bug fix: exception from process timeout should be TimeoutExpired 2015-07-31 00:06:58 -07:00
James R. Barlow
6ac7ffd77b Merge branch 'feature/drop-mupdf-poppler' into develop 2015-07-30 23:38:27 -07:00
James R. Barlow
b28faa582a Automatically use all available cores unless told not to 2015-07-30 23:20:21 -07:00
James R. Barlow
a036de318e Replace mupdf and poppler with qpdf
Drop two dependencies and replace them with one that does the job of
both.  Smells like progress.

mupdf does PDF file repair and rendering
poppler does rendering and page splitting
qpdf does PDF file repair and page splitting
ghostscript does PDF file repair, rendering, and page splitting (sort of)

So we use qpdf.  Ghostscript's page splitting is supposed is less
efficient because it reprints the page (PDF -> Postscript -> PDF) and
possibly loses quality.  qpdf's library could be used to improve
performance.

This causes a slight performance regression:

py.test tests/test_main.py::test_maximum_options went from 187 seconds
up to 192.  This is likely due to O(n) serialized invocations of qpdf
compared to a single serialized call to pdfseparate.  Could improve on
this situation by using the example code in qpdf: pdf-split-pages.cc
or create marker files in split_pages() and then write a new @transform
function that would split pages on each CPU.  Probably not worth it,
overall, unless this causes problems on files with hundreds of pages.
2015-07-30 04:16:35 -07:00
James R. Barlow
9e0c443c2f -rc2: because pypi won't accept -rc1 2015-07-28 04:55:10 -07:00
James R. Barlow
60832152b1 Don't mess with options 2015-07-28 04:46:21 -07:00
James R. Barlow
6a160d22fe Update release notes, add copyrights 2015-07-28 04:36:58 -07:00
James R. Barlow
e35526192c More test cases 2015-07-28 03:02:35 -07:00
James R. Barlow
2a9da225e4 Minor tweaks to uncommon arguments 2015-07-28 02:25:50 -07:00
James R. Barlow
a3f37de9b5 Test cases for --tesseract-timeout 2015-07-28 01:47:30 -07:00
James R. Barlow
6064160953 Get rid of subprocess call on import of tesseract, unpaper -- bit nasty 2015-07-28 01:00:29 -07:00
James R. Barlow
587fa63c8e --oversample: Default to 0 2015-07-27 20:42:16 -07:00
James R. Barlow
b40eec4cb0 Add --oversample test for hocr rendering 2015-07-27 17:18:02 -07:00
James R. Barlow
2e7cd52c0f Improve argument handling, test cases 2015-07-27 15:39:54 -07:00
James R. Barlow
77d4cb367e Put ghostscript in a module 2015-07-27 15:22:00 -07:00
James R. Barlow
2c45c5abc6 Implement tesseract timeout 2015-07-27 04:23:37 -07:00
James R. Barlow
a89afabd79 Implement tesseract PDF rendering as an alternative
It's much better a rendering text baselines than hocr and seems to
produce small file sizes, so it's progress.  Not available for
Tesseract 3.02 obviously, so both modes need to remove available.
2015-07-27 04:20:49 -07:00
James R. Barlow
6c3cb6acba Remove redundant *res_render 2015-07-26 12:56:10 -07:00
James R. Barlow
d3088829af More packaging changes: move jhove, fix console script 2015-07-26 01:52:08 -07:00
James R. Barlow
9aaaba1714 Packaging stuff 2015-07-25 23:45:13 -07:00
Jim Barlow
9adb0d696f Prepare for Python packaging - move to ocrmypdf folder 2015-07-25 18:22:04 -07:00