2895 Commits

Author SHA1 Message Date
James R. Barlow
c6d106ec33 Throw exception if iccprofiles not found instead of returning None
So far iccprofiles were only missing for a user who had a custom and
possibly broken ghostscript installation.
2015-08-28 03:59:35 -07:00
James R. Barlow
2ce6834be4 Bump to -rc8 v3.0-rc8 2015-08-24 01:25:01 -07:00
James R. Barlow
b376672dbc Bug fix: exception thrown if input PDF was missing DocumentInfo block 2015-08-24 01:23:30 -07:00
James R. Barlow
d07db8547f Merge branch 'master' of https://github.com/fritz-hh/OCRmyPDF v3.0-rc7 2015-08-23 12:30:46 -07:00
James R. Barlow
aab08bfcc7 Fix requirements.txt problem 2015-08-23 12:30:40 -07:00
jbarlow83
e0a25494ee Explain the need for multi core, etc 2015-08-22 13:34:42 -07:00
James R. Barlow
fd876d5e4e Merge branch 'develop' v3.0-rc6 2015-08-22 01:51:44 -07:00
James R. Barlow
ee7f008ff5 Require unpaper 6.1; no messing around with broken versions 2015-08-22 01:51:08 -07:00
jbarlow83
d9161a6ddb Update README: docker run instructions 2015-08-22 01:50:13 -07:00
jbarlow83
f8d66768e3 Update README with docker install instructions 2015-08-22 01:33:12 -07:00
James R. Barlow
4f3673d14d Update notes for -rc6 2015-08-22 00:40:07 -07:00
James R. Barlow
1712fdb74a Merge branch 'feature/docker-debian' 2015-08-22 00:32:27 -07:00
James R. Barlow
3a5ffc79e0 Stock debian unpaper is no good; replace with 6.1 built from source
debian and ubuntu both install unpaper 0.4.2 or so. No .deb packages
available at higher version numbers although ArchLinux had something.
Considered making a separate image to handle building and install but
decided that was a premature optimization at this point, so just build
the unpaper that works. All tests pass.
2015-08-22 00:30:39 -07:00
James R. Barlow
859b063444 Fixup other docker test suite errors
Outstanding failures:
test_pageinfo::test_jpeg
tests involving unpaper due to version <6.1 failures
2015-08-20 02:37:03 -07:00
James R. Barlow
bd61e7c644 dockerignore *.pyc
https://github.com/docker/docker/issues/13113
Docker kinda sucks. No recursive exclusion.
2015-08-20 02:27:07 -07:00
James R. Barlow
c9abf282b5 Set docker locale to utf-8
Shocked, shocked, that there's a Linux distribution out that there isn't
doing the right thing and setting up utf-8 by default. (Many tests failed)
2015-08-20 01:44:30 -07:00
James R. Barlow
9dad40b5a3 Major overhaul of the Dockerfile
Switched from Ubuntu to debian:stretch because stretch has more recent
versions of our binary packages and starts smaller.  In particular,
stretch has both pillow==2.9.0 and reportlab==3.2.0 available as system
packages which saves the considerable hassle of install a toolchain.

Instead, a pyvenv is set up with access to system's site-packages (note:
needs two steps), making the binary-dependent packages available.  Then
the remaining packages are installed into the pyvenv with --no-cache-dir
to avoid saving files. And there we are.

Image is still very large (>500 MB), but programs like reportlab require
font rendering capabilities so they pull in large portions of the Linux
graphics stack. Not much will shrink that.
2015-08-20 01:25:31 -07:00
James R. Barlow
8e2d690cb0 Rework Dockerfile, setup.py to work with wheels for better cache use 2015-08-19 13:43:32 -07:00
James R. Barlow
c132e091e1 Dockerfile: use local copy of application 2015-08-19 13:10:58 -07:00
James R. Barlow
630e6cbf1e pip chokes on Unicode filenames? 2015-08-18 23:56:30 -07:00
James R. Barlow
83ff5760a8 Dockerfile comment cleanup 2015-08-18 23:41:41 -07:00
James R. Barlow
fed0ee638e Fix ruffus writing to RO directory in container 2015-08-18 23:30:06 -07:00
James R. Barlow
cc161780df Replace fileinput with regular open-replace
fileinput is supposed to save time in these cases but it's not capable
of doing both in-place rewrites and working with a non-ascii encoding.
This was not noticed until characters outside of ASCII were picked up
by tesseract and saved in a HOCR file. Rework some surrounding code as
well and add multilingual test cases.
2015-08-18 23:27:50 -07:00
James R. Barlow
898b2b000a Works 2015-08-18 05:38:05 -07:00
James R. Barlow
b3ee743ed7 WIP on docker 2015-08-18 04:46:25 -07:00
James R. Barlow
ef17b669fe README needs ghostscript 2015-08-18 03:27:39 -07:00
James R. Barlow
2dff3e07ce Drop libxml2 dependency
It seems that Python's internal XML parser is good enough to do the job.
2015-08-17 15:26:07 -07:00
James R. Barlow
53c88093ad Bump to -rc5 v3.0-rc5 2015-08-16 02:19:04 -07:00
James R. Barlow
0ec13d3a17 Fix test cases: minor issues
-os.environ directly modified when whole suite run, breaking subsequent
tests
-no longer trusting JHOVE for PDF/A validation
2015-08-16 01:57:35 -07:00
jbarlow83
0d5104049a Update README with better install instructions 2015-08-16 01:28:28 -07:00
James R. Barlow
ce8fa69785 Update readme 2015-08-16 00:59:57 -07:00
James R. Barlow
30072e0c70 Pillow sucks
Far from being fluffy or friendly, Pillow silently allows installation
of itself without support for major image types.  Reportlab calls for
pillow 2.4.0.  On Ubuntu 14.04 LTS this will trigger an upgrade of
pillow that will be built without JPEG or ZLIB so it is effectively
neutered, and unfortunately Pillow will not detect this situation at
install time and guide users to a resolution.  Instead, you see nasty
stack traces.

So add a run-time check to ensure that Pillow is sane and capable of JPEG
and PNG support since both may be used internally.
2015-08-16 00:54:03 -07:00
James R. Barlow
eb04a890b2 Relax Pillow requirement for Ubuntu 14.04 LTS 2015-08-15 15:55:56 -07:00
James R. Barlow
0c53adb04f setup: rollback lxml version to 3.3.3 - that's the latest in Ubuntu 14.04 2015-08-15 15:25:58 -07:00
James R. Barlow
ee5a43fd47 setup: suppress jhove errors 2015-08-15 15:25:30 -07:00
James R. Barlow
c43d6c2cbe Merge branch 'develop' of https://github.com/fritz-hh/OCRmyPDF into develop
Conflicts:
	setup.py
2015-08-15 15:18:41 -07:00
James R. Barlow
87aeeacb04 Fix erroneous instruction to "apt-get install tesseract"
Should be tesseract-ocr
2015-08-15 15:17:38 -07:00
James R. Barlow
6b26e9cad6 Fix erroneous instruction to "apt-get install tesseract"
Should be tesseract-ocr
2015-08-15 15:12:05 -07:00
James R. Barlow
85af0f0d03 Add test case for blank PDF page 2015-08-14 00:46:50 -07:00
James R. Barlow
f6f4705ea3 Remove Java from setup.py 2015-08-14 00:44:56 -07:00
James R. Barlow
a4702bff22 Possible fix for issue #111 2015-08-13 23:10:22 -07:00
James R. Barlow
73c5c48f79 Update notes 2015-08-13 23:08:29 -07:00
James R. Barlow
adf495e8cc Remove JHOVE
JHOVE is not an effective PDF/A validator, as detailed in this article:
http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/

In short, it's buggy. Out of 670 invalid PDF/A files in a test suite,
it only flagged 5.  It only looks for certain problems that Ghostscript
generated PDFs are unlikely to have.  So use qpdf as a final check for
general ill-formed PDF problems since it is quite reliable.

JHOVE 1 is no longer maintained. There's a JHOVE 2 but it has no PDF
support.  I also don't know if it's appropriate to bundle JHOVE, with an
LGPL, under this project and its current license.

Removing a dependency on Java is a huge win.  A world with less Java is
a world with less AbstractFactoryConstructorInterfaces.
2015-08-11 15:31:32 -07:00
James R. Barlow
9247ea00bf Improve ruffus exception handling
ruffus swallows the return code if the process of handling an exception
we hit an error in ruffus' own code, which can happen.  So pick through
its error stack and find out if there's an interesting return code in
there.  Had to use eval() of all things.

Also suppress the stack trace for normal error conditions that don't
need one.
2015-08-11 02:19:46 -07:00
James R. Barlow
a1238d7bf9 Document override binary test 2015-08-11 00:44:43 -07:00
James R. Barlow
2d63268f0f Work around JHOVE bug for now, so that the test passes 2015-08-11 00:23:48 -07:00
James R. Barlow
1cb5f6a90d Refactor exit codes; test for missing tessdata
Some versions of tesseract installed by homebrew end up without a
functional tessdata folder, and tesseract is not helpful in this
situation, so add a new test to make sure our output is at least
indicative of the problem.

In the process of properly handling return codes I discovered
test_override_metadata triggers a NPE inside JHOVE probably due to the
Unicode character checking.  This could be specific to my JRE (1.6.0_65,
Oracle) but it's probably JHOVE's fault.  A valid PDF/A (per Acrobat)
is still generated.
2015-08-11 00:17:02 -07:00
James R. Barlow
8d848284df Fix code, test case: complain when GS fails to produce PDF/A
Modified pipeline to fix regression and return the proper error code if
we did not produce a PDF/A as expected.  The wrapper forces the output
to be PDF 1.3 which is not PDF/A compliant.

The funny thing is that in some cases JHOVE incorrectly states that a
file is PDF/A-1b compliant, well formed and valid, even when it is not
according to Acrobat XI and is missing the PDF/A metadata marker, as
far as I can tell.  JHOVE may not be as beneficial as hoped.
2015-08-10 16:05:00 -07:00
James R. Barlow
8fe54d1a5c Add new test case to check invalid PDF/A case
It revealed a regression - return code not the same as v2.x for invalid
PDF/A.  It's also not easy to get the return code out of ruffus.  Will
need to tweak the final step of the pipeline.
2015-08-10 13:57:28 -07:00
James R. Barlow
11dd9f14c3 setup.py: block unsafe 'upload', say to use twine instead 2015-08-09 14:16:30 -07:00