2676 Commits

Author SHA1 Message Date
James R. Barlow
630e6cbf1e pip chokes on Unicode filenames? 2015-08-18 23:56:30 -07:00
James R. Barlow
83ff5760a8 Dockerfile comment cleanup 2015-08-18 23:41:41 -07:00
James R. Barlow
fed0ee638e Fix ruffus writing to RO directory in container 2015-08-18 23:30:06 -07:00
James R. Barlow
cc161780df Replace fileinput with regular open-replace
fileinput is supposed to save time in these cases but it's not capable
of doing both in-place rewrites and working with a non-ascii encoding.
This was not noticed until characters outside of ASCII were picked up
by tesseract and saved in a HOCR file. Rework some surrounding code as
well and add multilingual test cases.
2015-08-18 23:27:50 -07:00
James R. Barlow
898b2b000a Works 2015-08-18 05:38:05 -07:00
James R. Barlow
b3ee743ed7 WIP on docker 2015-08-18 04:46:25 -07:00
James R. Barlow
ef17b669fe README needs ghostscript 2015-08-18 03:27:39 -07:00
James R. Barlow
2dff3e07ce Drop libxml2 dependency
It seems that Python's internal XML parser is good enough to do the job.
2015-08-17 15:26:07 -07:00
James R. Barlow
53c88093ad Bump to -rc5 v3.0-rc5 2015-08-16 02:19:04 -07:00
James R. Barlow
0ec13d3a17 Fix test cases: minor issues
-os.environ directly modified when whole suite run, breaking subsequent
tests
-no longer trusting JHOVE for PDF/A validation
2015-08-16 01:57:35 -07:00
jbarlow83
0d5104049a Update README with better install instructions 2015-08-16 01:28:28 -07:00
James R. Barlow
ce8fa69785 Update readme 2015-08-16 00:59:57 -07:00
James R. Barlow
30072e0c70 Pillow sucks
Far from being fluffy or friendly, Pillow silently allows installation
of itself without support for major image types.  Reportlab calls for
pillow 2.4.0.  On Ubuntu 14.04 LTS this will trigger an upgrade of
pillow that will be built without JPEG or ZLIB so it is effectively
neutered, and unfortunately Pillow will not detect this situation at
install time and guide users to a resolution.  Instead, you see nasty
stack traces.

So add a run-time check to ensure that Pillow is sane and capable of JPEG
and PNG support since both may be used internally.
2015-08-16 00:54:03 -07:00
James R. Barlow
eb04a890b2 Relax Pillow requirement for Ubuntu 14.04 LTS 2015-08-15 15:55:56 -07:00
James R. Barlow
0c53adb04f setup: rollback lxml version to 3.3.3 - that's the latest in Ubuntu 14.04 2015-08-15 15:25:58 -07:00
James R. Barlow
ee5a43fd47 setup: suppress jhove errors 2015-08-15 15:25:30 -07:00
James R. Barlow
c43d6c2cbe Merge branch 'develop' of https://github.com/fritz-hh/OCRmyPDF into develop
Conflicts:
	setup.py
2015-08-15 15:18:41 -07:00
James R. Barlow
87aeeacb04 Fix erroneous instruction to "apt-get install tesseract"
Should be tesseract-ocr
2015-08-15 15:17:38 -07:00
James R. Barlow
6b26e9cad6 Fix erroneous instruction to "apt-get install tesseract"
Should be tesseract-ocr
2015-08-15 15:12:05 -07:00
James R. Barlow
85af0f0d03 Add test case for blank PDF page 2015-08-14 00:46:50 -07:00
James R. Barlow
f6f4705ea3 Remove Java from setup.py 2015-08-14 00:44:56 -07:00
James R. Barlow
a4702bff22 Possible fix for issue #111 2015-08-13 23:10:22 -07:00
James R. Barlow
73c5c48f79 Update notes 2015-08-13 23:08:29 -07:00
James R. Barlow
adf495e8cc Remove JHOVE
JHOVE is not an effective PDF/A validator, as detailed in this article:
http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/

In short, it's buggy. Out of 670 invalid PDF/A files in a test suite,
it only flagged 5.  It only looks for certain problems that Ghostscript
generated PDFs are unlikely to have.  So use qpdf as a final check for
general ill-formed PDF problems since it is quite reliable.

JHOVE 1 is no longer maintained. There's a JHOVE 2 but it has no PDF
support.  I also don't know if it's appropriate to bundle JHOVE, with an
LGPL, under this project and its current license.

Removing a dependency on Java is a huge win.  A world with less Java is
a world with less AbstractFactoryConstructorInterfaces.
2015-08-11 15:31:32 -07:00
James R. Barlow
9247ea00bf Improve ruffus exception handling
ruffus swallows the return code if the process of handling an exception
we hit an error in ruffus' own code, which can happen.  So pick through
its error stack and find out if there's an interesting return code in
there.  Had to use eval() of all things.

Also suppress the stack trace for normal error conditions that don't
need one.
2015-08-11 02:19:46 -07:00
James R. Barlow
a1238d7bf9 Document override binary test 2015-08-11 00:44:43 -07:00
James R. Barlow
2d63268f0f Work around JHOVE bug for now, so that the test passes 2015-08-11 00:23:48 -07:00
James R. Barlow
1cb5f6a90d Refactor exit codes; test for missing tessdata
Some versions of tesseract installed by homebrew end up without a
functional tessdata folder, and tesseract is not helpful in this
situation, so add a new test to make sure our output is at least
indicative of the problem.

In the process of properly handling return codes I discovered
test_override_metadata triggers a NPE inside JHOVE probably due to the
Unicode character checking.  This could be specific to my JRE (1.6.0_65,
Oracle) but it's probably JHOVE's fault.  A valid PDF/A (per Acrobat)
is still generated.
2015-08-11 00:17:02 -07:00
James R. Barlow
8d848284df Fix code, test case: complain when GS fails to produce PDF/A
Modified pipeline to fix regression and return the proper error code if
we did not produce a PDF/A as expected.  The wrapper forces the output
to be PDF 1.3 which is not PDF/A compliant.

The funny thing is that in some cases JHOVE incorrectly states that a
file is PDF/A-1b compliant, well formed and valid, even when it is not
according to Acrobat XI and is missing the PDF/A metadata marker, as
far as I can tell.  JHOVE may not be as beneficial as hoped.
2015-08-10 16:05:00 -07:00
James R. Barlow
8fe54d1a5c Add new test case to check invalid PDF/A case
It revealed a regression - return code not the same as v2.x for invalid
PDF/A.  It's also not easy to get the return code out of ruffus.  Will
need to tweak the final step of the pipeline.
2015-08-10 13:57:28 -07:00
James R. Barlow
11dd9f14c3 setup.py: block unsafe 'upload', say to use twine instead 2015-08-09 14:16:30 -07:00
James R. Barlow
16d24f1166 Bump version to -rc4 v3.0-rc4 2015-08-05 23:26:38 -07:00
James R. Barlow
97015ef775 Add a test case to check on the @argumentsfile syntax 2015-08-05 23:17:38 -07:00
James R. Barlow
2744dafb74 New test case: ensure metadata is preserved from input to output 2015-08-05 17:09:38 -07:00
James R. Barlow
7b268dbe1a Remove duplication in test case 2015-08-05 16:57:04 -07:00
James R. Barlow
8fcbbcef94 Improve usage text 2015-08-05 16:56:53 -07:00
James R. Barlow
8f93f0a06e Tidy docs 2015-08-05 16:56:30 -07:00
James R. Barlow
387142488c Kill duplicate file 2015-07-31 01:57:16 -07:00
James R. Barlow
6887e232fc Bug fix: exception from process timeout should be TimeoutExpired 2015-07-31 00:06:58 -07:00
James R. Barlow
6ac7ffd77b Merge branch 'feature/drop-mupdf-poppler' into develop 2015-07-30 23:38:27 -07:00
James R. Barlow
b28faa582a Automatically use all available cores unless told not to 2015-07-30 23:20:21 -07:00
James R. Barlow
454ee029c8 Run final ghostscript in multithreaded mode
This step is serialized so all cores are not busy at this stage.
2015-07-30 23:20:04 -07:00
James R. Barlow
a036de318e Replace mupdf and poppler with qpdf
Drop two dependencies and replace them with one that does the job of
both.  Smells like progress.

mupdf does PDF file repair and rendering
poppler does rendering and page splitting
qpdf does PDF file repair and page splitting
ghostscript does PDF file repair, rendering, and page splitting (sort of)

So we use qpdf.  Ghostscript's page splitting is supposed is less
efficient because it reprints the page (PDF -> Postscript -> PDF) and
possibly loses quality.  qpdf's library could be used to improve
performance.

This causes a slight performance regression:

py.test tests/test_main.py::test_maximum_options went from 187 seconds
up to 192.  This is likely due to O(n) serialized invocations of qpdf
compared to a single serialized call to pdfseparate.  Could improve on
this situation by using the example code in qpdf: pdf-split-pages.cc
or create marker files in split_pages() and then write a new @transform
function that would split pages on each CPU.  Probably not worth it,
overall, unless this causes problems on files with hundreds of pages.
2015-07-30 04:16:35 -07:00
James R. Barlow
9918c4020e Use img2pdf in test case because it does a better job 2015-07-30 03:35:56 -07:00
jbarlow83
3d6264e1b8 Fix formatting of 'motivation' 2015-07-28 17:58:26 -07:00
jbarlow83
1c25270503 Improve instructions for users that need sudo or venv 2015-07-28 17:55:56 -07:00
James R. Barlow
47e50f82c4 setup.py: allow mutool 1.7 2015-07-28 13:37:32 -07:00
James R. Barlow
27ecdfbba8 More fixes to error cases in setup.py 2015-07-28 13:05:23 -07:00
James R. Barlow
6901550065 Fix some installer issues 2015-07-28 12:41:24 -07:00
jbarlow83
6e6f918630 Actually link the release notes 2015-07-28 12:21:57 -07:00