71 Commits

Author SHA1 Message Date
James R. Barlow
c1eb047a4b Fix name of pdfa_def.ps
Used to include a copy of the parent dir's name.
2016-01-19 13:11:03 -08:00
James R. Barlow
626ca18f5c Remove stale comment 2016-01-19 13:02:35 -08:00
James R. Barlow
a0952bfca3 Optimize: use img2pdf stream instead of repeated copies 2016-01-18 20:24:46 -08:00
James R. Barlow
fc0479f110 Fix all but test_oversample[hocr] 2016-01-15 15:46:47 -08:00
James R. Barlow
62728205b6 Implement image+text merging in other cases
5 failed, 28 passed

failures:
test_oversample[hocr], test_skip_ocr, test_skip_big, test_maximum_options[hocr],
test_blank_input_pdf,
2016-01-15 15:38:08 -08:00
James R. Barlow
dc0fb25e64 Render hocr page: no longer needs an image as input 2016-01-15 15:16:47 -08:00
James R. Barlow
7067110308 Add safety check to prevent merge from running when not sensible 2016-01-15 14:54:45 -08:00
James R. Barlow
599d889703 Implement "perfect reconstruction" - transfer page and watermark OCR layer
Works, does not account for changes to clean/deskew, etc.
Surprisingly, it works. PyPDF2 fixes since last attempt?
2016-01-15 14:39:12 -08:00
James R. Barlow
074c1d71b4 Activate --tesseract-pagesegmode 2016-01-11 17:19:32 -08:00
James R. Barlow
1fca9a004d Adjust command line parameters
Was splitting each argument to --tesseract-config into a list of single
character strings
2016-01-11 16:57:19 -08:00
James R. Barlow
b485a1ef78 Override ruffus' handling of --jobs
Ruffus treats omitted parameter as -j1. For our purposes it makes more
sense for omitting the parameter to mean "use all CPUs". As such we
must be able to distinguish -j1 from the parameter -j being omitted.

Telling ruffus to ignore the argument actually just makes it not auto
generate the argument. We can add an argument back with the same name.
2016-01-09 19:07:48 -08:00
James R. Barlow
326ef7a3ac Merge branch 'hotfix/v3.1.1' into develop
# Conflicts:
#	RELEASE_NOTES.rst
2016-01-09 18:55:04 -08:00
James R. Barlow
6af0815681 Bump version 2016-01-09 18:45:06 -08:00
James R. Barlow
61b3ccb57c Place ruffus database in temporary folder
Because we don't really use ruffus checkpoint feature, putting the
database in a permanent location does not help anything, but does cause
large database files and problems if the .ruffus_history.sqlite wanted
to be in a writable location.
2016-01-04 13:23:47 -08:00
James R. Barlow
133357779a All subprocess invocations refactored out of main.py 2015-12-17 08:31:18 -08:00
James R. Barlow
5d8167b232 Move PDF validation check to qpdf.py 2015-12-17 08:28:00 -08:00
James R. Barlow
e76ae8c46c Move more qpdf calls into qpdf.py 2015-12-17 08:24:48 -08:00
James R. Barlow
53a7c0e668 Refactor qpdf subprocess calls into module 2015-12-17 08:19:53 -08:00
James R. Barlow
4ca243e490 Merge commit '9f374461559460527e47237323e511123f31b6b0' into feature/envvars 2015-12-17 07:27:26 -08:00
Shem Pasamba
d7c7559b05 Use boolean instead of integers 2015-12-17 11:23:27 +08:00
Shem Pasamba
b2b66d1344 Don't exit when qpdf repair was successful 2015-12-17 11:20:20 +08:00
James R. Barlow
5d111a3c04 Refactor tesseract --pdfrenderer calls to tesseract.py 2015-12-16 17:48:26 -08:00
James R. Barlow
10416f847f Migrate tesseract-hocr code to tesseract module, because modularity 2015-12-16 17:36:11 -08:00
James R. Barlow
79b3472b26 All tests passed, bump version 2015-12-04 04:31:01 -08:00
James R. Barlow
f1b2f1ae08 Merge branch 'feature/pdfa-2' into develop 2015-12-04 04:04:08 -08:00
James R. Barlow
ee7d97ae8c Trivial 2015-12-04 04:03:38 -08:00
James R. Barlow
7d9f473bb1 Remove eval() call by introspecting ExitCode 2015-12-04 03:34:53 -08:00
James R. Barlow
e77a5e5e75 We don't want threads. Really. Do. Not. Want. 2015-12-04 03:11:38 -08:00
James R. Barlow
6ab19af122 Comments 2015-12-04 03:09:39 -08:00
James R. Barlow
276fe49867 Better error messages for input file not found or invalid
Not as good finding a general way to deal with ruffus exceptions, but
better than nil.
2015-12-04 03:07:53 -08:00
James R. Barlow
acb31abe86 Fix issue #20 - fails on uppercase .PDF 2015-12-04 02:14:09 -08:00
James R. Barlow
4f964a3c8a Introduce --pdf-renderer auto
Tess 3.03's has various quality problems like wrong DPI that are fixed
in Tess 3.04. Idea here is to introduce an option to let OCRmyPDF
select the rendering backend based on the options and system.

However, we're not ready for tesseract as the main renderer.
Setting pdf-renderer to tesseract does not pass all test cases, mainly
the one where --tesseract-timeout is triggered, and some others.
2015-12-02 23:20:31 -08:00
James R. Barlow
80d89b5420 Set /Creator metadata to OCRmyPDF
with reference to Tess version and settings
2015-12-02 02:19:39 -08:00
James R. Barlow
281eafada0 bump to v3.0 and move repos 2015-09-05 00:53:14 -07:00
James R. Barlow
c14e10128a Bump version to -rc9 2015-08-29 16:43:22 -07:00
James R. Barlow
c4f134d694 Prevent running validation on missing file after an exception is thrown 2015-08-28 04:48:29 -07:00
James R. Barlow
83f9dfbac4 Use png256 raster device when possible
Someone reported a bug where the .png input to unpaper ended up being
type 'P' (palette) for some reason, which was not supported in unpaper.

Not sure how it happened, but seemed easier to fix by explicitly
supporting. Here we use png256 if it would capture all colors in the
input file. It's up to tesseract/reportlab to make use of the palette
PNG when rendering.
2015-08-28 04:47:57 -07:00
James R. Barlow
2ce6834be4 Bump to -rc8 2015-08-24 01:25:01 -07:00
James R. Barlow
b376672dbc Bug fix: exception thrown if input PDF was missing DocumentInfo block 2015-08-24 01:23:30 -07:00
James R. Barlow
aab08bfcc7 Fix requirements.txt problem 2015-08-23 12:30:40 -07:00
James R. Barlow
4f3673d14d Update notes for -rc6 2015-08-22 00:40:07 -07:00
James R. Barlow
cc161780df Replace fileinput with regular open-replace
fileinput is supposed to save time in these cases but it's not capable
of doing both in-place rewrites and working with a non-ascii encoding.
This was not noticed until characters outside of ASCII were picked up
by tesseract and saved in a HOCR file. Rework some surrounding code as
well and add multilingual test cases.
2015-08-18 23:27:50 -07:00
James R. Barlow
53c88093ad Bump to -rc5 2015-08-16 02:19:04 -07:00
James R. Barlow
30072e0c70 Pillow sucks
Far from being fluffy or friendly, Pillow silently allows installation
of itself without support for major image types.  Reportlab calls for
pillow 2.4.0.  On Ubuntu 14.04 LTS this will trigger an upgrade of
pillow that will be built without JPEG or ZLIB so it is effectively
neutered, and unfortunately Pillow will not detect this situation at
install time and guide users to a resolution.  Instead, you see nasty
stack traces.

So add a run-time check to ensure that Pillow is sane and capable of JPEG
and PNG support since both may be used internally.
2015-08-16 00:54:03 -07:00
James R. Barlow
adf495e8cc Remove JHOVE
JHOVE is not an effective PDF/A validator, as detailed in this article:
http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/

In short, it's buggy. Out of 670 invalid PDF/A files in a test suite,
it only flagged 5.  It only looks for certain problems that Ghostscript
generated PDFs are unlikely to have.  So use qpdf as a final check for
general ill-formed PDF problems since it is quite reliable.

JHOVE 1 is no longer maintained. There's a JHOVE 2 but it has no PDF
support.  I also don't know if it's appropriate to bundle JHOVE, with an
LGPL, under this project and its current license.

Removing a dependency on Java is a huge win.  A world with less Java is
a world with less AbstractFactoryConstructorInterfaces.
2015-08-11 15:31:32 -07:00
James R. Barlow
9247ea00bf Improve ruffus exception handling
ruffus swallows the return code if the process of handling an exception
we hit an error in ruffus' own code, which can happen.  So pick through
its error stack and find out if there's an interesting return code in
there.  Had to use eval() of all things.

Also suppress the stack trace for normal error conditions that don't
need one.
2015-08-11 02:19:46 -07:00
James R. Barlow
1cb5f6a90d Refactor exit codes; test for missing tessdata
Some versions of tesseract installed by homebrew end up without a
functional tessdata folder, and tesseract is not helpful in this
situation, so add a new test to make sure our output is at least
indicative of the problem.

In the process of properly handling return codes I discovered
test_override_metadata triggers a NPE inside JHOVE probably due to the
Unicode character checking.  This could be specific to my JRE (1.6.0_65,
Oracle) but it's probably JHOVE's fault.  A valid PDF/A (per Acrobat)
is still generated.
2015-08-11 00:17:02 -07:00
James R. Barlow
8d848284df Fix code, test case: complain when GS fails to produce PDF/A
Modified pipeline to fix regression and return the proper error code if
we did not produce a PDF/A as expected.  The wrapper forces the output
to be PDF 1.3 which is not PDF/A compliant.

The funny thing is that in some cases JHOVE incorrectly states that a
file is PDF/A-1b compliant, well formed and valid, even when it is not
according to Acrobat XI and is missing the PDF/A metadata marker, as
far as I can tell.  JHOVE may not be as beneficial as hoped.
2015-08-10 16:05:00 -07:00
James R. Barlow
16d24f1166 Bump version to -rc4 2015-08-05 23:26:38 -07:00
James R. Barlow
8fcbbcef94 Improve usage text 2015-08-05 16:56:53 -07:00