77 Commits

Author SHA1 Message Date
James R. Barlow
4b51b521e2 Implement autorotate (provided lossless reconstruction is disabled)
Works for a single page file, probably

Although arguably rotation is not quite lossless, and the two could be
mutually exclusive anyway, so maybe this is it. Did not check in some
debugging changes (lossless=False, text debugging=True)

PyPDF seems to get merging wrong when one of the pages is rotated.
2016-02-07 03:27:33 -08:00
James R. Barlow
daa3916430 Fix img2pdf 0.2 usage
All tests pass when forced to rely on img2pdf, so seems okay
2016-02-05 15:13:26 -08:00
James R. Barlow
e9b87cefcc Try img2pdf 0.2 2016-02-05 14:38:37 -08:00
James R. Barlow
37c508f3f8 Better versioning: no silly version files, but wrong ver in development
Small price to pay.
2016-01-19 16:07:52 -08:00
James R. Barlow
26e36422cc More fiddling with version 2016-01-19 15:07:21 -08:00
James R. Barlow
f82cb002bc Try automatic versioning with setuptools_scm 2016-01-19 13:27:18 -08:00
James R. Barlow
c1eb047a4b Fix name of pdfa_def.ps
Used to include a copy of the parent dir's name.
2016-01-19 13:11:03 -08:00
James R. Barlow
626ca18f5c Remove stale comment 2016-01-19 13:02:35 -08:00
James R. Barlow
a0952bfca3 Optimize: use img2pdf stream instead of repeated copies 2016-01-18 20:24:46 -08:00
James R. Barlow
fc0479f110 Fix all but test_oversample[hocr] 2016-01-15 15:46:47 -08:00
James R. Barlow
62728205b6 Implement image+text merging in other cases
5 failed, 28 passed

failures:
test_oversample[hocr], test_skip_ocr, test_skip_big, test_maximum_options[hocr],
test_blank_input_pdf,
2016-01-15 15:38:08 -08:00
James R. Barlow
dc0fb25e64 Render hocr page: no longer needs an image as input 2016-01-15 15:16:47 -08:00
James R. Barlow
7067110308 Add safety check to prevent merge from running when not sensible 2016-01-15 14:54:45 -08:00
James R. Barlow
599d889703 Implement "perfect reconstruction" - transfer page and watermark OCR layer
Works, does not account for changes to clean/deskew, etc.
Surprisingly, it works. PyPDF2 fixes since last attempt?
2016-01-15 14:39:12 -08:00
James R. Barlow
074c1d71b4 Activate --tesseract-pagesegmode 2016-01-11 17:19:32 -08:00
James R. Barlow
1fca9a004d Adjust command line parameters
Was splitting each argument to --tesseract-config into a list of single
character strings
2016-01-11 16:57:19 -08:00
James R. Barlow
b485a1ef78 Override ruffus' handling of --jobs
Ruffus treats omitted parameter as -j1. For our purposes it makes more
sense for omitting the parameter to mean "use all CPUs". As such we
must be able to distinguish -j1 from the parameter -j being omitted.

Telling ruffus to ignore the argument actually just makes it not auto
generate the argument. We can add an argument back with the same name.
2016-01-09 19:07:48 -08:00
James R. Barlow
326ef7a3ac Merge branch 'hotfix/v3.1.1' into develop
# Conflicts:
#	RELEASE_NOTES.rst
2016-01-09 18:55:04 -08:00
James R. Barlow
6af0815681 Bump version 2016-01-09 18:45:06 -08:00
James R. Barlow
61b3ccb57c Place ruffus database in temporary folder
Because we don't really use ruffus checkpoint feature, putting the
database in a permanent location does not help anything, but does cause
large database files and problems if the .ruffus_history.sqlite wanted
to be in a writable location.
2016-01-04 13:23:47 -08:00
James R. Barlow
133357779a All subprocess invocations refactored out of main.py 2015-12-17 08:31:18 -08:00
James R. Barlow
5d8167b232 Move PDF validation check to qpdf.py 2015-12-17 08:28:00 -08:00
James R. Barlow
e76ae8c46c Move more qpdf calls into qpdf.py 2015-12-17 08:24:48 -08:00
James R. Barlow
53a7c0e668 Refactor qpdf subprocess calls into module 2015-12-17 08:19:53 -08:00
James R. Barlow
4ca243e490 Merge commit '9f374461559460527e47237323e511123f31b6b0' into feature/envvars 2015-12-17 07:27:26 -08:00
Shem Pasamba
d7c7559b05 Use boolean instead of integers 2015-12-17 11:23:27 +08:00
Shem Pasamba
b2b66d1344 Don't exit when qpdf repair was successful 2015-12-17 11:20:20 +08:00
James R. Barlow
5d111a3c04 Refactor tesseract --pdfrenderer calls to tesseract.py 2015-12-16 17:48:26 -08:00
James R. Barlow
10416f847f Migrate tesseract-hocr code to tesseract module, because modularity 2015-12-16 17:36:11 -08:00
James R. Barlow
79b3472b26 All tests passed, bump version 2015-12-04 04:31:01 -08:00
James R. Barlow
f1b2f1ae08 Merge branch 'feature/pdfa-2' into develop 2015-12-04 04:04:08 -08:00
James R. Barlow
ee7d97ae8c Trivial 2015-12-04 04:03:38 -08:00
James R. Barlow
7d9f473bb1 Remove eval() call by introspecting ExitCode 2015-12-04 03:34:53 -08:00
James R. Barlow
e77a5e5e75 We don't want threads. Really. Do. Not. Want. 2015-12-04 03:11:38 -08:00
James R. Barlow
6ab19af122 Comments 2015-12-04 03:09:39 -08:00
James R. Barlow
276fe49867 Better error messages for input file not found or invalid
Not as good finding a general way to deal with ruffus exceptions, but
better than nil.
2015-12-04 03:07:53 -08:00
James R. Barlow
acb31abe86 Fix issue #20 - fails on uppercase .PDF 2015-12-04 02:14:09 -08:00
James R. Barlow
4f964a3c8a Introduce --pdf-renderer auto
Tess 3.03's has various quality problems like wrong DPI that are fixed
in Tess 3.04. Idea here is to introduce an option to let OCRmyPDF
select the rendering backend based on the options and system.

However, we're not ready for tesseract as the main renderer.
Setting pdf-renderer to tesseract does not pass all test cases, mainly
the one where --tesseract-timeout is triggered, and some others.
2015-12-02 23:20:31 -08:00
James R. Barlow
80d89b5420 Set /Creator metadata to OCRmyPDF
with reference to Tess version and settings
2015-12-02 02:19:39 -08:00
James R. Barlow
281eafada0 bump to v3.0 and move repos 2015-09-05 00:53:14 -07:00
James R. Barlow
c14e10128a Bump version to -rc9 2015-08-29 16:43:22 -07:00
James R. Barlow
c4f134d694 Prevent running validation on missing file after an exception is thrown 2015-08-28 04:48:29 -07:00
James R. Barlow
83f9dfbac4 Use png256 raster device when possible
Someone reported a bug where the .png input to unpaper ended up being
type 'P' (palette) for some reason, which was not supported in unpaper.

Not sure how it happened, but seemed easier to fix by explicitly
supporting. Here we use png256 if it would capture all colors in the
input file. It's up to tesseract/reportlab to make use of the palette
PNG when rendering.
2015-08-28 04:47:57 -07:00
James R. Barlow
2ce6834be4 Bump to -rc8 2015-08-24 01:25:01 -07:00
James R. Barlow
b376672dbc Bug fix: exception thrown if input PDF was missing DocumentInfo block 2015-08-24 01:23:30 -07:00
James R. Barlow
aab08bfcc7 Fix requirements.txt problem 2015-08-23 12:30:40 -07:00
James R. Barlow
4f3673d14d Update notes for -rc6 2015-08-22 00:40:07 -07:00
James R. Barlow
cc161780df Replace fileinput with regular open-replace
fileinput is supposed to save time in these cases but it's not capable
of doing both in-place rewrites and working with a non-ascii encoding.
This was not noticed until characters outside of ASCII were picked up
by tesseract and saved in a HOCR file. Rework some surrounding code as
well and add multilingual test cases.
2015-08-18 23:27:50 -07:00
James R. Barlow
53c88093ad Bump to -rc5 2015-08-16 02:19:04 -07:00
James R. Barlow
30072e0c70 Pillow sucks
Far from being fluffy or friendly, Pillow silently allows installation
of itself without support for major image types.  Reportlab calls for
pillow 2.4.0.  On Ubuntu 14.04 LTS this will trigger an upgrade of
pillow that will be built without JPEG or ZLIB so it is effectively
neutered, and unfortunately Pillow will not detect this situation at
install time and guide users to a resolution.  Instead, you see nasty
stack traces.

So add a run-time check to ensure that Pillow is sane and capable of JPEG
and PNG support since both may be used internally.
2015-08-16 00:54:03 -07:00