139 Commits

Author SHA1 Message Date
James R. Barlow
bd534c3313 main.py -> __main__.py
Executing a package with python -m packagename will check for
__main__.py inside the package.  In other words main.py should have
always been named __main__.py.

In the unlikely event that someone depends on "import ocrmypdf.main"
being meaningful, main.py continues to exist and replicates the
behavior of __main__.  (It's unlikely because import ocrmypdf.main does
unpythonic ruffus-related things at things import time, essentially
configuring itself to work with sys.argv.  To fix another day.)

This should solve the problem of Debian needing to run test suites
before installation and afterwards for continuous integration without
having to patch either file, as python -m ocrmypdf will follow import
order.  That is, if the current directory contains "ocrmypdf/" (e.g.
staging a new version) then that will be tested, else sys.path will
be checked.
2016-08-31 17:01:42 -07:00
James R. Barlow
71b54035ba Bug fix issue #89: trying to perform arithmetic on IndirectObject
TypeError: bad operand type for unary -: 'IndirectObject'
2016-08-31 10:25:58 -07:00
James R. Barlow
bc11454e1c Help text: example of shell pipeline with img2pdf 2016-08-25 14:58:25 -07:00
James R. Barlow
38fe14b108 Make final PDF/A output message less obtuse 2016-08-25 14:46:40 -07:00
James R. Barlow
1b7b2f3695 v4.2.2 release notes, documentation improvements 2016-08-25 14:46:09 -07:00
James R. Barlow
27a3813207 Recover input filename from symlink on error message
The recent commit to accept files from stdin broken the feature of
returning the input filename on an error, returning the temp filename
instead, which is confusing.
2016-08-23 17:38:28 -07:00
James R. Barlow
e08c42fd3d Tweak pipeline again 2016-08-09 22:40:29 -07:00
James R. Barlow
16901f7134 Accept input from stdin if input filename is '-' 2016-08-09 15:46:24 -07:00
James R. Barlow
35addb8a33 Complain if Chinese is requested with settings known to not work
Should extend test for other Asian languages
2016-08-03 01:29:12 -07:00
James R. Barlow
d32ea8d0dd Remove dead code from qpdf merge + PyPDF2 metadata patching
I tried "qpdf merge + PyPDF2 metadata patching" first. The problem is
that PyPDF2 produces a 1.3 by default and generally I have less
confidence it.

New approach is to stuff the Document Info metadata in the first page
with PyPdf2, cross fingers and use qpdf to merge. It's not quite as
clean and might harm the first page, but it's better than shipping
files produced by PyPDF2.
2016-08-03 01:28:27 -07:00
James R. Barlow
12575d594a Improve PDF/A validity checking at end 2016-08-03 01:26:16 -07:00
James R. Barlow
0746083301 Fix failing test case - unbound local variable in finally block 2016-08-03 01:00:38 -07:00
James R. Barlow
5c99acf6d1 Experimental change to use qpdf to merge files (disables Ghostscript)
All but one tests pass, test_input_file_not_a_pdf

Not sure if PyPDF2 metadata generation will mangle the first page.
2016-08-03 00:56:44 -07:00
James R. Barlow
ebe68de4ff Functional qpdfmerge with PyPDF2 for DocumentInfo block
Tests mostly passing. For the moment this is the new default.

Although PyPDF2 produces a PDF-1.3 which will be wrong for some contents
and possible should be repaired with qpdf. Again.

Looks like it could work better to merge PyPDF2 and fix everything
with qpdf.
2016-08-02 16:48:13 -07:00
James R. Barlow
b17c6a146d Experimental qpdf merging
Does not copy /Catalog metadata, but otherwise functional
2016-08-02 02:19:02 -07:00
James R. Barlow
0b24f971cd ocrmyimage: complain about ICC profiles being presumed 2016-08-02 01:22:36 -07:00
James R. Barlow
bc5d3824bd Don't overload --oversample, use --image-dpi instead for images 2016-07-31 02:09:30 -07:00
James R. Barlow
4356983707 Suppress overly long stack traces on traverse_ruffus_exception 2016-07-31 02:06:44 -07:00
James R. Barlow
2414b79ee6 More cleanup of exception related errors 2016-07-31 01:48:13 -07:00
James R. Barlow
968e1546f0 Refactor image file triage 2016-07-31 01:47:57 -07:00
James R. Barlow
f385772d21 Refactor "is this an iterable that's not a string?" test 2016-07-29 15:25:02 -07:00
James R. Barlow
d257c83520 Most tests were failing at split_pages()
It seems that ruffus sometimes decides to send a ['inputfile.pdf']
instead of a bare string.
2016-07-29 14:59:17 -07:00
James R. Barlow
7b72ffec4f ocrmyimage: better handling of missing/invalid DPI 2016-07-29 14:38:07 -07:00
James R. Barlow
757f6826dc ocrmyimage - Attempt conversion to PDF if input file is not a PDF
First cut.

May have broken ruffus errors again too.
2016-07-29 14:03:19 -07:00
James R. Barlow
d70e3d3753 ruffus exceptions: for clarity only, don't iterate strings
It's a good habit to ensure any iterator test is explicit about
allowing or disallowing strings.
2016-07-29 13:31:24 -07:00
James R. Barlow
fef35e4eb2 Fix handling of DPI for rare case of JPEG recompression after deskew/clean
This test is exercised by page 4 of multipage.pdf. If all images are
JPEGs, and one of deskew/clean removes DPI information, make sure that
we can get the right information back and that the DPI stays square.
2016-07-29 01:34:52 -07:00
James R. Barlow
8f77576dc4 Fix non-square image resolution for "hocr" case; use img2pdf 0.2.1
Tesseract renderer not immediately fixable.
2016-07-28 16:43:51 -07:00
James R. Barlow
b3fcf24a26 Refactor DPI: fix regressions in test suite
Some called functions are particular about the data format of DPI and
don't like to deal with the Decimal() returned by PyPDF2. Convert to
float and int where needed.
2016-07-28 00:19:32 -07:00
James R. Barlow
16e4d342d2 Bug fix: --force-ocr should still run on pages with no images
Useful for people who want to reprocess text.

This also requires --oversample because DPI is undefined. To be fixed
in next commit.
2016-07-27 15:06:49 -07:00
James R. Barlow
bbd02926e1 Add helpful error message for PDFs that use algorithm 4 2016-06-23 13:13:17 -07:00
James R. Barlow
349ec5c81f Provide more helpful error message if pypdf can't merge pages 2016-04-28 14:02:12 -07:00
James R. Barlow
fe14cb57c0 Fix ruffus exception output
I found this issue in ruffus 2.6.3
https://github.com/bunbun/ruffus/issues/65
also discussed here
https://github.com/bunbun/ruffus/pull/67

ruffus 2.6.3 RethrownJobError don't follow the normal conventions and
so its exception causes problems when they cross process boundaries.
This change carefully examines the various forms of ruffus exception
objects that can appear in 2.6.3 and parses them more carefully. It
also removes any direct posting of the exception to the logger because
this triggers another serializing of the exception object, mutating it
further.
2016-04-28 00:38:50 -07:00
James R. Barlow
e877d37ac8 --rotate-pages: Only apply rotation if we're reasonable confident
Take the threshold from tesseract's default value for -psm 1.
2016-04-14 13:49:44 -07:00
James R. Barlow
322085933b unpaper: fix check for missing and old versions, add test case 2016-03-10 15:37:09 -08:00
James R. Barlow
6a380ee99c Fix temporary file placed in wrong folder 2016-02-27 00:51:47 -08:00
James R. Barlow
dad2198394 Log information about detected page orientations in a summary line 2016-02-26 01:07:59 -08:00
James R. Barlow
e40fdc502d Always dump stack trace for unexpected errors 2016-02-26 01:06:59 -08:00
James R. Barlow
71fbda8bf6 Adjust page orientation parsing to deal with change in Tess 3.04.01 2016-02-20 01:32:56 -08:00
James R. Barlow
f3b0434a87 Improve ability to capture error messages from tesseract on a crash 2016-02-19 03:48:49 -08:00
James R. Barlow
d4ef3411e0 Suppress --pdf-renderer tesseract warning in Docker image
Since the corrected font is provided in the Docker image, there's no
reason to show the warning.
2016-02-17 01:03:20 -08:00
James R. Barlow
60b2eb1455 Fix JPEG DPI: Pillow expects dpi=(x,y) 2016-02-16 07:29:20 -08:00
James R. Barlow
71e493a810 Fix case of JPEG missing DPI field 2016-02-16 05:29:32 -08:00
James R. Barlow
c50e3f1329 Complain about older tesseracts that don't have sharp2.ttf installed 2016-02-15 16:43:41 -08:00
James R. Barlow
7c691c21ab Fix image layer rotation for pages with nonzero crop boxes 2016-02-10 17:48:33 -08:00
James R. Barlow
4ec51729d8 Partial fix for images not anchored to (0, 0) 2016-02-10 17:14:48 -08:00
James R. Barlow
6510bcad19 DPI information not transferred automatically from PNG to JPEG 2016-02-09 02:18:54 -08:00
James R. Barlow
1928a64cae Better logging output for autorotation 2016-02-08 23:42:25 -08:00
James R. Barlow
1ba8b1aa4b unpaper is lousy at deskewing, so let leptonica do it 2016-02-08 15:26:33 -08:00
James R. Barlow
2752bda80b Merge branch 'feature/leptdeskew' into feature/logging
Need leptonica for testing now, I think
# Conflicts:
#	ocrmypdf/tesseract.py
#	requirements.txt
#	setup.py
2016-02-08 12:34:48 -08:00
James R. Barlow
d30a879e2d Fix test suite by running select_image_for_pdf unconditionally
The purpose of this change that caused the problem was a minor
optimization for the tesseract renderer path that had it pull an image
from select_image_for_pdf so that it could use a JPEG instead of PNG,
instead of taking it from preprocess_clean where it would only get a PNG
and make large files.
2016-02-08 02:33:03 -08:00