2676 Commits

Author SHA1 Message Date
James R. Barlow
44f47fba21 PDF/A: handle case of no XMP metadata gracefully 2016-08-03 02:57:25 -07:00
James R. Barlow
02584094a1 Suppress NUL bytes in metadata from input files 2016-08-03 02:47:44 -07:00
James R. Barlow
91d715ac93 Add test cases for --output-type 2016-08-03 02:47:18 -07:00
James R. Barlow
35addb8a33 Complain if Chinese is requested with settings known to not work
Should extend test for other Asian languages
2016-08-03 01:29:12 -07:00
James R. Barlow
d32ea8d0dd Remove dead code from qpdf merge + PyPDF2 metadata patching
I tried "qpdf merge + PyPDF2 metadata patching" first. The problem is
that PyPDF2 produces a 1.3 by default and generally I have less
confidence it.

New approach is to stuff the Document Info metadata in the first page
with PyPdf2, cross fingers and use qpdf to merge. It's not quite as
clean and might harm the first page, but it's better than shipping
files produced by PyPDF2.
2016-08-03 01:28:27 -07:00
James R. Barlow
12575d594a Improve PDF/A validity checking at end 2016-08-03 01:26:16 -07:00
James R. Barlow
0746083301 Fix failing test case - unbound local variable in finally block 2016-08-03 01:00:38 -07:00
James R. Barlow
5c99acf6d1 Experimental change to use qpdf to merge files (disables Ghostscript)
All but one tests pass, test_input_file_not_a_pdf

Not sure if PyPDF2 metadata generation will mangle the first page.
2016-08-03 00:56:44 -07:00
James R. Barlow
2b10df7b74 leptonica: note about when it may be safe to drop <1.72 workaround 2016-08-03 00:54:37 -07:00
James R. Barlow
ebe68de4ff Functional qpdfmerge with PyPDF2 for DocumentInfo block
Tests mostly passing. For the moment this is the new default.

Although PyPDF2 produces a PDF-1.3 which will be wrong for some contents
and possible should be repaired with qpdf. Again.

Looks like it could work better to merge PyPDF2 and fix everything
with qpdf.
2016-08-02 16:48:13 -07:00
James R. Barlow
b17c6a146d Experimental qpdf merging
Does not copy /Catalog metadata, but otherwise functional
2016-08-02 02:19:02 -07:00
James R. Barlow
46d837c866 Clarify trusty/precise stuff 2016-08-02 01:29:33 -07:00
James R. Barlow
24856b61e4 Fix typo in readme 2016-08-02 01:29:22 -07:00
James R. Barlow
8d0c6ff616 pyvenv -> python3 -m venv
Sadly the Python developers are removing this script
2016-08-02 01:27:50 -07:00
James R. Barlow
0b24f971cd ocrmyimage: complain about ICC profiles being presumed 2016-08-02 01:22:36 -07:00
James R. Barlow
bc5d3824bd Don't overload --oversample, use --image-dpi instead for images 2016-07-31 02:09:30 -07:00
James R. Barlow
4356983707 Suppress overly long stack traces on traverse_ruffus_exception 2016-07-31 02:06:44 -07:00
James R. Barlow
2414b79ee6 More cleanup of exception related errors 2016-07-31 01:48:13 -07:00
James R. Barlow
968e1546f0 Refactor image file triage 2016-07-31 01:47:57 -07:00
James R. Barlow
48213c9c3f Update release notes and readme 2016-07-29 15:25:16 -07:00
James R. Barlow
f385772d21 Refactor "is this an iterable that's not a string?" test 2016-07-29 15:25:02 -07:00
James R. Barlow
d257c83520 Most tests were failing at split_pages()
It seems that ruffus sometimes decides to send a ['inputfile.pdf']
instead of a bare string.
2016-07-29 14:59:17 -07:00
James R. Barlow
7b72ffec4f ocrmyimage: better handling of missing/invalid DPI 2016-07-29 14:38:07 -07:00
James R. Barlow
757f6826dc ocrmyimage - Attempt conversion to PDF if input file is not a PDF
First cut.

May have broken ruffus errors again too.
2016-07-29 14:03:19 -07:00
James R. Barlow
5df83a0d30 Travis: use Python 3.5 too 2016-07-29 13:31:40 -07:00
James R. Barlow
d70e3d3753 ruffus exceptions: for clarity only, don't iterate strings
It's a good habit to ensure any iterator test is explicit about
allowing or disallowing strings.
2016-07-29 13:31:24 -07:00
James R. Barlow
0dfceedcfb Remove old OCRmyPDF 2.x from release notes; update 4.2 notes 2016-07-29 03:08:59 -07:00
James R. Barlow
2c30f4bfc5 Travis: build partly working on trusty; tweak requirements again
The build is #122
https://travis-ci.org/jbarlow83/OCRmyPDF/builds/148255615

Errors seem to be related to either Ghostscript or leptonica? Maybe
-dSAFER?
2016-07-29 03:08:01 -07:00
James R. Barlow
9e7fb52b47 Travis: add PPA to support unpaper 2016-07-29 01:57:12 -07:00
James R. Barlow
bb5fd38e38 Remove additional PPA's and try again 2016-07-29 01:47:56 -07:00
James R. Barlow
7c8cf5cfa2 Try travis-trusty
This removes some backports for packages that Ubuntu trusty offers but
for which Ubuntu precise needed help.
2016-07-29 01:44:57 -07:00
James R. Barlow
fef35e4eb2 Fix handling of DPI for rare case of JPEG recompression after deskew/clean
This test is exercised by page 4 of multipage.pdf. If all images are
JPEGs, and one of deskew/clean removes DPI information, make sure that
we can get the right information back and that the DPI stays square.
2016-07-29 01:34:52 -07:00
James R. Barlow
8f77576dc4 Fix non-square image resolution for "hocr" case; use img2pdf 0.2.1
Tesseract renderer not immediately fixable.
2016-07-28 16:43:51 -07:00
James R. Barlow
b3fcf24a26 Refactor DPI: fix regressions in test suite
Some called functions are particular about the data format of DPI and
don't like to deal with the Decimal() returned by PyPDF2. Convert to
float and int where needed.
2016-07-28 00:19:32 -07:00
James R. Barlow
16e4d342d2 Bug fix: --force-ocr should still run on pages with no images
Useful for people who want to reprocess text.

This also requires --oversample because DPI is undefined. To be fixed
in next commit.
2016-07-27 15:06:49 -07:00
James R. Barlow
8458a51860 Tighten requirements and dependencies 2016-07-27 14:47:59 -07:00
James R. Barlow
636d1903b3 Ghostscript: do raster output with -dSAFER
-dSAFER does not work when rendering PDF/A, because that needs to load
the ICC file, and -dSAFER prevents access to external files.
2016-07-27 00:54:40 -07:00
jbarlow83
514efa36fc Readme: Add table of contents, brew install tesseract --with-language packs v4.1.4 2016-07-24 11:21:46 -07:00
James R. Barlow
bd48f40d3d v4.1.4 release notes v4.1.4rc1 2016-07-17 00:35:06 -07:00
James R. Barlow
c02dbc809a Merge commit '68cf9cbd87c188823027f9d1bfe9029017e7281f' into develop 2016-07-17 00:29:48 -07:00
James R. Barlow
410111d6fb Bug fix: Monochrome images with ICC treated as full color images
Issue #79.
User submitted PDF with ICC profile attached to the monochrome image
in the input file, which is not common but useful for PDFs that want to
define how light the paper is or how dark the black is. The code was
written to assume unusual images are full color unless it can prove
otherwise. Handle this simple case. Other ICC cases should be tested.
2016-07-17 00:29:32 -07:00
jbarlow83
68cf9cbd87 .rst: add code-block markup 2016-07-05 14:03:55 -07:00
jbarlow83
c9b2540d9d Fix some .rst formatting errors 2016-07-05 13:48:19 -07:00
jbarlow83
1bacf35a2c Update license information for encrypted_algo4.pdf 2016-06-24 14:25:15 -07:00
jbarlow83
8aef0d9277 Merge pull request #76 from Jmuccigr/patch-2
Adding explicit reference to help
2016-06-24 14:21:23 -07:00
John Muccigrosso
b2fa8645ba Adding explicit reference to help 2016-06-24 13:44:12 -05:00
James R. Barlow
c96823a648 v4.1.3 release notes v4.1.3 v4.1.3rc1 2016-06-23 13:47:56 -07:00
James R. Barlow
3807b7d655 Merge branch 'feature/leptfun' into develop 2016-06-23 13:45:35 -07:00
James R. Barlow
a45505cf1d Fix order of operations in matrix multiplication
Issue #73. The order of operations happens to not matter for scaling
but does matter for translation. We only need scaling to find the DPI,
so the error was not noticed. Mainly useful to other uses of this
library.
2016-06-23 13:36:23 -07:00
James R. Barlow
b4a734fc0d Test case for "algorithm 4" test
Algorithm 4 -> PDF version 1.6
2016-06-23 13:21:26 -07:00