80 Commits

Author SHA1 Message Date
James R. Barlow
4a27124eab Simplify metadata for invalid xml in output
Removes possibly non-free resource enron1.pdf.
2020-02-12 00:07:18 -08:00
James R. Barlow
0c0d53b10f tests: AcroForm test case did not work correctly; fixed 2019-12-30 17:50:32 -08:00
James R. Barlow
c5571388e2 Improve test coverage of _sync.py 2019-12-10 01:06:27 -08:00
James R. Barlow
5e2a7f8a56 tests: speed up several slow tests 2019-12-09 16:17:57 -08:00
James R. Barlow
0a72c12ff0 weave: add new test for link consistency 2019-05-12 03:36:33 -07:00
James R. Barlow
f34b3015b2 Prevent Ghostscript from generating invalid XMP metadata
If DocumentInfo contains NULs Ghostscript will generate XMP with
NULs which is not allowed. Repair DocumentInfo before Ghostscript sees it.
2019-01-04 13:20:41 -08:00
James R. Barlow
9e6b54c7ed Add test case for Type3 fonts with no Unicode mapping 2018-11-15 21:54:26 -08:00
James R. Barlow
d3b334c10f Test case: true type font without Unicode mapping 2018-11-15 16:22:53 -08:00
James R. Barlow
686207ab7f Check for and reject Adobe LiveCycle Designer PDFs
These are the ones that display a "Please wait..." message.

Closes #296
2018-09-13 21:50:51 -07:00
James R. Barlow
795019b0c1 Work around invalid TOC entries
Kodak Capture Desktop and probably other software creates a
/Outlines entry with /First being set to an invalid indirect reference to
an object that hasn't been created. This is legal in the PDF spec but
problematic for qpdf. The objgen will be (max valid object ID + 1, 0).
Because we create new objects in _weave, some TOC entries will end
up assigned to new objects we create. Typically /ProcSet.

We solve the issue by refactoring page traversal and then doing it
twice, once to resolve all references (eliminating the null
reference problem) and a second pass to make our changes.
2018-09-11 14:44:16 -07:00
James R. Barlow
c171cb7286 Merge img2pdf 0.3.0 fix from v6.2.3 2018-08-01 15:17:33 -07:00
James R. Barlow
1d09061130 Revert previous commit amd reject input images with alpha channel
Decided on this for simplicity of old release branch.

Modifies baiona.png by stripping
alpha, adds baiona_alpha which
includes the alpha.
2018-07-31 23:45:28 -07:00
James R. Barlow
ed8ff79e10 Optimize some of our bigger test files
Only partially optimize multipage.pdf so that it hopefully
improves speed of test suite without being useless as an
optimization test.
2018-06-29 00:35:49 -07:00
James R. Barlow
9637696a54 Fix test resources naming inconsistency 2018-06-28 23:37:14 -07:00
James R. Barlow
02b3ca6862 Compress test images more heavily 2018-06-28 21:40:12 -07:00
James R. Barlow
2131ad4670 Fix --remove-background error on PDFs with colormapped images
It's unclear how exactly a
colormapped image gets to this
spot given the tendency of other
image processing tools to flatten
such images, but someone made it happen, so now we make sure
the image is okay.

Closes #262
2018-04-27 17:21:01 -07:00
James R. Barlow
7368399f8b Clarify license of two test files - https://github.com/jbarlow83/OCRmyPDF/issues/254 2018-04-17 11:56:36 -07:00
James R. Barlow
34c78a892a Fix list table for tests/resources
[ci skip]
2018-04-15 23:52:19 -07:00
James R. Barlow
4f6bffb477 Update copyrights 2018-03-31 11:54:38 -07:00
James R. Barlow
45dbff6401 Fix table of contents not preserved in PDF/A 2018-03-26 02:23:19 -07:00
James R. Barlow
6756016572 Add license notice to all files
Source files to GPL3

Exceptions:
-tests/spoof/* to MIT
-hocrtransform.py
-_unicodefun.py

Test resources to CC BY-SA 4.0 except when otherwise noted.

Add GPL license.
2018-03-24 02:33:24 -07:00
James R. Barlow
74ca736333 Issue #223: improve text of encrypted PDF error message 2018-02-27 15:08:22 -08:00
James R. Barlow
a9da839c39 Add vector-only PDF test case 2018-02-08 00:17:35 -08:00
James R. Barlow
3a167af2c4 Nearly smallest possible PDF-1.3 with all required fields 2017-11-26 23:32:21 -08:00
James R. Barlow
965de3a235 Test case for issue #200 2017-11-26 22:52:53 -08:00
James R. Barlow
34fc1f5fd7 Add reminder that blank.pdf is not trivial 2017-09-13 01:19:18 -07:00
James R. Barlow
d04e43d46d Update copyright info for test files
[ci skip]
2017-09-01 01:00:32 -07:00
James R. Barlow
52483072dc Add a differential test that checks tesseract uses supplied word list 2017-07-21 16:40:20 -07:00
James R. Barlow
4b5cd420e1 Add new test file 2017-05-29 12:16:08 -07:00
James R. Barlow
21982cf1cb baiona_gray remove alpha channel 2017-05-11 23:23:37 -07:00
James R. Barlow
edc01408da Update the .png files, again, hopefully without corruption 2017-05-11 23:20:50 -07:00
James R. Barlow
bf04f03c4c Fix corrupt test file “typewriter.png”
This file is not currently used in any tests, but could be, so replace
corrupt version with a useful one.
2017-05-06 22:28:34 -07:00
James R. Barlow
93e802f473 Fix issue #163, color and grayscale images JPEG compressed when not needed 2017-05-06 22:27:25 -07:00
James R. Barlow
aa859a4139 Fix #156 - NoneType has no ‘getObject’ for pages with no /Contents 2017-05-01 15:46:15 -07:00
James R. Barlow
d1a0065ef8 Create test case for Form XObjects 2017-02-14 12:51:15 -08:00
James R. Barlow
1976dc6f30 Fix issue #121 “pop from empty list” (content stream parsing error) 2017-01-26 17:24:40 -08:00
James R. Barlow
097a69d07f pageinfo: fix “decimal.InvalidOperation: quantize result has too many digits”
And add new test case for this.
2016-12-08 16:04:14 -08:00
James R. Barlow
949d2ff1c2 v4.3.1 release notes 2016-11-07 14:36:08 -08:00
James R. Barlow
cc9c0d819e Add test case for documents that get rotated incorrectly after deskew 2016-11-07 14:15:03 -08:00
James R. Barlow
fdd9b8b8ce Optimize some of the test resources to reduce file sizes
Mostly by reducing RGB -> monochrome and applying JBIG2 compression
2016-11-07 14:01:23 -08:00
James R. Barlow
a86805f0d9 Remove possibly non-free page from "multipage.pdf" 2016-10-27 15:56:43 -07:00
James R. Barlow
013c5a369f Replace redacted file with an OCR-able file 2016-10-07 12:45:22 -07:00
James R. Barlow
6baf8668a6 Replace with non-free file milk.pdf with free equivalent 2016-10-06 13:10:28 -07:00
James R. Barlow
4ba2962c56 Comment on non-free files 2016-10-05 16:48:16 -07:00
James R. Barlow
4dad09cc91 resources/README: replace the other large table with a list table 2016-10-05 16:38:51 -07:00
James R. Barlow
825c0f8b2a Note that milk.pdf is non-free, start using list-tables 2016-09-10 14:44:00 -07:00
James R. Barlow
9ca29c787b Update description of masks.pdf to reflect what it actually tests 2016-09-01 21:21:14 -07:00
James R. Barlow
bf89e38c69 Add milk.pdf test case 2016-08-31 11:42:21 -07:00
James R. Barlow
d25397e2b0 Add test case for PDFs with masks and stencil masks 2016-08-26 15:03:27 -07:00
James R. Barlow
fef35e4eb2 Fix handling of DPI for rare case of JPEG recompression after deskew/clean
This test is exercised by page 4 of multipage.pdf. If all images are
JPEGs, and one of deskew/clean removes DPI information, make sure that
we can get the right information back and that the DPI stays square.
2016-07-29 01:34:52 -07:00