James R. Barlow 2131ad4670 Fix --remove-background error on PDFs with colormapped images
It's unclear how exactly a
colormapped image gets to this
spot given the tendency of other
image processing tools to flatten
such images, but someone made it happen, so now we make sure
the image is okay.

Closes #262
2018-04-27 17:21:01 -07:00
..
2015-08-14 00:46:50 -07:00
2015-07-22 03:16:19 -07:00
2016-01-19 13:01:56 -08:00
2015-07-27 00:25:43 -07:00
2015-08-11 02:19:46 -07:00
2016-01-19 13:01:56 -08:00
2016-02-20 05:34:21 -08:00
2016-02-20 05:34:21 -08:00
2017-11-26 22:52:53 -08:00
2017-05-29 12:16:08 -07:00
2018-02-08 00:17:35 -08:00

These test files are used in OCRmyPDF's test suite. They do not necessarily produce OCR results
at all and are not meant as examples of OCR output. Some are even invalid PDFs that might
crash certain PDF viewers.


Files derived from free sources
===============================

These test resources come from free sources, under either public domain or Creative Commons licenses.
In some cases they were converted from one image format to another without other changes.

.. list-table:: 
    :widths: 20 50 30
    :header-rows: 1

    *   - File
        - Source
        - License
    *   - c02-22.pdf
        - `Project Gutenberg`_, Adventures of Huckleberry Finn, page 22
        - Public Domain
    *   - congress.jpg
        - `US Congressional Records`_
        - Public Domain
    *   - graph.pdf
        - `Wikimedia: Pandas text analysis.png`_
        - Public Domain
    *   - lichtenstein.pdf
        - `Wikimedia: JPEG2000 Lichtenstein`_
        - Creative Commons BY-SA 3.0
    *   - LinnSequencer.jpg, linn.pdf, linn.txt
        - `Wikimedia: LinnSequencer`_
        - Creative Commons BY-SA 3.0
    *   - typewriter.png, 2400dpi.pdf
        - `Wikimedia: Triumph typewrtier text Linzensoep`_
        - Creative Commons BY-SA 2.5
    *   - baiona.png
        - `Wikimedia: Baionako udalerri mugakideak`_
        - Creative Commons BY-SA 4.0


Files generated for this project
================================

The following test resources were crafted specifically for this project, and are
licensed under the specified license.

.. list-table:: 
    :widths: 20 40 15 15 10
    :header-rows: 1

    *   - File
        - Purpose
        - Contributor
        - Copyright Holder
        - License
    *   - aspect.pdf
        - test image with 200 x 100 DPI resolution
        - @jbarlow83
        - @jbarlow83
        - CC-BY-SA 4.0
    *   - blank.pdf
        - blank PDF generated by Adobe Illustrator CC 17, containing a lot of application-specific metadata/bloat
        - @jbarlow83
        - @jbarlow83
        - CC-BY-SA 4.0
    *   - cmyk.pdf
        - a CMYK image created in Photoshop
        - @jbarlow83
        - @jbarlow83
        - CC-BY-SA 4.0        
    *   - crom.png
        - test for non-dictionary words
        - @jbarlow83
        - @jbarlow83
        - CC-BY-SA 4.0        
    *   - enormous.pdf
        - very large PDF page
        - @jbarlow83
        - @jbarlow83
        - CC-BY-SA 4.0
    *   - epson.pdf
        - a linearized PDF containing some unusual indirect objects, created by an Epson printer; printout of a Wikipedia article (CC-BY-SA)
        - @lowesjam
        - Wikipedia authors
        - CC-BY-SA 3.0
    *   - formxobject.pdf
        - hand-crafted PDF containing an image inside a Form XObject
        - @jbarlow83
        - @jbarlow83
        - CC-BY-SA 4.0
    *   - francais.pdf
        - a page containing French accents (diacritics)  
        - @jbarlow83
        - @jbarlow83
        - CC-BY-SA 4.0
    *   - hugemono.pdf
        - large monochrome 35000x35000 image in JBIG2 encoding 
        - @jbarlow83
        - @jbarlow83
        - CC-BY-SA 4.0
    *   - invalid.pdf
        - a PDF file header followed by EOF marker
        - @jbarlow83
        - @jbarlow83
        - CC-BY-SA 4.0
    *   - masks.pdf
        - file containing explicit masks and a stencil mask drawn without a proper transformation matrix; printout of a German Wikipedia article (CC-BY-SA)
        - @supergrobi
        - Wikipedia authors
        - CC-BY-SA 3.0
    *   - missing_docinfo.pdf
        - @jbarlow83
        - @jbarlow83
        - PDF file with no /DocumentInfo section 
        - CC-BY-SA 4.0
    *   - overlay.pdf
        - PDF file generated by PDFPen pro that triggered content stream parse errors
        - @maxandersen
        - @maxandersen
        - CC-BY-SA 4.0
    *   - negzero.pdf
        - copy of formxobject.pdf with token that qpdf doesn't like
        - @jbarlow83
        - @jbarlow83
        - CC-BY-SA 4.0
    *   - no_contents.pdf
        - synthetic PDF with a blank page that has no /Contents entry
        - @jbarlow83
        - @jbarlow83
        - CC-BY-SA 4.0
    *   - trivial.pdf
        - smallest possible valid PDF-1.3 with all required fields
        - @jbarlow83
        - @jbarlow83
        - CC-BY-SA 4.0
    *   - vector.pdf
        - a PDF with vector art and text rendered as curves with no fonts
        - @Catscratch
        - @Catscratch
        - CC-BY-SA 4.0


Assemblies
==========

These test resources are assemblies or derivatives from other previously mentioned files, released under the same license terms as their input files.

- baiona_gray.png (from baiona.png)
- baiona_colormapped.png (from baiona.png)
- cardinal.pdf (four cardinal directions, baked-in rotated copies of LinnSequencer.jpg)
- ccitt.pdf (LinnSequencer.jpg, converted to CCITT encoding)
- encrypted_algo4.pdf (congress.jpg, encrypted with algorithm 4 - not supported by PyPDF2)
- graph_ocred.pdf (from graph.pdf)
- jbig2.pdf (congress.jpg, converted to JBIG2 encoding)
- multipage.pdf (from several other files)
- palette.pdf (congress.jpg, converted to a 256-color palette)
- poster.pdf (from LinnSequencer.jpg)
- rotated_skew.pdf (a /Rotate'd and skewed document from LinnSequencer.jpg)
- skew-encrypted.pdf (skew.pdf with encryption - access supported by PyPDF2, password is "password")
- skew.pdf (from LinnSequencer.jpg, skew simulated by adjusting the transformation matrix)
- toc.pdf (from formxobject.pdf, trivial.pdf)


.. _`Wikimedia: LinnSequencer`: https://upload.wikimedia.org/wikipedia/en/b/b7/LinnSequencer_hardware_MIDI_sequencer_brochure_page_2_300dpi.jpg

.. _`Project Gutenberg`: https://www.gutenberg.org/files/76/76-h/76-h.htm#c2

.. _`US Congressional Records`: http://www.baxleystamps.com/litho/meiji/courts_1871.jpg

.. _`Wikimedia: Pandas text analysis.png`: https://en.wikipedia.org/wiki/File:Pandas_text_analysis.png

.. _`Wikimedia: JPEG2000 Lichtenstein`: https://en.wikipedia.org/wiki/JPEG_2000#/media/File:Jpeg2000_2-level_wavelet_transform-lichtenstein.png

.. _`Linux (Wikipedia Article)`: https://de.wikipedia.org/wiki/Linux 

.. _`Wikimedia: Triumph typewrtier text Linzensoep`: https://commons.wikimedia.org/wiki/File:Triumph.typewriter_text_Linzensoep.gif

.. _`Wikimedia: Baionako udalerri mugakideak`: https://commons.wikimedia.org/wiki/File:Baionako_udalerri_mugakideak.png