mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2025-08-14 03:31:59 +00:00
These test files are used in OCRmyPDF's test suite. They do not necessarily produce OCR results at all and are not meant as examples of OCR output. Some are even invalid PDFs that might crash certain PDF viewers. Files derived from free sources =============================== These test resources come from free sources, under either public domain or Creative Commons licenses. In some cases they were converted from one image format to another without other changes. .. list-table:: :widths: 20 50 30 :header-rows: 1 * - File - Source - License * - c02-22.pdf - `Project Gutenberg`_, Adventures of Huckleberry Finn, page 22 - Public Domain * - congress.jpg - `US Congressional Records`_ - Public Domain * - graph.pdf - `Wikimedia: Pandas text analysis.png`_ - Public Domain * - lichtenstein.pdf - `Wikimedia: JPEG2000 Lichtenstein`_ - Creative Commons BY-SA 3.0 * - linn.png, linn.pdf, linn.txt - `Wikimedia: LinnSequencer`_ - Creative Commons BY-SA 3.0 * - typewriter.png, 2400dpi.pdf - `Wikimedia: Triumph typewrtier text Linzensoep`_ - Creative Commons BY-SA 2.5 * - baiona.png - `Wikimedia: Baionako udalerri mugakideak`_ - Creative Commons BY-SA 4.0 * - enron1.pdf - EnronData.org - Creative Commons BY 3.0 Files generated for this project ================================ The following test resources were crafted specifically for this project, and are licensed under the specified license. .. list-table:: :widths: 20 40 15 15 10 :header-rows: 1 * - File - Purpose - Contributor - Copyright Holder - License * - aspect.pdf - test image with 200 x 100 DPI resolution - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - blank.pdf - blank PDF generated by Adobe Illustrator CC 17, containing a lot of application-specific metadata/bloat - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - cmyk.pdf - a CMYK image created in Photoshop - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - crom.png - test for non-dictionary words - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - enormous.pdf - very large PDF page - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - epson.pdf - a linearized PDF containing some unusual indirect objects, created by an Epson printer; printout of a Wikipedia article (CC-BY-SA) - @lowesjam - Wikipedia authors - CC-BY-SA 3.0 * - formxobject.pdf - hand-crafted PDF containing an image inside a Form XObject - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - francais.pdf - a page containing French accents (diacritics) - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - hugemono.pdf - large monochrome 35000x35000 image in JBIG2 encoding - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - invalid.pdf - a PDF file header followed by EOF marker - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - kcs.pdf - PDF file generated by Kodak Capture Desktop Software 1.2; has invalid table of contents - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - livecycle.pdf - a minimal PDF that claims to use dynamic XFA forms - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - masks.pdf - file containing explicit masks and a stencil mask drawn without a proper transformation matrix; printout of a German Wikipedia article (CC-BY-SA) - @supergrobi - Wikipedia authors - CC-BY-SA 3.0 * - missing_docinfo.pdf - PDF file with no /DocumentInfo section - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - overlay.pdf - PDF file generated by PDFPen pro that triggered content stream parse errors - @maxandersen - @maxandersen - CC-BY-SA 4.0 * - negzero.pdf - copy of formxobject.pdf with token that qpdf doesn't like - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - no_contents.pdf - synthetic PDF with a blank page that has no /Contents entry - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - truetype_font_nomapping.pdf - example of a PDF with an embedded subsetted TrueType font with no Unicode mapping - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - trivial.pdf - smallest possible valid PDF-1.3 with all required fields - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - type3_font_nomapping.pdf - example of a PDF with an embedded subsetted TrueType font with no Unicode mapping - @jbarlow83 - @jbarlow83 - CC-BY-SA 4.0 * - vector.pdf - a PDF with vector art and text rendered as curves with no fonts - @Catscratch - @Catscratch - CC-BY-SA 4.0 Assemblies ========== These test resources are assemblies or derivatives from other previously mentioned files, released under the same license terms as their input files. - baiona_gray.png (from baiona.png, grayscale version) - baiona_colormapped.png (from baiona.png, palette version) - baiona_alpha.png (from baiona.png, RGB+A version) - cardinal.pdf (four cardinal directions, baked-in rotated copies of linn.png) - ccitt.pdf (linn.png, converted to CCITT encoding) - encrypted_algo4.pdf (congress.jpg, encrypted with algorithm 4 - not supported by PyPDF2) - graph_ocred.pdf (from graph.pdf) - jbig2.pdf (congress.jpg, converted to JBIG2 encoding) - multipage.pdf (from several other files) - palette.pdf (congress.jpg, converted to a 256-color palette) - poster.pdf (from linn.png) - rotated_skew.pdf (a /Rotate'd and skewed document from linn.png) - skew-encrypted.pdf (skew.pdf with encryption - access supported by PyPDF2, password is "password") - skew.pdf (from linn.png, skew simulated by adjusting the transformation matrix) - toc.pdf (from formxobject.pdf, trivial.pdf) .. _`Wikimedia: LinnSequencer`: https://upload.wikimedia.org/wikipedia/en/b/b7/LinnSequencer_hardware_MIDI_sequencer_brochure_page_2_300dpi.jpg .. _`Project Gutenberg`: https://www.gutenberg.org/files/76/76-h/76-h.htm#c2 .. _`US Congressional Records`: http://www.baxleystamps.com/litho/meiji/courts_1871.jpg .. _`Wikimedia: Pandas text analysis.png`: https://en.wikipedia.org/wiki/File:Pandas_text_analysis.png .. _`Wikimedia: JPEG2000 Lichtenstein`: https://en.wikipedia.org/wiki/JPEG_2000#/media/File:Jpeg2000_2-level_wavelet_transform-lichtenstein.png .. _`Linux (Wikipedia Article)`: https://de.wikipedia.org/wiki/Linux .. _`Wikimedia: Triumph typewrtier text Linzensoep`: https://commons.wikimedia.org/wiki/File:Triumph.typewriter_text_Linzensoep.gif .. _`Wikimedia: Baionako udalerri mugakideak`: https://commons.wikimedia.org/wiki/File:Baionako_udalerri_mugakideak.png