mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2025-06-26 23:49:59 +00:00
.. SPDX-FileCopyrightText: 2022 James R. Barlow .. SPDX-License-Identifier: CC-BY-SA-4.0 These test files are used in OCRmyPDF's test suite. They do not necessarily produce OCR results at all and are not necessarily meant as examples of OCR output. Some are even invalid PDFs that might crash certain PDF viewers. Some of these images were obtained from the public domain. Others are copyrighted and may have licenses associated. Refer to ``.reuse/dep5`` file in OCRmyPDF's Git repository for information on the copyright holder(s) and license(s) applicable to these resources. .. list-table:: :widths: 15 35 50 :header-rows: 1 * - File - Source - Purpose * - c02-22.pdf - `Project Gutenberg`_, Adventures of Huckleberry Finn, page 22 - difficult OCR image (obscure fonts and illustrations) * - graph.pdf - `Wikimedia:Simple_line_graph_of_ACE_2012_results_by_candidate_sj01.png`_ - image with slanted text * - lichtenstein.pdf - `Wikimedia: JPEG2000 Lichtenstein`_ - JPEG2000 image * - linn.png, linn.pdf, linn.txt - `Wikimedia: LinnSequencer`_ - image with two columns * - typewriter.png, 2400dpi.pdf - `Wikimedia: Triumph typewrtier text Linzensoep`_ - simple text * - baiona.png - `Wikimedia: Baionako udalerri mugakideak`_ - multilingual text and images * - aspect.pdf - synthetic - test image with 200 x 100 DPI resolution * - blank.pdf - synthetic - blank PDF generated by Adobe Illustrator CC 17, containing a lot of application-specific metadata/bloat * - cmyk.pdf - synthetic - a CMYK image created in Photoshop * - crom.png - synthetic - test for non-dictionary words * - enormous.pdf - synthetic - very large PDF page * - epson.pdf - synthetic - a linearized PDF containing some unusual indirect objects, created by an Epson printer; printout of a Wikipedia article (CC-BY-SA) * - formxobject.pdf - synthetic - hand-crafted PDF containing an image inside a Form XObject * - francais.pdf - synthetic - a page containing French accents (diacritics) * - hugemono.pdf - synthetic - large monochrome 35000x35000 image in JBIG2 encoding * - invalid.pdf - synthetic - a PDF file header followed by EOF marker * - kcs.pdf - synthetic - PDF file generated by Kodak Capture Desktop Software 1.2; has invalid table of contents * - livecycle.pdf - synthetic - a minimal PDF that claims to use dynamic XFA forms * - masks.pdf - synthetic - file containing explicit masks and a stencil mask drawn without a proper transformation matrix; printout of a German Wikipedia article (CC-BY-SA) * - missing_docinfo.pdf - synthetic - PDF file with no /DocumentInfo section * - overlay.pdf - synthetic - PDF file generated by PDFPen pro that triggered content stream parse errors * - negzero.pdf - synthetic - copy of formxobject.pdf with token that qpdf doesn't like * - no_contents.pdf - synthetic - synthetic PDF with a blank page that has no /Contents entry * - truetype_font_nomapping.pdf - synthetic - example of a PDF with an embedded subsetted TrueType font with no Unicode mapping * - trivial.pdf - synthetic - smallest possible valid PDF-1.3 with all required fields * - type3_font_nomapping.pdf - synthetic - example of a PDF with an embedded subsetted TrueType font with no Unicode mapping * - vector.pdf - synthetic - a PDF with vector art and text rendered as curves with no fonts Assemblies ========== These test resources are assemblies or derivatives from other previously mentioned files, released under the same license terms as their input files. - baiona_gray.png (from baiona.png, grayscale version) - baiona_colormapped.png (from baiona.png, palette version) - baiona_alpha.png (from baiona.png, RGB+A version) - cardinal.pdf (four cardinal directions, baked-in rotated copies of linn.png) - ccitt.pdf (linn.png, converted to CCITT encoding) - graph_ocred.pdf (from graph.pdf) - jbig2.pdf (from linn.png) - multipage.pdf (from several other files) - palette.pdf (from baiona_colormapped.png) - poster.pdf (from linn.png) - rotated_skew.pdf (a /Rotate'd and skewed document from linn.png) - skew.pdf (from linn.png, skew simulated by adjusting the transformation matrix) - toc.pdf (from formxobject.pdf, trivial.pdf) .. _`Wikimedia: LinnSequencer`: https://upload.wikimedia.org/wikipedia/en/b/b7/LinnSequencer_hardware_MIDI_sequencer_brochure_page_2_300dpi.jpg .. _`Project Gutenberg`: https://www.gutenberg.org/files/76/76-h/76-h.htm#c2 .. _`Wikimedia: Simple_line_graph_of_ACE_2012_results_by_candidate_sj01.png`: https://en.wikipedia.org/wiki/File:Simple_line_graph_of_ACE_2012_results_by_candidate_sj01.png .. _`Wikimedia: JPEG2000 Lichtenstein`: https://en.wikipedia.org/wiki/JPEG_2000#/media/File:Jpeg2000_2-level_wavelet_transform-lichtenstein.png .. _`Linux (Wikipedia Article)`: https://de.wikipedia.org/wiki/Linux .. _`Wikimedia: Triumph typewrtier text Linzensoep`: https://commons.wikimedia.org/wiki/File:Triumph.typewriter_text_Linzensoep.gif .. _`Wikimedia: Baionako udalerri mugakideak`: https://commons.wikimedia.org/wiki/File:Baionako_udalerri_mugakideak.png