OCRmyPDF/tests/resources/README.rst

204 lines
7.0 KiB
ReStructuredText
Raw Normal View History

These test files are used in OCRmyPDF's test suite. They do not necessarily produce OCR results
at all and are not meant as examples of OCR output. Some are even invalid PDFs that might
crash certain PDF viewers.
Files derived from free sources
===============================
These test resources come from free sources, under either public domain or Creative Commons licenses.
In some cases they were converted from one image format to another without other changes.
2018-06-28 15:51:57 -07:00
.. list-table::
:widths: 20 50 30
:header-rows: 1
* - File
- Source
- License
* - c02-22.pdf
- `Project Gutenberg`_, Adventures of Huckleberry Finn, page 22
- Public Domain
* - congress.jpg
- `US Congressional Records`_
- Public Domain
* - graph.pdf
- `Wikimedia: Pandas text analysis.png`_
- Public Domain
* - lichtenstein.pdf
- `Wikimedia: JPEG2000 Lichtenstein`_
- Creative Commons BY-SA 3.0
2018-06-28 15:51:57 -07:00
* - linn.png, linn.pdf, linn.txt
- `Wikimedia: LinnSequencer`_
- Creative Commons BY-SA 3.0
* - typewriter.png, 2400dpi.pdf
- `Wikimedia: Triumph typewrtier text Linzensoep`_
- Creative Commons BY-SA 2.5
* - baiona.png
- `Wikimedia: Baionako udalerri mugakideak`_
- Creative Commons BY-SA 4.0
* - enron1.pdf
- EnronData.org
- Creative Commons BY 3.0
Files generated for this project
================================
The following test resources were crafted specifically for this project, and are
licensed under the specified license.
2018-06-28 15:51:57 -07:00
.. list-table::
2018-03-31 11:54:38 -07:00
:widths: 20 40 15 15 10
:header-rows: 1
* - File
- Purpose
2018-03-31 11:54:38 -07:00
- Contributor
- Copyright Holder
- License
* - aspect.pdf
- test image with 200 x 100 DPI resolution
- @jbarlow83
2018-03-31 11:54:38 -07:00
- @jbarlow83
- CC-BY-SA 4.0
* - blank.pdf
- blank PDF generated by Adobe Illustrator CC 17, containing a lot of application-specific metadata/bloat
- @jbarlow83
2018-03-31 11:54:38 -07:00
- @jbarlow83
- CC-BY-SA 4.0
* - cmyk.pdf
- a CMYK image created in Photoshop
2017-11-26 22:52:53 -08:00
- @jbarlow83
2018-03-31 11:54:38 -07:00
- @jbarlow83
2018-06-28 15:51:57 -07:00
- CC-BY-SA 4.0
2018-03-31 11:54:38 -07:00
* - crom.png
- test for non-dictionary words
- @jbarlow83
2018-03-31 11:54:38 -07:00
- @jbarlow83
2018-06-28 15:51:57 -07:00
- CC-BY-SA 4.0
2018-03-31 11:54:38 -07:00
* - enormous.pdf
- very large PDF page
2018-03-31 11:54:38 -07:00
- @jbarlow83
- @jbarlow83
- CC-BY-SA 4.0
* - epson.pdf
2018-03-31 11:54:38 -07:00
- a linearized PDF containing some unusual indirect objects, created by an Epson printer; printout of a Wikipedia article (CC-BY-SA)
- @lowesjam
2018-03-31 11:54:38 -07:00
- Wikipedia authors
- CC-BY-SA 3.0
2017-02-14 12:51:15 -08:00
* - formxobject.pdf
2017-11-26 22:52:53 -08:00
- hand-crafted PDF containing an image inside a Form XObject
- @jbarlow83
2018-03-31 11:54:38 -07:00
- @jbarlow83
- CC-BY-SA 4.0
* - francais.pdf
2018-06-28 15:51:57 -07:00
- a page containing French accents (diacritics)
- @jbarlow83
2018-03-31 11:54:38 -07:00
- @jbarlow83
- CC-BY-SA 4.0
* - hugemono.pdf
2018-06-28 15:51:57 -07:00
- large monochrome 35000x35000 image in JBIG2 encoding
- @jbarlow83
2018-03-31 11:54:38 -07:00
- @jbarlow83
- CC-BY-SA 4.0
* - invalid.pdf
- a PDF file header followed by EOF marker
2018-03-31 11:54:38 -07:00
- @jbarlow83
- @jbarlow83
- CC-BY-SA 4.0
* - kcs.pdf
- PDF file generated by Kodak Capture Desktop Software 1.2; has invalid table of contents
- @jbarlow83
- @jbarlow83
- CC-BY-SA 4.0
* - livecycle.pdf
- a minimal PDF that claims to use dynamic XFA forms
- @jbarlow83
- @jbarlow83
- CC-BY-SA 4.0
* - masks.pdf
2018-03-31 11:54:38 -07:00
- file containing explicit masks and a stencil mask drawn without a proper transformation matrix; printout of a German Wikipedia article (CC-BY-SA)
- @supergrobi
2018-03-31 11:54:38 -07:00
- Wikipedia authors
- CC-BY-SA 3.0
* - missing_docinfo.pdf
2018-06-28 15:51:57 -07:00
- PDF file with no /DocumentInfo section
- @jbarlow83
- @jbarlow83
2018-03-31 11:54:38 -07:00
- CC-BY-SA 4.0
* - overlay.pdf
- PDF file generated by PDFPen pro that triggered content stream parse errors
2018-03-31 11:54:38 -07:00
- @maxandersen
- @maxandersen
- CC-BY-SA 4.0
2017-11-26 22:52:53 -08:00
* - negzero.pdf
- copy of formxobject.pdf with token that qpdf doesn't like
- @jbarlow83
2018-03-31 11:54:38 -07:00
- @jbarlow83
- CC-BY-SA 4.0
* - no_contents.pdf
- synthetic PDF with a blank page that has no /Contents entry
- @jbarlow83
2018-03-31 11:54:38 -07:00
- @jbarlow83
- CC-BY-SA 4.0
* - truetype_font_nomapping.pdf
- example of a PDF with an embedded subsetted TrueType font with no Unicode mapping
- @jbarlow83
- @jbarlow83
- CC-BY-SA 4.0
2018-03-31 11:54:38 -07:00
* - trivial.pdf
- smallest possible valid PDF-1.3 with all required fields
2018-03-31 11:54:38 -07:00
- @jbarlow83
- @jbarlow83
- CC-BY-SA 4.0
* - type3_font_nomapping.pdf
- example of a PDF with an embedded subsetted TrueType font with no Unicode mapping
- @jbarlow83
- @jbarlow83
- CC-BY-SA 4.0
2018-02-08 00:15:12 -08:00
* - vector.pdf
- a PDF with vector art and text rendered as curves with no fonts
2018-03-31 11:54:38 -07:00
- @Catscratch
- @Catscratch
- CC-BY-SA 4.0
Assemblies
==========
These test resources are assemblies or derivatives from other previously mentioned files, released under the same license terms as their input files.
2018-08-01 15:17:33 -07:00
- baiona_gray.png (from baiona.png, grayscale version)
- baiona_colormapped.png (from baiona.png, palette version)
- baiona_alpha.png (from baiona.png, RGB+A version)
2018-06-28 15:51:57 -07:00
- cardinal.pdf (four cardinal directions, baked-in rotated copies of linn.png)
- ccitt.pdf (linn.png, converted to CCITT encoding)
- encrypted_algo4.pdf (congress.jpg, encrypted with algorithm 4 - not supported by PyPDF2)
- graph_ocred.pdf (from graph.pdf)
- jbig2.pdf (congress.jpg, converted to JBIG2 encoding)
- multipage.pdf (from several other files)
- palette.pdf (congress.jpg, converted to a 256-color palette)
2018-06-28 15:51:57 -07:00
- poster.pdf (from linn.png)
- rotated_skew.pdf (a /Rotate'd and skewed document from linn.png)
- skew-encrypted.pdf (skew.pdf with encryption - access supported by PyPDF2, password is "password")
2018-06-28 15:51:57 -07:00
- skew.pdf (from linn.png, skew simulated by adjusting the transformation matrix)
- toc.pdf (from formxobject.pdf, trivial.pdf)
2016-02-16 14:29:56 -08:00
.. _`Wikimedia: LinnSequencer`: https://upload.wikimedia.org/wikipedia/en/b/b7/LinnSequencer_hardware_MIDI_sequencer_brochure_page_2_300dpi.jpg
.. _`Project Gutenberg`: https://www.gutenberg.org/files/76/76-h/76-h.htm#c2
.. _`US Congressional Records`: http://www.baxleystamps.com/litho/meiji/courts_1871.jpg
2016-02-20 05:13:19 -08:00
.. _`Wikimedia: Pandas text analysis.png`: https://en.wikipedia.org/wiki/File:Pandas_text_analysis.png
.. _`Wikimedia: JPEG2000 Lichtenstein`: https://en.wikipedia.org/wiki/JPEG_2000#/media/File:Jpeg2000_2-level_wavelet_transform-lichtenstein.png
2018-06-28 15:51:57 -07:00
.. _`Linux (Wikipedia Article)`: https://de.wikipedia.org/wiki/Linux
.. _`Wikimedia: Triumph typewrtier text Linzensoep`: https://commons.wikimedia.org/wiki/File:Triumph.typewriter_text_Linzensoep.gif
2018-06-28 15:51:57 -07:00
.. _`Wikimedia: Baionako udalerri mugakideak`: https://commons.wikimedia.org/wiki/File:Baionako_udalerri_mugakideak.png