2016-02-16 00:28:28 -08:00
|
|
|
These test files are used in OCRmyPDF's test suite. They do not necessarily produce OCR results
|
|
|
|
at all and are not meant as examples of OCR output. Some are even invalid PDFs that might
|
|
|
|
crash certain PDF viewers.
|
|
|
|
|
|
|
|
|
|
|
|
Files derived from free sources
|
|
|
|
===============================
|
|
|
|
|
|
|
|
These test resources come from free sources, under either public domain or Creative Commons licenses.
|
|
|
|
In some cases they were converted from one image format to another without other changes.
|
|
|
|
|
2018-06-28 15:51:57 -07:00
|
|
|
.. list-table::
|
2016-09-10 14:44:00 -07:00
|
|
|
:widths: 20 50 30
|
|
|
|
:header-rows: 1
|
2016-08-26 15:03:27 -07:00
|
|
|
|
2016-09-10 14:44:00 -07:00
|
|
|
* - File
|
|
|
|
- Source
|
|
|
|
- License
|
|
|
|
* - c02-22.pdf
|
|
|
|
- `Project Gutenberg`_, Adventures of Huckleberry Finn, page 22
|
|
|
|
- Public Domain
|
|
|
|
* - congress.jpg
|
|
|
|
- `US Congressional Records`_
|
|
|
|
- Public Domain
|
|
|
|
* - graph.pdf
|
|
|
|
- `Wikimedia: Pandas text analysis.png`_
|
|
|
|
- Public Domain
|
|
|
|
* - lichtenstein.pdf
|
|
|
|
- `Wikimedia: JPEG2000 Lichtenstein`_
|
|
|
|
- Creative Commons BY-SA 3.0
|
2018-06-28 15:51:57 -07:00
|
|
|
* - linn.png, linn.pdf, linn.txt
|
2016-09-10 14:44:00 -07:00
|
|
|
- `Wikimedia: LinnSequencer`_
|
|
|
|
- Creative Commons BY-SA 3.0
|
2016-12-08 16:04:14 -08:00
|
|
|
* - typewriter.png, 2400dpi.pdf
|
|
|
|
- `Wikimedia: Triumph typewrtier text Linzensoep`_
|
2018-02-27 15:08:22 -08:00
|
|
|
- Creative Commons BY-SA 2.5
|
2017-05-06 22:27:25 -07:00
|
|
|
* - baiona.png
|
|
|
|
- `Wikimedia: Baionako udalerri mugakideak`_
|
|
|
|
- Creative Commons BY-SA 4.0
|
2019-01-04 13:20:41 -08:00
|
|
|
* - enron1.pdf
|
|
|
|
- EnronData.org
|
|
|
|
- Creative Commons BY 3.0
|
2016-09-10 14:44:00 -07:00
|
|
|
|
|
|
|
|
2016-02-16 00:28:28 -08:00
|
|
|
Files generated for this project
|
|
|
|
================================
|
|
|
|
|
2018-03-14 14:40:48 -07:00
|
|
|
The following test resources were crafted specifically for this project, and are
|
|
|
|
licensed under the specified license.
|
2016-02-16 00:28:28 -08:00
|
|
|
|
2018-06-28 15:51:57 -07:00
|
|
|
.. list-table::
|
2018-03-31 11:54:38 -07:00
|
|
|
:widths: 20 40 15 15 10
|
2016-10-05 16:38:51 -07:00
|
|
|
:header-rows: 1
|
2016-02-16 00:28:28 -08:00
|
|
|
|
2016-10-05 16:38:51 -07:00
|
|
|
* - File
|
|
|
|
- Purpose
|
2018-03-31 11:54:38 -07:00
|
|
|
- Contributor
|
|
|
|
- Copyright Holder
|
2018-03-14 14:40:48 -07:00
|
|
|
- License
|
2016-10-05 16:38:51 -07:00
|
|
|
* - aspect.pdf
|
|
|
|
- test image with 200 x 100 DPI resolution
|
|
|
|
- @jbarlow83
|
2018-03-31 11:54:38 -07:00
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
|
|
|
* - blank.pdf
|
2017-09-13 01:19:18 -07:00
|
|
|
- blank PDF generated by Adobe Illustrator CC 17, containing a lot of application-specific metadata/bloat
|
2016-10-05 16:38:51 -07:00
|
|
|
- @jbarlow83
|
2018-03-31 11:54:38 -07:00
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
|
|
|
* - cmyk.pdf
|
2016-10-05 16:38:51 -07:00
|
|
|
- a CMYK image created in Photoshop
|
2017-11-26 22:52:53 -08:00
|
|
|
- @jbarlow83
|
2018-03-31 11:54:38 -07:00
|
|
|
- @jbarlow83
|
2018-06-28 15:51:57 -07:00
|
|
|
- CC-BY-SA 4.0
|
2018-03-31 11:54:38 -07:00
|
|
|
* - crom.png
|
2018-03-14 14:40:48 -07:00
|
|
|
- test for non-dictionary words
|
2016-10-05 16:38:51 -07:00
|
|
|
- @jbarlow83
|
2018-03-31 11:54:38 -07:00
|
|
|
- @jbarlow83
|
2018-06-28 15:51:57 -07:00
|
|
|
- CC-BY-SA 4.0
|
2018-03-31 11:54:38 -07:00
|
|
|
* - enormous.pdf
|
2016-10-05 16:38:51 -07:00
|
|
|
- very large PDF page
|
2018-03-31 11:54:38 -07:00
|
|
|
- @jbarlow83
|
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
2016-10-07 12:44:49 -07:00
|
|
|
* - epson.pdf
|
2018-03-31 11:54:38 -07:00
|
|
|
- a linearized PDF containing some unusual indirect objects, created by an Epson printer; printout of a Wikipedia article (CC-BY-SA)
|
2016-10-07 12:44:49 -07:00
|
|
|
- @lowesjam
|
2018-03-31 11:54:38 -07:00
|
|
|
- Wikipedia authors
|
|
|
|
- CC-BY-SA 3.0
|
2017-02-14 12:51:15 -08:00
|
|
|
* - formxobject.pdf
|
2017-11-26 22:52:53 -08:00
|
|
|
- hand-crafted PDF containing an image inside a Form XObject
|
2016-10-05 16:38:51 -07:00
|
|
|
- @jbarlow83
|
2018-03-31 11:54:38 -07:00
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
|
|
|
* - francais.pdf
|
2018-06-28 15:51:57 -07:00
|
|
|
- a page containing French accents (diacritics)
|
2016-10-05 16:38:51 -07:00
|
|
|
- @jbarlow83
|
2018-03-31 11:54:38 -07:00
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
|
|
|
* - hugemono.pdf
|
2018-06-28 15:51:57 -07:00
|
|
|
- large monochrome 35000x35000 image in JBIG2 encoding
|
2016-10-05 16:38:51 -07:00
|
|
|
- @jbarlow83
|
2018-03-31 11:54:38 -07:00
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
|
|
|
* - invalid.pdf
|
2016-10-05 16:38:51 -07:00
|
|
|
- a PDF file header followed by EOF marker
|
2018-03-31 11:54:38 -07:00
|
|
|
- @jbarlow83
|
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
2018-09-11 14:44:16 -07:00
|
|
|
* - kcs.pdf
|
|
|
|
- PDF file generated by Kodak Capture Desktop Software 1.2; has invalid table of contents
|
|
|
|
- @jbarlow83
|
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
2018-09-13 21:50:51 -07:00
|
|
|
* - livecycle.pdf
|
|
|
|
- a minimal PDF that claims to use dynamic XFA forms
|
|
|
|
- @jbarlow83
|
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
2016-10-05 16:38:51 -07:00
|
|
|
* - masks.pdf
|
2018-03-31 11:54:38 -07:00
|
|
|
- file containing explicit masks and a stencil mask drawn without a proper transformation matrix; printout of a German Wikipedia article (CC-BY-SA)
|
2016-10-05 16:38:51 -07:00
|
|
|
- @supergrobi
|
2018-03-31 11:54:38 -07:00
|
|
|
- Wikipedia authors
|
|
|
|
- CC-BY-SA 3.0
|
2016-10-05 16:38:51 -07:00
|
|
|
* - missing_docinfo.pdf
|
2018-06-28 15:51:57 -07:00
|
|
|
- PDF file with no /DocumentInfo section
|
2018-04-15 23:52:19 -07:00
|
|
|
- @jbarlow83
|
2016-10-05 16:38:51 -07:00
|
|
|
- @jbarlow83
|
2018-03-31 11:54:38 -07:00
|
|
|
- CC-BY-SA 4.0
|
2017-01-26 17:24:40 -08:00
|
|
|
* - overlay.pdf
|
|
|
|
- PDF file generated by PDFPen pro that triggered content stream parse errors
|
2018-03-31 11:54:38 -07:00
|
|
|
- @maxandersen
|
|
|
|
- @maxandersen
|
2018-04-16 09:56:37 -07:00
|
|
|
- CC-BY-SA 4.0
|
2017-11-26 22:52:53 -08:00
|
|
|
* - negzero.pdf
|
|
|
|
- copy of formxobject.pdf with token that qpdf doesn't like
|
2017-05-01 15:46:15 -07:00
|
|
|
- @jbarlow83
|
2018-03-31 11:54:38 -07:00
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
|
|
|
* - no_contents.pdf
|
2017-05-01 15:46:15 -07:00
|
|
|
- synthetic PDF with a blank page that has no /Contents entry
|
2017-09-13 02:37:07 -07:00
|
|
|
- @jbarlow83
|
2018-03-31 11:54:38 -07:00
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
2018-11-15 16:22:53 -08:00
|
|
|
* - truetype_font_nomapping.pdf
|
|
|
|
- example of a PDF with an embedded subsetted TrueType font with no Unicode mapping
|
|
|
|
- @jbarlow83
|
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
2018-03-31 11:54:38 -07:00
|
|
|
* - trivial.pdf
|
2017-09-13 02:37:07 -07:00
|
|
|
- smallest possible valid PDF-1.3 with all required fields
|
2018-03-31 11:54:38 -07:00
|
|
|
- @jbarlow83
|
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
2018-11-15 21:54:26 -08:00
|
|
|
* - type3_font_nomapping.pdf
|
|
|
|
- example of a PDF with an embedded subsetted TrueType font with no Unicode mapping
|
|
|
|
- @jbarlow83
|
|
|
|
- @jbarlow83
|
|
|
|
- CC-BY-SA 4.0
|
2018-02-08 00:15:12 -08:00
|
|
|
* - vector.pdf
|
|
|
|
- a PDF with vector art and text rendered as curves with no fonts
|
2018-03-31 11:54:38 -07:00
|
|
|
- @Catscratch
|
|
|
|
- @Catscratch
|
2018-04-16 09:56:37 -07:00
|
|
|
- CC-BY-SA 4.0
|
2017-09-01 01:00:32 -07:00
|
|
|
|
2016-09-10 14:44:00 -07:00
|
|
|
|
2016-02-16 00:28:28 -08:00
|
|
|
Assemblies
|
|
|
|
==========
|
|
|
|
|
2017-09-01 01:00:32 -07:00
|
|
|
These test resources are assemblies or derivatives from other previously mentioned files, released under the same license terms as their input files.
|
2016-02-16 00:28:28 -08:00
|
|
|
|
2018-08-01 15:17:33 -07:00
|
|
|
- baiona_gray.png (from baiona.png, grayscale version)
|
|
|
|
- baiona_colormapped.png (from baiona.png, palette version)
|
|
|
|
- baiona_alpha.png (from baiona.png, RGB+A version)
|
2018-06-28 15:51:57 -07:00
|
|
|
- cardinal.pdf (four cardinal directions, baked-in rotated copies of linn.png)
|
|
|
|
- ccitt.pdf (linn.png, converted to CCITT encoding)
|
2016-06-24 14:25:15 -07:00
|
|
|
- encrypted_algo4.pdf (congress.jpg, encrypted with algorithm 4 - not supported by PyPDF2)
|
2016-02-16 00:28:28 -08:00
|
|
|
- graph_ocred.pdf (from graph.pdf)
|
|
|
|
- jbig2.pdf (congress.jpg, converted to JBIG2 encoding)
|
|
|
|
- multipage.pdf (from several other files)
|
|
|
|
- palette.pdf (congress.jpg, converted to a 256-color palette)
|
2018-06-28 15:51:57 -07:00
|
|
|
- poster.pdf (from linn.png)
|
|
|
|
- rotated_skew.pdf (a /Rotate'd and skewed document from linn.png)
|
2018-02-27 15:08:22 -08:00
|
|
|
- skew-encrypted.pdf (skew.pdf with encryption - access supported by PyPDF2, password is "password")
|
2018-06-28 15:51:57 -07:00
|
|
|
- skew.pdf (from linn.png, skew simulated by adjusting the transformation matrix)
|
2018-03-26 02:23:19 -07:00
|
|
|
- toc.pdf (from formxobject.pdf, trivial.pdf)
|
2016-02-16 00:28:28 -08:00
|
|
|
|
|
|
|
|
2016-02-16 14:29:56 -08:00
|
|
|
.. _`Wikimedia: LinnSequencer`: https://upload.wikimedia.org/wikipedia/en/b/b7/LinnSequencer_hardware_MIDI_sequencer_brochure_page_2_300dpi.jpg
|
2016-02-16 00:28:28 -08:00
|
|
|
|
|
|
|
.. _`Project Gutenberg`: https://www.gutenberg.org/files/76/76-h/76-h.htm#c2
|
|
|
|
|
|
|
|
.. _`US Congressional Records`: http://www.baxleystamps.com/litho/meiji/courts_1871.jpg
|
|
|
|
|
2016-02-20 05:13:19 -08:00
|
|
|
.. _`Wikimedia: Pandas text analysis.png`: https://en.wikipedia.org/wiki/File:Pandas_text_analysis.png
|
|
|
|
|
2016-06-24 14:25:15 -07:00
|
|
|
.. _`Wikimedia: JPEG2000 Lichtenstein`: https://en.wikipedia.org/wiki/JPEG_2000#/media/File:Jpeg2000_2-level_wavelet_transform-lichtenstein.png
|
2016-08-26 15:03:27 -07:00
|
|
|
|
2018-06-28 15:51:57 -07:00
|
|
|
.. _`Linux (Wikipedia Article)`: https://de.wikipedia.org/wiki/Linux
|
2016-12-08 16:04:14 -08:00
|
|
|
|
2017-05-06 22:27:25 -07:00
|
|
|
.. _`Wikimedia: Triumph typewrtier text Linzensoep`: https://commons.wikimedia.org/wiki/File:Triumph.typewriter_text_Linzensoep.gif
|
|
|
|
|
2018-06-28 15:51:57 -07:00
|
|
|
.. _`Wikimedia: Baionako udalerri mugakideak`: https://commons.wikimedia.org/wiki/File:Baionako_udalerri_mugakideak.png
|