141 Commits

Author SHA1 Message Date
James R. Barlow
9a15a4db10 Ensure specified destination is writable before starting pipeline process 2017-01-26 22:08:24 -08:00
James R. Barlow
55aeaec293 Autorotation check: Replace duplicated tests with parameterized test 2017-01-26 18:07:59 -08:00
James R. Barlow
f6df1fb40c Fix test suite regression: output files dumped in tests/resources 2017-01-26 18:07:09 -08:00
James R. Barlow
b889a89c36 Fix remaining 3.4/3.5 regressions 2017-01-26 17:53:27 -08:00
James R. Barlow
1976dc6f30 Fix issue #121 “pop from empty list” (content stream parsing error) 2017-01-26 17:24:40 -08:00
James R. Barlow
e864c65d26 (Hopefully) Fix Path <-> py.path conversion on Py3.4/3.5 2017-01-26 17:19:15 -08:00
James R. Barlow
02fba02d31 Refactor test suite to use fixtures to manage paths 2017-01-26 16:38:59 -08:00
James R. Barlow
fb9e7c82f6 Move duplicate test code into common namespace 2017-01-26 13:36:52 -08:00
James R. Barlow
bad67c6dc5 Rename ‘tesstop’ to ‘tess4’
There’s no reason text-only PDF shouldn’t become the default for
tesseract 4.
2017-01-26 12:28:51 -08:00
James R. Barlow
b8767e5ba9 Rename exe -> exec, more Unix-y and suggestive 2016-12-10 15:34:00 -08:00
James R. Barlow
d33a50660d Replace most sys.exit() with raising exceptions
Because ruffus doesn’t handle exceptions well I tended to call sys.exit
to make sure we got out of dodge when needed.  However, sys.exit is not
ideal for the Python API this is moving towards, so this introduces
proper exceptions for the various cases that retain suggested error
codes. Only __main__.py should call sys.exit now, everyone else has to
throw an exception.

For now the worker raising a fatal exception is logging messages rather
than passing an exception object with the fatal error message, mainly
because ruffus doesn’t properly marshall the exception object so we
just check “what is the name of the exception class that caused ruffus
to thrown an RethrownJobError”?

Also fixed along the way was the wrong return code being shown for
encrypted PDF checking, and incorrect use of str.find (e.output.find)
in boolean logic (str.find returns -1 on failure to find, which is True).
2016-12-10 15:24:24 -08:00
James R. Barlow
4ee9658e97 Move external program wrappers to ocrmypdf.exe package 2016-12-09 16:54:24 -08:00
James R. Barlow
adc1580742 Help py.test collect output in more cases 2016-12-08 16:21:07 -08:00
James R. Barlow
e57aa0eee2 pageinfo: fix “decimal.InvalidOperation: quantize result has too many digits”
And add new test case for this.
2016-12-08 16:06:53 -08:00
James R. Barlow
731e6792c7 Add test cases for Ghostscript PDF/A warnings 2016-12-03 00:32:09 -08:00
James R. Barlow
949d2ff1c2 v4.3.1 release notes 2016-11-07 14:36:08 -08:00
James R. Barlow
1c8b763d53 test_pageinfo: Remove bits per component test
The behavior of this test will ultimately depend on what version of
img2pdf is installed, since after my patch it will be able to produce
1bpp images.
2016-11-07 14:35:54 -08:00
James R. Barlow
bb91393b85 Fix “deskew-rotate” bug.
Turns out this occurred in any case where pdf-renderer hocr was used
and a tesseract timeout or error occurred. We created a replacement
page based on the unrotated page dimensions instead of the input image’s
dimensions.
2016-11-07 14:17:31 -08:00
James R. Barlow
cc9c0d819e Add test case for documents that get rotated incorrectly after deskew 2016-11-07 14:15:03 -08:00
James R. Barlow
fdd9b8b8ce Optimize some of the test resources to reduce file sizes
Mostly by reducing RGB -> monochrome and applying JBIG2 compression
2016-11-07 14:01:23 -08:00
James R. Barlow
a4f07756a5 tesseract caching: don't transcode tesseract's output, hash source file
For sanity's sake, deal with tesseract streams in binary without
transcoding (via universal_newlines, etc.). The only differences are
printing messages regarding spoofing.

Also hash the source file so that changes to the cache mechanism
invalidate old cache automatically. That is probably too aggressive,
but simple and safer than the previous approach.
2016-10-28 16:44:12 -07:00
James R. Barlow
2e4431cc63 Allow piping output to stdout 2016-10-27 16:14:42 -07:00
James R. Barlow
f7387b0859 test_stdin: simplify this test
No need to involve 'cat', just hook the file up to stdin.
2016-10-27 16:01:07 -07:00
James R. Barlow
a09f6b8977 Test cases: check that stdout is clear of output
To ensure piping to stdout is possible.
2016-10-27 15:58:24 -07:00
James R. Barlow
a86805f0d9 Remove possibly non-free page from "multipage.pdf" 2016-10-27 15:56:43 -07:00
James R. Barlow
7eca8508fd Implement new preprocessing feature, background removal 2016-10-14 17:23:34 -07:00
James R. Barlow
cf4b04f92d The main 'quick' test should be a file that OCRs to recognizable text 2016-10-07 16:25:34 -07:00
James R. Barlow
013c5a369f Replace redacted file with an OCR-able file 2016-10-07 12:45:22 -07:00
James R. Barlow
6baf8668a6 Replace with non-free file milk.pdf with free equivalent 2016-10-06 13:10:28 -07:00
James R. Barlow
4ba2962c56 Comment on non-free files 2016-10-05 16:48:16 -07:00
James R. Barlow
7ad92f5db4 Merge branch 'master' of https://github.com/jbarlow83/OCRmyPDF 2016-10-05 16:39:00 -07:00
James R. Barlow
4dad09cc91 resources/README: replace the other large table with a list table 2016-10-05 16:38:51 -07:00
Sean Whitton
7f08f15fc9 pytest skipif for milk.pdf test (#95)
Skip the test if the fair use restricted milk.pdf is not present.
2016-09-15 08:55:31 -07:00
James R. Barlow
825c0f8b2a Note that milk.pdf is non-free, start using list-tables 2016-09-10 14:44:00 -07:00
James R. Barlow
9ca29c787b Update description of masks.pdf to reflect what it actually tests 2016-09-01 21:21:14 -07:00
James R. Barlow
bd534c3313 main.py -> __main__.py
Executing a package with python -m packagename will check for
__main__.py inside the package.  In other words main.py should have
always been named __main__.py.

In the unlikely event that someone depends on "import ocrmypdf.main"
being meaningful, main.py continues to exist and replicates the
behavior of __main__.  (It's unlikely because import ocrmypdf.main does
unpythonic ruffus-related things at things import time, essentially
configuring itself to work with sys.argv.  To fix another day.)

This should solve the problem of Debian needing to run test suites
before installation and afterwards for continuous integration without
having to patch either file, as python -m ocrmypdf will follow import
order.  That is, if the current directory contains "ocrmypdf/" (e.g.
staging a new version) then that will be tested, else sys.path will
be checked.
2016-08-31 17:01:42 -07:00
James R. Barlow
bf89e38c69 Add milk.pdf test case 2016-08-31 11:42:21 -07:00
James R. Barlow
325cc0beca Allow test cases to run without installing first
As @spwhitton found:

The test suite needs to call "python3 -m ocrmypdf.main" instead of
just "ocrmypdf" because this /usr/bin/ocrmypdf script has not yet been
generated when dh runs the test suite.

---

Seems reasonable to perform in-place testing independent of installation.

Source:
https://sources.debian.net/src/ocrmypdf/4.2.1%2Bgit.20160824.1.5d67cc7-1/debian/patches/0001-patch-test-suite-executable.patch/
2016-08-26 15:23:26 -07:00
James R. Barlow
1a9f09c4d5 Remove OCRmyPDF.sh and its usage in all test cases 2016-08-26 15:18:38 -07:00
James R. Barlow
4fed4e2af3 tests: don't try to pass Unicode arguments on command line on Linux
Depends on locale being configured properly, and it's not necessary
to be able to do this.
2016-08-26 15:08:56 -07:00
James R. Barlow
cc7e328358 Improve some documentation for tests 2016-08-26 15:04:08 -07:00
James R. Barlow
d25397e2b0 Add test case for PDFs with masks and stencil masks 2016-08-26 15:03:27 -07:00
James R. Barlow
2025a096c3 Test case for stdin streaming 2016-08-25 14:46:54 -07:00
James R. Barlow
e5541e435c New test to confirm we can emit JBIG2 with appropriate settings 2016-08-03 11:35:48 -07:00
James R. Barlow
e70387b1af Add a simple test for image to PDF 2016-08-03 03:35:30 -07:00
James R. Barlow
91d715ac93 Add test cases for --output-type 2016-08-03 02:47:18 -07:00
James R. Barlow
fef35e4eb2 Fix handling of DPI for rare case of JPEG recompression after deskew/clean
This test is exercised by page 4 of multipage.pdf. If all images are
JPEGs, and one of deskew/clean removes DPI information, make sure that
we can get the right information back and that the DPI stays square.
2016-07-29 01:34:52 -07:00
James R. Barlow
8f77576dc4 Fix non-square image resolution for "hocr" case; use img2pdf 0.2.1
Tesseract renderer not immediately fixable.
2016-07-28 16:43:51 -07:00
James R. Barlow
16e4d342d2 Bug fix: --force-ocr should still run on pages with no images
Useful for people who want to reprocess text.

This also requires --oversample because DPI is undefined. To be fixed
in next commit.
2016-07-27 15:06:49 -07:00
jbarlow83
1bacf35a2c Update license information for encrypted_algo4.pdf 2016-06-24 14:25:15 -07:00