OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2026-02-06 23:27:29 +00:00

Author	SHA1	Message	Date
James R. Barlow	9a15a4db10	Ensure specified destination is writable before starting pipeline process	2017-01-26 22:08:24 -08:00
James R. Barlow	55aeaec293	Autorotation check: Replace duplicated tests with parameterized test	2017-01-26 18:07:59 -08:00
James R. Barlow	f6df1fb40c	Fix test suite regression: output files dumped in tests/resources	2017-01-26 18:07:09 -08:00
James R. Barlow	b889a89c36	Fix remaining 3.4/3.5 regressions	2017-01-26 17:53:27 -08:00
James R. Barlow	1976dc6f30	Fix issue #121 “pop from empty list” (content stream parsing error)	2017-01-26 17:24:40 -08:00
James R. Barlow	e864c65d26	(Hopefully) Fix Path <-> py.path conversion on Py3.4/3.5	2017-01-26 17:19:15 -08:00
James R. Barlow	02fba02d31	Refactor test suite to use fixtures to manage paths	2017-01-26 16:38:59 -08:00
James R. Barlow	fb9e7c82f6	Move duplicate test code into common namespace	2017-01-26 13:36:52 -08:00
James R. Barlow	bad67c6dc5	Rename ‘tesstop’ to ‘tess4’ There’s no reason text-only PDF shouldn’t become the default for tesseract 4.	2017-01-26 12:28:51 -08:00
James R. Barlow	b8767e5ba9	Rename exe -> exec, more Unix-y and suggestive	2016-12-10 15:34:00 -08:00
James R. Barlow	d33a50660d	Replace most sys.exit() with raising exceptions Because ruffus doesn’t handle exceptions well I tended to call sys.exit to make sure we got out of dodge when needed. However, sys.exit is not ideal for the Python API this is moving towards, so this introduces proper exceptions for the various cases that retain suggested error codes. Only __main__.py should call sys.exit now, everyone else has to throw an exception. For now the worker raising a fatal exception is logging messages rather than passing an exception object with the fatal error message, mainly because ruffus doesn’t properly marshall the exception object so we just check “what is the name of the exception class that caused ruffus to thrown an RethrownJobError”? Also fixed along the way was the wrong return code being shown for encrypted PDF checking, and incorrect use of str.find (e.output.find) in boolean logic (str.find returns -1 on failure to find, which is True).	2016-12-10 15:24:24 -08:00
James R. Barlow	4ee9658e97	Move external program wrappers to ocrmypdf.exe package	2016-12-09 16:54:24 -08:00
James R. Barlow	adc1580742	Help py.test collect output in more cases	2016-12-08 16:21:07 -08:00
James R. Barlow	e57aa0eee2	pageinfo: fix “decimal.InvalidOperation: quantize result has too many digits” And add new test case for this.	2016-12-08 16:06:53 -08:00
James R. Barlow	731e6792c7	Add test cases for Ghostscript PDF/A warnings	2016-12-03 00:32:09 -08:00
James R. Barlow	949d2ff1c2	v4.3.1 release notes	2016-11-07 14:36:08 -08:00
James R. Barlow	1c8b763d53	test_pageinfo: Remove bits per component test The behavior of this test will ultimately depend on what version of img2pdf is installed, since after my patch it will be able to produce 1bpp images.	2016-11-07 14:35:54 -08:00
James R. Barlow	bb91393b85	Fix “deskew-rotate” bug. Turns out this occurred in any case where pdf-renderer hocr was used and a tesseract timeout or error occurred. We created a replacement page based on the unrotated page dimensions instead of the input image’s dimensions.	2016-11-07 14:17:31 -08:00
James R. Barlow	cc9c0d819e	Add test case for documents that get rotated incorrectly after deskew	2016-11-07 14:15:03 -08:00
James R. Barlow	fdd9b8b8ce	Optimize some of the test resources to reduce file sizes Mostly by reducing RGB -> monochrome and applying JBIG2 compression	2016-11-07 14:01:23 -08:00
James R. Barlow	a4f07756a5	tesseract caching: don't transcode tesseract's output, hash source file For sanity's sake, deal with tesseract streams in binary without transcoding (via universal_newlines, etc.). The only differences are printing messages regarding spoofing. Also hash the source file so that changes to the cache mechanism invalidate old cache automatically. That is probably too aggressive, but simple and safer than the previous approach.	2016-10-28 16:44:12 -07:00
James R. Barlow	2e4431cc63	Allow piping output to stdout	2016-10-27 16:14:42 -07:00
James R. Barlow	f7387b0859	test_stdin: simplify this test No need to involve 'cat', just hook the file up to stdin.	2016-10-27 16:01:07 -07:00
James R. Barlow	a09f6b8977	Test cases: check that stdout is clear of output To ensure piping to stdout is possible.	2016-10-27 15:58:24 -07:00
James R. Barlow	a86805f0d9	Remove possibly non-free page from "multipage.pdf"	2016-10-27 15:56:43 -07:00
James R. Barlow	7eca8508fd	Implement new preprocessing feature, background removal	2016-10-14 17:23:34 -07:00
James R. Barlow	cf4b04f92d	The main 'quick' test should be a file that OCRs to recognizable text	2016-10-07 16:25:34 -07:00
James R. Barlow	013c5a369f	Replace redacted file with an OCR-able file	2016-10-07 12:45:22 -07:00
James R. Barlow	6baf8668a6	Replace with non-free file milk.pdf with free equivalent	2016-10-06 13:10:28 -07:00
James R. Barlow	4ba2962c56	Comment on non-free files	2016-10-05 16:48:16 -07:00
James R. Barlow	7ad92f5db4	Merge branch 'master' of https://github.com/jbarlow83/OCRmyPDF	2016-10-05 16:39:00 -07:00
James R. Barlow	4dad09cc91	resources/README: replace the other large table with a list table	2016-10-05 16:38:51 -07:00
Sean Whitton	7f08f15fc9	pytest skipif for milk.pdf test (#95 ) Skip the test if the fair use restricted milk.pdf is not present.	2016-09-15 08:55:31 -07:00
James R. Barlow	825c0f8b2a	Note that milk.pdf is non-free, start using list-tables	2016-09-10 14:44:00 -07:00
James R. Barlow	9ca29c787b	Update description of masks.pdf to reflect what it actually tests	2016-09-01 21:21:14 -07:00
James R. Barlow	bd534c3313	main.py -> __main__.py Executing a package with python -m packagename will check for __main__.py inside the package. In other words main.py should have always been named __main__.py. In the unlikely event that someone depends on "import ocrmypdf.main" being meaningful, main.py continues to exist and replicates the behavior of __main__. (It's unlikely because import ocrmypdf.main does unpythonic ruffus-related things at things import time, essentially configuring itself to work with sys.argv. To fix another day.) This should solve the problem of Debian needing to run test suites before installation and afterwards for continuous integration without having to patch either file, as python -m ocrmypdf will follow import order. That is, if the current directory contains "ocrmypdf/" (e.g. staging a new version) then that will be tested, else sys.path will be checked.	2016-08-31 17:01:42 -07:00
James R. Barlow	bf89e38c69	Add milk.pdf test case	2016-08-31 11:42:21 -07:00
James R. Barlow	325cc0beca	Allow test cases to run without installing first As @spwhitton found: The test suite needs to call "python3 -m ocrmypdf.main" instead of just "ocrmypdf" because this /usr/bin/ocrmypdf script has not yet been generated when dh runs the test suite. --- Seems reasonable to perform in-place testing independent of installation. Source: https://sources.debian.net/src/ocrmypdf/4.2.1%2Bgit.20160824.1.5d67cc7-1/debian/patches/0001-patch-test-suite-executable.patch/	2016-08-26 15:23:26 -07:00
James R. Barlow	1a9f09c4d5	Remove OCRmyPDF.sh and its usage in all test cases	2016-08-26 15:18:38 -07:00
James R. Barlow	4fed4e2af3	tests: don't try to pass Unicode arguments on command line on Linux Depends on locale being configured properly, and it's not necessary to be able to do this.	2016-08-26 15:08:56 -07:00
James R. Barlow	cc7e328358	Improve some documentation for tests	2016-08-26 15:04:08 -07:00
James R. Barlow	d25397e2b0	Add test case for PDFs with masks and stencil masks	2016-08-26 15:03:27 -07:00
James R. Barlow	2025a096c3	Test case for stdin streaming	2016-08-25 14:46:54 -07:00
James R. Barlow	e5541e435c	New test to confirm we can emit JBIG2 with appropriate settings	2016-08-03 11:35:48 -07:00
James R. Barlow	e70387b1af	Add a simple test for image to PDF	2016-08-03 03:35:30 -07:00
James R. Barlow	91d715ac93	Add test cases for --output-type	2016-08-03 02:47:18 -07:00
James R. Barlow	fef35e4eb2	Fix handling of DPI for rare case of JPEG recompression after deskew/clean This test is exercised by page 4 of multipage.pdf. If all images are JPEGs, and one of deskew/clean removes DPI information, make sure that we can get the right information back and that the DPI stays square.	2016-07-29 01:34:52 -07:00
James R. Barlow	8f77576dc4	Fix non-square image resolution for "hocr" case; use img2pdf 0.2.1 Tesseract renderer not immediately fixable.	2016-07-28 16:43:51 -07:00
James R. Barlow	16e4d342d2	Bug fix: --force-ocr should still run on pages with no images Useful for people who want to reprocess text. This also requires --oversample because DPI is undefined. To be fixed in next commit.	2016-07-27 15:06:49 -07:00
jbarlow83	1bacf35a2c	Update license information for encrypted_algo4.pdf	2016-06-24 14:25:15 -07:00

1 2 3

141 Commits