OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-10-17 19:10:06 +00:00

Author	SHA1	Message	Date
James R. Barlow	adf495e8cc	Remove JHOVE JHOVE is not an effective PDF/A validator, as detailed in this article: http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/ In short, it's buggy. Out of 670 invalid PDF/A files in a test suite, it only flagged 5. It only looks for certain problems that Ghostscript generated PDFs are unlikely to have. So use qpdf as a final check for general ill-formed PDF problems since it is quite reliable. JHOVE 1 is no longer maintained. There's a JHOVE 2 but it has no PDF support. I also don't know if it's appropriate to bundle JHOVE, with an LGPL, under this project and its current license. Removing a dependency on Java is a huge win. A world with less Java is a world with less AbstractFactoryConstructorInterfaces.	2015-08-11 15:31:32 -07:00
James R. Barlow	9247ea00bf	Improve ruffus exception handling ruffus swallows the return code if the process of handling an exception we hit an error in ruffus' own code, which can happen. So pick through its error stack and find out if there's an interesting return code in there. Had to use eval() of all things. Also suppress the stack trace for normal error conditions that don't need one.	2015-08-11 02:19:46 -07:00
James R. Barlow	1cb5f6a90d	Refactor exit codes; test for missing tessdata Some versions of tesseract installed by homebrew end up without a functional tessdata folder, and tesseract is not helpful in this situation, so add a new test to make sure our output is at least indicative of the problem. In the process of properly handling return codes I discovered test_override_metadata triggers a NPE inside JHOVE probably due to the Unicode character checking. This could be specific to my JRE (1.6.0_65, Oracle) but it's probably JHOVE's fault. A valid PDF/A (per Acrobat) is still generated.	2015-08-11 00:17:02 -07:00
James R. Barlow	8d848284df	Fix code, test case: complain when GS fails to produce PDF/A Modified pipeline to fix regression and return the proper error code if we did not produce a PDF/A as expected. The wrapper forces the output to be PDF 1.3 which is not PDF/A compliant. The funny thing is that in some cases JHOVE incorrectly states that a file is PDF/A-1b compliant, well formed and valid, even when it is not according to Acrobat XI and is missing the PDF/A metadata marker, as far as I can tell. JHOVE may not be as beneficial as hoped.	2015-08-10 16:05:00 -07:00
James R. Barlow	16d24f1166	Bump version to -rc4	2015-08-05 23:26:38 -07:00
James R. Barlow	8fcbbcef94	Improve usage text	2015-08-05 16:56:53 -07:00
James R. Barlow	6887e232fc	Bug fix: exception from process timeout should be TimeoutExpired	2015-07-31 00:06:58 -07:00
James R. Barlow	6ac7ffd77b	Merge branch 'feature/drop-mupdf-poppler' into develop	2015-07-30 23:38:27 -07:00
James R. Barlow	b28faa582a	Automatically use all available cores unless told not to	2015-07-30 23:20:21 -07:00
James R. Barlow	a036de318e	Replace mupdf and poppler with qpdf Drop two dependencies and replace them with one that does the job of both. Smells like progress. mupdf does PDF file repair and rendering poppler does rendering and page splitting qpdf does PDF file repair and page splitting ghostscript does PDF file repair, rendering, and page splitting (sort of) So we use qpdf. Ghostscript's page splitting is supposed is less efficient because it reprints the page (PDF -> Postscript -> PDF) and possibly loses quality. qpdf's library could be used to improve performance. This causes a slight performance regression: py.test tests/test_main.py::test_maximum_options went from 187 seconds up to 192. This is likely due to O(n) serialized invocations of qpdf compared to a single serialized call to pdfseparate. Could improve on this situation by using the example code in qpdf: pdf-split-pages.cc or create marker files in split_pages() and then write a new @transform function that would split pages on each CPU. Probably not worth it, overall, unless this causes problems on files with hundreds of pages.	2015-07-30 04:16:35 -07:00
James R. Barlow	9e0c443c2f	-rc2: because pypi won't accept -rc1	2015-07-28 04:55:10 -07:00
James R. Barlow	60832152b1	Don't mess with options	2015-07-28 04:46:21 -07:00
James R. Barlow	6a160d22fe	Update release notes, add copyrights	2015-07-28 04:36:58 -07:00
James R. Barlow	e35526192c	More test cases	2015-07-28 03:02:35 -07:00
James R. Barlow	2a9da225e4	Minor tweaks to uncommon arguments	2015-07-28 02:25:50 -07:00
James R. Barlow	a3f37de9b5	Test cases for --tesseract-timeout	2015-07-28 01:47:30 -07:00
James R. Barlow	6064160953	Get rid of subprocess call on import of tesseract, unpaper -- bit nasty	2015-07-28 01:00:29 -07:00
James R. Barlow	587fa63c8e	--oversample: Default to 0	2015-07-27 20:42:16 -07:00
James R. Barlow	b40eec4cb0	Add --oversample test for hocr rendering	2015-07-27 17:18:02 -07:00
James R. Barlow	2e7cd52c0f	Improve argument handling, test cases	2015-07-27 15:39:54 -07:00
James R. Barlow	77d4cb367e	Put ghostscript in a module	2015-07-27 15:22:00 -07:00
James R. Barlow	2c45c5abc6	Implement tesseract timeout	2015-07-27 04:23:37 -07:00
James R. Barlow	a89afabd79	Implement tesseract PDF rendering as an alternative It's much better a rendering text baselines than hocr and seems to produce small file sizes, so it's progress. Not available for Tesseract 3.02 obviously, so both modes need to remove available.	2015-07-27 04:20:49 -07:00
James R. Barlow	6c3cb6acba	Remove redundant *res_render	2015-07-26 12:56:10 -07:00
James R. Barlow	d3088829af	More packaging changes: move jhove, fix console script	2015-07-26 01:52:08 -07:00
James R. Barlow	9aaaba1714	Packaging stuff	2015-07-25 23:45:13 -07:00
Jim Barlow	9adb0d696f	Prepare for Python packaging - move to ocrmypdf folder	2015-07-25 18:22:04 -07:00

27 Commits