OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-10-29 17:01:27 +00:00

Author	SHA1	Message	Date
James R. Barlow	e76ae8c46c	Move more qpdf calls into qpdf.py	2015-12-17 08:24:48 -08:00
James R. Barlow	53a7c0e668	Refactor qpdf subprocess calls into module	2015-12-17 08:19:53 -08:00
James R. Barlow	4ca243e490	Merge commit '9f374461559460527e47237323e511123f31b6b0' into feature/envvars	2015-12-17 07:27:26 -08:00
Shem Pasamba	d7c7559b05	Use boolean instead of integers	2015-12-17 11:23:27 +08:00
Shem Pasamba	b2b66d1344	Don't exit when qpdf repair was successful	2015-12-17 11:20:20 +08:00
James R. Barlow	5d111a3c04	Refactor tesseract --pdfrenderer calls to tesseract.py	2015-12-16 17:48:26 -08:00
James R. Barlow	10416f847f	Migrate tesseract-hocr code to tesseract module, because modularity	2015-12-16 17:36:11 -08:00
James R. Barlow	79b3472b26	All tests passed, bump version	2015-12-04 04:31:01 -08:00
James R. Barlow	f1b2f1ae08	Merge branch 'feature/pdfa-2' into develop	2015-12-04 04:04:08 -08:00
James R. Barlow	ee7d97ae8c	Trivial	2015-12-04 04:03:38 -08:00
James R. Barlow	7d9f473bb1	Remove eval() call by introspecting ExitCode	2015-12-04 03:34:53 -08:00
James R. Barlow	e77a5e5e75	We don't want threads. Really. Do. Not. Want.	2015-12-04 03:11:38 -08:00
James R. Barlow	6ab19af122	Comments	2015-12-04 03:09:39 -08:00
James R. Barlow	276fe49867	Better error messages for input file not found or invalid Not as good finding a general way to deal with ruffus exceptions, but better than nil.	2015-12-04 03:07:53 -08:00
James R. Barlow	acb31abe86	Fix issue #20 - fails on uppercase .PDF	2015-12-04 02:14:09 -08:00
James R. Barlow	4f964a3c8a	Introduce --pdf-renderer auto Tess 3.03's has various quality problems like wrong DPI that are fixed in Tess 3.04. Idea here is to introduce an option to let OCRmyPDF select the rendering backend based on the options and system. However, we're not ready for tesseract as the main renderer. Setting pdf-renderer to tesseract does not pass all test cases, mainly the one where --tesseract-timeout is triggered, and some others.	2015-12-02 23:20:31 -08:00
James R. Barlow	80d89b5420	Set /Creator metadata to OCRmyPDF with reference to Tess version and settings	2015-12-02 02:19:39 -08:00
James R. Barlow	281eafada0	bump to v3.0 and move repos	2015-09-05 00:53:14 -07:00
James R. Barlow	c14e10128a	Bump version to -rc9	2015-08-29 16:43:22 -07:00
James R. Barlow	c4f134d694	Prevent running validation on missing file after an exception is thrown	2015-08-28 04:48:29 -07:00
James R. Barlow	83f9dfbac4	Use png256 raster device when possible Someone reported a bug where the .png input to unpaper ended up being type 'P' (palette) for some reason, which was not supported in unpaper. Not sure how it happened, but seemed easier to fix by explicitly supporting. Here we use png256 if it would capture all colors in the input file. It's up to tesseract/reportlab to make use of the palette PNG when rendering.	2015-08-28 04:47:57 -07:00
James R. Barlow	2ce6834be4	Bump to -rc8	2015-08-24 01:25:01 -07:00
James R. Barlow	b376672dbc	Bug fix: exception thrown if input PDF was missing DocumentInfo block	2015-08-24 01:23:30 -07:00
James R. Barlow	aab08bfcc7	Fix requirements.txt problem	2015-08-23 12:30:40 -07:00
James R. Barlow	4f3673d14d	Update notes for -rc6	2015-08-22 00:40:07 -07:00
James R. Barlow	cc161780df	Replace fileinput with regular open-replace fileinput is supposed to save time in these cases but it's not capable of doing both in-place rewrites and working with a non-ascii encoding. This was not noticed until characters outside of ASCII were picked up by tesseract and saved in a HOCR file. Rework some surrounding code as well and add multilingual test cases.	2015-08-18 23:27:50 -07:00
James R. Barlow	53c88093ad	Bump to -rc5	2015-08-16 02:19:04 -07:00
James R. Barlow	30072e0c70	Pillow sucks Far from being fluffy or friendly, Pillow silently allows installation of itself without support for major image types. Reportlab calls for pillow 2.4.0. On Ubuntu 14.04 LTS this will trigger an upgrade of pillow that will be built without JPEG or ZLIB so it is effectively neutered, and unfortunately Pillow will not detect this situation at install time and guide users to a resolution. Instead, you see nasty stack traces. So add a run-time check to ensure that Pillow is sane and capable of JPEG and PNG support since both may be used internally.	2015-08-16 00:54:03 -07:00
James R. Barlow	adf495e8cc	Remove JHOVE JHOVE is not an effective PDF/A validator, as detailed in this article: http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/ In short, it's buggy. Out of 670 invalid PDF/A files in a test suite, it only flagged 5. It only looks for certain problems that Ghostscript generated PDFs are unlikely to have. So use qpdf as a final check for general ill-formed PDF problems since it is quite reliable. JHOVE 1 is no longer maintained. There's a JHOVE 2 but it has no PDF support. I also don't know if it's appropriate to bundle JHOVE, with an LGPL, under this project and its current license. Removing a dependency on Java is a huge win. A world with less Java is a world with less AbstractFactoryConstructorInterfaces.	2015-08-11 15:31:32 -07:00
James R. Barlow	9247ea00bf	Improve ruffus exception handling ruffus swallows the return code if the process of handling an exception we hit an error in ruffus' own code, which can happen. So pick through its error stack and find out if there's an interesting return code in there. Had to use eval() of all things. Also suppress the stack trace for normal error conditions that don't need one.	2015-08-11 02:19:46 -07:00
James R. Barlow	1cb5f6a90d	Refactor exit codes; test for missing tessdata Some versions of tesseract installed by homebrew end up without a functional tessdata folder, and tesseract is not helpful in this situation, so add a new test to make sure our output is at least indicative of the problem. In the process of properly handling return codes I discovered test_override_metadata triggers a NPE inside JHOVE probably due to the Unicode character checking. This could be specific to my JRE (1.6.0_65, Oracle) but it's probably JHOVE's fault. A valid PDF/A (per Acrobat) is still generated.	2015-08-11 00:17:02 -07:00
James R. Barlow	8d848284df	Fix code, test case: complain when GS fails to produce PDF/A Modified pipeline to fix regression and return the proper error code if we did not produce a PDF/A as expected. The wrapper forces the output to be PDF 1.3 which is not PDF/A compliant. The funny thing is that in some cases JHOVE incorrectly states that a file is PDF/A-1b compliant, well formed and valid, even when it is not according to Acrobat XI and is missing the PDF/A metadata marker, as far as I can tell. JHOVE may not be as beneficial as hoped.	2015-08-10 16:05:00 -07:00
James R. Barlow	16d24f1166	Bump version to -rc4	2015-08-05 23:26:38 -07:00
James R. Barlow	8fcbbcef94	Improve usage text	2015-08-05 16:56:53 -07:00
James R. Barlow	6887e232fc	Bug fix: exception from process timeout should be TimeoutExpired	2015-07-31 00:06:58 -07:00
James R. Barlow	6ac7ffd77b	Merge branch 'feature/drop-mupdf-poppler' into develop	2015-07-30 23:38:27 -07:00
James R. Barlow	b28faa582a	Automatically use all available cores unless told not to	2015-07-30 23:20:21 -07:00
James R. Barlow	a036de318e	Replace mupdf and poppler with qpdf Drop two dependencies and replace them with one that does the job of both. Smells like progress. mupdf does PDF file repair and rendering poppler does rendering and page splitting qpdf does PDF file repair and page splitting ghostscript does PDF file repair, rendering, and page splitting (sort of) So we use qpdf. Ghostscript's page splitting is supposed is less efficient because it reprints the page (PDF -> Postscript -> PDF) and possibly loses quality. qpdf's library could be used to improve performance. This causes a slight performance regression: py.test tests/test_main.py::test_maximum_options went from 187 seconds up to 192. This is likely due to O(n) serialized invocations of qpdf compared to a single serialized call to pdfseparate. Could improve on this situation by using the example code in qpdf: pdf-split-pages.cc or create marker files in split_pages() and then write a new @transform function that would split pages on each CPU. Probably not worth it, overall, unless this causes problems on files with hundreds of pages.	2015-07-30 04:16:35 -07:00
James R. Barlow	9e0c443c2f	-rc2: because pypi won't accept -rc1	2015-07-28 04:55:10 -07:00
James R. Barlow	60832152b1	Don't mess with options	2015-07-28 04:46:21 -07:00
James R. Barlow	6a160d22fe	Update release notes, add copyrights	2015-07-28 04:36:58 -07:00
James R. Barlow	e35526192c	More test cases	2015-07-28 03:02:35 -07:00
James R. Barlow	2a9da225e4	Minor tweaks to uncommon arguments	2015-07-28 02:25:50 -07:00
James R. Barlow	a3f37de9b5	Test cases for --tesseract-timeout	2015-07-28 01:47:30 -07:00
James R. Barlow	6064160953	Get rid of subprocess call on import of tesseract, unpaper -- bit nasty	2015-07-28 01:00:29 -07:00
James R. Barlow	587fa63c8e	--oversample: Default to 0	2015-07-27 20:42:16 -07:00
James R. Barlow	b40eec4cb0	Add --oversample test for hocr rendering	2015-07-27 17:18:02 -07:00
James R. Barlow	2e7cd52c0f	Improve argument handling, test cases	2015-07-27 15:39:54 -07:00
James R. Barlow	77d4cb367e	Put ghostscript in a module	2015-07-27 15:22:00 -07:00
James R. Barlow	2c45c5abc6	Implement tesseract timeout	2015-07-27 04:23:37 -07:00

1 2

55 Commits