OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-08-17 13:11:44 +00:00

Author	SHA1	Message	Date
James R. Barlow	630e6cbf1e	pip chokes on Unicode filenames?	2015-08-18 23:56:30 -07:00
James R. Barlow	83ff5760a8	Dockerfile comment cleanup	2015-08-18 23:41:41 -07:00
James R. Barlow	fed0ee638e	Fix ruffus writing to RO directory in container	2015-08-18 23:30:06 -07:00
James R. Barlow	cc161780df	Replace fileinput with regular open-replace fileinput is supposed to save time in these cases but it's not capable of doing both in-place rewrites and working with a non-ascii encoding. This was not noticed until characters outside of ASCII were picked up by tesseract and saved in a HOCR file. Rework some surrounding code as well and add multilingual test cases.	2015-08-18 23:27:50 -07:00
James R. Barlow	898b2b000a	Works	2015-08-18 05:38:05 -07:00
James R. Barlow	b3ee743ed7	WIP on docker	2015-08-18 04:46:25 -07:00
James R. Barlow	ef17b669fe	README needs ghostscript	2015-08-18 03:27:39 -07:00
James R. Barlow	2dff3e07ce	Drop libxml2 dependency It seems that Python's internal XML parser is good enough to do the job.	2015-08-17 15:26:07 -07:00
James R. Barlow	53c88093ad	Bump to -rc5 v3.0-rc5	2015-08-16 02:19:04 -07:00
James R. Barlow	0ec13d3a17	Fix test cases: minor issues -os.environ directly modified when whole suite run, breaking subsequent tests -no longer trusting JHOVE for PDF/A validation	2015-08-16 01:57:35 -07:00
jbarlow83	0d5104049a	Update README with better install instructions	2015-08-16 01:28:28 -07:00
James R. Barlow	ce8fa69785	Update readme	2015-08-16 00:59:57 -07:00
James R. Barlow	30072e0c70	Pillow sucks Far from being fluffy or friendly, Pillow silently allows installation of itself without support for major image types. Reportlab calls for pillow 2.4.0. On Ubuntu 14.04 LTS this will trigger an upgrade of pillow that will be built without JPEG or ZLIB so it is effectively neutered, and unfortunately Pillow will not detect this situation at install time and guide users to a resolution. Instead, you see nasty stack traces. So add a run-time check to ensure that Pillow is sane and capable of JPEG and PNG support since both may be used internally.	2015-08-16 00:54:03 -07:00
James R. Barlow	eb04a890b2	Relax Pillow requirement for Ubuntu 14.04 LTS	2015-08-15 15:55:56 -07:00
James R. Barlow	0c53adb04f	setup: rollback lxml version to 3.3.3 - that's the latest in Ubuntu 14.04	2015-08-15 15:25:58 -07:00
James R. Barlow	ee5a43fd47	setup: suppress jhove errors	2015-08-15 15:25:30 -07:00
James R. Barlow	c43d6c2cbe	Merge branch 'develop' of https://github.com/fritz-hh/OCRmyPDF into develop Conflicts: setup.py	2015-08-15 15:18:41 -07:00
James R. Barlow	87aeeacb04	Fix erroneous instruction to "apt-get install tesseract" Should be tesseract-ocr	2015-08-15 15:17:38 -07:00
James R. Barlow	6b26e9cad6	Fix erroneous instruction to "apt-get install tesseract" Should be tesseract-ocr	2015-08-15 15:12:05 -07:00
James R. Barlow	85af0f0d03	Add test case for blank PDF page	2015-08-14 00:46:50 -07:00
James R. Barlow	f6f4705ea3	Remove Java from setup.py	2015-08-14 00:44:56 -07:00
James R. Barlow	a4702bff22	Possible fix for issue #111	2015-08-13 23:10:22 -07:00
James R. Barlow	73c5c48f79	Update notes	2015-08-13 23:08:29 -07:00
James R. Barlow	adf495e8cc	Remove JHOVE JHOVE is not an effective PDF/A validator, as detailed in this article: http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/ In short, it's buggy. Out of 670 invalid PDF/A files in a test suite, it only flagged 5. It only looks for certain problems that Ghostscript generated PDFs are unlikely to have. So use qpdf as a final check for general ill-formed PDF problems since it is quite reliable. JHOVE 1 is no longer maintained. There's a JHOVE 2 but it has no PDF support. I also don't know if it's appropriate to bundle JHOVE, with an LGPL, under this project and its current license. Removing a dependency on Java is a huge win. A world with less Java is a world with less AbstractFactoryConstructorInterfaces.	2015-08-11 15:31:32 -07:00
James R. Barlow	9247ea00bf	Improve ruffus exception handling ruffus swallows the return code if the process of handling an exception we hit an error in ruffus' own code, which can happen. So pick through its error stack and find out if there's an interesting return code in there. Had to use eval() of all things. Also suppress the stack trace for normal error conditions that don't need one.	2015-08-11 02:19:46 -07:00
James R. Barlow	a1238d7bf9	Document override binary test	2015-08-11 00:44:43 -07:00
James R. Barlow	2d63268f0f	Work around JHOVE bug for now, so that the test passes	2015-08-11 00:23:48 -07:00
James R. Barlow	1cb5f6a90d	Refactor exit codes; test for missing tessdata Some versions of tesseract installed by homebrew end up without a functional tessdata folder, and tesseract is not helpful in this situation, so add a new test to make sure our output is at least indicative of the problem. In the process of properly handling return codes I discovered test_override_metadata triggers a NPE inside JHOVE probably due to the Unicode character checking. This could be specific to my JRE (1.6.0_65, Oracle) but it's probably JHOVE's fault. A valid PDF/A (per Acrobat) is still generated.	2015-08-11 00:17:02 -07:00
James R. Barlow	8d848284df	Fix code, test case: complain when GS fails to produce PDF/A Modified pipeline to fix regression and return the proper error code if we did not produce a PDF/A as expected. The wrapper forces the output to be PDF 1.3 which is not PDF/A compliant. The funny thing is that in some cases JHOVE incorrectly states that a file is PDF/A-1b compliant, well formed and valid, even when it is not according to Acrobat XI and is missing the PDF/A metadata marker, as far as I can tell. JHOVE may not be as beneficial as hoped.	2015-08-10 16:05:00 -07:00
James R. Barlow	8fe54d1a5c	Add new test case to check invalid PDF/A case It revealed a regression - return code not the same as v2.x for invalid PDF/A. It's also not easy to get the return code out of ruffus. Will need to tweak the final step of the pipeline.	2015-08-10 13:57:28 -07:00
James R. Barlow	11dd9f14c3	setup.py: block unsafe 'upload', say to use twine instead	2015-08-09 14:16:30 -07:00
James R. Barlow	16d24f1166	Bump version to -rc4 v3.0-rc4	2015-08-05 23:26:38 -07:00
James R. Barlow	97015ef775	Add a test case to check on the @argumentsfile syntax	2015-08-05 23:17:38 -07:00
James R. Barlow	2744dafb74	New test case: ensure metadata is preserved from input to output	2015-08-05 17:09:38 -07:00
James R. Barlow	7b268dbe1a	Remove duplication in test case	2015-08-05 16:57:04 -07:00
James R. Barlow	8fcbbcef94	Improve usage text	2015-08-05 16:56:53 -07:00
James R. Barlow	8f93f0a06e	Tidy docs	2015-08-05 16:56:30 -07:00
James R. Barlow	387142488c	Kill duplicate file	2015-07-31 01:57:16 -07:00
James R. Barlow	6887e232fc	Bug fix: exception from process timeout should be TimeoutExpired	2015-07-31 00:06:58 -07:00
James R. Barlow	6ac7ffd77b	Merge branch 'feature/drop-mupdf-poppler' into develop	2015-07-30 23:38:27 -07:00
James R. Barlow	b28faa582a	Automatically use all available cores unless told not to	2015-07-30 23:20:21 -07:00
James R. Barlow	454ee029c8	Run final ghostscript in multithreaded mode This step is serialized so all cores are not busy at this stage.	2015-07-30 23:20:04 -07:00
James R. Barlow	a036de318e	Replace mupdf and poppler with qpdf Drop two dependencies and replace them with one that does the job of both. Smells like progress. mupdf does PDF file repair and rendering poppler does rendering and page splitting qpdf does PDF file repair and page splitting ghostscript does PDF file repair, rendering, and page splitting (sort of) So we use qpdf. Ghostscript's page splitting is supposed is less efficient because it reprints the page (PDF -> Postscript -> PDF) and possibly loses quality. qpdf's library could be used to improve performance. This causes a slight performance regression: py.test tests/test_main.py::test_maximum_options went from 187 seconds up to 192. This is likely due to O(n) serialized invocations of qpdf compared to a single serialized call to pdfseparate. Could improve on this situation by using the example code in qpdf: pdf-split-pages.cc or create marker files in split_pages() and then write a new @transform function that would split pages on each CPU. Probably not worth it, overall, unless this causes problems on files with hundreds of pages.	2015-07-30 04:16:35 -07:00
James R. Barlow	9918c4020e	Use img2pdf in test case because it does a better job	2015-07-30 03:35:56 -07:00
jbarlow83	3d6264e1b8	Fix formatting of 'motivation'	2015-07-28 17:58:26 -07:00
jbarlow83	1c25270503	Improve instructions for users that need sudo or venv	2015-07-28 17:55:56 -07:00
James R. Barlow	47e50f82c4	setup.py: allow mutool 1.7	2015-07-28 13:37:32 -07:00
James R. Barlow	27ecdfbba8	More fixes to error cases in setup.py	2015-07-28 13:05:23 -07:00
James R. Barlow	6901550065	Fix some installer issues	2015-07-28 12:41:24 -07:00
jbarlow83	6e6f918630	Actually link the release notes	2015-07-28 12:21:57 -07:00

... 44 45 46 47 48 ...

2676 Commits