OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-10-24 06:20:17 +00:00

Author	SHA1	Message	Date
James R. Barlow	c6d106ec33	Throw exception if iccprofiles not found instead of returning None So far iccprofiles were only missing for a user who had a custom and possibly broken ghostscript installation.	2015-08-28 03:59:35 -07:00
James R. Barlow	2ce6834be4	Bump to -rc8 v3.0-rc8	2015-08-24 01:25:01 -07:00
James R. Barlow	b376672dbc	Bug fix: exception thrown if input PDF was missing DocumentInfo block	2015-08-24 01:23:30 -07:00
James R. Barlow	d07db8547f	Merge branch 'master' of https://github.com/fritz-hh/OCRmyPDF v3.0-rc7	2015-08-23 12:30:46 -07:00
James R. Barlow	aab08bfcc7	Fix requirements.txt problem	2015-08-23 12:30:40 -07:00
jbarlow83	e0a25494ee	Explain the need for multi core, etc	2015-08-22 13:34:42 -07:00
James R. Barlow	fd876d5e4e	Merge branch 'develop' v3.0-rc6	2015-08-22 01:51:44 -07:00
James R. Barlow	ee7f008ff5	Require unpaper 6.1; no messing around with broken versions	2015-08-22 01:51:08 -07:00
jbarlow83	d9161a6ddb	Update README: docker run instructions	2015-08-22 01:50:13 -07:00
jbarlow83	f8d66768e3	Update README with docker install instructions	2015-08-22 01:33:12 -07:00
James R. Barlow	4f3673d14d	Update notes for -rc6	2015-08-22 00:40:07 -07:00
James R. Barlow	1712fdb74a	Merge branch 'feature/docker-debian'	2015-08-22 00:32:27 -07:00
James R. Barlow	3a5ffc79e0	Stock debian unpaper is no good; replace with 6.1 built from source debian and ubuntu both install unpaper 0.4.2 or so. No .deb packages available at higher version numbers although ArchLinux had something. Considered making a separate image to handle building and install but decided that was a premature optimization at this point, so just build the unpaper that works. All tests pass.	2015-08-22 00:30:39 -07:00
James R. Barlow	859b063444	Fixup other docker test suite errors Outstanding failures: test_pageinfo::test_jpeg tests involving unpaper due to version <6.1 failures	2015-08-20 02:37:03 -07:00
James R. Barlow	bd61e7c644	dockerignore *.pyc https://github.com/docker/docker/issues/13113 Docker kinda sucks. No recursive exclusion.	2015-08-20 02:27:07 -07:00
James R. Barlow	c9abf282b5	Set docker locale to utf-8 Shocked, shocked, that there's a Linux distribution out that there isn't doing the right thing and setting up utf-8 by default. (Many tests failed)	2015-08-20 01:44:30 -07:00
James R. Barlow	9dad40b5a3	Major overhaul of the Dockerfile Switched from Ubuntu to debian:stretch because stretch has more recent versions of our binary packages and starts smaller. In particular, stretch has both pillow==2.9.0 and reportlab==3.2.0 available as system packages which saves the considerable hassle of install a toolchain. Instead, a pyvenv is set up with access to system's site-packages (note: needs two steps), making the binary-dependent packages available. Then the remaining packages are installed into the pyvenv with --no-cache-dir to avoid saving files. And there we are. Image is still very large (>500 MB), but programs like reportlab require font rendering capabilities so they pull in large portions of the Linux graphics stack. Not much will shrink that.	2015-08-20 01:25:31 -07:00
James R. Barlow	8e2d690cb0	Rework Dockerfile, setup.py to work with wheels for better cache use	2015-08-19 13:43:32 -07:00
James R. Barlow	c132e091e1	Dockerfile: use local copy of application	2015-08-19 13:10:58 -07:00
James R. Barlow	630e6cbf1e	pip chokes on Unicode filenames?	2015-08-18 23:56:30 -07:00
James R. Barlow	83ff5760a8	Dockerfile comment cleanup	2015-08-18 23:41:41 -07:00
James R. Barlow	fed0ee638e	Fix ruffus writing to RO directory in container	2015-08-18 23:30:06 -07:00
James R. Barlow	cc161780df	Replace fileinput with regular open-replace fileinput is supposed to save time in these cases but it's not capable of doing both in-place rewrites and working with a non-ascii encoding. This was not noticed until characters outside of ASCII were picked up by tesseract and saved in a HOCR file. Rework some surrounding code as well and add multilingual test cases.	2015-08-18 23:27:50 -07:00
James R. Barlow	898b2b000a	Works	2015-08-18 05:38:05 -07:00
James R. Barlow	b3ee743ed7	WIP on docker	2015-08-18 04:46:25 -07:00
James R. Barlow	ef17b669fe	README needs ghostscript	2015-08-18 03:27:39 -07:00
James R. Barlow	2dff3e07ce	Drop libxml2 dependency It seems that Python's internal XML parser is good enough to do the job.	2015-08-17 15:26:07 -07:00
James R. Barlow	53c88093ad	Bump to -rc5 v3.0-rc5	2015-08-16 02:19:04 -07:00
James R. Barlow	0ec13d3a17	Fix test cases: minor issues -os.environ directly modified when whole suite run, breaking subsequent tests -no longer trusting JHOVE for PDF/A validation	2015-08-16 01:57:35 -07:00
jbarlow83	0d5104049a	Update README with better install instructions	2015-08-16 01:28:28 -07:00
James R. Barlow	ce8fa69785	Update readme	2015-08-16 00:59:57 -07:00
James R. Barlow	30072e0c70	Pillow sucks Far from being fluffy or friendly, Pillow silently allows installation of itself without support for major image types. Reportlab calls for pillow 2.4.0. On Ubuntu 14.04 LTS this will trigger an upgrade of pillow that will be built without JPEG or ZLIB so it is effectively neutered, and unfortunately Pillow will not detect this situation at install time and guide users to a resolution. Instead, you see nasty stack traces. So add a run-time check to ensure that Pillow is sane and capable of JPEG and PNG support since both may be used internally.	2015-08-16 00:54:03 -07:00
James R. Barlow	eb04a890b2	Relax Pillow requirement for Ubuntu 14.04 LTS	2015-08-15 15:55:56 -07:00
James R. Barlow	0c53adb04f	setup: rollback lxml version to 3.3.3 - that's the latest in Ubuntu 14.04	2015-08-15 15:25:58 -07:00
James R. Barlow	ee5a43fd47	setup: suppress jhove errors	2015-08-15 15:25:30 -07:00
James R. Barlow	c43d6c2cbe	Merge branch 'develop' of https://github.com/fritz-hh/OCRmyPDF into develop Conflicts: setup.py	2015-08-15 15:18:41 -07:00
James R. Barlow	87aeeacb04	Fix erroneous instruction to "apt-get install tesseract" Should be tesseract-ocr	2015-08-15 15:17:38 -07:00
James R. Barlow	6b26e9cad6	Fix erroneous instruction to "apt-get install tesseract" Should be tesseract-ocr	2015-08-15 15:12:05 -07:00
James R. Barlow	85af0f0d03	Add test case for blank PDF page	2015-08-14 00:46:50 -07:00
James R. Barlow	f6f4705ea3	Remove Java from setup.py	2015-08-14 00:44:56 -07:00
James R. Barlow	a4702bff22	Possible fix for issue #111	2015-08-13 23:10:22 -07:00
James R. Barlow	73c5c48f79	Update notes	2015-08-13 23:08:29 -07:00
James R. Barlow	adf495e8cc	Remove JHOVE JHOVE is not an effective PDF/A validator, as detailed in this article: http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/ In short, it's buggy. Out of 670 invalid PDF/A files in a test suite, it only flagged 5. It only looks for certain problems that Ghostscript generated PDFs are unlikely to have. So use qpdf as a final check for general ill-formed PDF problems since it is quite reliable. JHOVE 1 is no longer maintained. There's a JHOVE 2 but it has no PDF support. I also don't know if it's appropriate to bundle JHOVE, with an LGPL, under this project and its current license. Removing a dependency on Java is a huge win. A world with less Java is a world with less AbstractFactoryConstructorInterfaces.	2015-08-11 15:31:32 -07:00
James R. Barlow	9247ea00bf	Improve ruffus exception handling ruffus swallows the return code if the process of handling an exception we hit an error in ruffus' own code, which can happen. So pick through its error stack and find out if there's an interesting return code in there. Had to use eval() of all things. Also suppress the stack trace for normal error conditions that don't need one.	2015-08-11 02:19:46 -07:00
James R. Barlow	a1238d7bf9	Document override binary test	2015-08-11 00:44:43 -07:00
James R. Barlow	2d63268f0f	Work around JHOVE bug for now, so that the test passes	2015-08-11 00:23:48 -07:00
James R. Barlow	1cb5f6a90d	Refactor exit codes; test for missing tessdata Some versions of tesseract installed by homebrew end up without a functional tessdata folder, and tesseract is not helpful in this situation, so add a new test to make sure our output is at least indicative of the problem. In the process of properly handling return codes I discovered test_override_metadata triggers a NPE inside JHOVE probably due to the Unicode character checking. This could be specific to my JRE (1.6.0_65, Oracle) but it's probably JHOVE's fault. A valid PDF/A (per Acrobat) is still generated.	2015-08-11 00:17:02 -07:00
James R. Barlow	8d848284df	Fix code, test case: complain when GS fails to produce PDF/A Modified pipeline to fix regression and return the proper error code if we did not produce a PDF/A as expected. The wrapper forces the output to be PDF 1.3 which is not PDF/A compliant. The funny thing is that in some cases JHOVE incorrectly states that a file is PDF/A-1b compliant, well formed and valid, even when it is not according to Acrobat XI and is missing the PDF/A metadata marker, as far as I can tell. JHOVE may not be as beneficial as hoped.	2015-08-10 16:05:00 -07:00
James R. Barlow	8fe54d1a5c	Add new test case to check invalid PDF/A case It revealed a regression - return code not the same as v2.x for invalid PDF/A. It's also not easy to get the return code out of ruffus. Will need to tweak the final step of the pipeline.	2015-08-10 13:57:28 -07:00
James R. Barlow	11dd9f14c3	setup.py: block unsafe 'upload', say to use twine instead	2015-08-09 14:16:30 -07:00

... 48 49 50 51 52 ...

2895 Commits