OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-11-17 02:25:51 +00:00

Author	SHA1	Message	Date
jbarlow83	4633812246	Fix git clone command with one I tested ;)	2015-07-28 12:20:09 -07:00
jbarlow83	14bd1555aa	Update README with more detailed instructions	2015-07-28 12:15:37 -07:00
James R. Barlow	b9d7687fa0	Fixes: clarify install instructions and reactivate external program checks v3.0-rc2	2015-07-28 05:44:15 -07:00
James R. Barlow	93b36965e2	Merge branch 'develop' # Conflicts: # RELEASE_NOTES.md # src/config.sh # src/hocrTransform.py # src/ocrPage.sh	2015-07-28 04:59:49 -07:00
James R. Barlow	9e0c443c2f	-rc2: because pypi won't accept -rc1	2015-07-28 04:55:10 -07:00
James R. Barlow	60832152b1	Don't mess with options	2015-07-28 04:46:21 -07:00
James R. Barlow	6a160d22fe	Update release notes, add copyrights	2015-07-28 04:36:58 -07:00
James R. Barlow	e35526192c	More test cases	2015-07-28 03:02:35 -07:00
James R. Barlow	bea57bdded	More test cases for other parameters	2015-07-28 02:31:18 -07:00
James R. Barlow	2a9da225e4	Minor tweaks to uncommon arguments	2015-07-28 02:25:50 -07:00
James R. Barlow	a3f37de9b5	Test cases for --tesseract-timeout	2015-07-28 01:47:30 -07:00
James R. Barlow	6064160953	Get rid of subprocess call on import of tesseract, unpaper -- bit nasty	2015-07-28 01:00:29 -07:00
James R. Barlow	8508141314	Drop nose, all tests working reasonably again Although the real issue was that the ruffus pipeline cannot be executed twice in the same process due to its reliance on global variables. The new OO pipeline in ruffus 2.6 would be one resolution that would allow for more comprehensive testing as opposed to farming out the execution to subprocess and inspecting the results, as is currently done.	2015-07-28 00:43:22 -07:00
James R. Barlow	1c95597882	nose can't really handle external tests so looking into py.test instead Specifically it trips over the need to reimport ocrmypdf.main. That in turn raises questions about whether to make that function into an external script that imports ocrmypdf... or something else. Would be possible with a loop that manipulates sys_argv and then reloads ocrmypdf.main; might need that anyway.	2015-07-27 22:07:04 -07:00
James R. Barlow	587fa63c8e	--oversample: Default to 0	2015-07-27 20:42:16 -07:00
James R. Barlow	b40eec4cb0	Add --oversample test for hocr rendering	2015-07-27 17:18:02 -07:00
James R. Barlow	7bcd48c269	Add test to confirm that metadata is transferred to final PDF/A	2015-07-27 16:11:51 -07:00
James R. Barlow	2e7cd52c0f	Improve argument handling, test cases	2015-07-27 15:39:54 -07:00
James R. Barlow	77d4cb367e	Put ghostscript in a module	2015-07-27 15:22:00 -07:00
James R. Barlow	2c45c5abc6	Implement tesseract timeout	2015-07-27 04:23:37 -07:00
James R. Barlow	a89afabd79	Implement tesseract PDF rendering as an alternative It's much better a rendering text baselines than hocr and seems to produce small file sizes, so it's progress. Not available for Tesseract 3.02 obviously, so both modes need to remove available.	2015-07-27 04:20:49 -07:00
James R. Barlow	03f7c9bf07	setup.py: Only do program checks when installing	2015-07-27 02:14:51 -07:00
James R. Barlow	d5f4862749	setup.py: check for third party program requirements	2015-07-27 01:45:17 -07:00
James R. Barlow	8aced0b6d3	More testing: JPEG	2015-07-27 00:25:43 -07:00
James R. Barlow	6b9adef684	Don't create inline images in output PDFs ...except that Ghostscript will sometimes turn out of line images into inline images on its own, possibly if file size is small.	2015-07-26 21:43:49 -07:00
James R. Barlow	5440d988fc	Make this PDF a whole image page Originally it had a smaller image centred in a page, which is not quite supported.	2015-07-26 18:32:50 -07:00
James R. Barlow	30da4fc569	pageinfo: drop pdftotext and use PyPDF instead	2015-07-26 18:23:37 -07:00
James R. Barlow	2c1b5e100b	Test cases for pageinfo; complain about inline images	2015-07-26 18:18:41 -07:00
James R. Barlow	3684f278ed	Add some pageinfo test cases; found problem with inline images	2015-07-26 15:24:42 -07:00
James R. Barlow	6c3cb6acba	Remove redundant *res_render	2015-07-26 12:56:10 -07:00
James R. Barlow	b98ba8d174	Replace .md with .rst Github supports both, and PyPI expects .rst files, so use .rst and make everyone happy. Auto-converted using pandoc find . -name '*.md' \| parallel pandoc --from=markdown --to=rst --output='{.}.rst' '{}' http://bfroehle.com/2013/04/26/converting-md-to-rst/	2015-07-26 03:01:18 -07:00
James R. Barlow	d3088829af	More packaging changes: move jhove, fix console script	2015-07-26 01:52:08 -07:00
James R. Barlow	9aaaba1714	Packaging stuff	2015-07-25 23:45:13 -07:00
Jim Barlow	9adb0d696f	Prepare for Python packaging - move to ocrmypdf folder	2015-07-25 18:22:04 -07:00
Jim Barlow	c270f1ba5f	Update release notes so far	2015-07-25 18:18:37 -07:00
Jim Barlow	7b255b575a	Metadata override from command lien	2015-07-25 18:12:25 -07:00
Jim Barlow	d7a9f3a2ab	Transfer Unicode document information from input PDF to output PDF What a pain getting Unicode right, but there it is. I cannot find anything to confirm that it is acceptable to put the PDF/A definition file at the end of the Ghostscript inputs. I did this because Ghostscript seems to copy document info from the last document on the list so reportlab's information "wins" in normal order, so it fixes that issue, and reportlab 'helpfully' fills in all of those fields even if it does not have information. It could also work to pass document information along to reportlab, and set it in each output PDF: .debug.pdf, .rendered.pdf, and .page.pdf to ensure that whatever page is last in the pipeline has the right information. Or perhaps it's possible to write a Postscript trailer that overwrites any previous docinfo with no side effects, but I can't find any information on how to do that. I don't think it's worth pursuing unless this arrangement causes some problem with PDF/A generation. On a minor note, Jhove misreads the way I have encoded the strings in producing its validation log. It reads them as UTF-16 little endian, so will tend to produce a string of Asian characters in place of the real data.	2015-07-25 18:05:25 -07:00
Jim Barlow	abf2e7e9bb	Copy document metadata from source document into output (untested) This works for ASCII only; will do Unicode version.	2015-07-25 15:31:02 -07:00
Jim Barlow	72e5fa9ba0	Reimplement debug pages	2015-07-25 14:14:02 -07:00
Jim Barlow	32c1078d2c	Reimplement skip text pages	2015-07-25 14:13:32 -07:00
Jim Barlow	133f901a69	Change @subdivide to @split @split is for "1 to many" operations, so it's the right tool for this case.	2015-07-25 02:58:34 -07:00
Jim Barlow	42cd683ec0	Try to make pdfinfo less obnoxious by printing too many decimals	2015-07-25 02:47:59 -07:00
Jim Barlow	151eb05377	For now, unpaper is the only deskew provider	2015-07-25 01:46:16 -07:00
Jim Barlow	16177d0a52	Remove ability to override temporary (working) folder Little point to this feature - on most platforms the environment variable can be overridden if desired to set a new root location. At the same time, this change removes the ability to resume a partially executed pipeline by deleting all of the results on failure. If -k is provided then the temporary files will survive but there's no way to resume from them. Because resuming doesn't really work away and would only be useful to users experiencing very specific problems, this is probably not worth it, so no major loss. The intent of -k is to assist debugging.	2015-07-25 01:45:26 -07:00
Jim Barlow	5ce544289f	Automatically try to use all available CPUs	2015-07-25 01:10:14 -07:00
Jim Barlow	77bd35c3c7	Remove duplicate test folder	2015-07-25 01:00:40 -07:00
Jim Barlow	0c5c208db0	Goodbye, so long, farewell, shell...	2015-07-25 00:57:07 -07:00
Jim Barlow	60eb745331	Split selecting final image and render PDF result into separate tasks Simplifies the logic - one deals with all images, the other details with an image and .hocr. Also add JPEG reconversion.	2015-07-25 00:54:00 -07:00
Jim Barlow	9f90b5cb0a	Modularize unpaper; get -d and -c working again	2015-07-25 00:22:56 -07:00
Jim Barlow	5adff94545	Remove more dead/old code	2015-07-24 15:41:24 -07:00

... 45 46 47 48 49 ...

2676 Commits