OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-10-23 22:09:37 +00:00

Author	SHA1	Message	Date
James R. Barlow	16d24f1166	Bump version to -rc4 v3.0-rc4	2015-08-05 23:26:38 -07:00
James R. Barlow	97015ef775	Add a test case to check on the @argumentsfile syntax	2015-08-05 23:17:38 -07:00
James R. Barlow	2744dafb74	New test case: ensure metadata is preserved from input to output	2015-08-05 17:09:38 -07:00
James R. Barlow	7b268dbe1a	Remove duplication in test case	2015-08-05 16:57:04 -07:00
James R. Barlow	8fcbbcef94	Improve usage text	2015-08-05 16:56:53 -07:00
James R. Barlow	8f93f0a06e	Tidy docs	2015-08-05 16:56:30 -07:00
James R. Barlow	387142488c	Kill duplicate file	2015-07-31 01:57:16 -07:00
James R. Barlow	6887e232fc	Bug fix: exception from process timeout should be TimeoutExpired	2015-07-31 00:06:58 -07:00
James R. Barlow	6ac7ffd77b	Merge branch 'feature/drop-mupdf-poppler' into develop	2015-07-30 23:38:27 -07:00
James R. Barlow	b28faa582a	Automatically use all available cores unless told not to	2015-07-30 23:20:21 -07:00
James R. Barlow	454ee029c8	Run final ghostscript in multithreaded mode This step is serialized so all cores are not busy at this stage.	2015-07-30 23:20:04 -07:00
James R. Barlow	a036de318e	Replace mupdf and poppler with qpdf Drop two dependencies and replace them with one that does the job of both. Smells like progress. mupdf does PDF file repair and rendering poppler does rendering and page splitting qpdf does PDF file repair and page splitting ghostscript does PDF file repair, rendering, and page splitting (sort of) So we use qpdf. Ghostscript's page splitting is supposed is less efficient because it reprints the page (PDF -> Postscript -> PDF) and possibly loses quality. qpdf's library could be used to improve performance. This causes a slight performance regression: py.test tests/test_main.py::test_maximum_options went from 187 seconds up to 192. This is likely due to O(n) serialized invocations of qpdf compared to a single serialized call to pdfseparate. Could improve on this situation by using the example code in qpdf: pdf-split-pages.cc or create marker files in split_pages() and then write a new @transform function that would split pages on each CPU. Probably not worth it, overall, unless this causes problems on files with hundreds of pages.	2015-07-30 04:16:35 -07:00
James R. Barlow	9918c4020e	Use img2pdf in test case because it does a better job	2015-07-30 03:35:56 -07:00
jbarlow83	3d6264e1b8	Fix formatting of 'motivation'	2015-07-28 17:58:26 -07:00
jbarlow83	1c25270503	Improve instructions for users that need sudo or venv	2015-07-28 17:55:56 -07:00
James R. Barlow	47e50f82c4	setup.py: allow mutool 1.7	2015-07-28 13:37:32 -07:00
James R. Barlow	27ecdfbba8	More fixes to error cases in setup.py	2015-07-28 13:05:23 -07:00
James R. Barlow	6901550065	Fix some installer issues	2015-07-28 12:41:24 -07:00
jbarlow83	6e6f918630	Actually link the release notes	2015-07-28 12:21:57 -07:00
jbarlow83	4633812246	Fix git clone command with one I tested ;)	2015-07-28 12:20:09 -07:00
jbarlow83	14bd1555aa	Update README with more detailed instructions	2015-07-28 12:15:37 -07:00
James R. Barlow	b9d7687fa0	Fixes: clarify install instructions and reactivate external program checks v3.0-rc2	2015-07-28 05:44:15 -07:00
James R. Barlow	93b36965e2	Merge branch 'develop' # Conflicts: # RELEASE_NOTES.md # src/config.sh # src/hocrTransform.py # src/ocrPage.sh	2015-07-28 04:59:49 -07:00
James R. Barlow	9e0c443c2f	-rc2: because pypi won't accept -rc1	2015-07-28 04:55:10 -07:00
James R. Barlow	60832152b1	Don't mess with options	2015-07-28 04:46:21 -07:00
James R. Barlow	6a160d22fe	Update release notes, add copyrights	2015-07-28 04:36:58 -07:00
James R. Barlow	e35526192c	More test cases	2015-07-28 03:02:35 -07:00
James R. Barlow	bea57bdded	More test cases for other parameters	2015-07-28 02:31:18 -07:00
James R. Barlow	2a9da225e4	Minor tweaks to uncommon arguments	2015-07-28 02:25:50 -07:00
James R. Barlow	a3f37de9b5	Test cases for --tesseract-timeout	2015-07-28 01:47:30 -07:00
James R. Barlow	6064160953	Get rid of subprocess call on import of tesseract, unpaper -- bit nasty	2015-07-28 01:00:29 -07:00
James R. Barlow	8508141314	Drop nose, all tests working reasonably again Although the real issue was that the ruffus pipeline cannot be executed twice in the same process due to its reliance on global variables. The new OO pipeline in ruffus 2.6 would be one resolution that would allow for more comprehensive testing as opposed to farming out the execution to subprocess and inspecting the results, as is currently done.	2015-07-28 00:43:22 -07:00
James R. Barlow	1c95597882	nose can't really handle external tests so looking into py.test instead Specifically it trips over the need to reimport ocrmypdf.main. That in turn raises questions about whether to make that function into an external script that imports ocrmypdf... or something else. Would be possible with a loop that manipulates sys_argv and then reloads ocrmypdf.main; might need that anyway.	2015-07-27 22:07:04 -07:00
James R. Barlow	587fa63c8e	--oversample: Default to 0	2015-07-27 20:42:16 -07:00
James R. Barlow	b40eec4cb0	Add --oversample test for hocr rendering	2015-07-27 17:18:02 -07:00
James R. Barlow	7bcd48c269	Add test to confirm that metadata is transferred to final PDF/A	2015-07-27 16:11:51 -07:00
James R. Barlow	2e7cd52c0f	Improve argument handling, test cases	2015-07-27 15:39:54 -07:00
James R. Barlow	77d4cb367e	Put ghostscript in a module	2015-07-27 15:22:00 -07:00
James R. Barlow	2c45c5abc6	Implement tesseract timeout	2015-07-27 04:23:37 -07:00
James R. Barlow	a89afabd79	Implement tesseract PDF rendering as an alternative It's much better a rendering text baselines than hocr and seems to produce small file sizes, so it's progress. Not available for Tesseract 3.02 obviously, so both modes need to remove available.	2015-07-27 04:20:49 -07:00
James R. Barlow	03f7c9bf07	setup.py: Only do program checks when installing	2015-07-27 02:14:51 -07:00
James R. Barlow	d5f4862749	setup.py: check for third party program requirements	2015-07-27 01:45:17 -07:00
James R. Barlow	8aced0b6d3	More testing: JPEG	2015-07-27 00:25:43 -07:00
James R. Barlow	6b9adef684	Don't create inline images in output PDFs ...except that Ghostscript will sometimes turn out of line images into inline images on its own, possibly if file size is small.	2015-07-26 21:43:49 -07:00
James R. Barlow	5440d988fc	Make this PDF a whole image page Originally it had a smaller image centred in a page, which is not quite supported.	2015-07-26 18:32:50 -07:00
James R. Barlow	30da4fc569	pageinfo: drop pdftotext and use PyPDF instead	2015-07-26 18:23:37 -07:00
James R. Barlow	2c1b5e100b	Test cases for pageinfo; complain about inline images	2015-07-26 18:18:41 -07:00
James R. Barlow	3684f278ed	Add some pageinfo test cases; found problem with inline images	2015-07-26 15:24:42 -07:00
James R. Barlow	6c3cb6acba	Remove redundant *res_render	2015-07-26 12:56:10 -07:00
James R. Barlow	b98ba8d174	Replace .md with .rst Github supports both, and PyPI expects .rst files, so use .rst and make everyone happy. Auto-converted using pandoc find . -name '*.md' \| parallel pandoc --from=markdown --to=rst --output='{.}.rst' '{}' http://bfroehle.com/2013/04/26/converting-md-to-rst/	2015-07-26 03:01:18 -07:00

... 49 50 51 52 53 ...

2895 Commits