OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2026-01-08 13:11:17 +00:00

Author	SHA1	Message	Date
James R. Barlow	bd534c3313	main.py -> __main__.py Executing a package with python -m packagename will check for __main__.py inside the package. In other words main.py should have always been named __main__.py. In the unlikely event that someone depends on "import ocrmypdf.main" being meaningful, main.py continues to exist and replicates the behavior of __main__. (It's unlikely because import ocrmypdf.main does unpythonic ruffus-related things at things import time, essentially configuring itself to work with sys.argv. To fix another day.) This should solve the problem of Debian needing to run test suites before installation and afterwards for continuous integration without having to patch either file, as python -m ocrmypdf will follow import order. That is, if the current directory contains "ocrmypdf/" (e.g. staging a new version) then that will be tested, else sys.path will be checked.	2016-08-31 17:01:42 -07:00
James R. Barlow	71b54035ba	Bug fix issue #89 : trying to perform arithmetic on IndirectObject TypeError: bad operand type for unary -: 'IndirectObject'	2016-08-31 10:25:58 -07:00
James R. Barlow	bc11454e1c	Help text: example of shell pipeline with img2pdf	2016-08-25 14:58:25 -07:00
James R. Barlow	38fe14b108	Make final PDF/A output message less obtuse	2016-08-25 14:46:40 -07:00
James R. Barlow	1b7b2f3695	v4.2.2 release notes, documentation improvements	2016-08-25 14:46:09 -07:00
James R. Barlow	27a3813207	Recover input filename from symlink on error message The recent commit to accept files from stdin broken the feature of returning the input filename on an error, returning the temp filename instead, which is confusing.	2016-08-23 17:38:28 -07:00
James R. Barlow	e08c42fd3d	Tweak pipeline again	2016-08-09 22:40:29 -07:00
James R. Barlow	16901f7134	Accept input from stdin if input filename is '-'	2016-08-09 15:46:24 -07:00
James R. Barlow	35addb8a33	Complain if Chinese is requested with settings known to not work Should extend test for other Asian languages	2016-08-03 01:29:12 -07:00
James R. Barlow	d32ea8d0dd	Remove dead code from qpdf merge + PyPDF2 metadata patching I tried "qpdf merge + PyPDF2 metadata patching" first. The problem is that PyPDF2 produces a 1.3 by default and generally I have less confidence it. New approach is to stuff the Document Info metadata in the first page with PyPdf2, cross fingers and use qpdf to merge. It's not quite as clean and might harm the first page, but it's better than shipping files produced by PyPDF2.	2016-08-03 01:28:27 -07:00
James R. Barlow	12575d594a	Improve PDF/A validity checking at end	2016-08-03 01:26:16 -07:00
James R. Barlow	0746083301	Fix failing test case - unbound local variable in finally block	2016-08-03 01:00:38 -07:00
James R. Barlow	5c99acf6d1	Experimental change to use qpdf to merge files (disables Ghostscript) All but one tests pass, test_input_file_not_a_pdf Not sure if PyPDF2 metadata generation will mangle the first page.	2016-08-03 00:56:44 -07:00
James R. Barlow	ebe68de4ff	Functional qpdfmerge with PyPDF2 for DocumentInfo block Tests mostly passing. For the moment this is the new default. Although PyPDF2 produces a PDF-1.3 which will be wrong for some contents and possible should be repaired with qpdf. Again. Looks like it could work better to merge PyPDF2 and fix everything with qpdf.	2016-08-02 16:48:13 -07:00
James R. Barlow	b17c6a146d	Experimental qpdf merging Does not copy /Catalog metadata, but otherwise functional	2016-08-02 02:19:02 -07:00
James R. Barlow	0b24f971cd	ocrmyimage: complain about ICC profiles being presumed	2016-08-02 01:22:36 -07:00
James R. Barlow	bc5d3824bd	Don't overload --oversample, use --image-dpi instead for images	2016-07-31 02:09:30 -07:00
James R. Barlow	4356983707	Suppress overly long stack traces on traverse_ruffus_exception	2016-07-31 02:06:44 -07:00
James R. Barlow	2414b79ee6	More cleanup of exception related errors	2016-07-31 01:48:13 -07:00
James R. Barlow	968e1546f0	Refactor image file triage	2016-07-31 01:47:57 -07:00
James R. Barlow	f385772d21	Refactor "is this an iterable that's not a string?" test	2016-07-29 15:25:02 -07:00
James R. Barlow	d257c83520	Most tests were failing at split_pages() It seems that ruffus sometimes decides to send a ['inputfile.pdf'] instead of a bare string.	2016-07-29 14:59:17 -07:00
James R. Barlow	7b72ffec4f	ocrmyimage: better handling of missing/invalid DPI	2016-07-29 14:38:07 -07:00
James R. Barlow	757f6826dc	ocrmyimage - Attempt conversion to PDF if input file is not a PDF First cut. May have broken ruffus errors again too.	2016-07-29 14:03:19 -07:00
James R. Barlow	d70e3d3753	ruffus exceptions: for clarity only, don't iterate strings It's a good habit to ensure any iterator test is explicit about allowing or disallowing strings.	2016-07-29 13:31:24 -07:00
James R. Barlow	fef35e4eb2	Fix handling of DPI for rare case of JPEG recompression after deskew/clean This test is exercised by page 4 of multipage.pdf. If all images are JPEGs, and one of deskew/clean removes DPI information, make sure that we can get the right information back and that the DPI stays square.	2016-07-29 01:34:52 -07:00
James R. Barlow	8f77576dc4	Fix non-square image resolution for "hocr" case; use img2pdf 0.2.1 Tesseract renderer not immediately fixable.	2016-07-28 16:43:51 -07:00
James R. Barlow	b3fcf24a26	Refactor DPI: fix regressions in test suite Some called functions are particular about the data format of DPI and don't like to deal with the Decimal() returned by PyPDF2. Convert to float and int where needed.	2016-07-28 00:19:32 -07:00
James R. Barlow	16e4d342d2	Bug fix: --force-ocr should still run on pages with no images Useful for people who want to reprocess text. This also requires --oversample because DPI is undefined. To be fixed in next commit.	2016-07-27 15:06:49 -07:00
James R. Barlow	bbd02926e1	Add helpful error message for PDFs that use algorithm 4	2016-06-23 13:13:17 -07:00
James R. Barlow	349ec5c81f	Provide more helpful error message if pypdf can't merge pages	2016-04-28 14:02:12 -07:00
James R. Barlow	fe14cb57c0	Fix ruffus exception output I found this issue in ruffus 2.6.3 https://github.com/bunbun/ruffus/issues/65 also discussed here https://github.com/bunbun/ruffus/pull/67 ruffus 2.6.3 RethrownJobError don't follow the normal conventions and so its exception causes problems when they cross process boundaries. This change carefully examines the various forms of ruffus exception objects that can appear in 2.6.3 and parses them more carefully. It also removes any direct posting of the exception to the logger because this triggers another serializing of the exception object, mutating it further.	2016-04-28 00:38:50 -07:00
James R. Barlow	e877d37ac8	--rotate-pages: Only apply rotation if we're reasonable confident Take the threshold from tesseract's default value for -psm 1.	2016-04-14 13:49:44 -07:00
James R. Barlow	322085933b	unpaper: fix check for missing and old versions, add test case	2016-03-10 15:37:09 -08:00
James R. Barlow	6a380ee99c	Fix temporary file placed in wrong folder	2016-02-27 00:51:47 -08:00
James R. Barlow	dad2198394	Log information about detected page orientations in a summary line	2016-02-26 01:07:59 -08:00
James R. Barlow	e40fdc502d	Always dump stack trace for unexpected errors	2016-02-26 01:06:59 -08:00
James R. Barlow	71fbda8bf6	Adjust page orientation parsing to deal with change in Tess 3.04.01	2016-02-20 01:32:56 -08:00
James R. Barlow	f3b0434a87	Improve ability to capture error messages from tesseract on a crash	2016-02-19 03:48:49 -08:00
James R. Barlow	d4ef3411e0	Suppress --pdf-renderer tesseract warning in Docker image Since the corrected font is provided in the Docker image, there's no reason to show the warning.	2016-02-17 01:03:20 -08:00
James R. Barlow	60b2eb1455	Fix JPEG DPI: Pillow expects dpi=(x,y)	2016-02-16 07:29:20 -08:00
James R. Barlow	71e493a810	Fix case of JPEG missing DPI field	2016-02-16 05:29:32 -08:00
James R. Barlow	c50e3f1329	Complain about older tesseracts that don't have sharp2.ttf installed	2016-02-15 16:43:41 -08:00
James R. Barlow	7c691c21ab	Fix image layer rotation for pages with nonzero crop boxes	2016-02-10 17:48:33 -08:00
James R. Barlow	4ec51729d8	Partial fix for images not anchored to (0, 0)	2016-02-10 17:14:48 -08:00
James R. Barlow	6510bcad19	DPI information not transferred automatically from PNG to JPEG	2016-02-09 02:18:54 -08:00
James R. Barlow	1928a64cae	Better logging output for autorotation	2016-02-08 23:42:25 -08:00
James R. Barlow	1ba8b1aa4b	unpaper is lousy at deskewing, so let leptonica do it	2016-02-08 15:26:33 -08:00
James R. Barlow	2752bda80b	Merge branch 'feature/leptdeskew' into feature/logging Need leptonica for testing now, I think # Conflicts: # ocrmypdf/tesseract.py # requirements.txt # setup.py	2016-02-08 12:34:48 -08:00
James R. Barlow	d30a879e2d	Fix test suite by running select_image_for_pdf unconditionally The purpose of this change that caused the problem was a minor optimization for the tesseract renderer path that had it pull an image from select_image_for_pdf so that it could use a JPEG instead of PNG, instead of taking it from preprocess_clean where it would only get a PNG and make large files.	2016-02-08 02:33:03 -08:00

1 2 3

139 Commits