OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-08-16 04:31:45 +00:00

Author	SHA1	Message	Date
James R. Barlow	7d330afd81	Delinting	2019-01-02 13:34:45 -08:00
James R. Barlow	c771938907	Convert to f-strings where it makes sense	2018-12-31 15:01:19 -08:00
James R. Barlow	8c0009c5c8	Make pdfminer.six optional Mainly since the current release of pdfminer.six lacks a sdist, blocking homebrew packaging. Also in case other distros don't accept pdfminer.six.	2018-12-31 01:08:43 -08:00
James R. Barlow	cfc5cdf47d	pdfa: remove a pile of deprecated code It's now handled in pikepdf.	2018-12-31 00:05:13 -08:00
James R. Barlow	0880b16491	Sort imports with isort	2018-12-30 01:28:15 -08:00
James R. Barlow	06308a22ce	Reformat with black	2018-12-30 01:27:49 -08:00
James R. Barlow	80bd7de580	Generate test cache	2018-12-30 01:02:37 -08:00
James R. Barlow	8b90c45437	Drop support for Tesseract 3	2018-12-30 00:47:12 -08:00
James R. Barlow	72b920eb16	Drop support for Python 3.5	2018-12-30 00:23:26 -08:00
James R. Barlow	b4a51907d6	Detect when metadata is dropped during PDF/A conversion	2018-12-30 00:13:25 -08:00
James R. Barlow	13d20bd993	pdfinfo: tolerate PDFs that overflow and underflow the graphics stack	2018-12-15 15:10:29 -08:00
James R. Barlow	ed9bb985e2	Fix pikepdf 0.9.0	2018-12-14 23:21:13 -08:00
James R. Barlow	632dab2cc0	Replace Ghostscript DOCINFO and fix 9.25 metadata date regression We no longer use Ghostscript to manage PDF metadata, instead omitting the DOCINFO segment from the pdfmark file we generate. Instead all of the relevant metadata code has been migrated to pikepdf, and we use that API. This should be more consistent and fixes the Ghostscript version-depedent quirks. Also removes our python-xmp-toolkit dependency, except for testing.	2018-12-13 18:13:30 -08:00
James R. Barlow	414407fbd6	Deprecate encode/decode_pdf_date and remap to pikepdf version	2018-12-12 22:01:21 -08:00
James R. Barlow	9e6b54c7ed	Add test case for Type3 fonts with no Unicode mapping	2018-11-15 21:54:26 -08:00
James R. Barlow	d3b334c10f	Test case: true type font without Unicode mapping	2018-11-15 16:22:53 -08:00
James R. Barlow	cc7f2a3f02	Fix Python 3.5 pathlib regressions	2018-11-10 02:11:23 -08:00
James R. Barlow	a2170ef8d6	test: test version check code	2018-11-10 00:56:22 -08:00
James R. Barlow	5ed05e08b1	Fix "no languages" test and misuse of os.environ	2018-11-09 01:57:11 -08:00
James R. Barlow	501ce726e7	Fix two failing tests	2018-11-06 11:16:08 -08:00
James R. Barlow	2ac028c759	test: Add a basic redo OCR test	2018-11-04 15:54:41 -08:00
James R. Barlow	8b9ab25125	coverage: test compile leptonica	2018-11-02 01:55:25 -07:00
James R. Barlow	77e87abe8f	coverage: ensure get_orientation is checked	2018-11-02 01:32:20 -07:00
James R. Barlow	3be02e1e8d	coverage: improve leptonic; don't create objects with null pointers	2018-11-02 01:10:10 -07:00
James R. Barlow	5b8d197812	coverage: make it more likely timeout is tested	2018-11-02 00:41:15 -07:00
James R. Barlow	2cba62dc4f	coverage: ensure rotation is actually tested	2018-11-02 00:40:56 -07:00
James R. Barlow	288e28328f	coverage: add qpdf	2018-11-02 00:37:33 -07:00
James R. Barlow	8681693994	Set up code coverage (it works with multiprocessing now!)	2018-11-02 00:31:50 -07:00
James R. Barlow	de80fb6bc8	Fix some failing tests after --redo-ocr changes	2018-10-29 11:49:38 -07:00
James R. Barlow	f564aaf485	Remove only_ocr_text	2018-10-28 22:41:18 -07:00
James R. Barlow	58cc70725e	Reorganize around getting bboxes for visible/invisible text	2018-10-26 01:07:02 -07:00
James R. Barlow	16af753206	Add functional "redo OCR" feature Needs argument validation and some other changes. Needs testing with mixed-content PDFs. Only really works for pure invisible text at the moment.	2018-10-19 00:02:19 -07:00
James R. Barlow	b18e66e2ca	pdfinfo: learn to detect vector graphic objects	2018-10-18 01:21:51 -07:00
James R. Barlow	1495b78330	Remove cruft to support leptonica < 1.72 in test suite	2018-10-11 01:37:32 -07:00
James R. Barlow	5c229d48d5	optimize: Reorganize so JBIG2 can be performed on images reduced to 1bpp Closes #297	2018-10-04 11:53:11 -07:00
James R. Barlow	5b84549716	Change JBIG2 lossy mode to require --jbig2-lossy	2018-10-04 01:20:49 -07:00
James R. Barlow	a71e4488b3	test: fix pytest warning about direct use of a fixture	2018-10-03 15:04:46 -07:00
James R. Barlow	9fa471e053	Test: send stderr to stderr, why don't we?	2018-10-03 14:23:34 -07:00
James R. Barlow	31ef2fe907	test: this error message changed case in newer Tesseract	2018-10-03 13:58:20 -07:00
James R. Barlow	9a8ec4b210	optimize: only enable lossy JBIG2 for -O3	2018-10-03 00:38:58 -07:00
James R. Barlow	17a3fa671c	ghostscript: API docs update	2018-09-14 23:51:52 -07:00
James R. Barlow	686207ab7f	Check for and reject Adobe LiveCycle Designer PDFs These are the ones that display a "Please wait..." message. Closes #296	2018-09-13 21:50:51 -07:00
James R. Barlow	517b385fe5	Work around loss of Unicode DOCINFO in Ghostscript 9.24+ Ghostscript no longer supports UTF-16-BE-hex strings as a way of supplying Unicode data in pdfmark so we have lost this functionality too: http://git.ghostscript.com/?p=ghostpdl.git;a=commit;h=e997c6836d243ab37fe3a5f0d57974af95eb5eac For users this means setting --title, --author, etc. will not work if gs 9.24 is installed, but if the file has existing metadata it might work. For now we enforce police-state-strict ASCII, until there's time to implement proper metadata editing. Relevant tests set to xfail.	2018-09-13 21:33:39 -07:00
James R. Barlow	795019b0c1	Work around invalid TOC entries Kodak Capture Desktop and probably other software creates a /Outlines entry with /First being set to an invalid indirect reference to an object that hasn't been created. This is legal in the PDF spec but problematic for qpdf. The objgen will be (max valid object ID + 1, 0). Because we create new objects in _weave, some TOC entries will end up assigned to new objects we create. Typically /ProcSet. We solve the issue by refactoring page traversal and then doing it twice, once to resolve all references (eliminating the null reference problem) and a second pass to make our changes.	2018-09-11 14:44:16 -07:00
James R. Barlow	3aac3a98ca	tests: Migrate metadata tests to pikepdf For some reason PyPDF2 has begun to trigger internal errors in pytest on macOS alone. Not sure why, but nothing is wrong that I can see. Seemed like an opportune time to switch to pikepdf; found some new issues in the process anyway.	2018-09-10 16:06:01 -07:00
James R. Barlow	7aa4e60af2	Explain pytest --runslow	2018-08-03 00:57:59 -07:00
James R. Barlow	55eb481f30	Add intensive (optional) rotation test	2018-08-03 00:42:59 -07:00
James R. Barlow	c171cb7286	Merge img2pdf 0.3.0 fix from v6.2.3	2018-08-01 15:17:33 -07:00
James R. Barlow	1d09061130	Revert previous commit amd reject input images with alpha channel Decided on this for simplicity of old release branch. Modifies baiona.png by stripping alpha, adds baiona_alpha which includes the alpha.	2018-07-31 23:45:28 -07:00
James R. Barlow	a2203b2447	Discard alpha channel when triaging images	2018-07-25 22:23:41 -04:00

... 3 4 5 6 7 ...

579 Commits