OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-08-15 12:12:05 +00:00

Author	SHA1	Message	Date
James R. Barlow	aa0ec40102	Change license of all GPLv3 files to MPL-2.0 https://github.com/jbarlow83/OCRmyPDF/issues/600	2020-08-05 00:44:42 -07:00
James R. Barlow	ebfe4f0d29	Fix issue #582 - PDF/A acquires title "Untitled" after conversion	2020-06-20 02:01:16 -07:00
James R. Barlow	7b9025f397	Convert generate_pdfa to plugin	2020-06-08 22:28:38 -07:00
James R. Barlow	b109445215	Move Ghostscript rasterize_pdf to plugin	2020-06-08 17:10:27 -07:00
James R. Barlow	1598f2f0e5	Abolish spoof_tesseract_noop	2020-06-01 03:07:53 -07:00
James R. Barlow	9af94ac9b7	pipeline: use OCR engine abstraction instead of Tesseract	2020-05-16 01:28:56 -07:00
James R. Barlow	977665d2b6	Delint some tests	2020-05-08 03:49:33 -07:00
James R. Barlow	c85278b31d	Delinting	2020-05-03 00:53:29 -07:00
James R. Barlow	5dbc080fa0	Rename PDFContext->PdfContext	2020-05-02 04:32:46 -07:00
James R. Barlow	e02f6c1e97	Support plugin invocation with API	2020-05-02 03:34:31 -07:00
James R. Barlow	b3b61c152c	Handle malformed DocumentInfo (#497 ) User submitted a PDF in which /Trailer /Info pointed to the XMP metadata block instead of a DocumentInfo dictionary. Fix and add test.	2020-03-03 03:27:01 -08:00
James R. Barlow	4a27124eab	Simplify metadata for invalid xml in output Removes possibly non-free resource enron1.pdf.	2020-02-12 00:07:18 -08:00
James R. Barlow	c5edff2c2f	Sort imports	2019-12-19 15:31:18 -08:00
James R. Barlow	a3726e4ce3	Fix test_metadata: use mmap in a Windows and POSIX compatible way	2019-12-04 17:13:52 -08:00
James R. Barlow	6fbeb6347d	Merge api (without plugins)	2019-07-27 02:04:01 -07:00
James R. Barlow	12769b96e5	Drop support for omitting pdfminer.six	2019-07-10 13:37:01 -07:00
James R. Barlow	fb933edc0f	Use newer pytest tmp_path API	2019-06-01 01:55:51 -07:00
James R. Barlow	ef1ef1cdf0	Fix test invalidated by Python 3.6 logging fixes	2019-05-17 15:20:07 -07:00
James R. Barlow	c904b430b6	Merge master into api branch; all test pass	2019-05-14 16:33:02 -07:00
James R. Barlow	482cb788ed	Don't use MagicMock() as a dummy logger in pytest	2019-05-11 12:44:17 -07:00
mawi	c92ccc6134	fix: tests	2019-04-08 14:57:42 +02:00
mawi	783a128bd1	feat: move to sync (none ETL) implementation - remove ruffus	2019-04-04 21:02:38 +02:00
James R. Barlow	3f1d9ef99c	Fix tests for move to Alpine dockerfile	2019-02-26 12:30:21 -08:00
James R. Barlow	f34b3015b2	Prevent Ghostscript from generating invalid XMP metadata If DocumentInfo contains NULs Ghostscript will generate XMP with NULs which is not allowed. Repair DocumentInfo before Ghostscript sees it.	2019-01-04 13:20:41 -08:00
James R. Barlow	7d330afd81	Delinting	2019-01-02 13:34:45 -08:00
James R. Barlow	c771938907	Convert to f-strings where it makes sense	2018-12-31 15:01:19 -08:00
James R. Barlow	cfc5cdf47d	pdfa: remove a pile of deprecated code It's now handled in pikepdf.	2018-12-31 00:05:13 -08:00
James R. Barlow	0880b16491	Sort imports with isort	2018-12-30 01:28:15 -08:00
James R. Barlow	06308a22ce	Reformat with black	2018-12-30 01:27:49 -08:00
James R. Barlow	72b920eb16	Drop support for Python 3.5	2018-12-30 00:23:26 -08:00
James R. Barlow	b4a51907d6	Detect when metadata is dropped during PDF/A conversion	2018-12-30 00:13:25 -08:00
James R. Barlow	ed9bb985e2	Fix pikepdf 0.9.0	2018-12-14 23:21:13 -08:00
James R. Barlow	632dab2cc0	Replace Ghostscript DOCINFO and fix 9.25 metadata date regression We no longer use Ghostscript to manage PDF metadata, instead omitting the DOCINFO segment from the pdfmark file we generate. Instead all of the relevant metadata code has been migrated to pikepdf, and we use that API. This should be more consistent and fixes the Ghostscript version-depedent quirks. Also removes our python-xmp-toolkit dependency, except for testing.	2018-12-13 18:13:30 -08:00
James R. Barlow	414407fbd6	Deprecate encode/decode_pdf_date and remap to pikepdf version	2018-12-12 22:01:21 -08:00
James R. Barlow	517b385fe5	Work around loss of Unicode DOCINFO in Ghostscript 9.24+ Ghostscript no longer supports UTF-16-BE-hex strings as a way of supplying Unicode data in pdfmark so we have lost this functionality too: http://git.ghostscript.com/?p=ghostpdl.git;a=commit;h=e997c6836d243ab37fe3a5f0d57974af95eb5eac For users this means setting --title, --author, etc. will not work if gs 9.24 is installed, but if the file has existing metadata it might work. For now we enforce police-state-strict ASCII, until there's time to implement proper metadata editing. Relevant tests set to xfail.	2018-09-13 21:33:39 -07:00
James R. Barlow	795019b0c1	Work around invalid TOC entries Kodak Capture Desktop and probably other software creates a /Outlines entry with /First being set to an invalid indirect reference to an object that hasn't been created. This is legal in the PDF spec but problematic for qpdf. The objgen will be (max valid object ID + 1, 0). Because we create new objects in _weave, some TOC entries will end up assigned to new objects we create. Typically /ProcSet. We solve the issue by refactoring page traversal and then doing it twice, once to resolve all references (eliminating the null reference problem) and a second pass to make our changes.	2018-09-11 14:44:16 -07:00
James R. Barlow	3aac3a98ca	tests: Migrate metadata tests to pikepdf For some reason PyPDF2 has begun to trigger internal errors in pytest on macOS alone. Not sure why, but nothing is wrong that I can see. Seemed like an opportune time to switch to pikepdf; found some new issues in the process anyway.	2018-09-10 16:06:01 -07:00
James R. Barlow	1cc9d2d3d1	Fix path error on Py3.5	2018-07-08 01:01:06 -07:00
James R. Barlow	58642aa98b	Fix issue #275 : doesn't work when installed in non-Unicode path Closes #275	2018-07-07 01:35:05 -07:00
James R. Barlow	45cb4525cf	Remove other references to PyMuPDF	2018-06-13 01:02:53 -07:00
James R. Barlow	3b820ffa7b	test_metadata: change from xfail to skipif without fitz	2018-05-17 00:14:57 -07:00
James R. Barlow	5e20d1d554	metadata: Fix failing test on __getitem__['/CreationDate']	2018-05-16 13:46:07 -07:00
James R. Barlow	24b0adfacc	Merge branch 'master' into develop	2018-05-10 20:54:55 -07:00
James R. Barlow	acc6698ab3	Make XML metadata test actually work	2018-05-10 20:37:10 -07:00
James R. Barlow	abed8e034e	Add metadata preservation test from stash	2018-05-10 16:43:28 -07:00
James R. Barlow	8a9f174f63	Fix XMP validation issue with /CreationDate Related to previous validation issue. If the /CreationDate had no timezone, Ghostscript also creates invalid metadata. Work around this. Also fix up PDF date decoding, and transcode dates to standardize them.	2018-05-03 16:30:20 -07:00
James R. Barlow	1a516b2af9	Fix regression: time stamp test suite failures	2018-04-17 16:59:21 -07:00
James R. Barlow	7a1cd39b21	Fix creation date metadata lost from input Closes #247	2018-04-02 17:53:39 -07:00
James R. Barlow	8d9be43c60	test_bookmarks_preserved won't raise ImportError any more Due to trapping this in ocrmypdf.lib	2018-03-28 23:22:55 -07:00
James R. Barlow	5becfcf8ea	Refactor fitz ImportError trap	2018-03-27 21:38:02 -07:00

1 2

53 Commits