62 Commits

Author SHA1 Message Date
James R. Barlow
f51164aff8 Upgrade test version of pymupdf 2021-11-13 00:53:41 -08:00
James R. Barlow
6f58a14351 pdfa: remove deprecated pkg_resources based access and tests 2021-11-13 00:52:03 -08:00
James R. Barlow
7ba04267b1 Remove shims to support for old versions of pikepdf < 4 2021-11-13 00:43:20 -08:00
James R. Barlow
c725bf79da flake8 delinting 2021-09-21 16:37:03 -07:00
James R. Barlow
e788dde607
tests: eliminate unnecessary mmap 2021-04-07 02:11:31 -07:00
James R. Barlow
aa115a8be3
Remove pytest_helpers_namespace 2021-04-07 01:56:51 -07:00
James R. Barlow
72fa347c38
tests: skip metadata test for two pikepdf versions that warn incorrectly 2020-12-29 01:47:52 -08:00
James R. Barlow
babc76fa74 tests: assert that most patched functions are called
We were not actually checking if functions we patched we called when
expected.
2020-12-28 23:58:33 -08:00
James R. Barlow
3707af3b74
Change pdf.root to pdf.Root 2020-11-03 01:30:31 -08:00
James R. Barlow
aa0ec40102
Change license of all GPLv3 files to MPL-2.0
https://github.com/jbarlow83/OCRmyPDF/issues/600
2020-08-05 00:44:42 -07:00
James R. Barlow
ebfe4f0d29
Fix issue #582 - PDF/A acquires title "Untitled" after conversion 2020-06-20 02:01:16 -07:00
James R. Barlow
7b9025f397
Convert generate_pdfa to plugin 2020-06-08 22:28:38 -07:00
James R. Barlow
b109445215
Move Ghostscript rasterize_pdf to plugin 2020-06-08 17:10:27 -07:00
James R. Barlow
1598f2f0e5 Abolish spoof_tesseract_noop 2020-06-01 03:07:53 -07:00
James R. Barlow
9af94ac9b7 pipeline: use OCR engine abstraction instead of Tesseract 2020-05-16 01:28:56 -07:00
James R. Barlow
977665d2b6
Delint some tests 2020-05-08 03:49:33 -07:00
James R. Barlow
c85278b31d
Delinting 2020-05-03 00:53:29 -07:00
James R. Barlow
5dbc080fa0
Rename PDFContext->PdfContext 2020-05-02 04:32:46 -07:00
James R. Barlow
e02f6c1e97
Support plugin invocation with API 2020-05-02 03:34:31 -07:00
James R. Barlow
b3b61c152c Handle malformed DocumentInfo (#497)
User submitted a PDF in which /Trailer /Info pointed to the XMP metadata
block instead of a DocumentInfo dictionary. Fix and add test.
2020-03-03 03:27:01 -08:00
James R. Barlow
4a27124eab Simplify metadata for invalid xml in output
Removes possibly non-free resource enron1.pdf.
2020-02-12 00:07:18 -08:00
James R. Barlow
c5edff2c2f Sort imports 2019-12-19 15:31:18 -08:00
James R. Barlow
a3726e4ce3 Fix test_metadata: use mmap in a Windows and POSIX compatible way 2019-12-04 17:13:52 -08:00
James R. Barlow
6fbeb6347d Merge api (without plugins) 2019-07-27 02:04:01 -07:00
James R. Barlow
12769b96e5 Drop support for omitting pdfminer.six 2019-07-10 13:37:01 -07:00
James R. Barlow
fb933edc0f Use newer pytest tmp_path API 2019-06-01 01:55:51 -07:00
James R. Barlow
ef1ef1cdf0 Fix test invalidated by Python 3.6 logging fixes 2019-05-17 15:20:07 -07:00
James R. Barlow
c904b430b6 Merge master into api branch; all test pass 2019-05-14 16:33:02 -07:00
James R. Barlow
482cb788ed Don't use MagicMock() as a dummy logger in pytest 2019-05-11 12:44:17 -07:00
mawi
c92ccc6134 fix: tests 2019-04-08 14:57:42 +02:00
mawi
783a128bd1 feat: move to sync (none ETL) implementation - remove ruffus 2019-04-04 21:02:38 +02:00
James R. Barlow
3f1d9ef99c Fix tests for move to Alpine dockerfile 2019-02-26 12:30:21 -08:00
James R. Barlow
f34b3015b2 Prevent Ghostscript from generating invalid XMP metadata
If DocumentInfo contains NULs Ghostscript will generate XMP with
NULs which is not allowed. Repair DocumentInfo before Ghostscript sees it.
2019-01-04 13:20:41 -08:00
James R. Barlow
7d330afd81 Delinting 2019-01-02 13:34:45 -08:00
James R. Barlow
c771938907 Convert to f-strings where it makes sense 2018-12-31 15:01:19 -08:00
James R. Barlow
cfc5cdf47d pdfa: remove a pile of deprecated code
It's now handled in pikepdf.
2018-12-31 00:05:13 -08:00
James R. Barlow
0880b16491 Sort imports with isort 2018-12-30 01:28:15 -08:00
James R. Barlow
06308a22ce Reformat with black 2018-12-30 01:27:49 -08:00
James R. Barlow
72b920eb16 Drop support for Python 3.5 2018-12-30 00:23:26 -08:00
James R. Barlow
b4a51907d6 Detect when metadata is dropped during PDF/A conversion 2018-12-30 00:13:25 -08:00
James R. Barlow
ed9bb985e2 Fix pikepdf 0.9.0 2018-12-14 23:21:13 -08:00
James R. Barlow
632dab2cc0 Replace Ghostscript DOCINFO and fix 9.25 metadata date regression
We no longer use Ghostscript to manage PDF metadata, instead
omitting the DOCINFO segment from the pdfmark file we generate.

Instead all of the relevant metadata code has been migrated to pikepdf,
and we use that API. This should be more consistent and fixes the
Ghostscript version-depedent quirks.

Also removes our python-xmp-toolkit dependency, except for
testing.
2018-12-13 18:13:30 -08:00
James R. Barlow
414407fbd6 Deprecate encode/decode_pdf_date and remap to pikepdf version 2018-12-12 22:01:21 -08:00
James R. Barlow
517b385fe5 Work around loss of Unicode DOCINFO in Ghostscript 9.24+
Ghostscript no longer supports UTF-16-BE-hex strings as a way of
supplying Unicode data in pdfmark so we have lost this functionality too:
http://git.ghostscript.com/?p=ghostpdl.git;a=commit;h=e997c6836d243ab37fe3a5f0d57974af95eb5eac

For users this means setting --title, --author, etc. will not work if gs
9.24 is installed, but if the file has existing metadata it might work.

For now we enforce police-state-strict ASCII, until there's time to
implement proper metadata editing. Relevant tests set to xfail.
2018-09-13 21:33:39 -07:00
James R. Barlow
795019b0c1 Work around invalid TOC entries
Kodak Capture Desktop and probably other software creates a
/Outlines entry with /First being set to an invalid indirect reference to
an object that hasn't been created. This is legal in the PDF spec but
problematic for qpdf. The objgen will be (max valid object ID + 1, 0).
Because we create new objects in _weave, some TOC entries will end
up assigned to new objects we create. Typically /ProcSet.

We solve the issue by refactoring page traversal and then doing it
twice, once to resolve all references (eliminating the null
reference problem) and a second pass to make our changes.
2018-09-11 14:44:16 -07:00
James R. Barlow
3aac3a98ca tests: Migrate metadata tests to pikepdf
For some reason PyPDF2 has begun to trigger internal errors in
pytest on macOS alone. Not sure why, but nothing is wrong that I can
see. Seemed like an opportune time to switch to pikepdf; found some
new issues in the process anyway.
2018-09-10 16:06:01 -07:00
James R. Barlow
1cc9d2d3d1 Fix path error on Py3.5 2018-07-08 01:01:06 -07:00
James R. Barlow
58642aa98b Fix issue #275: doesn't work when installed in non-Unicode path
Closes #275
2018-07-07 01:35:05 -07:00
James R. Barlow
45cb4525cf Remove other references to PyMuPDF 2018-06-13 01:02:53 -07:00
James R. Barlow
3b820ffa7b test_metadata: change from xfail to skipif without fitz 2018-05-17 00:14:57 -07:00