579 Commits

Author SHA1 Message Date
James R. Barlow
7d330afd81 Delinting 2019-01-02 13:34:45 -08:00
James R. Barlow
c771938907 Convert to f-strings where it makes sense 2018-12-31 15:01:19 -08:00
James R. Barlow
8c0009c5c8 Make pdfminer.six optional
Mainly since the current release of pdfminer.six lacks a sdist, blocking
homebrew packaging. Also in case other distros don't accept pdfminer.six.
2018-12-31 01:08:43 -08:00
James R. Barlow
cfc5cdf47d pdfa: remove a pile of deprecated code
It's now handled in pikepdf.
2018-12-31 00:05:13 -08:00
James R. Barlow
0880b16491 Sort imports with isort 2018-12-30 01:28:15 -08:00
James R. Barlow
06308a22ce Reformat with black 2018-12-30 01:27:49 -08:00
James R. Barlow
80bd7de580 Generate test cache 2018-12-30 01:02:37 -08:00
James R. Barlow
8b90c45437 Drop support for Tesseract 3 2018-12-30 00:47:12 -08:00
James R. Barlow
72b920eb16 Drop support for Python 3.5 2018-12-30 00:23:26 -08:00
James R. Barlow
b4a51907d6 Detect when metadata is dropped during PDF/A conversion 2018-12-30 00:13:25 -08:00
James R. Barlow
13d20bd993 pdfinfo: tolerate PDFs that overflow and underflow the graphics stack 2018-12-15 15:10:29 -08:00
James R. Barlow
ed9bb985e2 Fix pikepdf 0.9.0 2018-12-14 23:21:13 -08:00
James R. Barlow
632dab2cc0 Replace Ghostscript DOCINFO and fix 9.25 metadata date regression
We no longer use Ghostscript to manage PDF metadata, instead
omitting the DOCINFO segment from the pdfmark file we generate.

Instead all of the relevant metadata code has been migrated to pikepdf,
and we use that API. This should be more consistent and fixes the
Ghostscript version-depedent quirks.

Also removes our python-xmp-toolkit dependency, except for
testing.
2018-12-13 18:13:30 -08:00
James R. Barlow
414407fbd6 Deprecate encode/decode_pdf_date and remap to pikepdf version 2018-12-12 22:01:21 -08:00
James R. Barlow
9e6b54c7ed Add test case for Type3 fonts with no Unicode mapping 2018-11-15 21:54:26 -08:00
James R. Barlow
d3b334c10f Test case: true type font without Unicode mapping 2018-11-15 16:22:53 -08:00
James R. Barlow
cc7f2a3f02 Fix Python 3.5 pathlib regressions 2018-11-10 02:11:23 -08:00
James R. Barlow
a2170ef8d6 test: test version check code 2018-11-10 00:56:22 -08:00
James R. Barlow
5ed05e08b1 Fix "no languages" test and misuse of os.environ 2018-11-09 01:57:11 -08:00
James R. Barlow
501ce726e7 Fix two failing tests 2018-11-06 11:16:08 -08:00
James R. Barlow
2ac028c759 test: Add a basic redo OCR test 2018-11-04 15:54:41 -08:00
James R. Barlow
8b9ab25125 coverage: test compile leptonica 2018-11-02 01:55:25 -07:00
James R. Barlow
77e87abe8f coverage: ensure get_orientation is checked 2018-11-02 01:32:20 -07:00
James R. Barlow
3be02e1e8d coverage: improve leptonic; don't create objects with null pointers 2018-11-02 01:10:10 -07:00
James R. Barlow
5b8d197812 coverage: make it more likely timeout is tested 2018-11-02 00:41:15 -07:00
James R. Barlow
2cba62dc4f coverage: ensure rotation is actually tested 2018-11-02 00:40:56 -07:00
James R. Barlow
288e28328f coverage: add qpdf 2018-11-02 00:37:33 -07:00
James R. Barlow
8681693994 Set up code coverage (it works with multiprocessing now!) 2018-11-02 00:31:50 -07:00
James R. Barlow
de80fb6bc8 Fix some failing tests after --redo-ocr changes 2018-10-29 11:49:38 -07:00
James R. Barlow
f564aaf485 Remove only_ocr_text 2018-10-28 22:41:18 -07:00
James R. Barlow
58cc70725e Reorganize around getting bboxes for visible/invisible text 2018-10-26 01:07:02 -07:00
James R. Barlow
16af753206 Add functional "redo OCR" feature
Needs argument validation and some other changes. Needs testing
with mixed-content PDFs.

Only really works for pure invisible text at the moment.
2018-10-19 00:02:19 -07:00
James R. Barlow
b18e66e2ca pdfinfo: learn to detect vector graphic objects 2018-10-18 01:21:51 -07:00
James R. Barlow
1495b78330 Remove cruft to support leptonica < 1.72 in test suite 2018-10-11 01:37:32 -07:00
James R. Barlow
5c229d48d5 optimize: Reorganize so JBIG2 can be performed on images reduced to 1bpp
Closes #297
2018-10-04 11:53:11 -07:00
James R. Barlow
5b84549716 Change JBIG2 lossy mode to require --jbig2-lossy 2018-10-04 01:20:49 -07:00
James R. Barlow
a71e4488b3 test: fix pytest warning about direct use of a fixture 2018-10-03 15:04:46 -07:00
James R. Barlow
9fa471e053 Test: send stderr to stderr, why don't we? 2018-10-03 14:23:34 -07:00
James R. Barlow
31ef2fe907 test: this error message changed case in newer Tesseract 2018-10-03 13:58:20 -07:00
James R. Barlow
9a8ec4b210 optimize: only enable lossy JBIG2 for -O3 2018-10-03 00:38:58 -07:00
James R. Barlow
17a3fa671c ghostscript: API docs update 2018-09-14 23:51:52 -07:00
James R. Barlow
686207ab7f Check for and reject Adobe LiveCycle Designer PDFs
These are the ones that display a "Please wait..." message.

Closes #296
2018-09-13 21:50:51 -07:00
James R. Barlow
517b385fe5 Work around loss of Unicode DOCINFO in Ghostscript 9.24+
Ghostscript no longer supports UTF-16-BE-hex strings as a way of
supplying Unicode data in pdfmark so we have lost this functionality too:
http://git.ghostscript.com/?p=ghostpdl.git;a=commit;h=e997c6836d243ab37fe3a5f0d57974af95eb5eac

For users this means setting --title, --author, etc. will not work if gs
9.24 is installed, but if the file has existing metadata it might work.

For now we enforce police-state-strict ASCII, until there's time to
implement proper metadata editing. Relevant tests set to xfail.
2018-09-13 21:33:39 -07:00
James R. Barlow
795019b0c1 Work around invalid TOC entries
Kodak Capture Desktop and probably other software creates a
/Outlines entry with /First being set to an invalid indirect reference to
an object that hasn't been created. This is legal in the PDF spec but
problematic for qpdf. The objgen will be (max valid object ID + 1, 0).
Because we create new objects in _weave, some TOC entries will end
up assigned to new objects we create. Typically /ProcSet.

We solve the issue by refactoring page traversal and then doing it
twice, once to resolve all references (eliminating the null
reference problem) and a second pass to make our changes.
2018-09-11 14:44:16 -07:00
James R. Barlow
3aac3a98ca tests: Migrate metadata tests to pikepdf
For some reason PyPDF2 has begun to trigger internal errors in
pytest on macOS alone. Not sure why, but nothing is wrong that I can
see. Seemed like an opportune time to switch to pikepdf; found some
new issues in the process anyway.
2018-09-10 16:06:01 -07:00
James R. Barlow
7aa4e60af2 Explain pytest --runslow 2018-08-03 00:57:59 -07:00
James R. Barlow
55eb481f30 Add intensive (optional) rotation test 2018-08-03 00:42:59 -07:00
James R. Barlow
c171cb7286 Merge img2pdf 0.3.0 fix from v6.2.3 2018-08-01 15:17:33 -07:00
James R. Barlow
1d09061130 Revert previous commit amd reject input images with alpha channel
Decided on this for simplicity of old release branch.

Modifies baiona.png by stripping
alpha, adds baiona_alpha which
includes the alpha.
2018-07-31 23:45:28 -07:00
James R. Barlow
a2203b2447 Discard alpha channel when triaging images 2018-07-25 22:23:41 -04:00