James R. Barlow
7d330afd81
Delinting
2019-01-02 13:34:45 -08:00
James R. Barlow
c771938907
Convert to f-strings where it makes sense
2018-12-31 15:01:19 -08:00
James R. Barlow
8c0009c5c8
Make pdfminer.six optional
...
Mainly since the current release of pdfminer.six lacks a sdist, blocking
homebrew packaging. Also in case other distros don't accept pdfminer.six.
2018-12-31 01:08:43 -08:00
James R. Barlow
cfc5cdf47d
pdfa: remove a pile of deprecated code
...
It's now handled in pikepdf.
2018-12-31 00:05:13 -08:00
James R. Barlow
0880b16491
Sort imports with isort
2018-12-30 01:28:15 -08:00
James R. Barlow
06308a22ce
Reformat with black
2018-12-30 01:27:49 -08:00
James R. Barlow
80bd7de580
Generate test cache
2018-12-30 01:02:37 -08:00
James R. Barlow
8b90c45437
Drop support for Tesseract 3
2018-12-30 00:47:12 -08:00
James R. Barlow
72b920eb16
Drop support for Python 3.5
2018-12-30 00:23:26 -08:00
James R. Barlow
b4a51907d6
Detect when metadata is dropped during PDF/A conversion
2018-12-30 00:13:25 -08:00
James R. Barlow
13d20bd993
pdfinfo: tolerate PDFs that overflow and underflow the graphics stack
2018-12-15 15:10:29 -08:00
James R. Barlow
ed9bb985e2
Fix pikepdf 0.9.0
2018-12-14 23:21:13 -08:00
James R. Barlow
632dab2cc0
Replace Ghostscript DOCINFO and fix 9.25 metadata date regression
...
We no longer use Ghostscript to manage PDF metadata, instead
omitting the DOCINFO segment from the pdfmark file we generate.
Instead all of the relevant metadata code has been migrated to pikepdf,
and we use that API. This should be more consistent and fixes the
Ghostscript version-depedent quirks.
Also removes our python-xmp-toolkit dependency, except for
testing.
2018-12-13 18:13:30 -08:00
James R. Barlow
414407fbd6
Deprecate encode/decode_pdf_date and remap to pikepdf version
2018-12-12 22:01:21 -08:00
James R. Barlow
9e6b54c7ed
Add test case for Type3 fonts with no Unicode mapping
2018-11-15 21:54:26 -08:00
James R. Barlow
d3b334c10f
Test case: true type font without Unicode mapping
2018-11-15 16:22:53 -08:00
James R. Barlow
cc7f2a3f02
Fix Python 3.5 pathlib regressions
2018-11-10 02:11:23 -08:00
James R. Barlow
a2170ef8d6
test: test version check code
2018-11-10 00:56:22 -08:00
James R. Barlow
5ed05e08b1
Fix "no languages" test and misuse of os.environ
2018-11-09 01:57:11 -08:00
James R. Barlow
501ce726e7
Fix two failing tests
2018-11-06 11:16:08 -08:00
James R. Barlow
2ac028c759
test: Add a basic redo OCR test
2018-11-04 15:54:41 -08:00
James R. Barlow
8b9ab25125
coverage: test compile leptonica
2018-11-02 01:55:25 -07:00
James R. Barlow
77e87abe8f
coverage: ensure get_orientation is checked
2018-11-02 01:32:20 -07:00
James R. Barlow
3be02e1e8d
coverage: improve leptonic; don't create objects with null pointers
2018-11-02 01:10:10 -07:00
James R. Barlow
5b8d197812
coverage: make it more likely timeout is tested
2018-11-02 00:41:15 -07:00
James R. Barlow
2cba62dc4f
coverage: ensure rotation is actually tested
2018-11-02 00:40:56 -07:00
James R. Barlow
288e28328f
coverage: add qpdf
2018-11-02 00:37:33 -07:00
James R. Barlow
8681693994
Set up code coverage (it works with multiprocessing now!)
2018-11-02 00:31:50 -07:00
James R. Barlow
de80fb6bc8
Fix some failing tests after --redo-ocr changes
2018-10-29 11:49:38 -07:00
James R. Barlow
f564aaf485
Remove only_ocr_text
2018-10-28 22:41:18 -07:00
James R. Barlow
58cc70725e
Reorganize around getting bboxes for visible/invisible text
2018-10-26 01:07:02 -07:00
James R. Barlow
16af753206
Add functional "redo OCR" feature
...
Needs argument validation and some other changes. Needs testing
with mixed-content PDFs.
Only really works for pure invisible text at the moment.
2018-10-19 00:02:19 -07:00
James R. Barlow
b18e66e2ca
pdfinfo: learn to detect vector graphic objects
2018-10-18 01:21:51 -07:00
James R. Barlow
1495b78330
Remove cruft to support leptonica < 1.72 in test suite
2018-10-11 01:37:32 -07:00
James R. Barlow
5c229d48d5
optimize: Reorganize so JBIG2 can be performed on images reduced to 1bpp
...
Closes #297
2018-10-04 11:53:11 -07:00
James R. Barlow
5b84549716
Change JBIG2 lossy mode to require --jbig2-lossy
2018-10-04 01:20:49 -07:00
James R. Barlow
a71e4488b3
test: fix pytest warning about direct use of a fixture
2018-10-03 15:04:46 -07:00
James R. Barlow
9fa471e053
Test: send stderr to stderr, why don't we?
2018-10-03 14:23:34 -07:00
James R. Barlow
31ef2fe907
test: this error message changed case in newer Tesseract
2018-10-03 13:58:20 -07:00
James R. Barlow
9a8ec4b210
optimize: only enable lossy JBIG2 for -O3
2018-10-03 00:38:58 -07:00
James R. Barlow
17a3fa671c
ghostscript: API docs update
2018-09-14 23:51:52 -07:00
James R. Barlow
686207ab7f
Check for and reject Adobe LiveCycle Designer PDFs
...
These are the ones that display a "Please wait..." message.
Closes #296
2018-09-13 21:50:51 -07:00
James R. Barlow
517b385fe5
Work around loss of Unicode DOCINFO in Ghostscript 9.24+
...
Ghostscript no longer supports UTF-16-BE-hex strings as a way of
supplying Unicode data in pdfmark so we have lost this functionality too:
http://git.ghostscript.com/?p=ghostpdl.git;a=commit;h=e997c6836d243ab37fe3a5f0d57974af95eb5eac
For users this means setting --title, --author, etc. will not work if gs
9.24 is installed, but if the file has existing metadata it might work.
For now we enforce police-state-strict ASCII, until there's time to
implement proper metadata editing. Relevant tests set to xfail.
2018-09-13 21:33:39 -07:00
James R. Barlow
795019b0c1
Work around invalid TOC entries
...
Kodak Capture Desktop and probably other software creates a
/Outlines entry with /First being set to an invalid indirect reference to
an object that hasn't been created. This is legal in the PDF spec but
problematic for qpdf. The objgen will be (max valid object ID + 1, 0).
Because we create new objects in _weave, some TOC entries will end
up assigned to new objects we create. Typically /ProcSet.
We solve the issue by refactoring page traversal and then doing it
twice, once to resolve all references (eliminating the null
reference problem) and a second pass to make our changes.
2018-09-11 14:44:16 -07:00
James R. Barlow
3aac3a98ca
tests: Migrate metadata tests to pikepdf
...
For some reason PyPDF2 has begun to trigger internal errors in
pytest on macOS alone. Not sure why, but nothing is wrong that I can
see. Seemed like an opportune time to switch to pikepdf; found some
new issues in the process anyway.
2018-09-10 16:06:01 -07:00
James R. Barlow
7aa4e60af2
Explain pytest --runslow
2018-08-03 00:57:59 -07:00
James R. Barlow
55eb481f30
Add intensive (optional) rotation test
2018-08-03 00:42:59 -07:00
James R. Barlow
c171cb7286
Merge img2pdf 0.3.0 fix from v6.2.3
2018-08-01 15:17:33 -07:00
James R. Barlow
1d09061130
Revert previous commit amd reject input images with alpha channel
...
Decided on this for simplicity of old release branch.
Modifies baiona.png by stripping
alpha, adds baiona_alpha which
includes the alpha.
2018-07-31 23:45:28 -07:00
James R. Barlow
a2203b2447
Discard alpha channel when triaging images
2018-07-25 22:23:41 -04:00