2895 Commits

Author SHA1 Message Date
James R. Barlow
25a1dde57c Fix recent versions of tesseract not registering as textonly_pdf
This change happened sometime after the 4.0.0-beta1 release in
Ubuntu 18.04
2018-06-23 02:59:22 -07:00
James R. Barlow
bf96171b65 Ignore whether or not textonly_pdf was used in cache
The difference doesn't matter in 7.0.0 anymore.
2018-06-23 02:58:26 -07:00
James R. Barlow
b7ff821fa3 Fix recent versions of tesseract not registering as textonly_pdf
This change happened sometime after the 4.0.0-beta1 release in
Ubuntu 18.04
2018-06-23 02:55:58 -07:00
James R. Barlow
b81daf71d1 Regenerate test cache 2018-06-23 02:02:58 -07:00
James R. Barlow
faad1fc58a Reactivate two tests that weren't using their fixtures properly 2018-06-23 01:54:09 -07:00
James R. Barlow
6f48181a56 Disable a pylint 2018-06-23 01:53:04 -07:00
James R. Barlow
f1305e5a37 pdfa: fix function using closure when it shouldn't 2018-06-23 01:52:36 -07:00
James R. Barlow
f0e0f92776 leptonica: fix variables defined on class outside __init__ 2018-06-23 01:51:55 -07:00
James R. Barlow
807c8b0726 Trailing whitespace 2018-06-23 01:51:19 -07:00
James R. Barlow
6333ec928c Cleanup some cases where log was lazy and should be 2018-06-23 01:50:27 -07:00
James R. Barlow
cd220d9ed9 pipeline: search_window variable not actually used 2018-06-23 01:48:57 -07:00
James R. Barlow
76532649b8 tesseract.get_orientation: removed unused language parameter 2018-06-23 01:48:24 -07:00
James R. Barlow
b0dbaeafc5 Cleanup unused imports 2018-06-23 01:47:53 -07:00
James R. Barlow
2530d1791b Fix several pylint errors and warnings 2018-06-23 00:54:22 -07:00
James R. Barlow
94150f414a Remove qpdf.merge
We no longer need to merge pages this way. Much of the functionality
was there to implement page splitting without hitting ulimit which
will be fixed in qpdf > 8.0.2. The tests were expensive to run.

Also remove pytest-timeout since it breaks the Linux build.
2018-06-23 00:45:03 -07:00
James R. Barlow
54e74f84cc Remove special of TypeError from ruffus
split_pages would still run if repair_pdf failed, for some reason.
Since we are no longer splitting pages this is vestigial.
2018-06-23 00:41:20 -07:00
James R. Barlow
76e7e8dbbb Replace several uses of str(path) with fspath(path)
Helps make it more explicit. Did not do this to tests because use of paths
is more involved there.
2018-06-22 21:00:47 -07:00
James R. Barlow
324598e992 Remove helpers.universal_open()
This helper function only had a single usage, this was always an awkward
way to support Python 3.5 that I'd forget to use.
2018-06-22 17:56:20 -07:00
James R. Barlow
9e765ddf46 Rename _optimize to optimize.py 2018-06-22 17:51:57 -07:00
James R. Barlow
6ac9e92f17 Fix PEP8 docstring convention misuse in a few places 2018-06-22 17:51:25 -07:00
James R. Barlow
faaa4a1def Ghostscript, PDF/A: support pathlib 2018-06-22 17:45:10 -07:00
James R. Barlow
0aa51f0f3a Remove fitz from Travis 2018-06-18 15:38:41 -07:00
James R. Barlow
73431d9761 Remove obsolete _naive_find_text 2018-06-13 14:00:50 -07:00
James R. Barlow
45cb4525cf Remove other references to PyMuPDF 2018-06-13 01:02:53 -07:00
James R. Barlow
8c84c515b6 Use Ghostscript for text region detection
Ghostscript txtwrite seems to be quite effective at the task.

Eliminates dependency on fitz
2018-06-13 00:58:09 -07:00
James R. Barlow
1dfbbdebf4 Adjust for pikepdf API change v7.0.0rc2 2018-06-08 22:47:56 -07:00
James R. Barlow
740918daee Create debug envvar to override Creator or Producer
Note that Ghostscript always overrides Producer
2018-06-06 23:17:28 -07:00
jbarlow83
1d10eac764
Add wiki link to issue template
[ci skip]
2018-06-06 12:59:59 -07:00
jbarlow83
3f868118cd
Remove gpg
[ci skip]
2018-06-06 12:58:02 -07:00
James R. Barlow
04d79b15b4 optimize: fix error in Py3.5 v7.0.0rc1 2018-06-06 12:25:32 -07:00
James R. Barlow
a13c398c06 Suppress some spurious tesseract errors 2018-06-05 23:26:28 -07:00
James R. Barlow
e3b3f716ee optimize: use tempdir for cmdline invocation 2018-06-05 21:20:54 -07:00
James R. Barlow
cf43c06f46 Use python-xmp-toolkit for xmp check
Eliminates PyPDF2 and defusedxml as dependencies.
2018-05-29 22:00:52 -07:00
James R. Barlow
74a5a18607 Tweak release notes v7.0.0b4 2018-05-28 14:52:06 -07:00
James R. Barlow
44241c6dd5 Travis: remove deploy to testpypi since it's broken 2018-05-27 01:49:18 -07:00
James R. Barlow
8fff496ffd Fix Py3.5 not understanding os.path.exists(Path(...)) v7.0.0b3 2018-05-26 22:55:22 -07:00
James R. Barlow
edf75c519c Update v7 release notes 2018-05-26 02:08:49 -07:00
James R. Barlow
9608b22d34 Remove all uses of PyPDF2 except PDF/A check
Leave PDF/A check alone for now, since pikepdf has no equivalent.
2018-05-26 02:07:18 -07:00
James R. Barlow
8ba4968c48 pdfinfo: more robustness 2018-05-26 01:54:25 -07:00
James R. Barlow
ffdd78f1a5 pdfinfo: Fix text_operators type not changed in related commit 2018-05-25 02:10:39 -07:00
James R. Barlow
ad9f8ca78e pdfinfo: reinstate stack normalization for q/Q 2018-05-25 01:28:26 -07:00
James R. Barlow
78a686ecb4 Consider qpdf behavior on algo4 a pass
qpdf opens files with null user password, so do the same.
2018-05-25 00:33:31 -07:00
James R. Barlow
59e786eb3c Remove old code to deal with single page only things 2018-05-25 00:32:55 -07:00
James R. Barlow
6d0461435f Use OperandGrouper whitelist 2018-05-24 22:52:33 -07:00
James R. Barlow
0a04a60f69 Document need for pdfinfo to be pickleable 2018-05-24 22:24:13 -07:00
James R. Barlow
68d8642988 Found out this test was extremely slow - no reason to actual use a large file 2018-05-24 22:22:51 -07:00
James R. Barlow
16f70ff054 Main changeset for pikepdf-based refactor pdfinfo 2018-05-24 22:22:01 -07:00
James R. Barlow
c00aeafff0 Add scratch file 2018-05-24 22:20:15 -07:00
James R. Barlow
83f35e00f3 Start removing PyPDF2 2018-05-21 01:28:21 -07:00
James R. Barlow
786a2ad65a Make optimize test do a little more 2018-05-18 17:50:39 -07:00