33 Commits

Author SHA1 Message Date
James R. Barlow
7ba04267b1 Remove shims to support for old versions of pikepdf < 4 2021-11-13 00:43:20 -08:00
James R. Barlow
757b72b0af Revert "Remove apparently unused portion of a test"
This reverts commit d89a633ba73af4a6bdacda6b9a4c0638b39167bd.
2021-04-16 00:21:11 -07:00
James R. Barlow
d673126994
Fix ZeroDivisionError on files containing images drawn at scale 0
Fixes #761
2021-04-15 23:26:14 -07:00
James R. Barlow
d89a633ba7
Remove apparently unused portion of a test 2021-04-15 23:25:18 -07:00
James R. Barlow
f687180ecc
tests: tidy pdfinfo 2021-01-08 15:04:52 -08:00
James R. Barlow
0b3a526049
Partial fix crash on 'userunit' None (#700)
Our method of getting data from pdfminer would silently consume a StopIteration
if pdfminer returned no processed pages, leading to odd error message.

We improve an error from pdfminer properly, and returning a more
descriptive error of our own.

It would be possible for ocrmypdf to repair the file before sending it to
pdfminer, but this seems to be rare enough that we won't do that yet.
2021-01-01 01:11:32 -08:00
James R. Barlow
aa0ec40102
Change license of all GPLv3 files to MPL-2.0
https://github.com/jbarlow83/OCRmyPDF/issues/600
2020-08-05 00:44:42 -07:00
James R. Barlow
872bafad4b Reinstate quick test for text/no text
Partial revert of commit 991db17
2020-06-10 12:00:52 -07:00
James R. Barlow
64891c2fc3
Pre-release delinting 2020-06-09 15:27:14 -07:00
James R. Barlow
0f942fb714 Rename ocrmypdf.exec -> ocrmypdf._exec 2020-06-09 14:59:09 -07:00
James R. Barlow
991db17fde
Remove Ghostscript-based text extraction
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
94c52a6fa3
Refactor 'xyres' into Resolution 2020-04-24 04:12:05 -07:00
James R. Barlow
57771f06a3
Refactor xy-pair for resolution to tuple 2020-04-16 15:38:33 -07:00
James R. Barlow
23bc3d3a29
tests: workaround for Ghostscript 9.52 txtwrite problem 2020-03-29 22:45:16 -07:00
James R. Barlow
c5edff2c2f Sort imports 2019-12-19 15:31:18 -08:00
James R. Barlow
4ab0a8ff35 Fix test_single_page_inline_image - remove temp file 2019-12-04 17:13:51 -08:00
James R. Barlow
6fbeb6347d Merge api (without plugins) 2019-07-27 02:04:01 -07:00
James R. Barlow
12769b96e5 Drop support for omitting pdfminer.six 2019-07-10 13:37:01 -07:00
James R. Barlow
c357d4146e Restructure ocrmypdf.pdfinfo 2019-06-20 03:10:41 -07:00
James R. Barlow
7d330afd81 Delinting 2019-01-02 13:34:45 -08:00
James R. Barlow
c771938907 Convert to f-strings where it makes sense 2018-12-31 15:01:19 -08:00
James R. Barlow
8c0009c5c8 Make pdfminer.six optional
Mainly since the current release of pdfminer.six lacks a sdist, blocking
homebrew packaging. Also in case other distros don't accept pdfminer.six.
2018-12-31 01:08:43 -08:00
James R. Barlow
0880b16491 Sort imports with isort 2018-12-30 01:28:15 -08:00
James R. Barlow
06308a22ce Reformat with black 2018-12-30 01:27:49 -08:00
James R. Barlow
13d20bd993 pdfinfo: tolerate PDFs that overflow and underflow the graphics stack 2018-12-15 15:10:29 -08:00
James R. Barlow
9e6b54c7ed Add test case for Type3 fonts with no Unicode mapping 2018-11-15 21:54:26 -08:00
James R. Barlow
d3b334c10f Test case: true type font without Unicode mapping 2018-11-15 16:22:53 -08:00
James R. Barlow
501ce726e7 Fix two failing tests 2018-11-06 11:16:08 -08:00
James R. Barlow
f564aaf485 Remove only_ocr_text 2018-10-28 22:41:18 -07:00
James R. Barlow
58cc70725e Reorganize around getting bboxes for visible/invisible text 2018-10-26 01:07:02 -07:00
James R. Barlow
16af753206 Add functional "redo OCR" feature
Needs argument validation and some other changes. Needs testing
with mixed-content PDFs.

Only really works for pure invisible text at the moment.
2018-10-19 00:02:19 -07:00
James R. Barlow
b18e66e2ca pdfinfo: learn to detect vector graphic objects 2018-10-18 01:21:51 -07:00
James R. Barlow
216d60ea2c pdfinfo: improve the regex 2018-07-04 00:59:32 -07:00