579 Commits

Author SHA1 Message Date
James R. Barlow
41eb54cc0a
Standardize tesseract.generate_hocr and _pdf parameters 2020-05-14 03:23:25 -07:00
James R. Barlow
12a2f78c4d
Fix validation of languages not using tesseract_env
And some related issues.
2020-05-14 03:19:22 -07:00
James R. Barlow
d372f1f7fa Remove "skip page" from tesseract interface
Breaks tests/test_main.py::test_tesseract_missing_tessdata because
conftest.py does not update options.tesseract_env before testing options
for some reason, and tesseract.has_textonly_pdf raises an exception
instead of returning False as the test assumes.
2020-05-12 04:09:42 -07:00
James R. Barlow
2541f6cf89
Fix missing jbig2enc reported as error with -O3 instead of warning
Fixes #558
2020-05-12 01:05:57 -07:00
James R. Barlow
977665d2b6
Delint some tests 2020-05-08 03:49:33 -07:00
James R. Barlow
fd7497f00d
Remove old function tesseract.v4() 2020-05-08 03:44:39 -07:00
James R. Barlow
1b086f60a9
tesseract.py: api cleanup 2020-05-06 12:37:44 -07:00
James R. Barlow
85cbf94a6e
Convert many uses of str paths to Path 2020-05-06 02:53:47 -07:00
James R. Barlow
c85278b31d
Delinting 2020-05-03 00:53:29 -07:00
James R. Barlow
5dbc080fa0
Rename PDFContext->PdfContext 2020-05-02 04:32:46 -07:00
James R. Barlow
e02f6c1e97
Support plugin invocation with API 2020-05-02 03:34:31 -07:00
James R. Barlow
016dfd420c Add warning if problematic --tesseract-pagesegmode is selected
Fixes #549
2020-04-30 04:12:11 -07:00
James R. Barlow
b840b16c82
Remove tesseract_badutf8.py
Should have been removed in 9db01c7
2020-04-28 02:35:23 -07:00
James R. Barlow
8f5c95f0f4
Remove last vestiges of command line usage of qpdf - change to check_pdf 2020-04-26 05:33:26 -07:00
James R. Barlow
991db17fde
Remove Ghostscript-based text extraction
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
7513f5425c Fix some broken tests 2020-04-26 03:49:20 -07:00
James R. Barlow
43d650e78c
Fix issue where only first PNG-style image would be optimized 2020-04-25 03:50:11 -07:00
James R. Barlow
94c52a6fa3
Refactor 'xyres' into Resolution 2020-04-24 04:12:05 -07:00
James R. Barlow
57771f06a3
Refactor xy-pair for resolution to tuple 2020-04-16 15:38:33 -07:00
James R. Barlow
58abb5785c
pytest picky about list vs tuple 2020-04-15 03:16:51 -07:00
James R. Barlow
31b5f63f85 hocrtransform: cleanup/PEP8
Some API breaking changes.
2020-04-15 02:48:56 -07:00
James R. Barlow
957fb1494e
pytest picky about list vs tuple 2020-04-15 02:26:20 -07:00
James R. Barlow
2155bcacb4
Loosen test language requirements - eng/deu 2020-04-15 00:30:38 -07:00
James R. Barlow
346da95899 Suppress loglevel since we have color now 2020-04-15 00:09:36 -07:00
James R. Barlow
d146d2b65c The Great Logging Refactor
Remove all instances of logger object being passed as parameters.
This was a holdover from ruffus, and complicated a lot of simple things.
2020-04-14 23:59:33 -07:00
James R. Barlow
4a640b8dcd
Fix language argument not working as list
Fixes #523
2020-04-14 23:18:52 -07:00
James R. Barlow
9471bc8921
Fix versions with leading v, e.g. v5.0 2020-04-10 13:42:33 -07:00
James R. Barlow
d13d70fd56 Fix version checker failing for qpdf 10.0.0
Fixes #527
2020-04-10 13:00:19 -07:00
James R. Barlow
23bc3d3a29
tests: workaround for Ghostscript 9.52 txtwrite problem 2020-03-29 22:45:16 -07:00
James R. Barlow
8307832ce9 tests: add force OCR to a file with text that Ghostscript doesn't see
For gs 9.52 support.

Also refactor use of pikepdf.open() to use with blocks.
2020-03-29 22:44:27 -07:00
James R. Barlow
378e4dae3b
Expand documentation for subprocess.run() from test 2020-03-04 13:37:44 -08:00
James R. Barlow
b3b61c152c Handle malformed DocumentInfo (#497)
User submitted a PDF in which /Trailer /Info pointed to the XMP metadata
block instead of a DocumentInfo dictionary. Fix and add test.
2020-03-03 03:27:01 -08:00
James R. Barlow
4a27124eab Simplify metadata for invalid xml in output
Removes possibly non-free resource enron1.pdf.
2020-02-12 00:07:18 -08:00
James R. Barlow
ce97af5a79 Add OCR quality measurement API 2020-01-17 03:10:27 -08:00
James R. Barlow
61a2674317 Skip test that needs chmod when on Windows 2020-01-06 02:36:04 -08:00
James R. Barlow
9ad8cbf1f6 Fix assert that depends on POSIX-y file handling 2020-01-06 02:02:05 -08:00
James R. Barlow
9c5f0d0ec6 Eliminate last use of PyPDF2 from test suite 2020-01-04 16:32:01 -08:00
James R. Barlow
32041c43e1 tests: improve tesseract coverage 2020-01-04 02:35:14 -08:00
James R. Barlow
1037d73efb tests: use smaller files for ghostscript 2019-12-31 17:20:28 -08:00
James R. Barlow
aeb7b142a9 tests: skip tests not compatible with coverage
For reasons not entirely clear, stdout will get some data injected when
pytest-cov is running. Our tests that
check for clean stdout need to ignore this.

We check for an environment variable that is defined only when coverage is
running.
2019-12-31 17:10:51 -08:00
James R. Barlow
422ea9777e Remove session scope from fixtures
pytest seems to prepare os.environ in complex ways, so we want to ensure
these fixtures are not reused.
2019-12-31 17:09:23 -08:00
James R. Barlow
2f1c743227 Rewrite main pool loop
pytest-cov documentation recommends using explicit
management of multiprocessing.Pool rather than the context manager.
This is supposed to work better for collecting coverage data, particularly
on Windows.
2019-12-31 16:23:41 -08:00
James R. Barlow
96ee21aee9 Try to set up subprocess coverage better 2019-12-31 15:39:45 -08:00
James R. Barlow
4b759af6ff tests: fix problems with ghostscript spoofers 2019-12-31 15:33:03 -08:00
James R. Barlow
25d2b0cda4 test: environment warnings/cleanup 2019-12-30 22:38:50 -08:00
James R. Barlow
c36e9950ae tests: test TqdmConsole 2019-12-30 17:51:09 -08:00
James R. Barlow
0c0d53b10f tests: AcroForm test case did not work correctly; fixed 2019-12-30 17:50:32 -08:00
James R. Barlow
63de7e1677 Improve error message for unreadable input files 2019-12-30 16:14:52 -08:00
James R. Barlow
b0e92760a2 tests: add coverage for helpers 2019-12-30 15:52:10 -08:00
James R. Barlow
c5edff2c2f Sort imports 2019-12-19 15:31:18 -08:00