James R. Barlow
41eb54cc0a
Standardize tesseract.generate_hocr and _pdf parameters
2020-05-14 03:23:25 -07:00
James R. Barlow
12a2f78c4d
Fix validation of languages not using tesseract_env
...
And some related issues.
2020-05-14 03:19:22 -07:00
James R. Barlow
d372f1f7fa
Remove "skip page" from tesseract interface
...
Breaks tests/test_main.py::test_tesseract_missing_tessdata because
conftest.py does not update options.tesseract_env before testing options
for some reason, and tesseract.has_textonly_pdf raises an exception
instead of returning False as the test assumes.
2020-05-12 04:09:42 -07:00
James R. Barlow
2541f6cf89
Fix missing jbig2enc reported as error with -O3 instead of warning
...
Fixes #558
2020-05-12 01:05:57 -07:00
James R. Barlow
977665d2b6
Delint some tests
2020-05-08 03:49:33 -07:00
James R. Barlow
fd7497f00d
Remove old function tesseract.v4()
2020-05-08 03:44:39 -07:00
James R. Barlow
1b086f60a9
tesseract.py: api cleanup
2020-05-06 12:37:44 -07:00
James R. Barlow
85cbf94a6e
Convert many uses of str paths to Path
2020-05-06 02:53:47 -07:00
James R. Barlow
c85278b31d
Delinting
2020-05-03 00:53:29 -07:00
James R. Barlow
5dbc080fa0
Rename PDFContext->PdfContext
2020-05-02 04:32:46 -07:00
James R. Barlow
e02f6c1e97
Support plugin invocation with API
2020-05-02 03:34:31 -07:00
James R. Barlow
016dfd420c
Add warning if problematic --tesseract-pagesegmode is selected
...
Fixes #549
2020-04-30 04:12:11 -07:00
James R. Barlow
b840b16c82
Remove tesseract_badutf8.py
...
Should have been removed in 9db01c7
2020-04-28 02:35:23 -07:00
James R. Barlow
8f5c95f0f4
Remove last vestiges of command line usage of qpdf - change to check_pdf
2020-04-26 05:33:26 -07:00
James R. Barlow
991db17fde
Remove Ghostscript-based text extraction
...
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
7513f5425c
Fix some broken tests
2020-04-26 03:49:20 -07:00
James R. Barlow
43d650e78c
Fix issue where only first PNG-style image would be optimized
2020-04-25 03:50:11 -07:00
James R. Barlow
94c52a6fa3
Refactor 'xyres' into Resolution
2020-04-24 04:12:05 -07:00
James R. Barlow
57771f06a3
Refactor xy-pair for resolution to tuple
2020-04-16 15:38:33 -07:00
James R. Barlow
58abb5785c
pytest picky about list vs tuple
2020-04-15 03:16:51 -07:00
James R. Barlow
31b5f63f85
hocrtransform: cleanup/PEP8
...
Some API breaking changes.
2020-04-15 02:48:56 -07:00
James R. Barlow
957fb1494e
pytest picky about list vs tuple
2020-04-15 02:26:20 -07:00
James R. Barlow
2155bcacb4
Loosen test language requirements - eng/deu
2020-04-15 00:30:38 -07:00
James R. Barlow
346da95899
Suppress loglevel since we have color now
2020-04-15 00:09:36 -07:00
James R. Barlow
d146d2b65c
The Great Logging Refactor
...
Remove all instances of logger object being passed as parameters.
This was a holdover from ruffus, and complicated a lot of simple things.
2020-04-14 23:59:33 -07:00
James R. Barlow
4a640b8dcd
Fix language argument not working as list
...
Fixes #523
2020-04-14 23:18:52 -07:00
James R. Barlow
9471bc8921
Fix versions with leading v, e.g. v5.0
2020-04-10 13:42:33 -07:00
James R. Barlow
d13d70fd56
Fix version checker failing for qpdf 10.0.0
...
Fixes #527
2020-04-10 13:00:19 -07:00
James R. Barlow
23bc3d3a29
tests: workaround for Ghostscript 9.52 txtwrite problem
2020-03-29 22:45:16 -07:00
James R. Barlow
8307832ce9
tests: add force OCR to a file with text that Ghostscript doesn't see
...
For gs 9.52 support.
Also refactor use of pikepdf.open() to use with blocks.
2020-03-29 22:44:27 -07:00
James R. Barlow
378e4dae3b
Expand documentation for subprocess.run() from test
2020-03-04 13:37:44 -08:00
James R. Barlow
b3b61c152c
Handle malformed DocumentInfo ( #497 )
...
User submitted a PDF in which /Trailer /Info pointed to the XMP metadata
block instead of a DocumentInfo dictionary. Fix and add test.
2020-03-03 03:27:01 -08:00
James R. Barlow
4a27124eab
Simplify metadata for invalid xml in output
...
Removes possibly non-free resource enron1.pdf.
2020-02-12 00:07:18 -08:00
James R. Barlow
ce97af5a79
Add OCR quality measurement API
2020-01-17 03:10:27 -08:00
James R. Barlow
61a2674317
Skip test that needs chmod when on Windows
2020-01-06 02:36:04 -08:00
James R. Barlow
9ad8cbf1f6
Fix assert that depends on POSIX-y file handling
2020-01-06 02:02:05 -08:00
James R. Barlow
9c5f0d0ec6
Eliminate last use of PyPDF2 from test suite
2020-01-04 16:32:01 -08:00
James R. Barlow
32041c43e1
tests: improve tesseract coverage
2020-01-04 02:35:14 -08:00
James R. Barlow
1037d73efb
tests: use smaller files for ghostscript
2019-12-31 17:20:28 -08:00
James R. Barlow
aeb7b142a9
tests: skip tests not compatible with coverage
...
For reasons not entirely clear, stdout will get some data injected when
pytest-cov is running. Our tests that
check for clean stdout need to ignore this.
We check for an environment variable that is defined only when coverage is
running.
2019-12-31 17:10:51 -08:00
James R. Barlow
422ea9777e
Remove session scope from fixtures
...
pytest seems to prepare os.environ in complex ways, so we want to ensure
these fixtures are not reused.
2019-12-31 17:09:23 -08:00
James R. Barlow
2f1c743227
Rewrite main pool loop
...
pytest-cov documentation recommends using explicit
management of multiprocessing.Pool rather than the context manager.
This is supposed to work better for collecting coverage data, particularly
on Windows.
2019-12-31 16:23:41 -08:00
James R. Barlow
96ee21aee9
Try to set up subprocess coverage better
2019-12-31 15:39:45 -08:00
James R. Barlow
4b759af6ff
tests: fix problems with ghostscript spoofers
2019-12-31 15:33:03 -08:00
James R. Barlow
25d2b0cda4
test: environment warnings/cleanup
2019-12-30 22:38:50 -08:00
James R. Barlow
c36e9950ae
tests: test TqdmConsole
2019-12-30 17:51:09 -08:00
James R. Barlow
0c0d53b10f
tests: AcroForm test case did not work correctly; fixed
2019-12-30 17:50:32 -08:00
James R. Barlow
63de7e1677
Improve error message for unreadable input files
2019-12-30 16:14:52 -08:00
James R. Barlow
b0e92760a2
tests: add coverage for helpers
2019-12-30 15:52:10 -08:00
James R. Barlow
c5edff2c2f
Sort imports
2019-12-19 15:31:18 -08:00