James R. Barlow
a9a473f2e5
Convert all tesseract cache usages to plugin
2020-06-05 17:55:18 -07:00
James R. Barlow
6268e2faff
Begin replacing tests/spoof/tesseract_cache with plugin
2020-06-05 17:27:10 -07:00
James R. Barlow
ec3f506500
Convert tesseract_badutf8 to plugin
2020-06-05 16:38:19 -07:00
James R. Barlow
c6b2fa8851
Remove unpaper spoof; no plugin needed
2020-06-02 02:42:14 -07:00
James R. Barlow
1b92f447c3
Convert tesseract_crash to plugin
2020-06-02 02:36:41 -07:00
James R. Barlow
82e7eb91d2
Tidy tesseract_noop
2020-06-02 01:50:02 -07:00
James R. Barlow
4f4ad0fb76
Convert tesseract_big_image_error to plugin
2020-06-02 01:49:47 -07:00
James R. Barlow
daca919775
Mark pdfminer.six 20200517 as supported
2020-06-02 00:11:02 -07:00
James R. Barlow
1598f2f0e5
Abolish spoof_tesseract_noop
2020-06-01 03:07:53 -07:00
James R. Barlow
2b23f7ec73
tesseract_noop: begin implementing with plugin
2020-06-01 02:45:49 -07:00
James R. Barlow
6528234608
Fix tesseract_ocr.py errors
2020-06-01 02:27:27 -07:00
James R. Barlow
aa060db5bc
Refactor tesseract_env variable into the plugin
...
Removed all cases except one in api.py, which isn't worth solving because
it should be removed anyway.
This also fixes a logic error in the OMP_THREAD_LIMIT decision, api.py
did not use pass kwargs correctly so they never worked before.
2020-05-26 02:14:06 -07:00
James R. Barlow
d43212d30b
Refactor --language argument into set
2020-05-25 03:20:10 -07:00
James R. Barlow
a0f9ca3a30
Move Tesseract options validation into plugin
2020-05-25 01:31:46 -07:00
James R. Barlow
9bccff4f88
Move Tesseract specific arguments to plugin
2020-05-16 03:24:31 -07:00
James R. Barlow
2bd586e093
Compare requested languages to OCR engine instead of tesseract directly
...
Also refactoring to facilitating validation needing the plugin manager.
2020-05-16 01:50:37 -07:00
James R. Barlow
9af94ac9b7
pipeline: use OCR engine abstraction instead of Tesseract
2020-05-16 01:28:56 -07:00
James R. Barlow
8174089c8b
Begin transforming Tesseract into pluggable OCR engine
2020-05-14 03:54:21 -07:00
James R. Barlow
41eb54cc0a
Standardize tesseract.generate_hocr and _pdf parameters
2020-05-14 03:23:25 -07:00
James R. Barlow
12a2f78c4d
Fix validation of languages not using tesseract_env
...
And some related issues.
2020-05-14 03:19:22 -07:00
James R. Barlow
d372f1f7fa
Remove "skip page" from tesseract interface
...
Breaks tests/test_main.py::test_tesseract_missing_tessdata because
conftest.py does not update options.tesseract_env before testing options
for some reason, and tesseract.has_textonly_pdf raises an exception
instead of returning False as the test assumes.
2020-05-12 04:09:42 -07:00
James R. Barlow
6f5b75bcd0
Remove lru_cache on get_version
...
Does not play well with forking.
2020-05-12 03:51:48 -07:00
James R. Barlow
a2d3e0b53e
Convert remaining imports to absolute
2020-05-12 02:12:08 -07:00
James R. Barlow
7f67556995
ocrmypdf.__init__: Hide _HookimplMarker
2020-05-12 01:35:45 -07:00
James R. Barlow
db8c37e58c
Refactor ocrmypdf.exec.__init__.py
2020-05-12 01:34:10 -07:00
James R. Barlow
a87c81a64f
helpers: remove unnecessary isinstance test
2020-05-12 01:28:50 -07:00
James R. Barlow
4b986a5943
cli: make ArgumentParser._api_mode private
2020-05-12 01:28:36 -07:00
James R. Barlow
2fae9b655e
Remove **kwargs from check_external_program; deprecated
2020-05-12 01:07:01 -07:00
James R. Barlow
33b68454f3
watcher: cleanup getenv casting
2020-05-08 03:49:49 -07:00
James R. Barlow
977665d2b6
Delint some tests
2020-05-08 03:49:33 -07:00
James R. Barlow
fd7497f00d
Remove old function tesseract.v4()
2020-05-08 03:44:39 -07:00
James R. Barlow
790ff58f67
Add fix for bug in Windows Python 3.6/3.7
...
TypeError: argument of type 'WindowsPath' is not iterable
2020-05-07 22:19:21 -07:00
James R. Barlow
4b98ce391b
docs: rename security->pdfsecurity so github won't misinterpret it
2020-05-07 03:54:27 -07:00
James R. Barlow
417dbd43f6
docs: plugin documentation
2020-05-07 03:53:37 -07:00
James R. Barlow
7a12908db9
Relocate example plugin
2020-05-07 03:27:39 -07:00
James R. Barlow
9462f0a28f
graft: more refactoring
2020-05-07 02:59:24 -07:00
James R. Barlow
e760622a5c
graft: refactor
2020-05-07 02:03:42 -07:00
James R. Barlow
1b086f60a9
tesseract.py: api cleanup
2020-05-06 12:37:44 -07:00
James R. Barlow
85cbf94a6e
Convert many uses of str paths to Path
2020-05-06 02:53:47 -07:00
James R. Barlow
6f4286e1b1
New hook: filter_page_image
2020-05-06 02:24:07 -07:00
James R. Barlow
39888ae8c9
Rename install_cli to add_options
2020-05-06 01:10:09 -07:00
James R. Barlow
dd361ecd05
Support importing plugin by filename
2020-05-06 00:44:40 -07:00
James R. Barlow
32759c9025
Change argument from --plugins to --plugin
2020-05-06 00:43:40 -07:00
James R. Barlow
75c34b873a
optimize: convert from executor to progress pool
2020-05-03 02:04:57 -07:00
James R. Barlow
fe4296c53b
safe_symlink: remove deprecated params
2020-05-03 00:53:47 -07:00
James R. Barlow
c85278b31d
Delinting
2020-05-03 00:53:29 -07:00
James R. Barlow
5dbc080fa0
Rename PDFContext->PdfContext
2020-05-02 04:32:46 -07:00
James R. Barlow
e02f6c1e97
Support plugin invocation with API
2020-05-02 03:34:31 -07:00
James R. Barlow
8c9a8fc85c
pluginspec: avoid circular reference
2020-05-02 03:32:55 -07:00
James R. Barlow
23d558ad8c
Allow plugins to add command line arguments
2020-05-02 01:37:24 -07:00