2676 Commits

Author SHA1 Message Date
James R. Barlow
03da34ee24
Test files needed! 2020-05-16 17:04:44 -07:00
James R. Barlow
9bccff4f88
Move Tesseract specific arguments to plugin 2020-05-16 03:24:31 -07:00
James R. Barlow
2bd586e093
Compare requested languages to OCR engine instead of tesseract directly
Also refactoring to facilitating validation needing the plugin manager.
2020-05-16 01:50:37 -07:00
James R. Barlow
9af94ac9b7 pipeline: use OCR engine abstraction instead of Tesseract 2020-05-16 01:28:56 -07:00
James R. Barlow
8174089c8b
Begin transforming Tesseract into pluggable OCR engine 2020-05-14 03:54:21 -07:00
James R. Barlow
41eb54cc0a
Standardize tesseract.generate_hocr and _pdf parameters 2020-05-14 03:23:25 -07:00
James R. Barlow
12a2f78c4d
Fix validation of languages not using tesseract_env
And some related issues.
2020-05-14 03:19:22 -07:00
James R. Barlow
d372f1f7fa Remove "skip page" from tesseract interface
Breaks tests/test_main.py::test_tesseract_missing_tessdata because
conftest.py does not update options.tesseract_env before testing options
for some reason, and tesseract.has_textonly_pdf raises an exception
instead of returning False as the test assumes.
2020-05-12 04:09:42 -07:00
James R. Barlow
6f5b75bcd0 Remove lru_cache on get_version
Does not play well with forking.
2020-05-12 03:51:48 -07:00
James R. Barlow
a2d3e0b53e
Convert remaining imports to absolute 2020-05-12 02:12:08 -07:00
James R. Barlow
7f67556995
ocrmypdf.__init__: Hide _HookimplMarker 2020-05-12 01:35:45 -07:00
James R. Barlow
db8c37e58c
Refactor ocrmypdf.exec.__init__.py 2020-05-12 01:34:10 -07:00
James R. Barlow
a87c81a64f
helpers: remove unnecessary isinstance test 2020-05-12 01:28:50 -07:00
James R. Barlow
4b986a5943
cli: make ArgumentParser._api_mode private 2020-05-12 01:28:36 -07:00
James R. Barlow
2fae9b655e
Remove **kwargs from check_external_program; deprecated 2020-05-12 01:07:01 -07:00
James R. Barlow
2541f6cf89
Fix missing jbig2enc reported as error with -O3 instead of warning
Fixes #558
2020-05-12 01:05:57 -07:00
James R. Barlow
33b68454f3
watcher: cleanup getenv casting 2020-05-08 03:49:49 -07:00
James R. Barlow
977665d2b6
Delint some tests 2020-05-08 03:49:33 -07:00
James R. Barlow
fd7497f00d
Remove old function tesseract.v4() 2020-05-08 03:44:39 -07:00
James R. Barlow
790ff58f67
Add fix for bug in Windows Python 3.6/3.7
TypeError: argument of type 'WindowsPath' is not iterable
2020-05-07 22:19:21 -07:00
James R. Barlow
4b98ce391b
docs: rename security->pdfsecurity so github won't misinterpret it 2020-05-07 03:54:27 -07:00
James R. Barlow
417dbd43f6
docs: plugin documentation 2020-05-07 03:53:37 -07:00
James R. Barlow
7a12908db9
Relocate example plugin 2020-05-07 03:27:39 -07:00
James R. Barlow
9462f0a28f
graft: more refactoring 2020-05-07 02:59:24 -07:00
James R. Barlow
e760622a5c
graft: refactor 2020-05-07 02:03:42 -07:00
James R. Barlow
1b086f60a9
tesseract.py: api cleanup 2020-05-06 12:37:44 -07:00
James R. Barlow
85cbf94a6e
Convert many uses of str paths to Path 2020-05-06 02:53:47 -07:00
James R. Barlow
6f4286e1b1 New hook: filter_page_image 2020-05-06 02:24:07 -07:00
James R. Barlow
39888ae8c9
Rename install_cli to add_options 2020-05-06 01:10:09 -07:00
James R. Barlow
dd361ecd05
Support importing plugin by filename 2020-05-06 00:44:40 -07:00
James R. Barlow
32759c9025
Change argument from --plugins to --plugin 2020-05-06 00:43:40 -07:00
James R. Barlow
59440448ee Merge branch 'master' of github.com:jbarlow83/OCRmyPDF 2020-05-04 01:38:26 -07:00
Peter Hogg
51b54893ce
docs: update Arch Linux install instructions (#540)
The python-pdfminer.six package is now available in the official Arch
repositories. The dependency will be automatically resolved when
installing the OCRmyPDF AUR package.
2020-05-04 01:37:58 -07:00
James R. Barlow
1f3665f614
docs: remove reference to brewfile 2020-05-03 16:10:26 -07:00
James R. Barlow
75c34b873a
optimize: convert from executor to progress pool 2020-05-03 02:04:57 -07:00
James R. Barlow
fe4296c53b
safe_symlink: remove deprecated params 2020-05-03 00:53:47 -07:00
James R. Barlow
c85278b31d
Delinting 2020-05-03 00:53:29 -07:00
James R. Barlow
5dbc080fa0
Rename PDFContext->PdfContext 2020-05-02 04:32:46 -07:00
James R. Barlow
e02f6c1e97
Support plugin invocation with API 2020-05-02 03:34:31 -07:00
James R. Barlow
8c9a8fc85c
pluginspec: avoid circular reference 2020-05-02 03:32:55 -07:00
James R. Barlow
23d558ad8c
Allow plugins to add command line arguments 2020-05-02 01:37:24 -07:00
James R. Barlow
be107b4fed
Set up filter_ocr_image hook 2020-05-01 02:56:41 -07:00
James R. Barlow
8d2535e327
Get pluggy to work with forking workers 2020-05-01 02:39:50 -07:00
James R. Barlow
5eb4fe0052
Refactor plugin setup to get_plugin_manager 2020-05-01 02:18:31 -07:00
James R. Barlow
d8ff4485f8
Move samefile to helpers 2020-05-01 02:18:11 -07:00
James R. Barlow
82bce463ae
Start pluggy-based plugin system 2020-05-01 02:15:23 -07:00
James R. Barlow
016dfd420c Add warning if problematic --tesseract-pagesegmode is selected
Fixes #549
2020-04-30 04:12:11 -07:00
James R. Barlow
b59e761a14
v9.8.0 release notes v9.8.0 2020-04-28 02:40:17 -07:00
James R. Barlow
17cd655752
Don't utf-8 decode tesseract --print-parameters
Output not guaranteed to be UTF-8.

Fixes #543.
2020-04-28 02:37:17 -07:00
James R. Barlow
b840b16c82
Remove tesseract_badutf8.py
Should have been removed in 9db01c7
2020-04-28 02:35:23 -07:00