James R. Barlow
03da34ee24
Test files needed!
2020-05-16 17:04:44 -07:00
James R. Barlow
9bccff4f88
Move Tesseract specific arguments to plugin
2020-05-16 03:24:31 -07:00
James R. Barlow
2bd586e093
Compare requested languages to OCR engine instead of tesseract directly
...
Also refactoring to facilitating validation needing the plugin manager.
2020-05-16 01:50:37 -07:00
James R. Barlow
9af94ac9b7
pipeline: use OCR engine abstraction instead of Tesseract
2020-05-16 01:28:56 -07:00
James R. Barlow
8174089c8b
Begin transforming Tesseract into pluggable OCR engine
2020-05-14 03:54:21 -07:00
James R. Barlow
41eb54cc0a
Standardize tesseract.generate_hocr and _pdf parameters
2020-05-14 03:23:25 -07:00
James R. Barlow
12a2f78c4d
Fix validation of languages not using tesseract_env
...
And some related issues.
2020-05-14 03:19:22 -07:00
James R. Barlow
d372f1f7fa
Remove "skip page" from tesseract interface
...
Breaks tests/test_main.py::test_tesseract_missing_tessdata because
conftest.py does not update options.tesseract_env before testing options
for some reason, and tesseract.has_textonly_pdf raises an exception
instead of returning False as the test assumes.
2020-05-12 04:09:42 -07:00
James R. Barlow
6f5b75bcd0
Remove lru_cache on get_version
...
Does not play well with forking.
2020-05-12 03:51:48 -07:00
James R. Barlow
a2d3e0b53e
Convert remaining imports to absolute
2020-05-12 02:12:08 -07:00
James R. Barlow
7f67556995
ocrmypdf.__init__: Hide _HookimplMarker
2020-05-12 01:35:45 -07:00
James R. Barlow
db8c37e58c
Refactor ocrmypdf.exec.__init__.py
2020-05-12 01:34:10 -07:00
James R. Barlow
a87c81a64f
helpers: remove unnecessary isinstance test
2020-05-12 01:28:50 -07:00
James R. Barlow
4b986a5943
cli: make ArgumentParser._api_mode private
2020-05-12 01:28:36 -07:00
James R. Barlow
2fae9b655e
Remove **kwargs from check_external_program; deprecated
2020-05-12 01:07:01 -07:00
James R. Barlow
2541f6cf89
Fix missing jbig2enc reported as error with -O3 instead of warning
...
Fixes #558
2020-05-12 01:05:57 -07:00
James R. Barlow
33b68454f3
watcher: cleanup getenv casting
2020-05-08 03:49:49 -07:00
James R. Barlow
977665d2b6
Delint some tests
2020-05-08 03:49:33 -07:00
James R. Barlow
fd7497f00d
Remove old function tesseract.v4()
2020-05-08 03:44:39 -07:00
James R. Barlow
790ff58f67
Add fix for bug in Windows Python 3.6/3.7
...
TypeError: argument of type 'WindowsPath' is not iterable
2020-05-07 22:19:21 -07:00
James R. Barlow
4b98ce391b
docs: rename security->pdfsecurity so github won't misinterpret it
2020-05-07 03:54:27 -07:00
James R. Barlow
417dbd43f6
docs: plugin documentation
2020-05-07 03:53:37 -07:00
James R. Barlow
7a12908db9
Relocate example plugin
2020-05-07 03:27:39 -07:00
James R. Barlow
9462f0a28f
graft: more refactoring
2020-05-07 02:59:24 -07:00
James R. Barlow
e760622a5c
graft: refactor
2020-05-07 02:03:42 -07:00
James R. Barlow
1b086f60a9
tesseract.py: api cleanup
2020-05-06 12:37:44 -07:00
James R. Barlow
85cbf94a6e
Convert many uses of str paths to Path
2020-05-06 02:53:47 -07:00
James R. Barlow
6f4286e1b1
New hook: filter_page_image
2020-05-06 02:24:07 -07:00
James R. Barlow
39888ae8c9
Rename install_cli to add_options
2020-05-06 01:10:09 -07:00
James R. Barlow
dd361ecd05
Support importing plugin by filename
2020-05-06 00:44:40 -07:00
James R. Barlow
32759c9025
Change argument from --plugins to --plugin
2020-05-06 00:43:40 -07:00
James R. Barlow
59440448ee
Merge branch 'master' of github.com:jbarlow83/OCRmyPDF
2020-05-04 01:38:26 -07:00
Peter Hogg
51b54893ce
docs: update Arch Linux install instructions ( #540 )
...
The python-pdfminer.six package is now available in the official Arch
repositories. The dependency will be automatically resolved when
installing the OCRmyPDF AUR package.
2020-05-04 01:37:58 -07:00
James R. Barlow
1f3665f614
docs: remove reference to brewfile
2020-05-03 16:10:26 -07:00
James R. Barlow
75c34b873a
optimize: convert from executor to progress pool
2020-05-03 02:04:57 -07:00
James R. Barlow
fe4296c53b
safe_symlink: remove deprecated params
2020-05-03 00:53:47 -07:00
James R. Barlow
c85278b31d
Delinting
2020-05-03 00:53:29 -07:00
James R. Barlow
5dbc080fa0
Rename PDFContext->PdfContext
2020-05-02 04:32:46 -07:00
James R. Barlow
e02f6c1e97
Support plugin invocation with API
2020-05-02 03:34:31 -07:00
James R. Barlow
8c9a8fc85c
pluginspec: avoid circular reference
2020-05-02 03:32:55 -07:00
James R. Barlow
23d558ad8c
Allow plugins to add command line arguments
2020-05-02 01:37:24 -07:00
James R. Barlow
be107b4fed
Set up filter_ocr_image hook
2020-05-01 02:56:41 -07:00
James R. Barlow
8d2535e327
Get pluggy to work with forking workers
2020-05-01 02:39:50 -07:00
James R. Barlow
5eb4fe0052
Refactor plugin setup to get_plugin_manager
2020-05-01 02:18:31 -07:00
James R. Barlow
d8ff4485f8
Move samefile to helpers
2020-05-01 02:18:11 -07:00
James R. Barlow
82bce463ae
Start pluggy-based plugin system
2020-05-01 02:15:23 -07:00
James R. Barlow
016dfd420c
Add warning if problematic --tesseract-pagesegmode is selected
...
Fixes #549
2020-04-30 04:12:11 -07:00
James R. Barlow
b59e761a14
v9.8.0 release notes
v9.8.0
2020-04-28 02:40:17 -07:00
James R. Barlow
17cd655752
Don't utf-8 decode tesseract --print-parameters
...
Output not guaranteed to be UTF-8.
Fixes #543 .
2020-04-28 02:37:17 -07:00
James R. Barlow
b840b16c82
Remove tesseract_badutf8.py
...
Should have been removed in 9db01c7
2020-04-28 02:35:23 -07:00