2895 Commits

Author SHA1 Message Date
James R. Barlow
59440448ee Merge branch 'master' of github.com:jbarlow83/OCRmyPDF 2020-05-04 01:38:26 -07:00
Peter Hogg
51b54893ce
docs: update Arch Linux install instructions (#540)
The python-pdfminer.six package is now available in the official Arch
repositories. The dependency will be automatically resolved when
installing the OCRmyPDF AUR package.
2020-05-04 01:37:58 -07:00
James R. Barlow
1f3665f614
docs: remove reference to brewfile 2020-05-03 16:10:26 -07:00
James R. Barlow
75c34b873a
optimize: convert from executor to progress pool 2020-05-03 02:04:57 -07:00
James R. Barlow
fe4296c53b
safe_symlink: remove deprecated params 2020-05-03 00:53:47 -07:00
James R. Barlow
c85278b31d
Delinting 2020-05-03 00:53:29 -07:00
James R. Barlow
5dbc080fa0
Rename PDFContext->PdfContext 2020-05-02 04:32:46 -07:00
James R. Barlow
e02f6c1e97
Support plugin invocation with API 2020-05-02 03:34:31 -07:00
James R. Barlow
8c9a8fc85c
pluginspec: avoid circular reference 2020-05-02 03:32:55 -07:00
James R. Barlow
23d558ad8c
Allow plugins to add command line arguments 2020-05-02 01:37:24 -07:00
James R. Barlow
be107b4fed
Set up filter_ocr_image hook 2020-05-01 02:56:41 -07:00
James R. Barlow
8d2535e327
Get pluggy to work with forking workers 2020-05-01 02:39:50 -07:00
James R. Barlow
5eb4fe0052
Refactor plugin setup to get_plugin_manager 2020-05-01 02:18:31 -07:00
James R. Barlow
d8ff4485f8
Move samefile to helpers 2020-05-01 02:18:11 -07:00
James R. Barlow
82bce463ae
Start pluggy-based plugin system 2020-05-01 02:15:23 -07:00
James R. Barlow
016dfd420c Add warning if problematic --tesseract-pagesegmode is selected
Fixes #549
2020-04-30 04:12:11 -07:00
James R. Barlow
b59e761a14
v9.8.0 release notes v9.8.0 2020-04-28 02:40:17 -07:00
James R. Barlow
17cd655752
Don't utf-8 decode tesseract --print-parameters
Output not guaranteed to be UTF-8.

Fixes #543.
2020-04-28 02:37:17 -07:00
James R. Barlow
b840b16c82
Remove tesseract_badutf8.py
Should have been removed in 9db01c7
2020-04-28 02:35:23 -07:00
James R. Barlow
8f5c95f0f4
Remove last vestiges of command line usage of qpdf - change to check_pdf 2020-04-26 05:33:26 -07:00
James R. Barlow
168fc60774
Update release notes with v10 changes 2020-04-26 05:14:59 -07:00
James R. Barlow
c84d0f606d
ghostscript: remove deprecated argument from generate_pdfa 2020-04-26 05:11:11 -07:00
James R. Barlow
8b54ce338f
setup: remove deprecated message about removeal of --force parameter 2020-04-26 05:09:42 -07:00
James R. Barlow
18c4aa10bf
Adjust number of workers for concurrent page scanning 2020-04-26 04:21:15 -07:00
James R. Barlow
991db17fde
Remove Ghostscript-based text extraction
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
2c07515907 macOS - use spawn for multiprocessing
See bpo-33725. This is the default for 3.8, opt-in for 3.7 and older.
2020-04-26 03:49:40 -07:00
James R. Barlow
27a3b80376 Use once-per-worker pikepdf init 2020-04-26 03:49:20 -07:00
James R. Barlow
8c381a0227 Replace task_initargs with use of partial() 2020-04-26 03:49:20 -07:00
James R. Barlow
86145a8c76 Some wrong with forking worker_pdf, just open it once per page for now 2020-04-26 03:49:20 -07:00
James R. Barlow
7513f5425c Fix some broken tests 2020-04-26 03:49:20 -07:00
James R. Barlow
af3c3c6466 Further refactoring of concurrency concerns 2020-04-26 03:49:20 -07:00
James R. Barlow
db3e75e33e Refactor multiprocessing pool 2020-04-26 03:49:13 -07:00
James R. Barlow
ce49fc26dd Do pikepdf.open() once instead of per worker 2020-04-26 03:42:13 -07:00
James R. Barlow
d0d0a98dca First cut at concurrent page scan
Improvement appears on 168 page file. Needs refactoring
2020-04-26 03:42:13 -07:00
James R. Barlow
3834d1a0bf
azure: use brew python instead 2020-04-26 00:58:38 -07:00
James R. Barlow
33e982b3fd
azure: add certifi, openssl for macOS 2020-04-26 00:37:14 -07:00
James R. Barlow
43d650e78c
Fix issue where only first PNG-style image would be optimized 2020-04-25 03:50:11 -07:00
James R. Barlow
b4c65c5781
Update requirements 2020-04-25 03:49:34 -07:00
James R. Barlow
d96867e6ab watcher: add polling and log level adjustment 2020-04-24 04:14:44 -07:00
James R. Barlow
0a5108e704 install: clarify that old ocrmypdf should be removed from Ubuntu 18.04
Closes #526
2020-04-24 04:14:19 -07:00
James R. Barlow
94c52a6fa3
Refactor 'xyres' into Resolution 2020-04-24 04:12:05 -07:00
James R. Barlow
57771f06a3
Refactor xy-pair for resolution to tuple 2020-04-16 15:38:33 -07:00
James R. Barlow
58abb5785c
pytest picky about list vs tuple v9.7.2 2020-04-15 03:16:51 -07:00
James R. Barlow
509e75eaff
v9.7.2 release notes 2020-04-15 02:56:46 -07:00
James R. Barlow
0c50eedb2a Support pdfminer.six 20200402 2020-04-15 02:55:22 -07:00
James R. Barlow
4581027246 Drop support for pdfminer.six 20181108
This version required a patch that has since been mainlined, and also did not
declare its dependency on chardet
correctly. We can remove both hacks now.
2020-04-15 02:50:36 -07:00
James R. Barlow
31b5f63f85 hocrtransform: cleanup/PEP8
Some API breaking changes.
2020-04-15 02:48:56 -07:00
James R. Barlow
957fb1494e
pytest picky about list vs tuple 2020-04-15 02:26:20 -07:00
James R. Barlow
9e3e4f2687
Improve help text about aborting due to text 2020-04-15 02:17:55 -07:00
James R. Barlow
2155bcacb4
Loosen test language requirements - eng/deu 2020-04-15 00:30:38 -07:00