James R. Barlow
59440448ee
Merge branch 'master' of github.com:jbarlow83/OCRmyPDF
2020-05-04 01:38:26 -07:00
Peter Hogg
51b54893ce
docs: update Arch Linux install instructions ( #540 )
...
The python-pdfminer.six package is now available in the official Arch
repositories. The dependency will be automatically resolved when
installing the OCRmyPDF AUR package.
2020-05-04 01:37:58 -07:00
James R. Barlow
1f3665f614
docs: remove reference to brewfile
2020-05-03 16:10:26 -07:00
James R. Barlow
75c34b873a
optimize: convert from executor to progress pool
2020-05-03 02:04:57 -07:00
James R. Barlow
fe4296c53b
safe_symlink: remove deprecated params
2020-05-03 00:53:47 -07:00
James R. Barlow
c85278b31d
Delinting
2020-05-03 00:53:29 -07:00
James R. Barlow
5dbc080fa0
Rename PDFContext->PdfContext
2020-05-02 04:32:46 -07:00
James R. Barlow
e02f6c1e97
Support plugin invocation with API
2020-05-02 03:34:31 -07:00
James R. Barlow
8c9a8fc85c
pluginspec: avoid circular reference
2020-05-02 03:32:55 -07:00
James R. Barlow
23d558ad8c
Allow plugins to add command line arguments
2020-05-02 01:37:24 -07:00
James R. Barlow
be107b4fed
Set up filter_ocr_image hook
2020-05-01 02:56:41 -07:00
James R. Barlow
8d2535e327
Get pluggy to work with forking workers
2020-05-01 02:39:50 -07:00
James R. Barlow
5eb4fe0052
Refactor plugin setup to get_plugin_manager
2020-05-01 02:18:31 -07:00
James R. Barlow
d8ff4485f8
Move samefile to helpers
2020-05-01 02:18:11 -07:00
James R. Barlow
82bce463ae
Start pluggy-based plugin system
2020-05-01 02:15:23 -07:00
James R. Barlow
016dfd420c
Add warning if problematic --tesseract-pagesegmode is selected
...
Fixes #549
2020-04-30 04:12:11 -07:00
James R. Barlow
b59e761a14
v9.8.0 release notes
v9.8.0
2020-04-28 02:40:17 -07:00
James R. Barlow
17cd655752
Don't utf-8 decode tesseract --print-parameters
...
Output not guaranteed to be UTF-8.
Fixes #543 .
2020-04-28 02:37:17 -07:00
James R. Barlow
b840b16c82
Remove tesseract_badutf8.py
...
Should have been removed in 9db01c7
2020-04-28 02:35:23 -07:00
James R. Barlow
8f5c95f0f4
Remove last vestiges of command line usage of qpdf - change to check_pdf
2020-04-26 05:33:26 -07:00
James R. Barlow
168fc60774
Update release notes with v10 changes
2020-04-26 05:14:59 -07:00
James R. Barlow
c84d0f606d
ghostscript: remove deprecated argument from generate_pdfa
2020-04-26 05:11:11 -07:00
James R. Barlow
8b54ce338f
setup: remove deprecated message about removeal of --force parameter
2020-04-26 05:09:42 -07:00
James R. Barlow
18c4aa10bf
Adjust number of workers for concurrent page scanning
2020-04-26 04:21:15 -07:00
James R. Barlow
991db17fde
Remove Ghostscript-based text extraction
...
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
2c07515907
macOS - use spawn for multiprocessing
...
See bpo-33725. This is the default for 3.8, opt-in for 3.7 and older.
2020-04-26 03:49:40 -07:00
James R. Barlow
27a3b80376
Use once-per-worker pikepdf init
2020-04-26 03:49:20 -07:00
James R. Barlow
8c381a0227
Replace task_initargs with use of partial()
2020-04-26 03:49:20 -07:00
James R. Barlow
86145a8c76
Some wrong with forking worker_pdf, just open it once per page for now
2020-04-26 03:49:20 -07:00
James R. Barlow
7513f5425c
Fix some broken tests
2020-04-26 03:49:20 -07:00
James R. Barlow
af3c3c6466
Further refactoring of concurrency concerns
2020-04-26 03:49:20 -07:00
James R. Barlow
db3e75e33e
Refactor multiprocessing pool
2020-04-26 03:49:13 -07:00
James R. Barlow
ce49fc26dd
Do pikepdf.open() once instead of per worker
2020-04-26 03:42:13 -07:00
James R. Barlow
d0d0a98dca
First cut at concurrent page scan
...
Improvement appears on 168 page file. Needs refactoring
2020-04-26 03:42:13 -07:00
James R. Barlow
3834d1a0bf
azure: use brew python instead
2020-04-26 00:58:38 -07:00
James R. Barlow
33e982b3fd
azure: add certifi, openssl for macOS
2020-04-26 00:37:14 -07:00
James R. Barlow
43d650e78c
Fix issue where only first PNG-style image would be optimized
2020-04-25 03:50:11 -07:00
James R. Barlow
b4c65c5781
Update requirements
2020-04-25 03:49:34 -07:00
James R. Barlow
d96867e6ab
watcher: add polling and log level adjustment
2020-04-24 04:14:44 -07:00
James R. Barlow
0a5108e704
install: clarify that old ocrmypdf should be removed from Ubuntu 18.04
...
Closes #526
2020-04-24 04:14:19 -07:00
James R. Barlow
94c52a6fa3
Refactor 'xyres' into Resolution
2020-04-24 04:12:05 -07:00
James R. Barlow
57771f06a3
Refactor xy-pair for resolution to tuple
2020-04-16 15:38:33 -07:00
James R. Barlow
58abb5785c
pytest picky about list vs tuple
v9.7.2
2020-04-15 03:16:51 -07:00
James R. Barlow
509e75eaff
v9.7.2 release notes
2020-04-15 02:56:46 -07:00
James R. Barlow
0c50eedb2a
Support pdfminer.six 20200402
2020-04-15 02:55:22 -07:00
James R. Barlow
4581027246
Drop support for pdfminer.six 20181108
...
This version required a patch that has since been mainlined, and also did not
declare its dependency on chardet
correctly. We can remove both hacks now.
2020-04-15 02:50:36 -07:00
James R. Barlow
31b5f63f85
hocrtransform: cleanup/PEP8
...
Some API breaking changes.
2020-04-15 02:48:56 -07:00
James R. Barlow
957fb1494e
pytest picky about list vs tuple
2020-04-15 02:26:20 -07:00
James R. Barlow
9e3e4f2687
Improve help text about aborting due to text
2020-04-15 02:17:55 -07:00
James R. Barlow
2155bcacb4
Loosen test language requirements - eng/deu
2020-04-15 00:30:38 -07:00