James R. Barlow
d3c54fbbde
For —rotate-pages, rasterize preview at half DPI instead of 200 DPI
...
Ensures that time is not wasted on previews at higher resolution than
the input as was sometimes the case
2017-05-29 13:01:18 -07:00
James R. Barlow
28341b755f
Refactor common test fixtures
2017-05-29 12:47:55 -07:00
James R. Barlow
4b5cd420e1
Add new test file
2017-05-29 12:16:08 -07:00
James R. Barlow
9b50ede977
Partially solve ghostscript rasterize_pdf producing wrong file size
...
Kludge. Assumes JPEG for now. Messy.
2017-05-25 01:17:43 -07:00
James R. Barlow
82cf010333
Error out if trying to produce PDF/A >200” due to Ghostscript limitation
2017-05-25 00:07:29 -07:00
James R. Barlow
6ff6c8614f
—output-type=pdf now outputs /UserUnit PDFs at the correct size
...
This currently distorts the output size because Tesseract assumes it
knows the DPI better than we do.
Does not work for Ghostscript, because it emerges that Ghostscript
honors /UserUnit for rasterizing but not in pdfwrite (resolve/wontfix).
https://bugs.ghostscript.com/show_bug.cgi?id=690781
Ghostscript’s output would need to be patched in a PDF/A safe way for
this to work. Temporary route may be to block Ghostscript if
/UserUnit.
2017-05-24 23:26:07 -07:00
James R. Barlow
148b632b4f
Prove multiprocessing works, although it is still racy in some places
2017-05-23 16:32:13 -07:00
James R. Barlow
75f2262659
Ensure JobContext stuff is actually tested for IPC consistency
2017-05-19 17:57:07 -07:00
James R. Barlow
d9005a1074
pdfinfo: replace most remaining dict-style access
2017-05-19 16:17:36 -07:00
James R. Barlow
08e47117a3
Rename pageinfo to pdfinfo
2017-05-19 15:48:23 -07:00
James R. Barlow
8694f8d2eb
Replace magic strings colorspace and encoding with Enums
2017-05-18 22:32:27 -07:00
James R. Barlow
56d2aae963
Refactor from ImageInfo index to attribute accessing
2017-05-18 18:39:14 -07:00
James R. Barlow
caee5b1428
Access PageInfo instance variables instead of dictionary
2017-05-18 17:12:04 -07:00
James R. Barlow
cd04ae6949
Refactor PdfInfo(str(filename)) -> PdfInfo(filename)
2017-05-18 16:43:50 -07:00
James R. Barlow
6a0b68298f
Refactor pdf_get_all_pageinfo to PdfInfo
2017-05-18 16:31:18 -07:00
James R. Barlow
5de107d44c
tesseract_cache: update explanatory notes
2017-05-14 23:54:09 -07:00
James R. Barlow
048ae40e75
Update copyrights
2017-05-14 23:38:28 -07:00
James R. Barlow
234183ecd2
Fix: Tesseract 3.04 is sensitive to order of configuration commands
...
“txt hocr” is not acceptable and does not produce expected output .txt
while “hocr text” works fine, so switch the order everywhere.
Should fix #169
2017-05-14 23:27:46 -07:00
James R. Barlow
e1e9135e93
Test suite: tidy up imports
2017-05-14 23:15:29 -07:00
James R. Barlow
cb06359c0b
Turn on Tesseract 4 cache in test suite
...
Travis is too slow without it, and perhaps it’s overly paranoid to
never cache Tess4. Maybe nuke the cache occasionally to be safe…
2017-05-12 11:42:27 -07:00
James R. Barlow
b0e95842b8
Fix Travis CI errors while looking around for Tess4
2017-05-12 00:40:00 -07:00
James R. Barlow
21982cf1cb
baiona_gray remove alpha channel
2017-05-11 23:23:37 -07:00
James R. Barlow
edc01408da
Update the .png files, again, hopefully without corruption
2017-05-11 23:20:50 -07:00
James R. Barlow
96045e98f4
Update develop with master changes
...
We’re well out of the “trivial updates” zone
2017-05-11 22:54:27 -07:00
James R. Barlow
01b7205e2c
Ensure skipped pages are explained in sidecars
2017-05-11 00:43:36 -07:00
James R. Barlow
c8a4cbcf17
Fix test suite breakage after sidecar feature added
...
Forgot to update tesseract spoofers to account for change in tesseract
parameters. Also the change to outputting multiple files in the collate
steps affected how ruffus passes information into downstream consumers
of those files.
2017-05-11 00:17:24 -07:00
James R. Barlow
183eafa587
Implement sidecar text files ( #126 )
2017-05-10 15:22:44 -07:00
James R. Barlow
01a1c2b576
Implement —pdfa-image-compression to control Ghostscript’s compression
...
Fixes #163
2017-05-09 16:37:29 -07:00
James R. Barlow
c97ea1f2a9
Update high DPI test case to confirm the output image is not downsampled
2017-05-06 22:34:01 -07:00
James R. Barlow
bf04f03c4c
Fix corrupt test file “typewriter.png”
...
This file is not currently used in any tests, but could be, so replace
corrupt version with a useful one.
2017-05-06 22:28:34 -07:00
James R. Barlow
93e802f473
Fix issue #163 , color and grayscale images JPEG compressed when not needed
2017-05-06 22:27:25 -07:00
James R. Barlow
aa859a4139
Fix #156 - NoneType has no ‘getObject’ for pages with no /Contents
2017-05-01 15:46:15 -07:00
James R. Barlow
b9b12e2879
Ensure that ocrmypdf stops and reports an error if Ghostscript fails
...
Past behavior was to continue and let ruffus puke eventually
2017-05-01 15:44:21 -07:00
James R. Barlow
554fcc8b9d
Add test case for #152
2017-04-18 15:20:25 -07:00
James R. Barlow
7b7e3a3e03
Enable lossless reconstruction for —pdf-renderer tess4 where appropriate
2017-03-29 23:44:12 -07:00
James R. Barlow
1e7fbd4202
Fix issues with —pdf-renderer tess4 page skipping
...
If tess4 renderer needed to skip OCR on a page it would end up
duplicating the page contents onto the new page, rather than creating
a blank OCR layer and placing it on the output page. This created
duplicated content in output files.
2017-03-29 23:43:26 -07:00
James R. Barlow
89599b4812
Drop Python 3.4 compatibility
2017-03-29 15:46:53 -07:00
James R. Barlow
88ef2718f1
Reject high Unicode metadata at command line
...
Ghostscript 9.21 does not seem to accept Unicode above U+FFFF. Previous
versions did, but it now exits with a rangecheck error (-15).
Reject on the command line for now. Complete fix would also need to
check input PDF’s metadata.
2017-03-28 11:08:38 -07:00
James R. Barlow
e71e8ca3ad
Workaround for GS VMerror -25 bug
...
Avoid inserting docinfo keys that would be translated to null strings,
to avoid running afoul of
https://bugs.ghostscript.com/show_bug.cgi?id=697684
2017-03-28 11:05:43 -07:00
James R. Barlow
199de96cff
Ghostcript 9.21 seems to have a regression related to Unicode metadata
2017-03-24 15:15:46 -07:00
James R. Barlow
8ddbe81513
Fix issue #147 : unpaper loses DPI information, affects —pdf-renderer tess4
2017-03-24 13:23:03 -07:00
James R. Barlow
f035cb1088
Fixed issue #142 — closed streams raise an exception on fork attempt
2017-03-13 15:52:57 -07:00
James R. Barlow
72660d0dec
MacOS skip the one test that needs poppler, to save installing poppler
2017-03-11 17:03:26 -08:00
James R. Barlow
4a1fec8328
Improvements to macOS test and work on homebrew tap autobrew
...
Squashed commits:
[3f06c1e] Try setting up homebrew tap autobuilding
[01532f1] Strict mode error in brew
2017-03-11 17:00:54 -08:00
James R. Barlow
7cd2770a13
Fix issue #137 - proportions of non-square resolution distorted
...
Distortion mainly affected —force-ocr
2017-02-26 17:13:16 -08:00
James R. Barlow
d1a0065ef8
Create test case for Form XObjects
2017-02-14 12:51:15 -08:00
James R. Barlow
9f800736bc
Fix running_in_docker() check failing on newer Docker
...
This test has to work to ensure spoof/tesseract_cache.py has a writable
directory to put cache into. Otherwise those tests fail.
2017-02-13 02:16:06 -08:00
James R. Barlow
a0657ad937
Prevent use of —pdf-renderer tess4 on tesseract 3
2017-02-06 13:49:43 -08:00
James R. Barlow
005216bc57
Support ocrmypdf-tess4
2017-01-29 18:26:52 -08:00
James R. Barlow
8c17c9918e
Add documentation and test cases for —tesseract-config
...
This parameter has existed for along time but never really got any
attention.
2017-01-28 22:06:51 -08:00