2676 Commits

Author SHA1 Message Date
James R. Barlow
8d323ae510 Merge branch 'feature/pagesegmode' into develop 2016-01-11 17:23:00 -08:00
James R. Barlow
3b53e9adac Use tesseract cache for -psm 2016-01-11 17:22:50 -08:00
James R. Barlow
074c1d71b4 Activate --tesseract-pagesegmode 2016-01-11 17:19:32 -08:00
James R. Barlow
1fca9a004d Adjust command line parameters
Was splitting each argument to --tesseract-config into a list of single
character strings
2016-01-11 16:57:19 -08:00
James R. Barlow
b485a1ef78 Override ruffus' handling of --jobs
Ruffus treats omitted parameter as -j1. For our purposes it makes more
sense for omitting the parameter to mean "use all CPUs". As such we
must be able to distinguish -j1 from the parameter -j being omitted.

Telling ruffus to ignore the argument actually just makes it not auto
generate the argument. We can add an argument back with the same name.
2016-01-09 19:07:48 -08:00
James R. Barlow
326ef7a3ac Merge branch 'hotfix/v3.1.1' into develop
# Conflicts:
#	RELEASE_NOTES.rst
2016-01-09 18:55:04 -08:00
James R. Barlow
12bc58b5b6 Merge branch 'hotfix/v3.1.1' v3.1.1 2016-01-09 18:45:40 -08:00
James R. Barlow
6af0815681 Bump version 2016-01-09 18:45:06 -08:00
James R. Barlow
66c2b9b78e Merge branch 'hotfix/v3.1.1' into develop 2016-01-09 18:38:09 -08:00
James R. Barlow
d03c056cb1 Supporting all languages bloats the image by an extra 1 GB
Make it a special image
2016-01-04 16:49:06 -08:00
James R. Barlow
3f94d628fa Dockerfile: remove manual build of unpaper
Fortunately unpaper now exists as binary package, eliminating the need
to install all of the build machinery and build it from source.
2016-01-04 15:07:12 -08:00
James R. Barlow
a64c7dbe99 Update dockerfile: include all languages
Also update ignore files
2016-01-04 14:27:16 -08:00
James R. Barlow
61b3ccb57c Place ruffus database in temporary folder
Because we don't really use ruffus checkpoint feature, putting the
database in a permanent location does not help anything, but does cause
large database files and problems if the .ruffus_history.sqlite wanted
to be in a writable location.
2016-01-04 13:23:47 -08:00
James R. Barlow
424b4b33b1 Just go right ahead and demand Python 3.4 2016-01-04 12:56:51 -08:00
James R. Barlow
e510f89792 Python 2 warning message 2015-12-21 09:38:38 -08:00
James R. Barlow
49cd6cc619 Off by one error in page info calculation 2015-12-21 09:35:02 -08:00
James R. Barlow
9aa3d340d4 Tell Travis about the cache 2015-12-17 14:02:13 -08:00
James R. Barlow
09782242c8 Adjust test cases to use cache and noop more effectively
This reduces total execution time to 164s on my machine, down from
about double that.
2015-12-17 14:00:17 -08:00
James R. Barlow
9ec4aa039d Add tesseract caching to speed up tests 2015-12-17 12:52:12 -08:00
James R. Barlow
ecebe2f24b Let some tests use the spoofed tesseract
Where getting OCR doesn't matter
2015-12-17 11:56:09 -08:00
James R. Barlow
7313a77c2a Implement pdf renderer side of tess spoof 2015-12-17 11:41:54 -08:00
James R. Barlow
45113676a3 Add Tesseract spoofing 2015-12-17 11:36:47 -08:00
James R. Barlow
102bd07019 Check for encrypted PDF and complain appropriately 2015-12-17 10:37:54 -08:00
James R. Barlow
9622e31da9 Use envvars in a new test case
And get rid of the messy binary replacement spoofing
2015-12-17 09:29:01 -08:00
James R. Barlow
1731ce2a44 Environment variables can now override default programs 2015-12-17 09:05:10 -08:00
James R. Barlow
276f421c44 Did a quick test of Ghostscript vs QPDF at PDF page splitting
qpdf won so hard it wasn't funny, even though it must be called once
per page to do the job. Perhaps Ghostscript interprets it as a call to
render the page?

time bash qpdf-test.fish ../tests/resources/multipage.pdf
        0.07 real         0.02 user         0.03 sys

time gs -sDEVICE=pdfwrite -dSAFER -o '%06d.pdf' ../tests/resources/multipage.pdf
        5.12 real         5.06 user         0.04 sys
2015-12-17 08:49:08 -08:00
James R. Barlow
133357779a All subprocess invocations refactored out of main.py 2015-12-17 08:31:18 -08:00
James R. Barlow
5d8167b232 Move PDF validation check to qpdf.py 2015-12-17 08:28:00 -08:00
James R. Barlow
e76ae8c46c Move more qpdf calls into qpdf.py 2015-12-17 08:24:48 -08:00
James R. Barlow
53a7c0e668 Refactor qpdf subprocess calls into module 2015-12-17 08:19:53 -08:00
James R. Barlow
4ca243e490 Merge commit '9f374461559460527e47237323e511123f31b6b0' into feature/envvars 2015-12-17 07:27:26 -08:00
jbarlow83
9f37446155 Merge pull request #34 from shemgp/master
Don't exit when qpdf repairs the file successfully but displays warning
2015-12-16 20:46:47 -08:00
Shem Pasamba
d7c7559b05 Use boolean instead of integers 2015-12-17 11:23:27 +08:00
Shem Pasamba
b2b66d1344 Don't exit when qpdf repair was successful 2015-12-17 11:20:20 +08:00
James R. Barlow
5d111a3c04 Refactor tesseract --pdfrenderer calls to tesseract.py 2015-12-16 17:48:26 -08:00
James R. Barlow
10416f847f Migrate tesseract-hocr code to tesseract module, because modularity 2015-12-16 17:36:11 -08:00
James R. Barlow
79b3472b26 All tests passed, bump version v3.1 2015-12-04 04:31:01 -08:00
James R. Barlow
f1b2f1ae08 Merge branch 'feature/pdfa-2' into develop 2015-12-04 04:04:08 -08:00
James R. Barlow
ee7d97ae8c Trivial 2015-12-04 04:03:38 -08:00
James R. Barlow
7d9f473bb1 Remove eval() call by introspecting ExitCode 2015-12-04 03:34:53 -08:00
James R. Barlow
e77a5e5e75 We don't want threads. Really. Do. Not. Want. 2015-12-04 03:11:38 -08:00
James R. Barlow
6ab19af122 Comments 2015-12-04 03:09:39 -08:00
James R. Barlow
276fe49867 Better error messages for input file not found or invalid
Not as good finding a general way to deal with ruffus exceptions, but
better than nil.
2015-12-04 03:07:53 -08:00
James R. Barlow
acb31abe86 Fix issue #20 - fails on uppercase .PDF 2015-12-04 02:14:09 -08:00
James R. Barlow
4f964a3c8a Introduce --pdf-renderer auto
Tess 3.03's has various quality problems like wrong DPI that are fixed
in Tess 3.04. Idea here is to introduce an option to let OCRmyPDF
select the rendering backend based on the options and system.

However, we're not ready for tesseract as the main renderer.
Setting pdf-renderer to tesseract does not pass all test cases, mainly
the one where --tesseract-timeout is triggered, and some others.
2015-12-02 23:20:31 -08:00
James R. Barlow
df1fda7438 pageinfo: workaround PyPDF extractText limitations on hidden text
It appears that extractText() does not find all text. At a glance it
may be that Tesseract's PDF renderer generates a font and uses glyphs
that map to different Unicode code points that PyPDF expects, so it
discards the content and finds nothing. As a proxy in lieu of better
PDF parsing, assume that a "GlyphLessFont" means there is a text there.

I had previously found it does not work to check for the presence of a
font on page. Some PDF generators create a font resource entry even if
the font is never called for.
2015-12-02 23:16:36 -08:00
James R. Barlow
d6124c1787 pageinfo: improve robustness of text test for Tesseract produced PDFs 2015-12-02 03:12:52 -08:00
James R. Barlow
80d89b5420 Set /Creator metadata to OCRmyPDF
with reference to Tess version and settings
2015-12-02 02:19:39 -08:00
James R. Barlow
74059eecf1 Choose PDF/A-2b by default instead of A-1b 2015-12-02 01:48:10 -08:00
James R. Barlow
78697341a2 pytest: don't run tests that happened to be part of pyvenv 2015-12-02 01:19:43 -08:00