OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2025-08-17 21:22:07 +00:00

Author	SHA1	Message	Date
James R. Barlow	8d323ae510	Merge branch 'feature/pagesegmode' into develop	2016-01-11 17:23:00 -08:00
James R. Barlow	3b53e9adac	Use tesseract cache for -psm	2016-01-11 17:22:50 -08:00
James R. Barlow	074c1d71b4	Activate --tesseract-pagesegmode	2016-01-11 17:19:32 -08:00
James R. Barlow	1fca9a004d	Adjust command line parameters Was splitting each argument to --tesseract-config into a list of single character strings	2016-01-11 16:57:19 -08:00
James R. Barlow	b485a1ef78	Override ruffus' handling of --jobs Ruffus treats omitted parameter as -j1. For our purposes it makes more sense for omitting the parameter to mean "use all CPUs". As such we must be able to distinguish -j1 from the parameter -j being omitted. Telling ruffus to ignore the argument actually just makes it not auto generate the argument. We can add an argument back with the same name.	2016-01-09 19:07:48 -08:00
James R. Barlow	326ef7a3ac	Merge branch 'hotfix/v3.1.1' into develop # Conflicts: # RELEASE_NOTES.rst	2016-01-09 18:55:04 -08:00
James R. Barlow	12bc58b5b6	Merge branch 'hotfix/v3.1.1' v3.1.1	2016-01-09 18:45:40 -08:00
James R. Barlow	6af0815681	Bump version	2016-01-09 18:45:06 -08:00
James R. Barlow	66c2b9b78e	Merge branch 'hotfix/v3.1.1' into develop	2016-01-09 18:38:09 -08:00
James R. Barlow	d03c056cb1	Supporting all languages bloats the image by an extra 1 GB Make it a special image	2016-01-04 16:49:06 -08:00
James R. Barlow	3f94d628fa	Dockerfile: remove manual build of unpaper Fortunately unpaper now exists as binary package, eliminating the need to install all of the build machinery and build it from source.	2016-01-04 15:07:12 -08:00
James R. Barlow	a64c7dbe99	Update dockerfile: include all languages Also update ignore files	2016-01-04 14:27:16 -08:00
James R. Barlow	61b3ccb57c	Place ruffus database in temporary folder Because we don't really use ruffus checkpoint feature, putting the database in a permanent location does not help anything, but does cause large database files and problems if the .ruffus_history.sqlite wanted to be in a writable location.	2016-01-04 13:23:47 -08:00
James R. Barlow	424b4b33b1	Just go right ahead and demand Python 3.4	2016-01-04 12:56:51 -08:00
James R. Barlow	e510f89792	Python 2 warning message	2015-12-21 09:38:38 -08:00
James R. Barlow	49cd6cc619	Off by one error in page info calculation	2015-12-21 09:35:02 -08:00
James R. Barlow	9aa3d340d4	Tell Travis about the cache	2015-12-17 14:02:13 -08:00
James R. Barlow	09782242c8	Adjust test cases to use cache and noop more effectively This reduces total execution time to 164s on my machine, down from about double that.	2015-12-17 14:00:17 -08:00
James R. Barlow	9ec4aa039d	Add tesseract caching to speed up tests	2015-12-17 12:52:12 -08:00
James R. Barlow	ecebe2f24b	Let some tests use the spoofed tesseract Where getting OCR doesn't matter	2015-12-17 11:56:09 -08:00
James R. Barlow	7313a77c2a	Implement pdf renderer side of tess spoof	2015-12-17 11:41:54 -08:00
James R. Barlow	45113676a3	Add Tesseract spoofing	2015-12-17 11:36:47 -08:00
James R. Barlow	102bd07019	Check for encrypted PDF and complain appropriately	2015-12-17 10:37:54 -08:00
James R. Barlow	9622e31da9	Use envvars in a new test case And get rid of the messy binary replacement spoofing	2015-12-17 09:29:01 -08:00
James R. Barlow	1731ce2a44	Environment variables can now override default programs	2015-12-17 09:05:10 -08:00
James R. Barlow	276f421c44	Did a quick test of Ghostscript vs QPDF at PDF page splitting qpdf won so hard it wasn't funny, even though it must be called once per page to do the job. Perhaps Ghostscript interprets it as a call to render the page? time bash qpdf-test.fish ../tests/resources/multipage.pdf 0.07 real 0.02 user 0.03 sys time gs -sDEVICE=pdfwrite -dSAFER -o '%06d.pdf' ../tests/resources/multipage.pdf 5.12 real 5.06 user 0.04 sys	2015-12-17 08:49:08 -08:00
James R. Barlow	133357779a	All subprocess invocations refactored out of main.py	2015-12-17 08:31:18 -08:00
James R. Barlow	5d8167b232	Move PDF validation check to qpdf.py	2015-12-17 08:28:00 -08:00
James R. Barlow	e76ae8c46c	Move more qpdf calls into qpdf.py	2015-12-17 08:24:48 -08:00
James R. Barlow	53a7c0e668	Refactor qpdf subprocess calls into module	2015-12-17 08:19:53 -08:00
James R. Barlow	4ca243e490	Merge commit '9f374461559460527e47237323e511123f31b6b0' into feature/envvars	2015-12-17 07:27:26 -08:00
jbarlow83	9f37446155	Merge pull request #34 from shemgp/master Don't exit when qpdf repairs the file successfully but displays warning	2015-12-16 20:46:47 -08:00
Shem Pasamba	d7c7559b05	Use boolean instead of integers	2015-12-17 11:23:27 +08:00
Shem Pasamba	b2b66d1344	Don't exit when qpdf repair was successful	2015-12-17 11:20:20 +08:00
James R. Barlow	5d111a3c04	Refactor tesseract --pdfrenderer calls to tesseract.py	2015-12-16 17:48:26 -08:00
James R. Barlow	10416f847f	Migrate tesseract-hocr code to tesseract module, because modularity	2015-12-16 17:36:11 -08:00
James R. Barlow	79b3472b26	All tests passed, bump version v3.1	2015-12-04 04:31:01 -08:00
James R. Barlow	f1b2f1ae08	Merge branch 'feature/pdfa-2' into develop	2015-12-04 04:04:08 -08:00
James R. Barlow	ee7d97ae8c	Trivial	2015-12-04 04:03:38 -08:00
James R. Barlow	7d9f473bb1	Remove eval() call by introspecting ExitCode	2015-12-04 03:34:53 -08:00
James R. Barlow	e77a5e5e75	We don't want threads. Really. Do. Not. Want.	2015-12-04 03:11:38 -08:00
James R. Barlow	6ab19af122	Comments	2015-12-04 03:09:39 -08:00
James R. Barlow	276fe49867	Better error messages for input file not found or invalid Not as good finding a general way to deal with ruffus exceptions, but better than nil.	2015-12-04 03:07:53 -08:00
James R. Barlow	acb31abe86	Fix issue #20 - fails on uppercase .PDF	2015-12-04 02:14:09 -08:00
James R. Barlow	4f964a3c8a	Introduce --pdf-renderer auto Tess 3.03's has various quality problems like wrong DPI that are fixed in Tess 3.04. Idea here is to introduce an option to let OCRmyPDF select the rendering backend based on the options and system. However, we're not ready for tesseract as the main renderer. Setting pdf-renderer to tesseract does not pass all test cases, mainly the one where --tesseract-timeout is triggered, and some others.	2015-12-02 23:20:31 -08:00
James R. Barlow	df1fda7438	pageinfo: workaround PyPDF extractText limitations on hidden text It appears that extractText() does not find all text. At a glance it may be that Tesseract's PDF renderer generates a font and uses glyphs that map to different Unicode code points that PyPDF expects, so it discards the content and finds nothing. As a proxy in lieu of better PDF parsing, assume that a "GlyphLessFont" means there is a text there. I had previously found it does not work to check for the presence of a font on page. Some PDF generators create a font resource entry even if the font is never called for.	2015-12-02 23:16:36 -08:00
James R. Barlow	d6124c1787	pageinfo: improve robustness of text test for Tesseract produced PDFs	2015-12-02 03:12:52 -08:00
James R. Barlow	80d89b5420	Set /Creator metadata to OCRmyPDF with reference to Tess version and settings	2015-12-02 02:19:39 -08:00
James R. Barlow	74059eecf1	Choose PDF/A-2b by default instead of A-1b	2015-12-02 01:48:10 -08:00
James R. Barlow	78697341a2	pytest: don't run tests that happened to be part of pyvenv	2015-12-02 01:19:43 -08:00

... 42 43 44 45 46 ...

2676 Commits