MuPDF Integrated OCR Support #761

JorjMcKie · 2020-12-10T12:46:13Z

JorjMcKie
Dec 10, 2020
Maintainer

MuPDF v1.18.0 contains support for integrated OCR via Tesseract.
In principle, this is done by internally making a page image and then letting Tesseract analyze it. The recognized text is then handed on to the legacy MuPDF textpage functions.

I have not yet provided support for this in PyMuPDF. The reasons are similar as in the case of missing PDF signing feature:

Tesseract and adequate recognition training data must be installed separately.
MuPDF either supports an independent local Tesseract installation or an integrated Tesseract.
For preparing wheels, this presents a significant challenge: obviously a decision between these alternatives must be made: wheels with or without Tesseract.
- PyMuPDF installations which should use an independent Tesseract must be told, where to look for it - unclear yet how this may work.
- Integrated Tesseract installations will be large and probably not well-trained, or at least the recognition training approach is unclear.

My current assessment is to aim for a PyMuPDF that also supports OCR, if it determines that Tesseract is installed. So the PyMuPDF user would have the option to independently install and train Tesseract if / when required.
PyMuPDF would have to appropriately react if OCR functions are used and Tesseract cannot be found.

Your comments are welcome1

bserg66 · 2020-12-17T12:31:57Z

bserg66
Dec 17, 2020

Good idea - call independent tesseract from PyMuPdf, with most suitable for PyMuPdf default settings.
I made OCR file wifh OCRMyPdf(tesseract) and got huge garbage after PyMuPdf compression (see our discussions - Most terrible example).

0 replies

Joe5522 · 2022-06-29T18:29:28Z

Joe5522
Jun 29, 2022

Have you considered using Windows Media OCR? Just a thought.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MuPDF Integrated OCR Support #761

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

MuPDF Integrated OCR Support #761

JorjMcKie Dec 10, 2020 Maintainer

Replies: 2 comments

bserg66 Dec 17, 2020

Joe5522 Jun 29, 2022

JorjMcKie
Dec 10, 2020
Maintainer

bserg66
Dec 17, 2020

Joe5522
Jun 29, 2022