MuPDF Integrated OCR Support #761
JorjMcKie
started this conversation in
Announcements
Replies: 2 comments
-
Good idea - call independent tesseract from PyMuPdf, with most suitable for PyMuPdf default settings. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Have you considered using Windows Media OCR? Just a thought. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
MuPDF v1.18.0 contains support for integrated OCR via Tesseract.
In principle, this is done by internally making a page image and then letting Tesseract analyze it. The recognized text is then handed on to the legacy MuPDF textpage functions.
I have not yet provided support for this in PyMuPDF. The reasons are similar as in the case of missing PDF signing feature:
My current assessment is to aim for a PyMuPDF that also supports OCR, if it determines that Tesseract is installed. So the PyMuPDF user would have the option to independently install and train Tesseract if / when required.
PyMuPDF would have to appropriately react if OCR functions are used and Tesseract cannot be found.
Your comments are welcome1
Beta Was this translation helpful? Give feedback.
All reactions