Extending PyMuPDF with OCRmyPDF #963
JorjMcKie
started this conversation in
Announcements
Replies: 2 comments
-
Today I added a sample script that can be used to OCR a PDF dynamically using the Python package version of OCRmyPDF. |
Beta Was this translation helpful? Give feedback.
0 replies
-
In the coming version 1.19.0, OCR will be brought to an entirely new level: MuPDF v1.19.0 contains integrated OCR support using Tesseract. This will include,
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As mentioned earlier already, MuPDF v1.18.0 contains integrated support for optionally using the Tesseract OCR engine in text extractions.
In PyMuPDF v1.18.*, accessing this feature is not implemented as yet, but intended to seriously be considered for the next version 1.19.0.
If you would like to combine PyMuPDF's text extraction cababilities with OCR features today, have a look at this folder's example scripts!
The examples use
page.get_text("dict")
and check if line or span text contains characters unrecognized by MuPDF - meaning character codechr(65533)
. In any such case, an OCR engine is used (Tesseract OCR or, resp. Python package easyocr) to try recognizing the text.Beta Was this translation helpful? Give feedback.
All reactions