Skip to content

OCR with PyMuPDF #1275

Sep 22, 2021 · 3 comments · 3 replies
Discussion options

You must be logged in to vote

While the base library MuPDF offers integrated use of Tesseract - thus providing a somewhat "dynamic" OCR feature - this is not (yet) supported by PyMuPDF.
But you can install OCRmyPDF, import it in your Python script and invoke it page-by-page using PyMuPDF - resulting in a similar behaviour.
The basic approach would be to make a 1-page PDF, pass that to ocrmypdf, receive back that temp PDF with its new text layer and then extract the text.
While this does work in principle, I haven't yet a ready-to-go code snippet ...

Replies: 3 comments 3 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@ghost
Comment options

Answer selected
Comment options

You must be logged in to vote
2 replies
@JorjMcKie
Comment options

@JorjMcKie
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant