Extending PyMuPDF with OCRmyPDF #963

JorjMcKie · 2021-03-22T11:02:29Z

JorjMcKie
Mar 22, 2021
Maintainer

As mentioned earlier already, MuPDF v1.18.0 contains integrated support for optionally using the Tesseract OCR engine in text extractions.
In PyMuPDF v1.18.*, accessing this feature is not implemented as yet, but intended to seriously be considered for the next version 1.19.0.

If you would like to combine PyMuPDF's text extraction cababilities with OCR features today, have a look at this folder's example scripts!

The examples use page.get_text("dict") and check if line or span text contains characters unrecognized by MuPDF - meaning character code chr(65533). In any such case, an OCR engine is used (Tesseract OCR or, resp. Python package easyocr) to try recognizing the text.

JorjMcKie · 2021-09-22T12:57:34Z

JorjMcKie
Sep 22, 2021
Maintainer Author

Today I added a sample script that can be used to OCR a PDF dynamically using the Python package version of OCRmyPDF.
It demonstrates how to (1) OCR and then (2) text-extract a single PDF page - dynamically: without creating intermediate files.
This script could be used as a template for more sophisticated approaches as explained here.

0 replies

JorjMcKie · 2021-10-10T19:52:14Z

JorjMcKie
Oct 10, 2021
Maintainer Author

In the coming version 1.19.0, OCR will be brought to an entirely new level:

MuPDF v1.19.0 contains integrated OCR support using Tesseract. This will include,

Coverting images and pixmaps to PDF pages with a hidden OCR text layer. From these pages, text can be extracted as usual, or several such pages can be joined to a multi-page PDF.
Page documents can be OCR-ed on the fly, using special versions of page.get_text() methods. The resulting text will be a mixture of "normal" and OCR-ed text: clever algorithms inside MuPDF determine the page regions that actually must be OCR-ed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending PyMuPDF with OCRmyPDF #963

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Extending PyMuPDF with OCRmyPDF #963

JorjMcKie Mar 22, 2021 Maintainer

Replies: 2 comments

JorjMcKie Sep 22, 2021 Maintainer Author

JorjMcKie Oct 10, 2021 Maintainer Author

JorjMcKie
Mar 22, 2021
Maintainer

JorjMcKie
Sep 22, 2021
Maintainer Author

JorjMcKie
Oct 10, 2021
Maintainer Author