OCR with PyMuPDF #1275
-
I noticed that when calling Is there a way to specify a document language and use corresponding OCR config? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
If text can be extracted, then that text is already present in the file. It is not dynamically generated by whatever mechanism. |
Beta Was this translation helpful? Give feedback.
-
While the base library MuPDF offers integrated use of Tesseract - thus providing a somewhat "dynamic" OCR feature - this is not (yet) supported by PyMuPDF. |
Beta Was this translation helpful? Give feedback.
-
I am about to publish a little script that does what I described Looks like so: """
This is a basic script demonstrating the use of OCRmyPDF together with PyMuPDF.
It reads a PDF's pages and passes them to ocrmypdf one by one. One could at this
point insert some checks as to whether the page is actually an, contains no text,
or text with many unrecognized characters or the like.
Each page is then converted to a 1-page temporary PDF which is
- passed to ocrmypdf for OCR-ing it
- the 1-page output PDF of the pervious step is then text-extracted
- return the extracted text
Instead of extracting simple naive text format, one could also use all other
text extraction formats like "dict" to get text position information.
"""
import fitz
import ocrmypdf
import sys
import io
def ocr_the_page(page):
"""Extract the text from passed-in PDF page."""
src = page.parent # the page's document
doc = fitz.open() # make temporary 1-pager
doc.insert_pdf(src, from_page=page.number, to_page=page.number)
pdfbytes = doc.tobytes()
inbytes = io.BytesIO(pdfbytes) # transform to BytesIO object
outbytes = io.BytesIO() # let ocrmypdf store its result pdf here
ocrmypdf.ocr(
inbytes, # input 1-pager
outbytes, # ouput 1-pager
language="eng", # modify as required
output_type="pdf", # only need simple PDF format
# add more paramneters, e.g. to enforce OCR-ing, etc.
)
ocr_pdf = fitz.open("pdf", outbytes.getvalue()) # read output as fitz PDF
text = ocr_pdf[0].get_text() # ...and extract text from the page
return text # return it
if __name__ == "__main__":
doc = fitz.open(sys.argv[1])
for page in doc:
text = ocr_the_page(page)
print("Text from page %i:" % page.number)
print(text) It already works under Windows. Now testing Linux with it. |
Beta Was this translation helpful? Give feedback.
While the base library MuPDF offers integrated use of Tesseract - thus providing a somewhat "dynamic" OCR feature - this is not (yet) supported by PyMuPDF.
But you can install OCRmyPDF, import it in your Python script and invoke it page-by-page using PyMuPDF - resulting in a similar behaviour.
The basic approach would be to make a 1-page PDF, pass that to ocrmypdf, receive back that temp PDF with its new text layer and then extract the text.
While this does work in principle, I haven't yet a ready-to-go code snippet ...