How to extract page as image and reserve the original resolution? #1305

void285 · 2021-10-01T10:09:39Z

void285
Oct 1, 2021

Hello, I want to extract pages of some pdf files as images and do OCR work with them in python script, I want to get image files of same width / height/ color depth with the pdf page, and images filesize as small as possible.

First I tried pdfpatcher version 0.6.2.3691 from https://www.cnblogs.com/pdfpatcher/ (in Chinese), it also use mupdf as pdf engine, and use https://freeimage.sourceforge.io/ as image engine, I extract a page as jpeg image from a pdf file with settings of [same as the original file], the result file size is 2120x3012x24, and filesize is 390KB, it fits my need well, but I don't want to extract images manually and let images occupy my disk, I want to get images in script runtime and upload them to OCR engine.

I tried code below, but can not get image files of same resolution and similar filesize as pdfpatcher. First I tried:

doc = fitz.open(pdffile)
page = doc.loadPage(pno)
pix = page.getPixmap()
imgfile = "C:\\out.jpg"
pix.save(imgfile)

and the result image file is 509x723x24, 130KB, which is too small.
After add matrix=fitz.Matrix(2, 2) to page.getPixmap(), the result image is 1018x1446x24, and filesize is 440KB,
After add matrix=fitz.Matrix(4, 4) to page.getPixmap(), the result image is 2035x2891x24, and filesize is 1.4MB.

I also tried code from https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python?lq=1, the result image is 2120x3012x8, and filesize is 940KB.

import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

I want to know how to achieve my goal with PyMuPDF, do I need to use matrix=fitz.Matrix(4, 4) and some extra image compress work, or there is a simple way which I missed? Thank you!

Answered by JorjMcKie

Oct 1, 2021

First of all, let me move this issue to Discussions - which seems more adequate.

You need not take integers as zoom values: floats are allowed. So you definitely can find a value that suits your need.
If you look at page.get_images() (list of images defined for the page), you should see 1 (maybe 2, depends on the scanner) item representing the scanned page.
You will also see image width and height there. Plus colorspace, which helps determine the adequate value for pixmap creation.

If you see only one image in said list, you do not need to make an extra pixmap of the page for your OCR engine.
Instead, just extract that image and hand over its binary representation. E.g.

>>> from pprint im…

View full answer

void285 · 2021-10-01T10:17:33Z

void285
Oct 1, 2021
Author

The pdf file I use is a scanned file, I don't know much about pdf format, but after extract all pages as images with pdfpatcher, filesize sum of all images is the same as the pdf file, so I guess there is something like page original width/height/color-depth and filesize.

0 replies

JorjMcKie · 2021-10-01T12:39:42Z

JorjMcKie
Oct 1, 2021
Maintainer

First of all, let me move this issue to Discussions - which seems more adequate.

You need not take integers as zoom values: floats are allowed. So you definitely can find a value that suits your need.
If you look at page.get_images() (list of images defined for the page), you should see 1 (maybe 2, depends on the scanner) item representing the scanned page.
You will also see image width and height there. Plus colorspace, which helps determine the adequate value for pixmap creation.

If you see only one image in said list, you do not need to make an extra pixmap of the page for your OCR engine.
Instead, just extract that image and hand over its binary representation. E.g.

>>> from pprint import pprint
>>> pprint(page.get_images())
[(1005, 0, 1945, 1004, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode')]
>>> img = doc.extract_image(1005)
>>> type(img["image"]), len(img["image"])
(<class 'bytes'>, 127889)
>>> # hand over img["image"] for OCR-ing

If the situation is not that simple, make a page pixmap using the zoom value (in this case) zoom = 1945 / page.rect.width. This should be the same as 1004/page.rect.height - or at least approximately.
If you now create the pixmap, you can (depends on your scanned material) probably choose a pixmap with gray values only - should be sufficient for OCR-ing text. So pix = page.get_pixmap(colorspace=fitz.csGRAY, matrix=fitz.Matrix(zoom, zoom)).
Hope this helps

1 reply

void285 Oct 1, 2021
Author

It workds, Thank you!
For others: related document is here: https://pymupdf.readthedocs.io/en/latest/faq.html#how-to-extract-images-pdf-documents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to extract page as image and reserve the original resolution? #1305

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to extract page as image and reserve the original resolution? #1305

Uh oh!

void285 Oct 1, 2021

Replies: 2 comments · 1 reply

Uh oh!

void285 Oct 1, 2021 Author

Uh oh!

JorjMcKie Oct 1, 2021 Maintainer

Uh oh!

void285 Oct 1, 2021 Author

void285
Oct 1, 2021

Replies: 2 comments 1 reply

void285
Oct 1, 2021
Author

JorjMcKie
Oct 1, 2021
Maintainer

void285 Oct 1, 2021
Author