How to match PyMuPdf output to pdf2image output? #913

arun-rangarajan · 2021-02-19T21:53:44Z

arun-rangarajan
Feb 19, 2021

Looking for help on this question asked on Stackoverflow.

Got a PDF to PNG converter built with pdf2image. It is quite slow for converting PDF to PNG images. For example, a 7-page PDF document takes 10 seconds to get split into PNG images even with thread_count set to 4 on a 4-core machine (Standard B4ms Azure VM).

Tried PyMuPdf and it ran much faster (only 800 ms) with default scaling:
mat = fitz.Matrix(1, 1)

Then I realized that the PNGs that are output with default scaling are much smaller compared to those of pdf2image, so increased the scaling to match the PNGs output by pdf2image. I had to use a scale factor of 2.7777 for the pixel sizes to match up with pdf2image, so
mat = fitz.Matrix(2.7777, 2.7777)

This took 3 seconds to run, but still much faster compared to pdf2image.

The images output by PyMuPdf looked quite identical to those of pdf2image to my eyes, but they actually differ. Our downstream processing (an object detection model) which uses these PNG's also produces different results.

Looking at pdf2image doc, we have just used the default dpi of 200. How does one translate this setting to PyMuPdf to get the exact same output? I tried setResolution of 200, but that didn't help.

Answered by JorjMcKie

Feb 19, 2021

I tried setResolution of 200, but that didn't help.

But the resolution was correctly set to the value, wasn't it? So, what was the difference?
Anyway, PyMuPDF also supports using Pillow for pixmap output, try this pix.pillowWrite("%02i.png" % page.number, dpi=(200, 200)).
The parameters of pillowWrite() are passed through to Pillow's Image.save() method unchanged. This should enable you to make the output as equal as desired.

View full answer

JorjMcKie · 2021-02-19T23:06:06Z

JorjMcKie
Feb 19, 2021
Maintainer

I tried setResolution of 200, but that didn't help.

But the resolution was correctly set to the value, wasn't it? So, what was the difference?
Anyway, PyMuPDF also supports using Pillow for pixmap output, try this pix.pillowWrite("%02i.png" % page.number, dpi=(200, 200)).
The parameters of pillowWrite() are passed through to Pillow's Image.save() method unchanged. This should enable you to make the output as equal as desired.

6 replies

JorjMcKie Feb 20, 2021
Maintainer

The zoom parameter controls how many pixels are created per each point of the page.
For a letter page with 612 x 792 points (which is - by using 72 points per 1 inch - 8.5 x 11.0 inches), the resulting pixmap also has this dimension if you zoom = 1.
By assigning a dpi value, you control the printed page size. MuPDF's default of 96 dpi therefore leads to a print page size of 6.38 x 8.25 inches.
So, to get a print page size of letter at 96 dpi, must use zoom = 96 / 72 = 1.3333, and similarly for getting a letter print size at 200 x 200 dpi, use zoom = 200 / 72 = 2,7777.

JorjMcKie Feb 20, 2021
Maintainer

But independently from zoom, also use dpi to determine the print page size.

arun-rangarajan Feb 22, 2021
Author

Thx again, @JorjMcKie.

I used zoom of (200 / 72) = 2.7777 and set dpi like this:
pix.pillowWrite(f"page-{i}.png", dpi=(200, 200))

Now the output PNG's are 1700 x 2200 px.

However pdf2image file sizes are 10% to 20% larger compared to pymupdf's. Not sure if it's possible to make the outputs identical.

JorjMcKie Feb 22, 2021
Maintainer

I looked up pdf2image - they seem to use pillow as well. Obviously with a different PNG compression parameter.
Maybe consult Pillow docu, they may have a way to set PNG compression ratio. Internally, PNG use gzip routines to compress image data, and for this there exist the usual ratios 0 through 9, with a default of 6 (supposedly a good balance between speed and file size).

arun-rangarajan Feb 22, 2021
Author

Thx @JorjMcKie. Will investigate this further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to match PyMuPdf output to pdf2image output? #913

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to match PyMuPdf output to pdf2image output? #913

arun-rangarajan Feb 19, 2021

Replies: 1 comment · 6 replies

JorjMcKie Feb 19, 2021 Maintainer

JorjMcKie Feb 20, 2021 Maintainer

JorjMcKie Feb 20, 2021 Maintainer

arun-rangarajan Feb 22, 2021 Author

JorjMcKie Feb 22, 2021 Maintainer

arun-rangarajan Feb 22, 2021 Author

arun-rangarajan
Feb 19, 2021

Replies: 1 comment 6 replies

JorjMcKie
Feb 19, 2021
Maintainer

JorjMcKie Feb 20, 2021
Maintainer

JorjMcKie Feb 20, 2021
Maintainer

arun-rangarajan Feb 22, 2021
Author

JorjMcKie Feb 22, 2021
Maintainer

arun-rangarajan Feb 22, 2021
Author