-
Hello, I want to extract pages of some pdf files as images and do OCR work with them in python script, I want to get image files of same width / height/ color depth with the pdf page, and images filesize as small as possible. First I tried pdfpatcher version 0.6.2.3691 from https://www.cnblogs.com/pdfpatcher/ (in Chinese), it also use mupdf as pdf engine, and use https://freeimage.sourceforge.io/ as image engine, I extract a page as jpeg image from a pdf file with settings of [same as the original file], the result file size is 2120x3012x24, and filesize is 390KB, it fits my need well, but I don't want to extract images manually and let images occupy my disk, I want to get images in script runtime and upload them to OCR engine. I tried code below, but can not get image files of same resolution and similar filesize as pdfpatcher. First I tried:
and the result image file is 509x723x24, 130KB, which is too small. I also tried code from https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python?lq=1, the result image is 2120x3012x8, and filesize is 940KB.
I want to know how to achieve my goal with PyMuPDF, do I need to use |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
The pdf file I use is a scanned file, I don't know much about pdf format, but after extract all pages as images with pdfpatcher, filesize sum of all images is the same as the pdf file, so I guess there is something like page original width/height/color-depth and filesize. |
Beta Was this translation helpful? Give feedback.
-
First of all, let me move this issue to Discussions - which seems more adequate. You need not take integers as zoom values: floats are allowed. So you definitely can find a value that suits your need. If you see only one image in said list, you do not need to make an extra pixmap of the page for your OCR engine. >>> from pprint import pprint
>>> pprint(page.get_images())
[(1005, 0, 1945, 1004, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode')]
>>> img = doc.extract_image(1005)
>>> type(img["image"]), len(img["image"])
(<class 'bytes'>, 127889)
>>> # hand over img["image"] for OCR-ing If the situation is not that simple, make a page pixmap using the zoom value (in this case) |
Beta Was this translation helpful? Give feedback.
First of all, let me move this issue to Discussions - which seems more adequate.
You need not take integers as zoom values: floats are allowed. So you definitely can find a value that suits your need.
If you look at
page.get_images()
(list of images defined for the page), you should see 1 (maybe 2, depends on the scanner) item representing the scanned page.You will also see image width and height there. Plus colorspace, which helps determine the adequate value for pixmap creation.
If you see only one image in said list, you do not need to make an extra pixmap of the page for your OCR engine.
Instead, just extract that image and hand over its binary representation. E.g.