Get image that is not returned in getImageList #1202

timurlenk07 · 2021-08-11T06:43:14Z

timurlenk07
Aug 11, 2021

I have a PDF document of an electronically generated invoice (which I am unfortunately unable to share), that I want to get OCR-ed by extracting all texts and running OCR on images. Unfortunately, the PDF has images that are not shown ingetImageList, thus I cannot run OCR on it. Does someone have any idea, in what way could you embed an image without showing up either in PyMuPDF or Adobe Reader (in both cases, rendering the page results in the company logo shown but unable to find it/select it in the GUI) and how could I yet find it and extract its contents?

I know this is hard to answer without an actual sample, I will try to find something that I can share here, but in the meantime I would gladly accept any ideas about what the issue could be.

timurlenk07 · 2021-08-11T07:22:13Z

timurlenk07
Aug 11, 2021
Author

Well I might have found part of the solution after a day of struggle. What I thought were images were actually drawings, so this recipe can extract them as a new PDF, while I have seen other questions that try to extract it as SVG images, but that is still not answered.

Since I just want to "convert" the PDF to a textual form with bounding box information retained, is there any simple (simpler) and foulproof way to extract all the drawings into an image file?

I have also seen strange artifacts with the recipe I linked above, drawn elements that were not visible on the original PDF. What can be the cause of it and how can I get rid of it? I have seen that some elements are only visible when printing, but in this case a printed version looks exactly the same as the original, when viewed in Adobe reader.

0 replies

JorjMcKie · 2021-08-11T09:01:24Z

JorjMcKie
Aug 11, 2021
Maintainer

thus I cannot run OCR on it

... this is an unjustified conclusion and may be true only with some OCR tools.
At least with ocrmypdf you would nevertheless be able to OCR it.

9 replies

JorjMcKie Aug 11, 2021
Maintainer

"OCR" operation in-memory

ocrmypdf does support that too.

Another option would always be to pixmap your difficult page (using a high resolution) and then feed it to your own OCR engine. Can all be done in memory, too.

timurlenk07 Aug 11, 2021
Author

Another thing is that at the moment I also can speed up OCR using batch processing, which I am not sure is possible with ocrmypdf. So I would really prefer a simple solution to extract drawings as they appear in the rendered PDF.

timurlenk07 Aug 11, 2021
Author

Well, I might delve a bit deeper into ocrmypdf then, before adding more stupid comments 😄 thanks for your help

JorjMcKie Aug 11, 2021
Maintainer

So I would really prefer a simple solution to extract drawings as they appear in the rendered PDF.

There are severe problems with this request: It would mean to detect, which object on the page covers which others (partly), taking also into account, the the "covering" object may be semi-transparent - - - endless problems which you could only resolve by re-inventing the better part of a PDF viewer.
At least PyMuPDF only extract text, image, drawings without caring about which covers what. You can of course look at the accompanying rectangles, e.g. within the list of drawing paths. Because their sequence is the same as the respective drawing command on the page, you can deduct some information from that.
The real difficulties arise, if a (non-transparent) drawing occurs after, say, some text which it covers: this cannot be detected right now.

JorjMcKie Aug 11, 2021
Maintainer

"... the "covering" object may be semi-transparent ..."

Forgot to add: .".. and have different colors" (!!!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get image that is not returned in getImageList #1202

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Get image that is not returned in getImageList #1202

timurlenk07 Aug 11, 2021

Replies: 2 comments · 9 replies

timurlenk07 Aug 11, 2021 Author

JorjMcKie Aug 11, 2021 Maintainer

JorjMcKie Aug 11, 2021 Maintainer

timurlenk07 Aug 11, 2021 Author

timurlenk07 Aug 11, 2021 Author

JorjMcKie Aug 11, 2021 Maintainer

JorjMcKie Aug 11, 2021 Maintainer

timurlenk07
Aug 11, 2021

Replies: 2 comments 9 replies

timurlenk07
Aug 11, 2021
Author

JorjMcKie
Aug 11, 2021
Maintainer

JorjMcKie Aug 11, 2021
Maintainer

timurlenk07 Aug 11, 2021
Author

timurlenk07 Aug 11, 2021
Author

JorjMcKie Aug 11, 2021
Maintainer

JorjMcKie Aug 11, 2021
Maintainer