Get image that is not returned in getImageList #1202
Replies: 2 comments 9 replies
-
Well I might have found part of the solution after a day of struggle. What I thought were images were actually drawings, so this recipe can extract them as a new PDF, while I have seen other questions that try to extract it as SVG images, but that is still not answered. Since I just want to "convert" the PDF to a textual form with bounding box information retained, is there any simple (simpler) and foulproof way to extract all the drawings into an image file? I have also seen strange artifacts with the recipe I linked above, drawn elements that were not visible on the original PDF. What can be the cause of it and how can I get rid of it? I have seen that some elements are only visible when printing, but in this case a printed version looks exactly the same as the original, when viewed in Adobe reader. |
Beta Was this translation helpful? Give feedback.
-
... this is an unjustified conclusion and may be true only with some OCR tools. |
Beta Was this translation helpful? Give feedback.
-
I have a PDF document of an electronically generated invoice (which I am unfortunately unable to share), that I want to get OCR-ed by extracting all texts and running OCR on images. Unfortunately, the PDF has images that are not shown in
getImageList
, thus I cannot run OCR on it. Does someone have any idea, in what way could you embed an image without showing up either in PyMuPDF or Adobe Reader (in both cases, rendering the page results in the company logo shown but unable to find it/select it in the GUI) and how could I yet find it and extract its contents?I know this is hard to answer without an actual sample, I will try to find something that I can share here, but in the meantime I would gladly accept any ideas about what the issue could be.
Beta Was this translation helpful? Give feedback.
All reactions