-
I have a two layered pdf - the background layer is an image and the front layer is text obtained from an OCR engine. I need to replace the image with another while keeping the text layer the same. Or, if it is easier, extract the text layer and place it with the same coordinates on the other image. Is either of these possible with PyMuPDF? I have looked at the issue #338, in order to remove the image and place a pixmap of the new image onto the pdf. You mention that it's not possible to completely remove the image, maybe a new feature has been added since then to allow for this? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 3 replies
-
First, please allow me to move this issue to the |
Beta Was this translation helpful? Give feedback.
-
Has the the page image been scanned-in and does the text represent the OCR-ed image content? Whatever, there are two approaches:
Option two is probably simpler and leaves everything - except images - intact. |
Beta Was this translation helpful? Give feedback.
-
Many of the cautionary points in #338 have been addressed in the meantime and are no longer a major concern. SO let's assume youz want to do this.
page.clean_contents()
xref = page.get_contents()[0] # the xref of the clean, unified contents object
contlines = doc.xref_stream(xref).splitlines()
for i in range(len(contlines)):
line = contlines[i]
if line.startswith(b"/Im") and line.endswith(b" Do"):
contlines[i] = b""
cont = b"".join(contlines)
doc.update_stream(xref, cont)
page.clean_contents() The above should remove all page references to any image on it. Now do |
Beta Was this translation helpful? Give feedback.
-
Sorry, I made a mistake: line cont = b"".join(contlines) should have been this, because I am joining the lines again: cont = b"\n".join(contlines) Then it will work. |
Beta Was this translation helpful? Give feedback.
Sorry, I made a mistake: line
should have been this, because I am joining the lines again:
Then it will work.