Replacing image/extracting text layer from pdf? #952

vs-777 · 2021-03-16T10:58:26Z

vs-777
Mar 16, 2021

I have a two layered pdf - the background layer is an image and the front layer is text obtained from an OCR engine. I need to replace the image with another while keeping the text layer the same. Or, if it is easier, extract the text layer and place it with the same coordinates on the other image. Is either of these possible with PyMuPDF? I have looked at the issue #338, in order to remove the image and place a pixmap of the new image onto the pdf. You mention that it's not possible to completely remove the image, maybe a new feature has been added since then to allow for this?

Answered by JorjMcKie

Mar 25, 2021

Sorry, I made a mistake: line

cont = b"".join(contlines)

should have been this, because I am joining the lines again:

cont = b"\n".join(contlines)

Then it will work.

View full answer

JorjMcKie · 2021-03-16T14:14:59Z

JorjMcKie
Mar 16, 2021
Maintainer

First, please allow me to move this issue to the Discussions. This seems adequate from its nature and should also find a larger community of people looking at it.

0 replies

JorjMcKie · 2021-03-16T14:27:38Z

JorjMcKie
Mar 16, 2021
Maintainer

Has the the page image been scanned-in and does the text represent the OCR-ed image content?

Whatever, there are two approaches:

Extract the text into a dictionary via page.get_text("dict", flags=0). Then make a redaction annotation that covers the full page, and apply the redaction with the option images=fitz.PDF_REDACT_IMAGE_REMOVE. This removes all text, images and links on the page. Then insert your desired new image followed by re-insertion of the text stored in the dict - or reverse sequence depending on whether the text should be visible.
Locate the image display command(s) in the page's /Contents area(s) and remove them. This leads to the existing image(s) no longer being referenced, which causes their removal when saving with at least garbage=3. And insert your new image into the page - either in foreground (covering the text which we did not touch here), or background.

Option two is probably simpler and leaves everything - except images - intact.

0 replies

JorjMcKie · 2021-03-16T14:52:28Z

JorjMcKie
Mar 16, 2021
Maintainer

Many of the cautionary points in #338 have been addressed in the meantime and are no longer a major concern. SO let's assume youz want to do this.
Then the detail steps are:

consolidate and unify the page's /Contents vie page.clean_contents(). This collect any multiple such objects into only one, which then also has a predictable, standardized syntax - like no multiple consecutive spaces, each command in its own line, etc.
make the list page.get_images(True). Every item in this list represents an image displayed on the page. One of each item's entry is the image's reference name in the (remaining unique) /Contents object. The referencing command there looks like b"/Im0 Do" and is the full content of one line.
Replace each /Contents line looking like this by b"". Then replace the page's /Contents by this modification.
Execute again page.clean_contents(). This will remove the now no longer referenced images from the page defintion.

page.clean_contents()
xref = page.get_contents()[0]  # the xref of the clean, unified contents object
contlines = doc.xref_stream(xref).splitlines()
for i in range(len(contlines)):
    line = contlines[i]
    if line.startswith(b"/Im") and line.endswith(b" Do"):
        contlines[i] = b""
cont = b"".join(contlines)
doc.update_stream(xref, cont)
page.clean_contents()

The above should remove all page references to any image on it. Now do page.insert_image(...) with the desired overlay parameter.

1 reply

vs-777 Mar 25, 2021
Author

Hi, thanks a lot for the reply. I've just tried this, but get the following messages when executing this script:

mupdf: unknown keyword: 'q2574.514'
mupdf: unknown keyword: 'cmQqBT99.648'
mupdf: unknown keyword: 'Tf3'
mupdf: unknown keyword: 'Tr-.011'
mupdf: unknown keyword: 'TJETBT117.834'
mupdf: unknown keyword: 'Tf1'
mupdf: unknown keyword: 'TJ120.626'
mupdf: unknown keyword: 'Tz1'
mupdf: unknown keyword: 'TJ126.43'
mupdf: unknown keyword: 'Tz1'
mupdf: unknown keyword: 'TJ120.376'
mupdf: unknown keyword: 'Tz1'
mupdf: unknown keyword: 'TJ127.97199'
etc...

The output pdf indeed has no image, but it seems the text layer is missing as well. I've attached the input and output pdfs, maybe something stands out?
in.pdf
out.pdf

JorjMcKie · 2021-03-25T09:55:52Z

JorjMcKie
Mar 25, 2021
Maintainer

Sorry, I made a mistake: line

cont = b"".join(contlines)

should have been this, because I am joining the lines again:

cont = b"\n".join(contlines)

Then it will work.

2 replies

JorjMcKie Mar 25, 2021
Maintainer

don't forget to save the updated PDF with garbage>=3 and deflate=True

vs-777 Mar 25, 2021
Author

This worked. Thank you very much @JorjMcKie

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacing image/extracting text layer from pdf? #952

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Replacing image/extracting text layer from pdf? #952

vs-777 Mar 16, 2021

Replies: 4 comments · 3 replies

JorjMcKie Mar 16, 2021 Maintainer

JorjMcKie Mar 16, 2021 Maintainer

JorjMcKie Mar 16, 2021 Maintainer

vs-777 Mar 25, 2021 Author

JorjMcKie Mar 25, 2021 Maintainer

JorjMcKie Mar 25, 2021 Maintainer

vs-777 Mar 25, 2021 Author

vs-777
Mar 16, 2021

Replies: 4 comments 3 replies

JorjMcKie
Mar 16, 2021
Maintainer

JorjMcKie
Mar 16, 2021
Maintainer

JorjMcKie
Mar 16, 2021
Maintainer

vs-777 Mar 25, 2021
Author

JorjMcKie
Mar 25, 2021
Maintainer

JorjMcKie Mar 25, 2021
Maintainer

vs-777 Mar 25, 2021
Author