Removing Optional Content conditions from Images and Form XObjects #1425

Jmuccigr · 2021-11-28T18:59:06Z

Jmuccigr
Nov 28, 2021

I'm trying to remove/delete specific images from a file and they're repeated on multiple pages. When I go after their bbox with page.get_image_bbox, only the first instance reports a valid bbox; the others return Rect(1.0, 1.0, -1.0, -1.0). I see some warnings in the docs that may refer to this, but is there a way around this?

JorjMcKie · 2021-11-28T19:19:35Z

JorjMcKie
Nov 28, 2021
Maintainer

How do you remove those images?

2 replies

Jmuccigr Nov 29, 2021
Author

I don’t. That’s the point of my question, but I’m attempting to do it by replacing them with annotations.

my question though is why I’m getting those rect coordinates.

JorjMcKie Nov 29, 2021
Maintainer

I don’t. That’s the point of my question, but I’m attempting to do it by replacing them with annotations.

my question though is why I’m getting those rect coordinates.

Ah, now I got you.
The problem with the list page.get_images() is that it seems to imply that images referenced therein are indeed displayed by the page.
But this is not the case necessarily. The list only enumerates the image definitions of the page object.

If an image is only referenced but not displayed by the page, then the said rectangle is returned. As opposed to earlier versions (before 1.19.0) this rectangle is now called invalid resp. empty.

JorjMcKie · 2021-11-29T21:03:03Z

JorjMcKie
Nov 29, 2021
Maintainer

This situation may be somewhat confusing. You can rectify it by executing page.clean_contents() immediately after loading the page.
This method does a lot of things - among them snychronizing the page definition with its paint commands.
If you execute page.get_images() you should never get a an invalid image bbox.

1 reply

Jmuccigr Nov 29, 2021
Author

Hmmm, page.clean_contents() seems to rename all the images as well, and not uniquely, so I get pages where doc.get_page_images reports this:

(24, 0, 3210, 5032, 1, 'DeviceGray', '', 'Im1', 'CCITTFaxDecode'), 
(21, 0, 67, 1752, 8, 'DeviceGray', '', 'Im1', 'FlateDecode'), 
(17, 18, 154, 34, 8, 'DeviceGray', '', 'Im2', 'FlateDecode'), 
(16, 0, 272, 31, 8, 'DeviceGray', '', 'Im3', 'FlateDecode')

Maybe it's reusing names if the image compression is different?

In any case since I'm using multiple occurrences to decide which images to delete, this kills it for me, or at least makes it a lot more difficult.

JorjMcKie · 2021-11-29T21:06:23Z

JorjMcKie
Nov 29, 2021
Maintainer

To remove an image, make a redaction annotation using its bbox. When done with all images on that page, execute page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_REMOVE).

0 replies

JorjMcKie · 2021-11-30T09:30:45Z

JorjMcKie
Nov 30, 2021
Maintainer

In any case since I'm using multiple occurrences to decide which images to delete, this kills it for me, or at least makes it a lot more difficult.

The doc give an explanation: images may be displayed by the page directly, or by "Form XObjects" which are invoked by the page. This type of nesting is indicated if you specify page.get_images(True). This will produce an extended list: each item then has the xref of the invoking XObject as last entry in each item, or 0 if the page does it.
This makes the list unique again.
Once you have your list, best use page.get_image_rects(xref) to compute bboxes. This method returns a list of bboxes, because it may always happen, that the same image is displayed more than once on the same page ...

1 reply

Jmuccigr Dec 1, 2021
Author

Yeah, this doesn't seem to work for me and I may be doing it wrong.

Here's the code, where the repeatedImages contains the xrefs for the images, some of which are repeated. I have it spit out the images for each page and then try to get the bbox for each. You'll see that only one page seems to contain the bbox for each image and two images aren't included at all:

The code:

repeatedImages = [21, 17, 16, 48, 24, 32, 40]
print("repeatedimages: ",repeatedImages)
for p in range(len(doc)):
    page = doc[p]
    page.get_images(True)
    page.clean_contents()
    print("page ",p," :",doc.get_page_images(p))
    for r in repeatedImages:
        print(r,page.get_image_rects(r))

And here's the output:

repeatedimages:  [21, 17, 16, 48, 24, 32, 40]
page  0  : [(21, 0, 67, 1752, 8, 'DeviceGray', '', 'Im1', 'FlateDecode'), (48, 0, 749, 1041, 8, 'DeviceRGB', '', 'Im2', 'DCTDecode')]
21 [Rect(2.0, 301.79998779296875, 20.17300033569336, 777.0)]
17 []
16 []
48 [Rect(204.0, 179.0, 408.0, 443.0)]
24 []
32 []
40 []
page  1  : [(24, 0, 3210, 5032, 1, 'DeviceGray', '', 'Im1', 'CCITTFaxDecode'), (21, 0, 67, 1752, 8, 'DeviceGray', '', 'Im1', 'FlateDecode'), (17, 18, 154, 34, 8, 'DeviceGray', '', 'Im2', 'FlateDecode'), (16, 0, 272, 31, 8, 'DeviceGray', '', 'Im3', 'FlateDecode')]
21 []
17 []
16 []
48 []
24 [Rect(18.17300033569336, 0.0, 403.4729919433594, 604.0)]
32 []
40 []
page  2  : [(32, 0, 3166, 5042, 1, 'DeviceGray', '', 'Im1', 'CCITTFaxDecode'), (21, 0, 67, 1752, 8, 'DeviceGray', '', 'Im1', 'FlateDecode'), (17, 18, 154, 34, 8, 'DeviceGray', '', 'Im2', 'FlateDecode'), (16, 0, 272, 31, 8, 'DeviceGray', '', 'Im3', 'FlateDecode')]
21 []
17 []
16 []
48 []
24 []
32 [Rect(18.17300033569336, 0.84002685546875, 398.1730041503906, 606.0)]
40 []
page  3  : [(40, 0, 3208, 5040, 1, 'DeviceGray', '', 'Im1', 'CCITTFaxDecode'), (21, 0, 67, 1752, 8, 'DeviceGray', '', 'Im1', 'FlateDecode'), (17, 18, 154, 34, 8, 'DeviceGray', '', 'Im2', 'FlateDecode'), (16, 0, 272, 31, 8, 'DeviceGray', '', 'Im3', 'FlateDecode')]
21 []
17 []
16 []
48 []
24 []
32 []
40 [Rect(18.17300033569336, 0.1400146484375, 403.1730041503906, 605.0)]

JorjMcKie · 2021-12-01T05:30:39Z

JorjMcKie
Dec 1, 2021
Maintainer

Confusion!

for page doc:
    page.clean_contents()  # do this before!
    xreflist = [item[0] for item in page.get_images(True) if item[0] in repeatedimages]
    for xref in xreflist:
        print()
        print("Rects of xref", xref)
        pprint(page.get_image_rects(xref)))

1 reply

Jmuccigr Dec 1, 2021
Author

Nope. Here's the slightly tweaked code:

repeatedImages = [16,17,21,24,32,40]
for page in doc:
    page.clean_contents()  # do this before!
    print("-----\n",page,"\ncount of all images = ", len(doc.get_page_images(page)))
    xreflist = [item[0] for item in page.get_images(True) if item[0] in repeatedImages]
    for xref in xreflist:
        print()
        print("Rects of xref", xref)
        print(page.get_image_rects(xref))

And the output where images 16 and 17 never get reported bboxes.

-----
 page 0 of /Users/john_muccigrosso/Downloads/test.pdf 
count of all images =  2

Rects of xref 21
[Rect(2.0, 301.79998779296875, 20.17300033569336, 777.0)]
-----
 page 1 of /Users/john_muccigrosso/Downloads/test.pdf 
count of all images =  4

Rects of xref 24
[Rect(18.17300033569336, 0.0, 403.4729919433594, 604.0)]

Rects of xref 21
[]

Rects of xref 17
[]

Rects of xref 16
[]
-----
 page 2 of /Users/john_muccigrosso/Downloads/test.pdf 
count of all images =  4

Rects of xref 32
[Rect(18.17300033569336, 0.84002685546875, 398.1730041503906, 606.0)]

Rects of xref 21
[]

Rects of xref 17
[]

Rects of xref 16
[]
-----
 page 3 of /Users/john_muccigrosso/Downloads/test.pdf 
count of all images =  4

Rects of xref 40
[Rect(18.17300033569336, 0.1400146484375, 403.1730041503906, 605.0)]

Rects of xref 21
[]

Rects of xref 17
[]

Rects of xref 16
[]

JorjMcKie · 2021-12-02T14:35:13Z

JorjMcKie
Dec 2, 2021
Maintainer

Can you please let me have that ominous PDF?

1 reply

Jmuccigr Dec 3, 2021
Author

Sure. I'm testing on a few pages from a HathiTrust PDF
test.pdf
.

JorjMcKie · 2021-12-03T21:02:06Z

JorjMcKie
Dec 3, 2021
Maintainer

Aha! It all is in best order:
Except for the first page (which displays 2 images), all the other only display 1, although the imagelist seems to suggest otherwise.
As explained in a previous post, that imagelist is nothing but a report about the page object definition source. It does not inspect the page's appearance painting source.
And those two informations need not coincide.
Because all the pages reference so-called Form XObjects (which can act like pages themselves: displaying images and / or text), their object definitions are referenced completely of course, even when the calling page only uses text from inside them.
So, for your purposes, you should use a different report: page.get_image_info(). This one does inspect the page's appearance painting source and thus only names images that are actually displayed. Because of the additional effort, its response time is longer.
You can do this:

>>> for page in doc:
	print("-"*60)
	pprint(page.get_image_info(xrefs=True))  # detect XREFs where possible

	
------------------------------------------------------------
[{'bbox': (2.0, 301.79998779296875, 20.17300033569336, 777.0),
  'bpc': 8,
  'colorspace': 1,
  'cs-name': 'DeviceGray',
  'digest': b'\x9f\xb0\xd5\xac\xc9\xe9\xdc\x83\xe0\xaem\xd4m\xe4\xcb\xf7',
  'height': 1752,
  'number': 0,
  'size': 40832,
  'transform': (18.17300033569336,
                0.0,
                -0.0,
                475.20001220703125,
                2.0,
                301.79998779296875),
  'width': 67,
  'xref': 21,
  'xres': 96,
  'yres': 96},
 {'bbox': (204.0, 179.0, 408.0, 443.0),
  'bpc': 8,
  'colorspace': 3,
  'cs-name': 'DeviceRGB',
  'digest': b'43r\xcb\xa0\x0cu\xfd\xa9Wr\x03\xe9H\xd1\xe0',
  'height': 1041,
  'number': 3,
  'size': 166089,
  'transform': (204.0, 0.0, -0.0, 264.0, 204.0, 179.0),
  'width': 749,
  'xref': 48,
  'xres': 96,
  'yres': 96}]
------------------------------------------------------------
[{'bbox': (18.17300033569336, 0.0, 403.4729919433594, 604.0),
  'bpc': 1,
  'colorspace': 1,
  'cs-name': 'DeviceGray',
  'digest': b'\xc7\t\xaaL\xd8\xe6\xedq\xbd,\xd5\xd7_\xee\xde\xc0',
  'height': 5032,
  'number': 0,
  'size': 13364,
  'transform': (385.29998779296875, 0.0, -0.0, 604.0, 18.17300033569336, 0.0),
  'width': 3210,
  'xref': 24,
  'xres': 96,
  'yres': 96}]
------------------------------------------------------------
[{'bbox': (18.17300033569336, 0.84002685546875, 398.1730041503906, 606.0),
  'bpc': 1,
  'colorspace': 1,
  'cs-name': 'DeviceGray',
  'digest': b'\xbf\x84Ix\xbfG\xd8\xf8/S\xd4\x08^`\xe6\x12',
  'height': 5042,
  'number': 0,
  'size': 55398,
  'transform': (380.0,
                0.0,
                -0.0,
                605.1599731445312,
                18.17300033569336,
                0.84002685546875),
  'width': 3166,
  'xref': 32,
  'xres': 96,
  'yres': 96}]
------------------------------------------------------------
[{'bbox': (18.17300033569336, 0.1400146484375, 403.1730041503906, 605.0),
  'bpc': 1,
  'colorspace': 1,
  'cs-name': 'DeviceGray',
  'digest': b'\xb7\xc1U<|\xdaQ\xa7\xbe\x8e\xe6\xae\xb8^\xbd\x89',
  'height': 5040,
  'number': 0,
  'size': 54654,
  'transform': (385.0,
                0.0,
                -0.0,
                604.8599853515625,
                18.17300033569336,
                0.1400146484375),
  'width': 3208,
  'xref': 40,
  'xres': 96,
  'yres': 96}]
>>>

This method works for all supported document types - not only PDF. Specifying xrefs=True delivers information for PDF only, and only if there indeed exists such an xref (may not be always the case - e.g. inline images).
If no xref can be found for whatever the possible reasons, "xref": 0 will be reported.

1 reply

Jmuccigr Dec 4, 2021
Author

Here's the thing though: there are multiple images on those pages. Each has its own unique image of the page contents (for a total of 4), then there are the three repeated "stamps", one each for the library, Google (both on the bottom of the page), and finally the vertical text on the side. So that's 7 images, but your method reports 5.

I'm pretty sure it's the two at the bottom that don't show up (16 and 17 from above). They don't show up on page 1 at all, but they are on each of the remaining three pages.

JorjMcKie · 2021-12-04T21:19:17Z

JorjMcKie
Dec 4, 2021
Maintainer

Pages 2 and up show one full page image each.
They all invoke a Form XObject which each contain more images - which however are not ever displayed by the page.
You can consult the mutool cli tool, or look at page.get_bboxlog() to confiirm this.
If they are not displayed, then there neither is a rectangle to associate with them obviously.
The one displayed image on page 2 covers the full page btw.
Text is written upon it, but it is white and fully transparent - i.e. invisible, but can be selected.

1 reply

Jmuccigr Dec 4, 2021
Author

I'm a bit confused. pdfimages reports multiple images on each page, which is consistent with doc.get_page_images:

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image      67  1752  gray    1   8  image  no        21  0   265   265 13.1K  11%
   1     1 image     749  1041  rgb     3   8  jpeg   no        48  0   264   284  162K 7.1%
   2     2 image    3210  5032  gray    1   1  ccitt  yes       24  0   600   600 12.4K 0.6%
   2     3 image      67  1752  gray    1   8  image  no        21  0   402   402 13.1K  11%
   2     4 image     154    34  gray    1   8  image  no        17  0   145   145 1642B  31%
   2     5 smask     154    34  gray    1   8  image  no        17  0   145   145   28B 0.5%
   2     6 image     272    31  gray    1   8  image  no        16  0   145   145 1272B  15%
   3     7 image    3166  5042  gray    1   1  ccitt  yes       32  0   600   600 53.5K 2.7%
   3     8 image      67  1752  gray    1   8  image  no        21  0   408   408 13.1K  11%
   3     9 image     154    34  gray    1   8  image  no        17  0   148   148 1642B  31%
   3    10 smask     154    34  gray    1   8  image  no        17  0   148   148   28B 0.5%
   3    11 image     272    31  gray    1   8  image  no        16  0   148   148 1272B  15%
   4    12 image    3208  5040  gray    1   1  ccitt  yes       40  0   600   600 52.8K 2.7%
   4    13 image      67  1752  gray    1   8  image  no        21  0   403   403 13.1K  11%
   4    14 image     154    34  gray    1   8  image  no        17  0   146   146 1642B  31%
   4    15 smask     154    34  gray    1   8  image  no        17  0   146   146   28B 0.5%
   4    16 image     272    31  gray    1   8  image  no        16  0   146   146 1272B  15%

JorjMcKie · 2021-12-04T22:06:15Z

JorjMcKie
Dec 4, 2021
Maintainer

Now I finally have seen why we talked past each other:
The FormXObject is under control of an "Optional Content" condition. So the object is executed only if the resp. condition is met.
Now the definition of the optional content is unclear to MuPDF: nothing can be found to interpret the OC status of the Form XObject as ON.
So it is regarded as not being present.
You can confirm all this by:

>>> page=doc[1]
>>> pprint(page.get_images(True))
[(24, 0, 3210, 5032, 1, 'DeviceGray', '', 'IxCBK', 'CCITTFaxDecode', 0),
 (21, 0, 67, 1752, 8, 'DeviceGray', '', 'PxCBA', 'FlateDecode', 15),
 (17, 18, 154, 34, 8, 'DeviceGray', '', 'PxCBF', 'FlateDecode', 15),
 (16, 0, 272, 31, 8, 'DeviceGray', '', 'PxCBG', 'FlateDecode', 15)]
>>> # check whether xref 15 is under an OC condition
>>> doc.get_oc(15)
14
>>> # indeed: it is, but the condition is not ON
>>> # ... as far as MuPDF sees it
>>>

3 replies

JorjMcKie Dec 4, 2021
Maintainer

When looking at the OCProperties catalog configuration, I can understand why MuPDF is confused.
What you can do here is some hacking around only:

remove the `OCProperties key from the catalog - this will automatically switch all content on
remove the OC entry from each of the Form XObjects, which has the same effect, but is more tedious.

Let's have a look at what happens to page 2 images when we do this:

>>> catalog = doc.pdf_catalog()
>>> doc.xref_set_key(catalog, "OCProperties", "null")
>>> page=doc[1]
>>> imglist = page.get_images(True)
>>> for xref in [item[0] for item in imglist]:
	print("xref:", xref)
	pprint(page.get_image_rects(xref))

	
xref: 24
[Rect(18.17300033569336, 0.0, 403.4729919433594, 604.0)]
xref: 21
[Rect(1.3208199739456177, 314.26702880859375, 13.322450637817383, 628.0938720703125)]
xref: 17
[Rect(62.904052734375, 621.1595458984375, 139.181396484375, 638.0)]
xref: 16
[Rect(235.76637268066406, 622.6454467773438, 370.489990234375, 638.0)]
>>>

Jmuccigr Dec 4, 2021
Author

OK, I'll give this a shot. Thanks for your help.

Obviously I'm a bit of a novice here, but it seems to me that there should be a way to programmatically get at the same info that is being used to create the page's appearance without "hacking".

Jmuccigr Dec 5, 2021
Author

OK, running with this one to start with, mostly because I understand it better. I'm going to play with your later suggestion next. If you'll indulge me as I learn this...

Here's my code:

#!/usr/local/bin/python3

import fitz
from pprint import pprint
from collections import Counter

images = []
repeatedImages = []

fname = '/Users/john_muccigrosso/Downloads/test.pdf'

doc = fitz.open(fname)

catalog = doc.pdf_catalog()
doc.xref_set_key(catalog, "OCProperties", "null")

# Go through the pages to find and count all the images
for p in range(len(doc)):
    pg = doc[p]
    pg.get_images(True)
    for i in range(len(doc.get_page_images(p))):
        images.append(doc.get_page_images(p)[i][0])
counts = Counter(images)
countsSorted = counts.most_common()
for c in range(len(countsSorted)):
    if countsSorted[c][1] > 1:
        repeatedImages.append(countsSorted[c][0])
print("repeatedimages: ",repeatedImages)
print("-"*60)

for pgno in range(len(doc)):
    page = doc[pgno]
    print("-"*60,"\nPage",pgno)     
    #pprint(page.get_images())
    pprint(page.get_image_info())
    for i in range(len(page.get_images())):
        print(i)
        if(page.get_images()[i][0] in repeatedImages):
            print("image ",i," = ",page.get_images()[i][0])
            rect = page.get_image_info()[i]['bbox']
            pprint(rect)
            page.add_redact_annot(rect, "")
    page.apply_redactions()
 
doc.save("image-removed.pdf", garbage=3, deflate=True)

I'm using get_image_info to get the bboxes because I'm not sure how to use the Rect that gets returned from get_image_bbox or page.get_image_rects(xref) doesn't work with add_redact_annot. Problem is that it seems those two methods don't necessarily sort the images the same way, so on page 0 the wrong image gets deleted. Is there a way to convert the rect to something usable rather than calling get_image_info?

JorjMcKie · 2021-12-05T08:50:46Z

JorjMcKie
Dec 5, 2021
Maintainer

Obviously I'm a bit of a novice here, but it seems to me that there should be a way to programmatically get at the same info that is being used to create the page's appearance without "hacking".

Understandable. Maybe this is better:

import fitz
from pprint import pprint

doc = fitz.open("test.pdf")
for page in doc:
    print("Images on page {}".format(page.number))
    imglist = page.get_images(True)
    for item in imglist:
        xref_img = item[0]
        xref_xob = item[-1]
        # remove any Optional Content information
        # from images and Form XObjects
        doc.set_oc(xref_img, 0)
        if xref_xob > 0:
            doc.set_oc(xref_xob, 0)
    # the following should now be complete ...
    imgs = [(img["xref"],img["bbox"]) for img in page.get_image_info(xrefs=True) if img["xref"]>0]
    pprint(imgs)
    print("-"*60)

0 replies

JorjMcKie · 2021-12-05T09:02:19Z

JorjMcKie
Dec 5, 2021
Maintainer

And this one also removes those images at the bottom:

import fitz
from pprint import pprint

doc = fitz.open("test.pdf")
for page in doc:
    print("Images on page {}".format(page.number))
    imglist = page.get_images(True)
    for item in imglist:
        xref_img = item[0]
        xref_xob = item[-1]
        # remove any Optional Content information
        # from images and Form XObjects
        doc.set_oc(xref_img, 0)
        if xref_xob > 0:
            doc.set_oc(xref_xob, 0)
    imgs = [
        (img["xref"], img["bbox"])
        for img in page.get_image_info(xrefs=True)
        if img["xref"] > 0
    ]
    pprint(imgs)
    print("-" * 60)
    for item in imgs:
        if item[0] in (16, 17):
            page.add_redact_annot(item[1])
    page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_PIXELS)
doc.ez_save("removed.pdf")

The option fitz.PDF_REDACT_IMAGE_REMOVE cannot be used here, because those images mostly overlap each other, so redactions regularly erase them all ...

1 reply

Jmuccigr Dec 5, 2021
Author

OK, this seems to do it (with that file at least). I run this after I check for repeated images and save their xrefs. Then I just delete those from the file and save it under a new name.

JorjMcKie · 2021-12-05T20:53:34Z

JorjMcKie
Dec 5, 2021
Maintainer

I run this after I check for repeated images and save their xrefs.

Yes, identifying the images you need to delete is that parts that requires the wits.
Deleting them should then work fairly safely ...

3 replies

Jmuccigr Dec 5, 2021
Author

That's my hope! Should I be using another technique to delete them? From what I was finding, this seems to be the easiest way (though it would be nice to get rid of those boxes entirely).

JorjMcKie Dec 5, 2021
Maintainer

Well, you expressed a general dislike of "hacky" approaches 😂😉just kidding.
But if you are processing files from the same source ("digitized by ..."), there is hope that you will always find Form XObjects named "CBJ", which contain the images in question:

>>> for page in doc:
	print(doc.xref_object(page.xref))

	
<<
  /Type /Page
  /Contents [ 67 0 R ]
  /MediaBox [ 0 0 612 792 ]
  /Parent 2 0 R
  /Resources <<
    /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ]
    /XObject <<
      /CBN 47 0 R
    >>
  >>
>>
<<
  /Type /Page
  /Contents [ 23 0 R 25 0 R 26 0 R 27 0 R 28 0 R 29 0 R ]
  /MediaBox [ 0 0 404.17 638 ]
  /Parent 2 0 R
  /Resources <<
    /ExtGState <<
      /CBH 5 0 R
    >>
    /Font <<
      /DeSaCBI~1638056977 6 0 R
    >>
    /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ]
    /XObject <<
      /CBJ 15 0 R  % <==
      /IxCBK 24 0 R
    >>
  >>
>>
<<
  /Type /Page
  /Contents [ 31 0 R 33 0 R 34 0 R 35 0 R 36 0 R 37 0 R ]
  /MediaBox [ 0 0 398.17 640 ]
  /Parent 2 0 R
  /Resources <<
    /ExtGState <<
      /CBH 5 0 R
    >>
    /Font <<
      /DeSaCBI~1638056977 6 0 R
    >>
    /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ]
    /XObject <<
      /CBJ 15 0 R  % <==
      /IxCBL 32 0 R
    >>
  >>
>>
<<
  /Type /Page
  /Contents [ 39 0 R 41 0 R 42 0 R 43 0 R 44 0 R 45 0 R ]
  /MediaBox [ 0 0 403.17 639 ]
  /Parent 2 0 R
  /Resources <<
    /ExtGState <<
      /CBH 5 0 R
    >>
    /Font <<
      /DeSaCBI~1638056977 6 0 R
    >>
    /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ]
    /XObject <<
      /CBJ 15 0 R  % <==
      /IxCBM 40 0 R
    >>
  >>
>>
>>>

All the pagesappearance sources contain the invocation statementb"/CBJ Do"`.
You just need to remove this statement from that source, choose adequate file save options ... and are done.

JorjMcKie Dec 5, 2021
Maintainer

Here is a snippet that does the job for the example file:

import fitz

doc = fitz.open("test.pdf")
for page in doc:
    for xref in page.get_contents():  # xrefs of page appearances soures
        cont = bytearray(doc.xref_stream(xref))  # read source modifyable
        pos = cont.find(b"/CBJ Do")  # find the critical command
        if pos < 0:
            continue  # not in here
        cont[pos : pos + 7] = b""  # remove it
        doc.update_stream(xref, cont)  # write back the source
        print("removed CBJ from {}".format(xref))
    page.clean_contents()  # synchronize obj definition with appearance stream
doc.ez_save("no-cbj.pdf", garbage=4, pretty=True)

JorjMcKie · 2021-12-05T20:57:28Z

JorjMcKie
Dec 5, 2021
Maintainer

If you are dealing with files built following the same kind of pattern, you could simply filter images located at the bottom of the pages ...

0 replies

JorjMcKie · 2021-12-05T21:37:00Z

JorjMcKie
Dec 5, 2021
Maintainer

If you are not sure about the reference name of such a Form XObject, you may want to look for a repeating reference name across pages:

>>> doc=fitz.open("test.pdf")
>>> for page in doc:
	print("page {}".format(page.number), page.get_xobjects())

	
page 0 [(47, 'CBN', 0, Rect(0.0, 0.0, 612.0, 792.0))]
page 1 [(15, 'CBJ', 0, Rect(0.0, 0.0, 612.0, 792.0))]  % CBJ invoked on all but the front page ...
page 2 [(15, 'CBJ', 0, Rect(0.0, 0.0, 612.0, 792.0))]
page 3 [(15, 'CBJ', 0, Rect(0.0, 0.0, 612.0, 792.0))]
>>>

2 replies

Jmuccigr Dec 5, 2021
Author

Thanks. This is where my lack of knowledge about the internal structure of PDFs comes into play. I'm used to pulling out images via pdfimages or creating PDFs with images, not manipulating existing files. Tips for a readable intro would be welcome.

JorjMcKie Dec 6, 2021
Maintainer

I would not start below the level found in the official spec, https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf.
The first few dozen pages are a gentle enough intro for someone like you I think.

Removing Optional Content conditions from Images and Form XObjects #1425

Jmuccigr Nov 28, 2021

Replies: 14 comments · 17 replies

JorjMcKie Nov 28, 2021 Maintainer

Jmuccigr Nov 29, 2021 Author

JorjMcKie Nov 29, 2021 Maintainer

JorjMcKie Nov 29, 2021 Maintainer

Jmuccigr Nov 29, 2021 Author

JorjMcKie Nov 29, 2021 Maintainer

JorjMcKie Nov 30, 2021 Maintainer

Jmuccigr Dec 1, 2021 Author

JorjMcKie Dec 1, 2021 Maintainer

Jmuccigr Dec 1, 2021 Author

JorjMcKie Dec 2, 2021 Maintainer

Jmuccigr Dec 3, 2021 Author

JorjMcKie Dec 3, 2021 Maintainer

Jmuccigr Dec 4, 2021 Author

JorjMcKie Dec 4, 2021 Maintainer

Jmuccigr Dec 4, 2021 Author

JorjMcKie Dec 4, 2021 Maintainer

JorjMcKie Dec 4, 2021 Maintainer

Jmuccigr Dec 4, 2021 Author

Jmuccigr Dec 5, 2021 Author

JorjMcKie Dec 5, 2021 Maintainer

JorjMcKie Dec 5, 2021 Maintainer

Jmuccigr Dec 5, 2021 Author

JorjMcKie Dec 5, 2021 Maintainer

Jmuccigr Dec 5, 2021 Author

JorjMcKie Dec 5, 2021 Maintainer

JorjMcKie Dec 5, 2021 Maintainer

JorjMcKie Dec 5, 2021 Maintainer

JorjMcKie Dec 5, 2021 Maintainer

Jmuccigr Dec 5, 2021 Author

JorjMcKie Dec 6, 2021 Maintainer

Jmuccigr
Nov 28, 2021

Replies: 14 comments 17 replies

JorjMcKie
Nov 28, 2021
Maintainer

Jmuccigr Nov 29, 2021
Author

JorjMcKie Nov 29, 2021
Maintainer

JorjMcKie
Nov 29, 2021
Maintainer

Jmuccigr Nov 29, 2021
Author

JorjMcKie
Nov 29, 2021
Maintainer

JorjMcKie
Nov 30, 2021
Maintainer

Jmuccigr Dec 1, 2021
Author

JorjMcKie
Dec 1, 2021
Maintainer

Jmuccigr Dec 1, 2021
Author

JorjMcKie
Dec 2, 2021
Maintainer

Jmuccigr Dec 3, 2021
Author

JorjMcKie
Dec 3, 2021
Maintainer

Jmuccigr Dec 4, 2021
Author

JorjMcKie
Dec 4, 2021
Maintainer

Jmuccigr Dec 4, 2021
Author

JorjMcKie
Dec 4, 2021
Maintainer

JorjMcKie Dec 4, 2021
Maintainer

Jmuccigr Dec 4, 2021
Author

Jmuccigr Dec 5, 2021
Author

JorjMcKie
Dec 5, 2021
Maintainer

JorjMcKie
Dec 5, 2021
Maintainer

Jmuccigr Dec 5, 2021
Author

JorjMcKie
Dec 5, 2021
Maintainer

Jmuccigr Dec 5, 2021
Author

JorjMcKie Dec 5, 2021
Maintainer

JorjMcKie Dec 5, 2021
Maintainer

JorjMcKie
Dec 5, 2021
Maintainer

JorjMcKie
Dec 5, 2021
Maintainer

Jmuccigr Dec 5, 2021
Author

JorjMcKie Dec 6, 2021
Maintainer