Removing Optional Content conditions from Images and Form XObjects #1425
Replies: 14 comments 17 replies
-
How do you remove those images? |
Beta Was this translation helpful? Give feedback.
-
This situation may be somewhat confusing. You can rectify it by executing |
Beta Was this translation helpful? Give feedback.
-
To remove an image, make a redaction annotation using its bbox. When done with all images on that page, execute |
Beta Was this translation helpful? Give feedback.
-
The doc give an explanation: images may be displayed by the page directly, or by "Form XObjects" which are invoked by the page. This type of nesting is indicated if you specify |
Beta Was this translation helpful? Give feedback.
-
Confusion! for page doc:
page.clean_contents() # do this before!
xreflist = [item[0] for item in page.get_images(True) if item[0] in repeatedimages]
for xref in xreflist:
print()
print("Rects of xref", xref)
pprint(page.get_image_rects(xref))) |
Beta Was this translation helpful? Give feedback.
-
Can you please let me have that ominous PDF? |
Beta Was this translation helpful? Give feedback.
-
Aha! It all is in best order: >>> for page in doc:
print("-"*60)
pprint(page.get_image_info(xrefs=True)) # detect XREFs where possible
------------------------------------------------------------
[{'bbox': (2.0, 301.79998779296875, 20.17300033569336, 777.0),
'bpc': 8,
'colorspace': 1,
'cs-name': 'DeviceGray',
'digest': b'\x9f\xb0\xd5\xac\xc9\xe9\xdc\x83\xe0\xaem\xd4m\xe4\xcb\xf7',
'height': 1752,
'number': 0,
'size': 40832,
'transform': (18.17300033569336,
0.0,
-0.0,
475.20001220703125,
2.0,
301.79998779296875),
'width': 67,
'xref': 21,
'xres': 96,
'yres': 96},
{'bbox': (204.0, 179.0, 408.0, 443.0),
'bpc': 8,
'colorspace': 3,
'cs-name': 'DeviceRGB',
'digest': b'43r\xcb\xa0\x0cu\xfd\xa9Wr\x03\xe9H\xd1\xe0',
'height': 1041,
'number': 3,
'size': 166089,
'transform': (204.0, 0.0, -0.0, 264.0, 204.0, 179.0),
'width': 749,
'xref': 48,
'xres': 96,
'yres': 96}]
------------------------------------------------------------
[{'bbox': (18.17300033569336, 0.0, 403.4729919433594, 604.0),
'bpc': 1,
'colorspace': 1,
'cs-name': 'DeviceGray',
'digest': b'\xc7\t\xaaL\xd8\xe6\xedq\xbd,\xd5\xd7_\xee\xde\xc0',
'height': 5032,
'number': 0,
'size': 13364,
'transform': (385.29998779296875, 0.0, -0.0, 604.0, 18.17300033569336, 0.0),
'width': 3210,
'xref': 24,
'xres': 96,
'yres': 96}]
------------------------------------------------------------
[{'bbox': (18.17300033569336, 0.84002685546875, 398.1730041503906, 606.0),
'bpc': 1,
'colorspace': 1,
'cs-name': 'DeviceGray',
'digest': b'\xbf\x84Ix\xbfG\xd8\xf8/S\xd4\x08^`\xe6\x12',
'height': 5042,
'number': 0,
'size': 55398,
'transform': (380.0,
0.0,
-0.0,
605.1599731445312,
18.17300033569336,
0.84002685546875),
'width': 3166,
'xref': 32,
'xres': 96,
'yres': 96}]
------------------------------------------------------------
[{'bbox': (18.17300033569336, 0.1400146484375, 403.1730041503906, 605.0),
'bpc': 1,
'colorspace': 1,
'cs-name': 'DeviceGray',
'digest': b'\xb7\xc1U<|\xdaQ\xa7\xbe\x8e\xe6\xae\xb8^\xbd\x89',
'height': 5040,
'number': 0,
'size': 54654,
'transform': (385.0,
0.0,
-0.0,
604.8599853515625,
18.17300033569336,
0.1400146484375),
'width': 3208,
'xref': 40,
'xres': 96,
'yres': 96}]
>>> This method works for all supported document types - not only PDF. Specifying |
Beta Was this translation helpful? Give feedback.
-
Pages 2 and up show one full page image each. |
Beta Was this translation helpful? Give feedback.
-
Now I finally have seen why we talked past each other: >>> page=doc[1]
>>> pprint(page.get_images(True))
[(24, 0, 3210, 5032, 1, 'DeviceGray', '', 'IxCBK', 'CCITTFaxDecode', 0),
(21, 0, 67, 1752, 8, 'DeviceGray', '', 'PxCBA', 'FlateDecode', 15),
(17, 18, 154, 34, 8, 'DeviceGray', '', 'PxCBF', 'FlateDecode', 15),
(16, 0, 272, 31, 8, 'DeviceGray', '', 'PxCBG', 'FlateDecode', 15)]
>>> # check whether xref 15 is under an OC condition
>>> doc.get_oc(15)
14
>>> # indeed: it is, but the condition is not ON
>>> # ... as far as MuPDF sees it
>>> |
Beta Was this translation helpful? Give feedback.
-
Understandable. Maybe this is better: import fitz
from pprint import pprint
doc = fitz.open("test.pdf")
for page in doc:
print("Images on page {}".format(page.number))
imglist = page.get_images(True)
for item in imglist:
xref_img = item[0]
xref_xob = item[-1]
# remove any Optional Content information
# from images and Form XObjects
doc.set_oc(xref_img, 0)
if xref_xob > 0:
doc.set_oc(xref_xob, 0)
# the following should now be complete ...
imgs = [(img["xref"],img["bbox"]) for img in page.get_image_info(xrefs=True) if img["xref"]>0]
pprint(imgs)
print("-"*60) |
Beta Was this translation helpful? Give feedback.
-
And this one also removes those images at the bottom: import fitz
from pprint import pprint
doc = fitz.open("test.pdf")
for page in doc:
print("Images on page {}".format(page.number))
imglist = page.get_images(True)
for item in imglist:
xref_img = item[0]
xref_xob = item[-1]
# remove any Optional Content information
# from images and Form XObjects
doc.set_oc(xref_img, 0)
if xref_xob > 0:
doc.set_oc(xref_xob, 0)
imgs = [
(img["xref"], img["bbox"])
for img in page.get_image_info(xrefs=True)
if img["xref"] > 0
]
pprint(imgs)
print("-" * 60)
for item in imgs:
if item[0] in (16, 17):
page.add_redact_annot(item[1])
page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_PIXELS)
doc.ez_save("removed.pdf") The option |
Beta Was this translation helpful? Give feedback.
-
Yes, identifying the images you need to delete is that parts that requires the wits. |
Beta Was this translation helpful? Give feedback.
-
If you are dealing with files built following the same kind of pattern, you could simply filter images located at the bottom of the pages ... |
Beta Was this translation helpful? Give feedback.
-
If you are not sure about the reference name of such a Form XObject, you may want to look for a repeating reference name across pages: >>> doc=fitz.open("test.pdf")
>>> for page in doc:
print("page {}".format(page.number), page.get_xobjects())
page 0 [(47, 'CBN', 0, Rect(0.0, 0.0, 612.0, 792.0))]
page 1 [(15, 'CBJ', 0, Rect(0.0, 0.0, 612.0, 792.0))] % CBJ invoked on all but the front page ...
page 2 [(15, 'CBJ', 0, Rect(0.0, 0.0, 612.0, 792.0))]
page 3 [(15, 'CBJ', 0, Rect(0.0, 0.0, 612.0, 792.0))]
>>> |
Beta Was this translation helpful? Give feedback.
-
I'm trying to remove/delete specific images from a file and they're repeated on multiple pages. When I go after their bbox with
page.get_image_bbox
, only the first instance reports a valid bbox; the others returnRect(1.0, 1.0, -1.0, -1.0)
. I see some warnings in the docs that may refer to this, but is there a way around this?Beta Was this translation helpful? Give feedback.
All reactions