Metadata gets updated in the PDF structure, but doesn't reflect in the Adobe Reader (some properties) #1254

SirishaGorasa · 2021-09-12T10:16:58Z

SirishaGorasa
Sep 12, 2021

Issue:

I have been updating the metadata of my pdf document and saving it back to file using set_metadata() function. Before setting metadata, I remove complete metadata using

doc.scrub(attached_files=False, clean_pages=False, embedded_files=False, hidden_text=False, javascript=False, metadata=True,
                redactions=False, redact_images=0, remove_links=False, reset_fields=False, reset_responses=False, thumbnails=False, xml_metadata=False)

Here's how it looks inside pdf structure.

Note: setting xml_metadata to true or false in the `scrub()` method resulted in same thing.

<<
  /Author (Martie Shrader)
  /CreationDate (D:19040229023654Z)
  /Creator (Adobe\256 PageMaker\256 6.5)
  /Keywords (untagged keywords)
  /ModDate (D:20150217093420-07'00')
  /Producer (Acrobat PDFWriter 4.05  for Power Macintosh)
  /Subject (untagged subject)
  /Title (Upload Title)
  /Trapped null
>>

In this example, keywords property didn't reflect in the Adobe reader (Works for some pdfs).

Please do let me know if I am doing something wrong, or anything I need to do, to make it work.

Thanks in advance!

JorjMcKie · 2021-09-12T11:10:44Z

JorjMcKie
Sep 12, 2021
Maintainer

Which version are you using?
Platform presumably Mac OSX?

0 replies

SirishaGorasa · 2021-09-12T11:21:45Z

SirishaGorasa
Sep 12, 2021
Author

Which version are you using?
Platform presumably Mac OSX?

PyMuPDF Details:

Name: PyMuPDF
Version: 1.18.15
Summary: PyMuPDF is a Python binding for the document renderer and toolkit MuPDF
Home-page: https://github.com/pymupdf/PyMuPDF
Author: Jorj McKie
Author-email: [email protected]
License: GNU AFFERO GPL 3.0

Platform: Windows 10, python 3.9.2

0 replies

JorjMcKie · 2021-09-12T11:33:33Z

JorjMcKie
Sep 12, 2021
Maintainer

Ok, looks good.
Which code did you use to update the metadata?
Cannot reproduce. When I do

import fitz
doc = fitz.open()
page = doc.new_page()
m = doc.metadata
m["keywords"] = "kw1 kw2"
doc.set_metadata(m)
doc.save("x.pdf")

the file show no irregularities in whatever viewer.

0 replies

JorjMcKie · 2021-09-12T11:35:54Z

JorjMcKie
Sep 12, 2021
Maintainer

If you want, please share the file and let me have a look.

0 replies

SirishaGorasa · 2021-09-12T11:36:05Z

SirishaGorasa
Sep 12, 2021
Author

untagged.pdf

Could you please check if the same works on the attached pdf.

0 replies

JorjMcKie · 2021-09-12T12:25:08Z

JorjMcKie
Sep 12, 2021
Maintainer

Hm, same thing happening for me 🤔.
All other viewer I am using show the keywords correctly: Firefox, PDF XChange, Foxit Reader, Nitro 5, evince (Linux).
And needless to mention: the PDF object also looks harmless:

Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> import fitz
>>> doc=fitz.open("untagged-mod.pdf")
>>> from pprint import pprint
>>> pprint(doc.metadata)
{'author': 'Martie Shrader',
 'creationDate': 'D:19040229023654Z',
 'creator': 'Adobe® PageMaker® 6.5',
 'encryption': None,
 'format': 'PDF 1.6',
 'keywords': 'kw1,kw2',
 'modDate': "D:20150217093420-07'00'",
 'producer': 'Acrobat PDFWriter 4.05  for Power Macintosh',
 'subject': '',
 'title': '',
 'trapped': ''}
>>> print(doc.xref_object(-1)) # the trailer
<<
  /Size 28
  /Info 14 0 R
  /Root 15 0 R
  /ID [ <182A58473815524CCFA528451E12F3C6> <5BE0747C168AEBD2D843611781631B79> ]
>>
>>> print(doc.xref_object(14)) # info object
<<
  /Author (Martie Shrader)
  /CreationDate (D:19040229023654Z)
  /Creator (Adobe\256 PageMaker\256 6.5)
  /Keywords (kw1,kw2)
  /ModDate (D:20150217093420-07'00')
  /Producer (Acrobat PDFWriter 4.05  for Power Macintosh)
  /Subject null
  /Title null
  /Trapped null
>>
>>>

Out of good advice, I must say ...

0 replies

JorjMcKie · 2021-09-12T12:31:49Z

JorjMcKie
Sep 12, 2021
Maintainer

Ha! I have the reason:
Because the file has XML metadata, Adobe Acrobat uses them for properties display.
There, keywords are empty. If you do doc.del_xml_metadata(), the display works!

0 replies

SirishaGorasa · 2021-09-12T13:14:35Z

SirishaGorasa
Sep 12, 2021
Author

Thanks @JorjMcKie .
Is it recommended to remove xml metadata? Is there any way we can automatically generate xml metadata for pdf using our original metadata ?

0 replies

JorjMcKie · 2021-09-12T13:32:12Z

JorjMcKie
Sep 12, 2021
Maintainer

Is it recommended to remove xml metadata?

Well, at least I recommend it. But you are on your own as per any risks losing potentially important information.
As I wrote in the documentation: I do not include XML manipulation features - on purpose.
If you want, develop something that interprets the PDF's XML metadata extracted via doc.get_xml_metadat(), and feeds information to whereever you like - e.g. the standard metadata.
You can also do the other way round and update XML metadata via doc.set_xml_metadata(xml source) with information you might have taken from doc.metadata.

But again: PyMuPDF is not in the business of dealing with XML syntax - use e.g. lxml if you need something there.

0 replies

SirishaGorasa · 2021-09-12T14:20:27Z

SirishaGorasa
Sep 12, 2021
Author

Ok! But when we set the xml_metadata=True in the doc.scrub() method, Isn't it supposed to remove any xml metadata present in the document?

0 replies

JorjMcKie · 2021-09-12T14:37:16Z

JorjMcKie
Sep 12, 2021
Maintainer

Isn't it supposed to remove any xml metadata present in the document?

Sure - and it does, just tested it, save the resulting file, added keywords information and, voilà, Adobe did show them.

1 reply

SirishaGorasa Sep 12, 2021
Author

Oh yeah, I was just checking for some permissions before I scrub the data. That might be the issue in my case. Thanks for checking!

Metadata gets updated in the PDF structure, but doesn't reflect in the Adobe Reader (some properties) #1254

Uh oh!

Uh oh!

SirishaGorasa Sep 12, 2021

Replies: 11 comments · 1 reply

Uh oh!

JorjMcKie Sep 12, 2021 Maintainer

Uh oh!

SirishaGorasa Sep 12, 2021 Author

Uh oh!

JorjMcKie Sep 12, 2021 Maintainer

Uh oh!

JorjMcKie Sep 12, 2021 Maintainer

Uh oh!

SirishaGorasa Sep 12, 2021 Author

Uh oh!

JorjMcKie Sep 12, 2021 Maintainer

Uh oh!

JorjMcKie Sep 12, 2021 Maintainer

Uh oh!

SirishaGorasa Sep 12, 2021 Author

Uh oh!

JorjMcKie Sep 12, 2021 Maintainer

Uh oh!

SirishaGorasa Sep 12, 2021 Author

Uh oh!

JorjMcKie Sep 12, 2021 Maintainer

Uh oh!

SirishaGorasa Sep 12, 2021 Author

SirishaGorasa
Sep 12, 2021

Replies: 11 comments 1 reply

JorjMcKie
Sep 12, 2021
Maintainer

SirishaGorasa
Sep 12, 2021
Author

JorjMcKie
Sep 12, 2021
Maintainer

JorjMcKie
Sep 12, 2021
Maintainer

SirishaGorasa
Sep 12, 2021
Author

JorjMcKie
Sep 12, 2021
Maintainer

JorjMcKie
Sep 12, 2021
Maintainer

SirishaGorasa
Sep 12, 2021
Author

JorjMcKie
Sep 12, 2021
Maintainer

SirishaGorasa
Sep 12, 2021
Author

JorjMcKie
Sep 12, 2021
Maintainer

SirishaGorasa Sep 12, 2021
Author