-
Hey again Jorj :), Is there a way to get CID values for a given character? I know that mupdf internally coerces characters to unicode which is why we don't get back values like but e.g. pdfminer will produce an output like this for characters it can't map to unicode: I'm trying to render some text outside of PyMuPDF using font files, and as it turns out, some font files use ADBC (Adobe Custom) encoding, so the only way to lookup certain character glyphs is if you have a CID on hand. If it is possible to extract CID's, as a feature request, could we potentially have that added as an option when extracting a rawdict? Best, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Presumably, the CID means the glyph id. >>> import fitz
>>> from pprint import pprint
>>> doc=fitz.open("pymupdf.pdf")
>>> page=doc[0]
>>> pprint(page._getTexttrace()) # a list of dictionaries of the page's text spans
[{'ascender': 0.9490000009536743, # font ascender
'bidi': 0, # ignore for now
# list of character information:
'chars': ((80, # unicode
15, # glyph id <== this is what you want
(237.02999877929688, 366.3789978027344), # origin coordinates of the character
16.532995485742617), # char width / advance
(121, 48, (253.56298828125, 366.3789978027344), 13.781627368237878),
(77,
12,
(267.3446044921875, 366.3789978027344),
20.647654271642637),
...,),
'color': (0.0,), # text color
'colorspace': 1, # colorspace.n
'descender': -0.3070000112056732, # font descender
'dir': (1.0, 0.0), # writing direction (cosine, sine) of the angle
'font': 'NimbusSanL-Bold', # font name
'linewidth': 0.9959999918937683, # current global line width
'opacity': 1.0, # alpha
'scissor': (1.0, 1.0, -1.0, -1.0), # ignore this
'size': 24.787099838256836, # font size
'spacewidth': 6.890813684118939, # width of space character
'type': 0, # standard text, if 4: hidden text
'wmode': 0}, # writing mode, 0 = horizontal, 1 = vertical
... |
Beta Was this translation helpful? Give feedback.
Presumably, the CID means the glyph id.
Yes you can access that: there is a - yet internal - function
page._getTexttrace()
. I am using this currently fordoc.subset_fonts()
to overcome cases, where the unicode cannot be determined, but the glyph id can.I have no official documentation yet, so here is the output of the first page of PyMuPDF's PDF documentation: