Is there a way to get CID values for a given character? #1140

inf3rnus · 2021-07-12T23:08:35Z

inf3rnus
Jul 12, 2021

Hey again Jorj :),

Is there a way to get CID values for a given character?

I know that mupdf internally coerces characters to unicode which is why we don't get back values like (cid: 80). So a piece of me "feels" like they would have something as part of their api that would allow for retrieving that info, but I figured you'd know best.

but e.g. pdfminer will produce an output like this for characters it can't map to unicode:

I'm trying to render some text outside of PyMuPDF using font files, and as it turns out, some font files use ADBC (Adobe Custom) encoding, so the only way to lookup certain character glyphs is if you have a CID on hand.

If it is possible to extract CID's, as a feature request, could we potentially have that added as an option when extracting a rawdict?

Best,
Aaron

Answered by JorjMcKie

Jul 12, 2021

Presumably, the CID means the glyph id.
Yes you can access that: there is a - yet internal - function page._getTexttrace(). I am using this currently for doc.subset_fonts() to overcome cases, where the unicode cannot be determined, but the glyph id can.
I have no official documentation yet, so here is the output of the first page of PyMuPDF's PDF documentation:

>>> import fitz
>>> from pprint import pprint
>>> doc=fitz.open("pymupdf.pdf")
>>> page=doc[0]
>>> pprint(page._getTexttrace())  # a list of dictionaries of the page's text spans
[{'ascender': 0.9490000009536743,  # font ascender
  'bidi': 0,  # ignore for now
  # list of character information:
  'chars': ((80,  # unicode
         …

View full answer

JorjMcKie · 2021-07-12T23:25:32Z

JorjMcKie
Jul 12, 2021
Maintainer

Presumably, the CID means the glyph id.
Yes you can access that: there is a - yet internal - function page._getTexttrace(). I am using this currently for doc.subset_fonts() to overcome cases, where the unicode cannot be determined, but the glyph id can.
I have no official documentation yet, so here is the output of the first page of PyMuPDF's PDF documentation:

>>> import fitz
>>> from pprint import pprint
>>> doc=fitz.open("pymupdf.pdf")
>>> page=doc[0]
>>> pprint(page._getTexttrace())  # a list of dictionaries of the page's text spans
[{'ascender': 0.9490000009536743,  # font ascender
  'bidi': 0,  # ignore for now
  # list of character information:
  'chars': ((80,  # unicode
             15,  # glyph id  <== this is what you want
             (237.02999877929688, 366.3789978027344),  # origin coordinates of the character
             16.532995485742617),  # char width / advance
            (121, 48, (253.56298828125, 366.3789978027344), 13.781627368237878),
            (77,
             12,
             (267.3446044921875, 366.3789978027344),
             20.647654271642637),
   ...,),
  'color': (0.0,),  # text color
  'colorspace': 1,  # colorspace.n
  'descender': -0.3070000112056732,  # font descender
  'dir': (1.0, 0.0),  # writing direction (cosine, sine) of the angle
  'font': 'NimbusSanL-Bold',  # font name
  'linewidth': 0.9959999918937683,  # current global line width
  'opacity': 1.0,  # alpha
  'scissor': (1.0, 1.0, -1.0, -1.0),  # ignore this
  'size': 24.787099838256836,  # font size
  'spacewidth': 6.890813684118939,  # width of space character
  'type': 0,  # standard text, if 4: hidden text
  'wmode': 0},  # writing mode, 0 = horizontal, 1 = vertical
...

2 replies

inf3rnus Jul 12, 2021
Author

Freakin' brilliant!

Thank you!

-Aaron

JorjMcKie Jul 13, 2021
Maintainer

Thank you 😎.
That internal method also respects the global setting established by set_subset_fontnames().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there a way to get CID values for a given character? #1140

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is there a way to get CID values for a given character? #1140

Uh oh!

inf3rnus Jul 12, 2021

Replies: 1 comment · 2 replies

Uh oh!

JorjMcKie Jul 12, 2021 Maintainer

Uh oh!

inf3rnus Jul 12, 2021 Author

Uh oh!

JorjMcKie Jul 13, 2021 Maintainer

inf3rnus
Jul 12, 2021

Replies: 1 comment 2 replies

JorjMcKie
Jul 12, 2021
Maintainer

inf3rnus Jul 12, 2021
Author

JorjMcKie Jul 13, 2021
Maintainer