Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072

dhdaines · 2024-12-13T17:33:46Z

In theory pdfminer.six has a CMapParse which is capable of parsing embedded CMaps defined in the Encoding field of a Type0 font specification.

In practice, it doesn't do that at all... it only parses ToUnicode CMaps: https://github.com/search?q=repo%3Apdfminer%2Fpdfminer.six%20CMapParser&type=code

This is a problem because some PDFs will actually define their own, more exotic mappings of byte strings to CIDs in Type0 fonts. So pdfminer.six is not able to get the right widths, etc, for characters in PDFs that use these because it cannot map them to any CIDs.

There is a more visible problem, which is that it is also unable to extract any text from them. This is because its handling of ToUnicode CMaps is actually entirely incorrect (and unfortunately PLAYA has inherited this, which I am in the process of fixing at the moment).

Specifically, pdfminer.six assumes that the mapping from a byte sequence in an object stream to a Unicode string goes like this:

b'ABC' => [cid(A), cid(B), cid(C)] => ["A", "B", "C"]

This is incorrect. Instead, ToUnicode is intended to map byte sequences directly to Unicode characters, so:

b'ABC' => ["A", "B", "C"]

The Encoding CMap (which could be an embedded one as noted above) does a separate mapping of byte sequences to CIDs which has nothing to do with text extraction. This only happens to work most of the time in pdfminer.six because either there is no CMap, or the CMap is an identity CMap, so the input bytes and the CIDs are the same, or one of the predefined Unicode CMaps is used (see below).

Here are some samples from pdf.js that illustrate the problem (pdfminer.six cannot extract text from them):

https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue2931.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue7901.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue9534_reduced.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue18117.pdf

The text was updated successfully, but these errors were encountered:

dhdaines · 2024-12-15T21:27:50Z

The pdf.js code is really quite clear for this.

From an input byte string, first it reads variable-width character codes according to the ranges defined in the CMap: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3454
The CID (called widthCode here but it's the CID) is looked up in the CMap: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3350
The Unicode string representation is looked up in the ToUnicode map: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3363

And then some other stuff happens ;-) but the important point here is that Encoding and ToUnicode maps, while they both have the form of CMaps, are really totally separate and different things.

dhdaines · 2024-12-15T23:29:54Z

The source of the confusion here is because Adobe's "standards" are contradictory, see below due to the special case (represented by pdfminer/cmap/to-unicode-*) of Unicode conversion for predefined CMaps. This is indeed done by mapping the CID to a "Unicode value" (presumably a code point in UCS-2) using a special CMap.

But this particular CMap is not a ToUnicode map, it is simply a special CMap whose CID values can be interpreted as Unicode code points. See PDF 1.7 section 9.10.2.

This logic is implemented in conformance with the PDF 1.7 specification in pdf.js here: https://github.com/mozilla/pdf.js/blob/master/src/core/evaluator.js#L3796

dhdaines · 2025-01-06T16:40:50Z

The plot thickens here - if you read Adobe Technical Note #5411, which to their credit, the authors of pdfminer.six clearly did, and which is referenced in the PDF 1.7 specification, then you would assume that ToUnicode maps are intended to apply to CIDs:

In order to derive content from PDFs that embed CIDFonts based on other character
collections, a “ToUnicode” mapping file must be created, and properly installed for use with
Distiller. This “ToUnicode” mapping file shall become part of the PDF, to ensure portability.
This file, which follows CMap-style syntax, maps CIDs to Unicode UTF-16BE character
codes.
Because a “ToUnicode” mapping file is used to convert from CIDs (which begin at decimal 0,
which is expressed as 0x0000 in hexadecimal notation) to Unicode code points, the following
“codespacerange” definition, without exception, shall always be used:
1 begincodespacerange
<0000> <FFFF>
endcodespacerange

But this is entirely wrong! If you continue reading the PDF 1.7 standard, which the authors of pdf.js did (probably after encountering many curious PDFs), it goes on to say something totally different:

The CMap file shall contain begincodespacerange and endcodespacerange operators that are
consistent with the encoding that the font uses. In particular, for a simple font, the codespace shall be one
byte long.
It shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange operators to define the mapping
from character codes to Unicode character sequences expressed in UTF-16BE encoding.

Note: character codes which are not the same thing as CIDs and are obviously not always two bytes!

Cue Spiderman pointing at Spiderman image with both Spidermen labeled "Adobe"!

I don't really know why the PDF 1.4 standard added a reference to that utterly misleading technical note, but it should be ignored.

That said, the correct definition of ToUnicode is really a strict superset of the one in the technical note - basically you just have to actually respect the codespace ranges, and it covers both cases, and this is what pdf.js does.

dhdaines changed the title ~~Encoding CMaps are not actually parsed~~ Embedded CMaps are not actually parsed Dec 15, 2024

dhdaines changed the title ~~Embedded CMaps are not actually parsed~~ Embedded CMaps are not actually parsed, and character codes are not mapped Dec 15, 2024

dhdaines changed the title ~~Embedded CMaps are not actually parsed, and character codes are not mapped~~ Character and Unicode mapping is incorrect for CID fonts with embeded CMaps Dec 15, 2024

dhdaines mentioned this issue Dec 15, 2024

ToUnicode maps should map character codes, not CIDs dhdaines/playa#28

Closed

dhdaines mentioned this issue Dec 16, 2024

Use PLAYA instead of pdfminer jsvine/pdfplumber#1226

Draft

dhdaines mentioned this issue Jan 6, 2025

Correctly implement ToUnicode according to the PDF standard and not that bogus technical note (that the PDF standard refers to...) dhdaines/playa#41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072

Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072

dhdaines commented Dec 13, 2024 •

edited

Loading

dhdaines commented Dec 15, 2024 •

edited

Loading

dhdaines commented Dec 15, 2024 •

edited

Loading

dhdaines commented Jan 6, 2025 •

edited

Loading

Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072

Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072

Comments

dhdaines commented Dec 13, 2024 • edited Loading

dhdaines commented Dec 15, 2024 • edited Loading

dhdaines commented Dec 15, 2024 • edited Loading

dhdaines commented Jan 6, 2025 • edited Loading

dhdaines commented Dec 13, 2024 •

edited

Loading

dhdaines commented Dec 15, 2024 •

edited

Loading

dhdaines commented Dec 15, 2024 •

edited

Loading

dhdaines commented Jan 6, 2025 •

edited

Loading