Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072

Open
dhdaines opened this issue Dec 13, 2024 · 3 comments
Open

Comments

@dhdaines
Copy link
Contributor

dhdaines commented Dec 13, 2024

In theory pdfminer.six has a CMapParse which is capable of parsing embedded CMaps defined in the Encoding field of a Type0 font specification.

In practice, it doesn't do that at all... it only parses ToUnicode CMaps: https://github.com/search?q=repo%3Apdfminer%2Fpdfminer.six%20CMapParser&type=code

This is a problem because some PDFs will actually define their own, more exotic mappings of byte strings to CIDs in Type0 fonts. So pdfminer.six is not able to get the right widths, etc, for characters in PDFs that use these because it cannot map them to any CIDs.

There is a more visible problem, which is that it is also unable to extract any text from them. This is because its handling of ToUnicode CMaps is actually entirely incorrect (and unfortunately PLAYA has inherited this, which I am in the process of fixing at the moment).

Specifically, pdfminer.six assumes that the mapping from a byte sequence in an object stream to a Unicode string goes like this:

b'ABC' => [cid(A), cid(B), cid(C)] => ["A", "B", "C"]

This is incorrect. Instead, ToUnicode is intended to map byte sequences directly to Unicode characters, so:

b'ABC' => ["A", "B", "C"]

The Encoding CMap (which could be an embedded one as noted above) does a separate mapping of byte sequences to CIDs which has nothing to do with text extraction. This only happens to work most of the time in pdfminer.six because either there is no CMap, or the CMap is an identity CMap, so the input bytes and the CIDs are the same, or one of the predefined Unicode CMaps is used (see below).

Here are some samples from pdf.js that illustrate the problem (pdfminer.six cannot extract text from them):

https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue2931.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue7901.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue9534_reduced.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue18117.pdf

@dhdaines dhdaines changed the title Encoding CMaps are not actually parsed Embedded CMaps are not actually parsed Dec 15, 2024
@dhdaines dhdaines changed the title Embedded CMaps are not actually parsed Embedded CMaps are not actually parsed, and character codes are not mapped Dec 15, 2024
@dhdaines dhdaines changed the title Embedded CMaps are not actually parsed, and character codes are not mapped Character and Unicode mapping is incorrect for CID fonts with embeded CMaps Dec 15, 2024
@dhdaines
Copy link
Contributor Author

dhdaines commented Dec 15, 2024

The pdf.js code is really quite clear for this.

  1. From an input byte string, first it reads variable-width character codes according to the ranges defined in the CMap: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3454
  2. The CID (called widthCode here but it's the CID) is looked up in the CMap: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3350
  3. The Unicode string representation is looked up in the ToUnicode map: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3363

And then some other stuff happens ;-) but the important point here is that Encoding and ToUnicode maps, while they both have the form of CMaps, are really totally separate and different things.

@dhdaines
Copy link
Contributor Author

dhdaines commented Dec 15, 2024

The source of the confusion here is because Adobe's "standards" are contradictory, see below due to the special case (represented by pdfminer/cmap/to-unicode-*) of Unicode conversion for predefined CMaps. This is indeed done by mapping the CID to a "Unicode value" (presumably a code point in UCS-2) using a special CMap.

But this particular CMap is not a ToUnicode map, it is simply a special CMap whose CID values can be interpreted as Unicode code points. See PDF 1.7 section 9.10.2.

This logic is implemented in conformance with the PDF 1.7 specification in pdf.js here: https://github.com/mozilla/pdf.js/blob/master/src/core/evaluator.js#L3796

@dhdaines
Copy link
Contributor Author

dhdaines commented Jan 6, 2025

The plot thickens here - if you read Adobe Technical Note #5411, which to their credit, the authors of pdfminer.six clearly did, and which is referenced in the PDF 1.7 specification, then you would assume that ToUnicode maps are intended to apply to CIDs:

In order to derive content from PDFs that embed CIDFonts based on other character
collections, a “ToUnicode” mapping file must be created, and properly installed for use with
Distiller. This “ToUnicode” mapping file shall become part of the PDF, to ensure portability.
This file, which follows CMap-style syntax, maps CIDs to Unicode UTF-16BE character
codes.
Because a “ToUnicode” mapping file is used to convert from CIDs (which begin at decimal 0,
which is expressed as 0x0000 in hexadecimal notation) to Unicode code points, the following
“codespacerange” definition, without exception, shall always be used:
1 begincodespacerange
<0000> <FFFF>
endcodespacerange

But this is entirely wrong! If you continue reading the PDF 1.7 standard, which the authors of pdf.js did (probably after encountering many curious PDFs), it goes on to say something totally different:

The CMap file shall contain begincodespacerange and endcodespacerange operators that are
consistent with the encoding that the font uses. In particular, for a simple font, the codespace shall be one
byte long.
It shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange operators to define the mapping
from character codes to Unicode character sequences expressed in UTF-16BE encoding.

Note: character codes which are not the same thing as CIDs and are obviously not always two bytes!

Cue Spiderman pointing at Spiderman image with both Spidermen labeled "Adobe"!

I don't really know why the PDF 1.4 standard added a reference to that utterly misleading technical note, but it should be ignored.

That said, the correct definition of ToUnicode is really a strict superset of the one in the technical note - basically you just have to actually respect the codespace ranges, and it covers both cases, and this is what pdf.js does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant