Unicode support in notes #5

HermannDppes · 2022-08-02T23:28:23Z

The (non-)problem(s)

Investigating pdfpc/#586 to understand the lower-level mechanisms that break when using LuaLaTeX I found that the unicode support in \pdfpcnote is as follows:

Full Unicode support in XeLaTeX.
Partial Latin-1 support in PDFLaTeX.
No Non-ASCII support in LuaLaTeX.

While 1. is exactly like one'd suspect, both 2. and very much 3. are not what I'd expect and these limitations should probably at least be documented or better yet fixed. (Though if I had found fixing them easy, this would be a pull request, not an issue.)

Current state

Looking at the generated pdf files after uncompressing them with pdftk, we see that XeTeX converts /Contents (ä β 你好) to /Contents <feff00e4002003b200204f60597d> on its own, whereas PDFTeX and LuaTeX seem to do no conversion. This means that with LuaTeX we just get a UTF-8 encoded string in the pdf file which just violates the specification (unless it happens to be exclusively ASCII) and is consequently not rendered as intended. PDFTeX does much of the same but under some circumstances, the expansion of the argument happens to be valid ISO-8859-1, which is an allowed encoding for text strings.

More specifically for PDFLaTeX, when using [utf8]{inputenc} or [latin1]{inputenc} with [T1]{fontenc}, the majority of the non-ascii symbols in the encoding come through. When not using the font encoding or using [utf8x]{inputenc}, the notes get quite horribly mangled. I have not tried much in the ways of non-latin-1 characters, but with input encoding UTF-8 and font encoding T2A, at least the letter Б was not rendered properly.

This time (see later for a time I did not) I kept all my files for seeing what works how out of the box, so see pdfpcnotes-unicode.tar.gz for these files. You do not have to build them yourself, you can just open all.pdf in pdfpc to see most of the things I mentioned so far.

Ways forward

I think the best way to handle non-ascii notes would be to implement logic for doing what XeTeX already does, i.e. converting whatever input is given to the macro to UTF-16 (big endian, with byte-order-mark) and output it as hex.

Why UTF-16?

The only valid encodings for text strings (at least as of PDF 1.5) are UTF-16 (BE+BOM) and ISO-8859-1. Having a special case for the less-comprehensive encoding accomplishes nothing (and might incure the fringe problem of strings starting with þÿ being read as unicode, e.g. þÿOZ as 佚).

Why hex?

It will most likely turn out to be easier but if doing it “directly” is easier, this preference of mine reverses.

Package `stringenc`

The package stringenc does a lot of this work and in fact its internal representation of strings is already as hex expansion of UTF-16BE, but it has two problems:

It does not provide byte-order-marks and actually removes them from its inputs. This is easily solved by using the lower level macros to get the internal representation and prefixing it with FEFF.
It relies on \EdefEscapeHex by pdfescape, which I do not understand but does not naively do what would be needed.

Using stringenc to convert from iso-8859-1 to the format described above would give an advantage on LuaTeX of at least supporting some additional characters (presumably those that already work through PDFTeX) and probably no advantage otherwise. As an example,

\documentclass{scrartcl}
\usepackage{stringenc}
\makeatletter
\newcommand{\toutfxvibebomhex}[3][\inputencodingname]{%
  \EdefSanitize\SE@from{#1}%
  \EdefEscapeHex\PC@result{#3}
  \expandafter\SE@ConvertFrom\expandafter\PC@result\expandafter{\PC@result}\SE@from%
  \edef#2{FEFF\PC@result}%
}
\protected\def\pdfannot {\pdfextension annot }%
\newcommand{\pdfpcnote}[1]{%
  {%
    \toutfxvibebomhex[iso-8859-1]\tmp@a{#1}%
    \edef\\{\string\n}%
    \pdfannot width 0pt height 0pt depth 0pt {%
       /Subtype /Text%
       /Contents <\tmp@a>%
       /F 6%
    }%
  }%
}
\makeatother

\begin{document}
	Hi!
	Hühner wären Vögel.
	\pdfpcnote{Hühner wären Vögel.}
\end{document}

works and displays the note correctly in pdfpc after being compiled with LuaLaTeX. Do you want a pull request patching in at least that much?

Beyond `\EdefEscapeHex`

The command \EdefEscapeHex recreates the PDFTeX primitive \pdfescapehex with a different interface. This apparently means returning the ISO-8859-1 encoding of the expansion of its argument in hex, dropping everything that is not covered by the encoding. (In my rather non-comprehensive testing as the documentation was not very useful to me.) But reusing the code that pdfescape employs to mimic this behaviour in absence of the primitive, I had some success getting the UTF-8 hex expansion of UTF-8 input.

If there is interest in it, I'll see if I can recreate these partial successes and give some exposition of the problems encountered, but I was not meticulous about keeping things that don't work so I do not have the relevant files laying around right now.

The text was updated successfully, but these errors were encountered:

HermannDppes · 2022-08-08T02:50:08Z

Unless someone finds some issue with it, #6 fixes one of the issues above but I still think that

it would be preferable to have better unicode support on PdfTeX than whatever we have right now.
the unicode capabilities or enginewise limitations should be documented.

HermannDppes · 2022-08-19T01:34:33Z

it would be preferable to have better unicode support on PdfTeX than whatever we have right now.

So in HermannDppes@8c0e774 I figured out that there may be a way to piggyback off of the fact that hyperref has, by necessity, a way to encode text strings. Do the maintaners like the idea of having unicode support enough that taking on hyperref as an unconditional dependency for PDFTeX seems like a good idea? (I always load hyperref because I rely on some of its functionality in pretty much every document, so to me it seems like a no-cost deal, but I'm not sure that generalizes …)

If it does not, would this be a candidate for a package option?

HermannDppes mentioned this issue Aug 8, 2022

Add unicode support to \pdfpcnote for LuaTeX #6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode support in notes #5

Unicode support in notes #5

HermannDppes commented Aug 2, 2022

HermannDppes commented Aug 8, 2022

HermannDppes commented Aug 19, 2022

Unicode support in notes #5

Unicode support in notes #5

Comments

HermannDppes commented Aug 2, 2022

The (non-)problem(s)

Current state

Ways forward

Why UTF-16?

Why hex?

Package stringenc

Beyond \EdefEscapeHex

HermannDppes commented Aug 8, 2022

HermannDppes commented Aug 19, 2022

Package `stringenc`

Beyond `\EdefEscapeHex`