You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Investigating pdfpc/#586 to understand the lower-level mechanisms that break when using LuaLaTeX I found that the unicode support in \pdfpcnote is as follows:
Full Unicode support in XeLaTeX.
Partial Latin-1 support in PDFLaTeX.
No Non-ASCII support in LuaLaTeX.
While 1. is exactly like one'd suspect, both 2. and very much 3. are not what I'd expect and these limitations should probably at least be documented or better yet fixed. (Though if I had found fixing them easy, this would be a pull request, not an issue.)
Current state
Looking at the generated pdf files after uncompressing them with pdftk, we see that XeTeX converts /Contents (ä β 你好) to /Contents <feff00e4002003b200204f60597d> on its own, whereas PDFTeX and LuaTeX seem to do no conversion. This means that with LuaTeX we just get a UTF-8 encoded string in the pdf file which just violates the specification (unless it happens to be exclusively ASCII) and is consequently not rendered as intended. PDFTeX does much of the same but under some circumstances, the expansion of the argument happens to be valid ISO-8859-1, which is an allowed encoding for text strings.
More specifically for PDFLaTeX, when using [utf8]{inputenc} or [latin1]{inputenc} with [T1]{fontenc}, the majority of the non-ascii symbols in the encoding come through. When not using the font encoding or using [utf8x]{inputenc}, the notes get quite horribly mangled. I have not tried much in the ways of non-latin-1 characters, but with input encoding UTF-8 and font encoding T2A, at least the letter Б was not rendered properly.
This time (see later for a time I did not) I kept all my files for seeing what works how out of the box, so see pdfpcnotes-unicode.tar.gz for these files. You do not have to build them yourself, you can just open all.pdf in pdfpc to see most of the things I mentioned so far.
Ways forward
I think the best way to handle non-ascii notes would be to implement logic for doing what XeTeX already does, i.e. converting whatever input is given to the macro to UTF-16 (big endian, with byte-order-mark) and output it as hex.
Why UTF-16?
The only valid encodings for text strings (at least as of PDF 1.5) are UTF-16 (BE+BOM) and ISO-8859-1. Having a special case for the less-comprehensive encoding accomplishes nothing (and might incure the fringe problem of strings starting with þÿ being read as unicode, e.g. þÿOZ as 佚).
Why hex?
It will most likely turn out to be easier but if doing it “directly” is easier, this preference of mine reverses.
Package stringenc
The package stringenc does a lot of this work and in fact its internal representation of strings is already as hex expansion of UTF-16BE, but it has two problems:
It does not provide byte-order-marks and actually removes them from its inputs. This is easily solved by using the lower level macros to get the internal representation and prefixing it with FEFF.
It relies on \EdefEscapeHex by pdfescape, which I do not understand but does not naively do what would be needed.
Using stringenc to convert from iso-8859-1 to the format described above would give an advantage on LuaTeX of at least supporting some additional characters (presumably those that already work through PDFTeX) and probably no advantage otherwise. As an example,
works and displays the note correctly in pdfpc after being compiled with LuaLaTeX. Do you want a pull request patching in at least that much?
Beyond \EdefEscapeHex
The command \EdefEscapeHex recreates the PDFTeX primitive \pdfescapehex with a different interface. This apparently means returning the ISO-8859-1 encoding of the expansion of its argument in hex, dropping everything that is not covered by the encoding. (In my rather non-comprehensive testing as the documentation was not very useful to me.) But reusing the code that pdfescape employs to mimic this behaviour in absence of the primitive, I had some success getting the UTF-8 hex expansion of UTF-8 input.
If there is interest in it, I'll see if I can recreate these partial successes and give some exposition of the problems encountered, but I was not meticulous about keeping things that don't work so I do not have the relevant files laying around right now.
The text was updated successfully, but these errors were encountered:
it would be preferable to have better unicode support on PdfTeX than whatever we have right now.
So in HermannDppes@8c0e774 I figured out that there may be a way to piggyback off of the fact that hyperref has, by necessity, a way to encode text strings. Do the maintaners like the idea of having unicode support enough that taking on hyperref as an unconditional dependency for PDFTeX seems like a good idea? (I always load hyperref because I rely on some of its functionality in pretty much every document, so to me it seems like a no-cost deal, but I'm not sure that generalizes …)
If it does not, would this be a candidate for a package option?
The (non-)problem(s)
Investigating pdfpc/#586 to understand the lower-level mechanisms that break when using LuaLaTeX I found that the unicode support in
\pdfpcnote
is as follows:While 1. is exactly like one'd suspect, both 2. and very much 3. are not what I'd expect and these limitations should probably at least be documented or better yet fixed. (Though if I had found fixing them easy, this would be a pull request, not an issue.)
Current state
Looking at the generated pdf files after uncompressing them with pdftk, we see that XeTeX converts
/Contents (ä β 你好)
to/Contents <feff00e4002003b200204f60597d>
on its own, whereas PDFTeX and LuaTeX seem to do no conversion. This means that with LuaTeX we just get a UTF-8 encoded string in the pdf file which just violates the specification (unless it happens to be exclusively ASCII) and is consequently not rendered as intended. PDFTeX does much of the same but under some circumstances, the expansion of the argument happens to be valid ISO-8859-1, which is an allowed encoding for text strings.More specifically for PDFLaTeX, when using
[utf8]{inputenc}
or[latin1]{inputenc}
with[T1]{fontenc}
, the majority of the non-ascii symbols in the encoding come through. When not using the font encoding or using[utf8x]{inputenc}
, the notes get quite horribly mangled. I have not tried much in the ways of non-latin-1 characters, but with input encoding UTF-8 and font encoding T2A, at least the letterБ
was not rendered properly.This time (see later for a time I did not) I kept all my files for seeing what works how out of the box, so see pdfpcnotes-unicode.tar.gz for these files. You do not have to build them yourself, you can just open all.pdf in pdfpc to see most of the things I mentioned so far.
Ways forward
I think the best way to handle non-ascii notes would be to implement logic for doing what XeTeX already does, i.e. converting whatever input is given to the macro to UTF-16 (big endian, with byte-order-mark) and output it as hex.
Why UTF-16?
The only valid encodings for text strings (at least as of PDF 1.5) are UTF-16 (BE+BOM) and ISO-8859-1. Having a special case for the less-comprehensive encoding accomplishes nothing (and might incure the fringe problem of strings starting with
þÿ
being read as unicode, e.g.þÿOZ
as佚
).Why hex?
It will most likely turn out to be easier but if doing it “directly” is easier, this preference of mine reverses.
Package
stringenc
The package
stringenc
does a lot of this work and in fact its internal representation of strings is already as hex expansion of UTF-16BE, but it has two problems:FEFF
.\EdefEscapeHex
bypdfescape
, which I do not understand but does not naively do what would be needed.Using
stringenc
to convert fromiso-8859-1
to the format described above would give an advantage on LuaTeX of at least supporting some additional characters (presumably those that already work through PDFTeX) and probably no advantage otherwise. As an example,works and displays the note correctly in pdfpc after being compiled with LuaLaTeX. Do you want a pull request patching in at least that much?
Beyond
\EdefEscapeHex
The command
\EdefEscapeHex
recreates the PDFTeX primitive\pdfescapehex
with a different interface. This apparently means returning the ISO-8859-1 encoding of the expansion of its argument in hex, dropping everything that is not covered by the encoding. (In my rather non-comprehensive testing as the documentation was not very useful to me.) But reusing the code thatpdfescape
employs to mimic this behaviour in absence of the primitive, I had some success getting the UTF-8 hex expansion of UTF-8 input.If there is interest in it, I'll see if I can recreate these partial successes and give some exposition of the problems encountered, but I was not meticulous about keeping things that don't work so I do not have the relevant files laying around right now.
The text was updated successfully, but these errors were encountered: