What encoding is used for string types. #45

nielsbasjes · 2019-08-02T05:35:40Z

I'm implementing some software to read SunSpec and I have not been able to find the character encoding that is used for string typed fields.

The documentation I have been able to find only shows a simple example that is limited to the English letters of ASCII.

Given only this example it is still possible that the real encoding standard is something like UTF8.

Does anyone know where I can find the chosen encoding?

altendky · 2019-08-02T13:54:02Z

This has been a concern for me since they started their own support for py3. The docs sort of indicate one byte per character (ASCII or iso-8859-1/latin-1 I think being 'plausible' options then) and latin-1 was selected in pysunspec because it 'made tests pass' (all byte sequences are valid latin-1 AFAIK). IIRC many of the string fields are only 8 bytes or so which leaves little room for i18n via UTF-8. Mostly I think this is an area which is not clearly understood and thus the docs are not clear. I would vote for longer strings and UTF-8 probably but given the present size you almost have to leave the encoding unspecified at the SunSpec level and just know out of band from the manufacturer thus allowing for a byte-length optimized encoding to be selected. Of course, this is a mess to deal with.

ghost · 2019-08-06T23:24:56Z

This was an oversight in the original spec, but most people are treating it as ASCII. It is one byte per character and the chosen character encoding is latin-1 to get largest character set and still maintain backwards compatibility.

nielsbasjes · 2019-08-07T11:18:32Z

Hmm, ok. So the current status is that essentially only the lower 128 values of a byte are implicitly defined as ASCII.
The upper 128 values of a byte are 'fuzzy' as no explicit encoding has been defined in the standard and essentially people have picked what came to mind while writing their software.

My proposal would be to define the standard to use UTF-8 for strings.

By using UTF-8 the current ASCII characters remain unchanged (i.e. 1 char = 1 byte) and as such is backwards compatible with everything I have seen so far.
In addition the option is created to have Asian characters (Chineese and such) under the effect of having multiple bytes per character. Since a lot of companies in the Solar space are of Asian origin this seems like a sensible option.

Also UTF-8 is a so common standard that all programming languages and environments support it.

A good read about character encoding: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

@silvia2019 @altendky What do you think?

altendky · 2019-08-07T17:35:53Z

I agree that anything but utf-* is very much 'western' or even 'english' oriented (and that that is a negative thing). And yes, utf-8 is what I would choose as well (vs utf-16-le or whatever Windows uses, etc). I think that practically speaking the strings are too short to support other character sets well since they get even fewer usable characters so I think that should be reviewed as well.

I guess that's a 'yes I like utf-8' plus 'consider longer strings'.

Given that the documentation only specified one byte per character but no encoding, I think that can be ignored when considering backwards compatibility. You had to just know the proper encoding and once you have that you don't need to also know how many bytes per character so it was really completely undefined. I'll just take your word for it that ASCII is what tended to be used in practice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What encoding is used for string types. #45

What encoding is used for string types. #45

nielsbasjes commented Aug 2, 2019 •

edited

Loading

altendky commented Aug 2, 2019

ghost commented Aug 6, 2019

nielsbasjes commented Aug 7, 2019

altendky commented Aug 7, 2019

What encoding is used for string types. #45

What encoding is used for string types. #45

Comments

nielsbasjes commented Aug 2, 2019 • edited Loading

altendky commented Aug 2, 2019

ghost commented Aug 6, 2019

nielsbasjes commented Aug 7, 2019

altendky commented Aug 7, 2019

nielsbasjes commented Aug 2, 2019 •

edited

Loading