Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What encoding is used for string types. #45

Open
nielsbasjes opened this issue Aug 2, 2019 · 4 comments
Open

What encoding is used for string types. #45

nielsbasjes opened this issue Aug 2, 2019 · 4 comments

Comments

@nielsbasjes
Copy link
Contributor

nielsbasjes commented Aug 2, 2019

I'm implementing some software to read SunSpec and I have not been able to find the character encoding that is used for string typed fields.

The documentation I have been able to find only shows a simple example that is limited to the English letters of ASCII.

Given only this example it is still possible that the real encoding standard is something like UTF8.

Does anyone know where I can find the chosen encoding?

@altendky
Copy link
Contributor

altendky commented Aug 2, 2019

This has been a concern for me since they started their own support for py3. The docs sort of indicate one byte per character (ASCII or iso-8859-1/latin-1 I think being 'plausible' options then) and latin-1 was selected in pysunspec because it 'made tests pass' (all byte sequences are valid latin-1 AFAIK). IIRC many of the string fields are only 8 bytes or so which leaves little room for i18n via UTF-8. Mostly I think this is an area which is not clearly understood and thus the docs are not clear. I would vote for longer strings and UTF-8 probably but given the present size you almost have to leave the encoding unspecified at the SunSpec level and just know out of band from the manufacturer thus allowing for a byte-length optimized encoding to be selected. Of course, this is a mess to deal with.

@ghost
Copy link

ghost commented Aug 6, 2019

This was an oversight in the original spec, but most people are treating it as ASCII. It is one byte per character and the chosen character encoding is latin-1 to get largest character set and still maintain backwards compatibility.

@nielsbasjes
Copy link
Contributor Author

Hmm, ok. So the current status is that essentially only the lower 128 values of a byte are implicitly defined as ASCII.
The upper 128 values of a byte are 'fuzzy' as no explicit encoding has been defined in the standard and essentially people have picked what came to mind while writing their software.

My proposal would be to define the standard to use UTF-8 for strings.

By using UTF-8 the current ASCII characters remain unchanged (i.e. 1 char = 1 byte) and as such is backwards compatible with everything I have seen so far.
In addition the option is created to have Asian characters (Chineese and such) under the effect of having multiple bytes per character. Since a lot of companies in the Solar space are of Asian origin this seems like a sensible option.

Also UTF-8 is a so common standard that all programming languages and environments support it.

A good read about character encoding: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

@silvia2019 @altendky What do you think?

@altendky
Copy link
Contributor

altendky commented Aug 7, 2019

I agree that anything but utf-* is very much 'western' or even 'english' oriented (and that that is a negative thing). And yes, utf-8 is what I would choose as well (vs utf-16-le or whatever Windows uses, etc). I think that practically speaking the strings are too short to support other character sets well since they get even fewer usable characters so I think that should be reviewed as well.

I guess that's a 'yes I like utf-8' plus 'consider longer strings'.

Given that the documentation only specified one byte per character but no encoding, I think that can be ignored when considering backwards compatibility. You had to just know the proper encoding and once you have that you don't need to also know how many bytes per character so it was really completely undefined. I'll just take your word for it that ASCII is what tended to be used in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants