-
Notifications
You must be signed in to change notification settings - Fork 1
String, char, unsigned integers, and character encodings. #3
Comments
This has been a major pain in the neck for years, and this proposal sounds like just what the doctor ordered. |
I support all four points of John's proposal, leaving aside any quibbles over glyph/encoding/repertoire/character set/etc terminology. When the defining document is written you will probably need references to ISO10646 (Unicode), ISO646 (ASCII) and possibly the US version: ANSI_X3.4-1968 aka US-ASCII |
Dear John Thanks for this. These new facilities are certainly attractive and will clarify things. Regarding signed and unsigned, I agree it is sensible to choose the one which is appropriate. I see the ubyte is unsigned, but http://www.unidata.ucar.edu/software/netcdf/docs/cdl_data_types.html does not say that byte is signed. Should be be (a) recommended to use byte for signed 8-bit data, (b) assumed that byte is signed (like short and int) when reading? As far as CF-2 is concerned, we can include your recommendation to use signed and unsigned types as appropriate, but I don't think any automatic checking can be done, can it, because the checker could not know whether the user had made the right choice for their data. How do you indicate in CDL that a text attribute (such as the I presume that in CDL you can declare "string array(dimension)" in CDL - is that right? I agree that this would be neater than a 2D char array. I would be happy to see this adopted in CF-2 but we would have to check carefully in the standard document to see if it causes any problems, and some careful wording would be needed. The main use of string '''arrays''' in variables (rather than text attributes) is for string-valued auxiliary coordinate variables, I think. If we made this change, the CF checker could issue a warning for the use of a 2D char array where a string array could be used. (It gives warnings for recommendations that have not been followed.) Is there any way to check your final point? I mean, can you tell if a char variable is being used for data other than ASCII text? I suppose if it exceeds 127b (unsigned!) it is not ASCII - anything else? Best wishes Jonathan PS. It seems that in this browser (Firefox 15.0) the Preview doesn't work. I hope this comes out all right! |
Up to now, we have used valid-max and valid-min to distinguish between signed and unsigned data, i.e. netcdf byte data is ambiguous, it is not signed. There may be a slight violation of CF in that the values of valid-max and valid-min are not bytes but integers, but I think that is not particularly confusing. Please don't break this. |
Hi Jonathan:
I think that the answer is that byte is signed. I will check with Russ about clarifying the document.
agree, theres no way to tell if the stored data values actually match the type declaration.
i think ncgen now stores string attributes as strings, rather than chars, for netcdf4. I will double check that. programmatically you can do either char or strings, but strings are the right way because the encoding is defined (namely UTF-8).
yes, one can have multidimensional arrays of strings.
agree
Im not sure of the exact definition of ASCII. Probably we could use some help (Chris Little are you there?) in deciding exactly what we should recommend if you have to use char data. Not sure what the use case is, for writing new files. For backwards compatibility, you would just stick to CF-1.X. Note that not all sequences of bytes are valid UTF-8, and that could be checked for string type data. thanks for your comments, as usual, covering some corners i hadn't noticed. John |
Hi Benno:
I dont see this implemented anywhere in the Java or C libraries. Did you mean that your code implements that? In any case, if you are writing new files in netCDF-4, using CF-2, you would not want to do that. You would want to use signed or unsigned datatypes to indicate signed or unsigned data. Regards, |
signedness of the byte data type. In C library for netcdf-3, a byte is signed for the purposes of type conversion (eg asking for data as a float). The _Unsigned attribute is not yet implemented in the C library, as it turns out to be complicated. However, we have a JIRA ticket (https://bugtracking.unidata.ucar.edu/browse/NCF-274) on it, so we hope to do it eventually. We will fix http://www.unidata.ucar.edu/software/netcdf/docs/cdl_data_types.html to make that clearer. |
Note that the ticket https://bugtracking.unidata.ucar.edu/browse/NCF-274 My point is pretty much implicit in John's two responses: netcdf-3 handles On Mon, Oct 27, 2014 at 5:55 PM, John Caron [email protected]
Dr. M. Benno Blumenthal [email protected] |
Hmm, youre right, I never noticed (or probably just forgot) about that, thanks for the reminder. I agree that any conversion program had better understand the signed conventions of the source file. Having the signed/unsigned types will solve this sort of confusion. |
This discussion is interesting and the proposal is supported. Especially the UTF-8 binding is highly welcome as this issue has caused many problems working with strings (e.g. station or PI names) in data sets. |
A little UTF-8 tutorial: all ASCII characters are represented by octets with the high order bit set to zero. 'Plain' 8-bit ASCII is UTF-8. If the high bit is set to 1, the next few octets (2,3,4 or 5) are used to define one of the Unicode basic 64000 characters, and the next high order bits define how many octets are used for that character. This means that there is not a simple mapping of n characters to n octets, but for western alphabets, there will be probably be significantly less than 2n octets. If there are a lot of Chinese/Japanese/Korean characters, there could be up to 6n octets needed, but for this use case, it would be better to use UTF-16, and most of the n characters would be represented by 2n octets. This sugggests the question: what will CF2 do if the characters are not in UTF-8? UTF-16 is the obvious case to address. I recommend not allowing any other 8 bit encodings, such as Windows, ISO2022, Mac, etc. HTH, Chris |
I support John's suggestion of only allowing UTF-8 and definitely agree with Chris on excluding other 8 bit encodings. While UTF-16 might have some limited advantages in particular cases [http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16], the extra hassle of two encodings isn't worth it. It's highly likely that the largest contributer to the size of a netcdf file will be the data rather than the text, so a few bytes saved in some languages won't make any significant difference. |
I think UTF-8 is an extremely sensible idea for String types which are variable length by default. It might be worth specifically saying char[] is ASCII not UTF-8. It will be more awkward to deal with variable length octet encoding for a fixed-width type like char[]. |
I would like to propose another point on String types. When used as attribute types we now have the option of String arrays. I would propose that any attribute of type String[] should be interpreted within CF as a whitespace-separated single String type. This will avoid having to deal with the difference between forms like: (EDIT: corrected number of values)
and
(example drawn from CF-1.7.2 draft) |
I don't think it's a good idea to treat Strings differently than other If you want to define an attribute (such as flag_meanings) as a String On 2014-12-05 8:27 AM, Stephen Pascoe wrote:
Sincerely, Bob Simons The contents of this message are mine personally and |
This makes sense to me and I would prefer to treat string arrays like other arrays. Øystein Godøy Bob Simons [email protected]:
|
Ok, I can see how String array attributes could be useful. My desire is to find some general principles that might ease backward compatability with NetCDF3 as we begin to adopt NetCDF4 features. For instance, with the example I gave before, which of these should be considered valid CF? The first is valid now; for the sake of argument lets say we decide to allow the second form. We would need to state rules for both NetCDF3 and NetCDF4. Wouldn't it be better if we had a rule for all attributes? As a tool developer, I would like not to have to work around the difference between We actually have 3 options for encoding a string attribute which could naturally be considered a sequence of string values. I'll illustrate this with a 2D Lat/Lon coordinate system:
In NetCDF3 And then it would be nice to allow |
I'm confused. You seem to be lamenting the annoyance of allowing two One correction/ one preference for notation: you refer to a "String[0]" On 2014-12-07 8:48 AM, Stephen Pascoe wrote:
Sincerely, Bob Simons The contents of this message are mine personally and |
As a general principle, it would be much better if CF used the netcdf native expression of multiple values, e.g. using an array for a list of values, than forcing software to interpret coded strings, e.g. space-separated token parsing. That way the information could be more-easily transformed into other systems (XML, OGC, semantic technologies) because the transformation is less-dependent on the particular content. Clearly this goes against backward compatibility, question would be if we could significantly simply the transformation problem enough to make it worth it. Clearly such a change would be an investment in the somewhat distant future. |
My earlier comments were a bit confused. I will try and disentangle my thoughts here now. @BobSimons I mean't In the README.md of this group we say we should take care when introducing new features because it is easier for users if there is only one way to do something (charter, item 4). I think NetCDF4 strings is possibly the simplest example of where NetCDF4 introduces more than one way of doing something that before only had one way. There are 2 separate points:
I think no. 1 is dealt with by John. I agree with Benno that no. 2 allows us to be more structured, which is good, but we need to be mindful of backward compatibility. If an attribute is encoded as |
I am strongly for allowing unsigned bytes or other unsigned data types. This is especially important since we often have data that is packed. Currently this seems impossible when following the CF standards strictly since the section about packed data states that |
I am strongly for using string arrays with the use of flag_meanings attribute. I understand that will break backward compatibility but isn't that the whole point of a major revision change? I think we need to remember that CF netCDF data files are containing a lot more than simple gridded data now and that the status fields can contain more complicated information. For my work using a flag descriptor that is less restrictive with the formatting of the text string will be a great benefit. Currently, the individual flag_meanings must not contain white spaces which makes more complicated flag descriptions difficult to read. For example, when trying to reference a variable name which contains an underscore, the variable name will get lost. Also, long descriptions become more difficult to read. I would also argue that using flag_values and flag_masks are vectors that do not natively match and require parsing the flag_meanings attribute. Using the flag_meanings as a string vector allows the indexes to automatically match. If backward compatibility is the goal we could introduce a new attribute name (i.e. flag_definions) to keep flag_meanings of type char[] while using string[] array for cf-2.0. I suggest skipping the notion of a string[] type separated by space characters for flag_meanings entirely to reduce the options to only one. |
@rsignell-usgs and @kenkehoe -- should we migrate this issue to the latest cf-conventions issue tracker? |
From: John Caron
Background:
In the classic model, data using the "byte" data type are interpreted as signed when converting. However, the byte data type is sometimes used for unsigned data. Unidata introduced the "_Unsigned" attribute to allow the user to specify this. Not all libraries look for this attribute.
Sometimes the "char" data type is intended to mean unsigned byte data. More typically it is used for encoding text data, but the character encoding is undefined. Probably "printable ASCII" is a reasonable assumption. Char data are fixed length arrays only, and one must specify the length using a global, shared dimension, which is unneeded and clutters the dimension namespace.
The NetCDF-4 enhanced model adds Strings and unsigned integer types, so we have the opportunity to clarify. Lots of work on character encodings have been done in the last 20 years with Unicode, and we should leverage that. UTF8 is a variable length encoding of Unicode that has ASCII as a subset, allows any language to be encoded, and has become the dominant encoding on the web. NetCDF libraries assume Strings are UTF8 encoded. If your text is ASCII, you are using UTF8 already.
Also see:
CDL Data Types
Developing Conventions for NetCDF-4 : Use of Strings
Proposal:
The text was updated successfully, but these errors were encountered: