Skip to content
This repository has been archived by the owner on Apr 24, 2020. It is now read-only.

String, char, unsigned integers, and character encodings. #3

Open
JohnLCaron opened this issue Oct 15, 2014 · 23 comments
Open

String, char, unsigned integers, and character encodings. #3

JohnLCaron opened this issue Oct 15, 2014 · 23 comments

Comments

@JohnLCaron
Copy link

From: John Caron

Background:

In the classic model, data using the "byte" data type are interpreted as signed when converting. However, the byte data type is sometimes used for unsigned data. Unidata introduced the "_Unsigned" attribute to allow the user to specify this. Not all libraries look for this attribute.

Sometimes the "char" data type is intended to mean unsigned byte data. More typically it is used for encoding text data, but the character encoding is undefined. Probably "printable ASCII" is a reasonable assumption. Char data are fixed length arrays only, and one must specify the length using a global, shared dimension, which is unneeded and clutters the dimension namespace.

The NetCDF-4 enhanced model adds Strings and unsigned integer types, so we have the opportunity to clarify. Lots of work on character encodings have been done in the last 20 years with Unicode, and we should leverage that. UTF8 is a variable length encoding of Unicode that has ASCII as a subset, allows any language to be encoded, and has become the dominant encoding on the web. NetCDF libraries assume Strings are UTF8 encoded. If your text is ASCII, you are using UTF8 already.

Also see:

CDL Data Types

Developing Conventions for NetCDF-4 : Use of Strings

Proposal:

  1. Use the unsigned or signed integer data types when your data is unsigned or signed, respectively.
  2. Do not use _Unsigned attribute.
  3. Use the String data type for text data, encoded in UTF-8. Any language (aka character set) is allowable.
  4. The char data type is deprecated. If you must use it, use it only for ASCII text data.
@rsignell-usgs
Copy link
Member

This has been a major pain in the neck for years, and this proposal sounds like just what the doctor ordered.

@chris-little
Copy link

I support all four points of John's proposal, leaving aside any quibbles over glyph/encoding/repertoire/character set/etc terminology. When the defining document is written you will probably need references to ISO10646 (Unicode), ISO646 (ASCII) and possibly the US version: ANSI_X3.4-1968 aka US-ASCII

@JonathanGregory
Copy link

Dear John

Thanks for this. These new facilities are certainly attractive and will clarify things.

Regarding signed and unsigned, I agree it is sensible to choose the one which is appropriate. I see the ubyte is unsigned, but http://www.unidata.ucar.edu/software/netcdf/docs/cdl_data_types.html does not say that byte is signed. Should be be (a) recommended to use byte for signed 8-bit data, (b) assumed that byte is signed (like short and int) when reading? As far as CF-2 is concerned, we can include your recommendation to use signed and unsigned types as appropriate, but I don't think any automatic checking can be done, can it, because the checker could not know whether the user had made the right choice for their data.

How do you indicate in CDL that a text attribute (such as the history etc. which you mention) is of string type rather than char type? There are many text attributes in CF e.g. standard_name. Do you think these should be strings rather than char?

I presume that in CDL you can declare "string array(dimension)" in CDL - is that right? I agree that this would be neater than a 2D char array. I would be happy to see this adopted in CF-2 but we would have to check carefully in the standard document to see if it causes any problems, and some careful wording would be needed. The main use of string '''arrays''' in variables (rather than text attributes) is for string-valued auxiliary coordinate variables, I think. If we made this change, the CF checker could issue a warning for the use of a 2D char array where a string array could be used. (It gives warnings for recommendations that have not been followed.)

Is there any way to check your final point? I mean, can you tell if a char variable is being used for data other than ASCII text? I suppose if it exceeds 127b (unsigned!) it is not ASCII - anything else?

Best wishes

Jonathan

PS. It seems that in this browser (Firefox 15.0) the Preview doesn't work. I hope this comes out all right!

@bennoblumenthal
Copy link

Up to now, we have used valid-max and valid-min to distinguish between signed and unsigned data, i.e. netcdf byte data is ambiguous, it is not signed. There may be a slight violation of CF in that the values of valid-max and valid-min are not bytes but integers, but I think that is not particularly confusing.

Please don't break this.

@JohnLCaron
Copy link
Author

Hi Jonathan:

Regarding signed and unsigned, I agree it is sensible to choose the one which is appropriate. I see the ubyte is unsigned, but http://www.unidata.ucar.edu/software/netcdf/docs/cdl_data_types.html does not say that byte is signed. Should be be (a) recommended to use byte for signed 8-bit data, (b) assumed that byte is signed (like short and int) when reading?

I think that the answer is that byte is signed. I will check with Russ about clarifying the document.

As far as CF-2 is concerned, we can include your recommendation to use signed and unsigned types as appropriate, but I don't think any automatic checking can be done, can it, because the checker could not know whether the user had made the right choice for their data.

agree, theres no way to tell if the stored data values actually match the type declaration.

How do you indicate in CDL that a text attribute (such as the history etc. which you mention) is of string type rather than char type? There are many text attributes in CF e.g. standard_name. Do you think these should be strings rather than char?

i think ncgen now stores string attributes as strings, rather than chars, for netcdf4. I will double check that. programmatically you can do either char or strings, but strings are the right way because the encoding is defined (namely UTF-8).

I presume that in CDL you can declare "string array(dimension)" in CDL - is that right?

yes, one can have multidimensional arrays of strings.

I agree that this would be neater than a 2D char array. I would be happy to see this adopted in CF-2 but we would have to check carefully in the standard document to see if it causes any problems, and some careful wording would be needed. The main use of string '''arrays''' in variables (rather than text attributes) is for string-valued auxiliary coordinate variables, I think. If we made this change, the CF checker could issue a warning for the use of a 2D char array where a string array could be used. (It gives warnings for recommendations that have not been followed.)

agree

Is there any way to check your final point? I mean, can you tell if a char variable is being used for data other than ASCII text? I suppose if it exceeds 127b (unsigned!) it is not ASCII - anything else?

Im not sure of the exact definition of ASCII. Probably we could use some help (Chris Little are you there?) in deciding exactly what we should recommend if you have to use char data. Not sure what the use case is, for writing new files. For backwards compatibility, you would just stick to CF-1.X.

Note that not all sequences of bytes are valid UTF-8, and that could be checked for string type data.

thanks for your comments, as usual, covering some corners i hadn't noticed.

John

@JohnLCaron
Copy link
Author

Hi Benno:

Up to now, we have used valid-max and valid-min to distinguish between signed and unsigned data, i.e. netcdf byte data is ambiguous, it is not signed. There may be a slight violation of CF in that the values of valid-max and valid-min are not bytes but integers, but I think that is not particularly confusing. Please don't break this.

I dont see this implemented anywhere in the Java or C libraries. Did you mean that your code implements that?

In any case, if you are writing new files in netCDF-4, using CF-2, you would not want to do that. You would want to use signed or unsigned datatypes to indicate signed or unsigned data.

Regards,
John

@JohnLCaron
Copy link
Author

signedness of the byte data type.

In C library for netcdf-3, a byte is signed for the purposes of type conversion (eg asking for data as a float). The _Unsigned attribute is not yet implemented in the C library, as it turns out to be complicated. However, we have a JIRA ticket (https://bugtracking.unidata.ucar.edu/browse/NCF-274) on it, so we hope to do it eventually.

We will fix http://www.unidata.ucar.edu/software/netcdf/docs/cdl_data_types.html to make that clearer.

@bennoblumenthal
Copy link

Note that the ticket https://bugtracking.unidata.ucar.edu/browse/NCF-274
quotes the user guide as recommending valid-min and valid-max -- so yes my
code implements it, but I did not make it up: it is pretty obvious usage
of CF attributes.

My point is pretty much implicit in John's two responses: netcdf-3 handles
byte sign as attribute(s), so if you/we want to have byte sign as a pair of
datatypes in netcdf-4, anything that converts between netcdf-3 and netcdf-4
(including the netcdf library!) better do it correctly. That seems to
imply awareness of valid-min/valid-max CF attributes, which is a road not
traveled by the netcdf library. Alternatively, the library can have its
own attribute and we can implement conversion of attributes between
conventions, my favorite topic [?](namespaces, anyone?)

On Mon, Oct 27, 2014 at 5:55 PM, John Caron [email protected]
wrote:

signedness of the byte data type.

In C library for netcdf-3, a byte is signed for the purposes of type
conversion (eg asking for data as a float). The _Unsigned attribute is not
yet implemented in the C library, as it turns out to be complicated.
However, we have a JIRA ticket (
https://bugtracking.unidata.ucar.edu/browse/NCF-274) on it, so we hope to
do it eventually.

We will fix
http://www.unidata.ucar.edu/software/netcdf/docs/cdl_data_types.html to
make that clearer.


Reply to this email directly or view it on GitHub
#3 (comment).

Dr. M. Benno Blumenthal [email protected]
International Research Institute for climate and society
The Earth Institute at Columbia University
Lamont Campus, Palisades NY 10964-8000 (845) 680-4450

@JohnLCaron
Copy link
Author

Hmm, youre right, I never noticed (or probably just forgot) about that, thanks for the reminder. I agree that any conversion program had better understand the signed conventions of the source file.

Having the signed/unsigned types will solve this sort of confusion.

@steingod
Copy link

This discussion is interesting and the proposal is supported. Especially the UTF-8 binding is highly welcome as this issue has caused many problems working with strings (e.g. station or PI names) in data sets.

@chris-little
Copy link

A little UTF-8 tutorial: all ASCII characters are represented by octets with the high order bit set to zero. 'Plain' 8-bit ASCII is UTF-8. If the high bit is set to 1, the next few octets (2,3,4 or 5) are used to define one of the Unicode basic 64000 characters, and the next high order bits define how many octets are used for that character.

This means that there is not a simple mapping of n characters to n octets, but for western alphabets, there will be probably be significantly less than 2n octets.

If there are a lot of Chinese/Japanese/Korean characters, there could be up to 6n octets needed, but for this use case, it would be better to use UTF-16, and most of the n characters would be represented by 2n octets.

This sugggests the question: what will CF2 do if the characters are not in UTF-8? UTF-16 is the obvious case to address. I recommend not allowing any other 8 bit encodings, such as Windows, ISO2022, Mac, etc.

HTH, Chris

@mikeggrant-pml
Copy link

I support John's suggestion of only allowing UTF-8 and definitely agree with Chris on excluding other 8 bit encodings. While UTF-16 might have some limited advantages in particular cases [http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16], the extra hassle of two encodings isn't worth it. It's highly likely that the largest contributer to the size of a netcdf file will be the data rather than the text, so a few bytes saved in some languages won't make any significant difference.

@stephenpascoe
Copy link

I think UTF-8 is an extremely sensible idea for String types which are variable length by default. It might be worth specifically saying char[] is ASCII not UTF-8. It will be more awkward to deal with variable length octet encoding for a fixed-width type like char[].

@stephenpascoe
Copy link

I would like to propose another point on String types. When used as attribute types we now have the option of String arrays. I would propose that any attribute of type String[] should be interpreted within CF as a whitespace-separated single String type. This will avoid having to deal with the difference between forms like:

(EDIT: corrected number of values)

sensor_status_qc:flag_meanings = "low_battery
                                      processor_fault
                                      memory_fault
                                      disk_fault
                                      software_fault
                                      maintenance_required"

and

sensor_status_qc:flag_meanings = "low_battery", "processor_fault", "memory_fault", "disk_fault", "software_fault", "maintenance_required"

(example drawn from CF-1.7.2 draft)

@BobSimons
Copy link

I don't think it's a good idea to treat Strings differently than other
data types when it comes to arrays. I think a String[] should be a
String[]. Saying all "String[]" are to be stored as a String with a
space-separated list is very constraining (e.g., for history).

If you want to define an attribute (such as flag_meanings) as a String
containing a space-separated list, then isn't it better to define the
attribute as a String containing a space-separated list?

On 2014-12-05 8:27 AM, Stephen Pascoe wrote:

I would like to propose another point on String types. When used as
attribute types we now have the option of String arrays. I would
propose that any attribute of type String[] should be interpreted
within CF as a whitespace-separated single String type. This will
avoid having to deal with the difference between forms like:

|sensor_status_qc:flag_meanings = "low_battery processor_fault
memory_fault disk_fault
software_fault
maintenance_required"
|

and

|sensor_status_qc:flag_meanings = "low_battery processor_fault", "memory_fault disk_fault", "software_fault", "maintenance_required"
|

(example drawn from CF-1.7.2 draft)


Reply to this email directly or view it on GitHub
#3 (comment).

Sincerely,

Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St, Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Fax: (831)648-8440
Email: [email protected]

The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric
Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <>< <><

@steingod
Copy link

steingod commented Dec 7, 2014

This makes sense to me and I would prefer to treat string arrays like other arrays.

Øystein Godøy
Sendt i farten...

Bob Simons [email protected]:

I don't think it's a good idea to treat Strings differently than other
data types when it comes to arrays. I think a String[] should be a
String[]. Saying all "String[]" are to be stored as a String with a
space-separated list is very constraining (e.g., for history).

If you want to define an attribute (such as flag_meanings) as a String
containing a space-separated list, then isn't it better to define the
attribute as a String containing a space-separated list?

On 2014-12-05 8:27 AM, Stephen Pascoe wrote:

I would like to propose another point on String types. When used as
attribute types we now have the option of String arrays. I would
propose that any attribute of type String[] should be interpreted
within CF as a whitespace-separated single String type. This will
avoid having to deal with the difference between forms like:

|sensor_status_qc:flag_meanings = "low_battery processor_fault
memory_fault disk_fault
software_fault
maintenance_required"
|

and

|sensor_status_qc:flag_meanings = "low_battery processor_fault", "memory_fault disk_fault", "software_fault", "maintenance_required"
|

(example drawn from CF-1.7.2 draft)


Reply to this email directly or view it on GitHub
#3 (comment).

Sincerely,

Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St, Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Fax: (831)648-8440
Email: [email protected]

The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric
Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <>< <><


Reply to this email directly or view it on GitHub.

@stephenpascoe
Copy link

Ok, I can see how String array attributes could be useful. My desire is to find some general principles that might ease backward compatability with NetCDF3 as we begin to adopt NetCDF4 features.

For instance, with the example I gave before, which of these should be considered valid CF? The first is valid now; for the sake of argument lets say we decide to allow the second form. We would need to state rules for both NetCDF3 and NetCDF4. Wouldn't it be better if we had a rule for all attributes? As a tool developer, I would like not to have to work around the difference between char[] and String on a case by case basis. Particularly as I want my code to work for both NetCDF3 and NetCDF4.

We actually have 3 options for encoding a string attribute which could naturally be considered a sequence of string values. I'll illustrate this with a 2D Lat/Lon coordinate system:

dimensions:
  xc = 128 ;
  yc = 64 ;
variables:
  float T(yc,xc) ;
    T:coordinates = "lon lat" ;
  float xc(xc) ;
    xc:axis = "X" ;
  float yc(yc) ;
    yc:axis = "Y" ;
  float lon(yc,xc) ;
    lon:units = "degrees_east" ;
  float lat(yc,xc) ;
    lat:units = "degrees_north" ;

In NetCDF3 T:coordinates must be of type char[]. In NetCDF4 it could also be of type String[0] and John's proposal is that this should be default, with char[] deprecated. Even so we are going to have NetCDF in circulation for many years with both forms (even in NetCDF4 it will be very easy to use char[] by mistake). These two cases have to be dealt with differently in programming APIs (at least in Python and C) so allowing both carries an overhead for tool developers.

And then it would be nice to allow T:coordinates = "lon", "lat";. That will be 3 different representations of the same information. If we have a general rule for interoperability between these types then we could start using String[] now without breaking backward compatibility.

@BobSimons
Copy link

I'm confused. You seem to be lamenting the annoyance of allowing two
forms, then suggesting that we also support a third form.
I'll leave these decisions to John Caron.

One correction/ one preference for notation: you refer to a "String[0]"
as equivalent to a char[]. I think it is better to refer to a "String"
(a scalar String, a String with no dimensions). As a definition of a
variable, String[0] is a string array with one dimension, where the
dimension size is 0; in other words, a String array with 0 Strings.

On 2014-12-07 8:48 AM, Stephen Pascoe wrote:

Ok, I can see how String array attributes could be useful. My desire
is to find some general principles that might ease backward
compatability with NetCDF3 as we begin to adopt NetCDF4 features.

For instance, with the example I gave before, which of these should be
considered valid CF? The first is valid now; for the sake of argument
lets say we decide to allow the second form. We would need to state
rules for both NetCDF3 and NetCDF4. Wouldn't it be better if we had a
rule for all attributes? As a tool developer, I would like not to have
to work around the difference between |char[]| and |String| on a case
by case basis. Particularly as I want my code to work for both NetCDF3
and NetCDF4.

We actually have 3 options for encoding a string attribute which could
naturally be considered a sequence of string values. I'll illustrate
this with a 2D Lat/Lon coordinate system:

|dimensions:
xc = 128 ;
yc = 64 ;
variables:
float T(yc,xc) ;
T:coordinates = "lon lat" ;
float xc(xc) ;
xc:axis = "X" ;
float yc(yc) ;
yc:axis = "Y" ;
float lon(yc,xc) ;
lon:units = "degrees_east" ;
float lat(yc,xc) ;
lat:units = "degrees_north" ;
|

In NetCDF3 |T:coordinates| must be of type |char[]|. In NetCDF4 it
could also be of type |String[0]| and John's proposal is that this
should be default, with |char[]| deprecated. Even so we are going to
have NetCDF in circulation for many years with both forms (even in
NetCDF4 it will be very easy to use |char[]| by mistake). These two
cases have to be dealt with differently in programming APIs (at least
in Python and C) so allowing both carries an overhead for tool developers.

And then it would be nice to allow |T:coordinates = "lon", "lat";|.
That will be 3 different representations of the same information. If
we have a general rule for interoperability between these types then
we could start using |String[]| now without breaking backward
compatibility.


Reply to this email directly or view it on GitHub
#3 (comment).

Sincerely,

Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St, Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Fax: (831)648-8440
Email: [email protected]

The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric
Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <>< <><

@bennoblumenthal
Copy link

As a general principle, it would be much better if CF used the netcdf native expression of multiple values, e.g. using an array for a list of values, than forcing software to interpret coded strings, e.g. space-separated token parsing. That way the information could be more-easily transformed into other systems (XML, OGC, semantic technologies) because the transformation is less-dependent on the particular content. Clearly this goes against backward compatibility, question would be if we could significantly simply the transformation problem enough to make it worth it. Clearly such a change would be an investment in the somewhat distant future.

@stephenpascoe
Copy link

My earlier comments were a bit confused. I will try and disentangle my thoughts here now.

@BobSimons I mean't String[1] not String[0], in that all attributes are 1-d arrays of values underneath. Sorry, that must have been completely mystifying to read. I'll just use String here.

In the README.md of this group we say we should take care when introducing new features because it is easier for users if there is only one way to do something (charter, item 4). I think NetCDF4 strings is possibly the simplest example of where NetCDF4 introduces more than one way of doing something that before only had one way.

There are 2 separate points:

  1. string attributes that were of type char[] can now also be String.
  2. we can now store 1-d arrays of strings.

I think no. 1 is dealt with by John. char[] is deprecated and we should use String. If we can help keep char[] out of NetCDF4 by raising warnings in the CF-checker, etc. all the better IMO.

I agree with Benno that no. 2 allows us to be more structured, which is good, but we need to be mindful of backward compatibility. If an attribute is encoded as String[] in NetCDF4 are we abandoning NetCDF3 or is there an equivilent encoding as char[]? It would be good to start using String[] but without some general equivilence rules that are compatible with NetCDF3 I believe any suggestion will be resisted.

@cpaulik
Copy link

cpaulik commented Jun 14, 2016

I am strongly for allowing unsigned bytes or other unsigned data types. This is especially important since we often have data that is packed. Currently this seems impossible when following the CF standards strictly since the section about packed data states that valid_min, valid_max or valid_range have to have the same data type as the packed data whereas the section about data types states that they can be used to specify the unsigned range.

@kenkehoe
Copy link

I am strongly for using string arrays with the use of flag_meanings attribute. I understand that will break backward compatibility but isn't that the whole point of a major revision change? I think we need to remember that CF netCDF data files are containing a lot more than simple gridded data now and that the status fields can contain more complicated information. For my work using a flag descriptor that is less restrictive with the formatting of the text string will be a great benefit. Currently, the individual flag_meanings must not contain white spaces which makes more complicated flag descriptions difficult to read. For example, when trying to reference a variable name which contains an underscore, the variable name will get lost. Also, long descriptions become more difficult to read. I would also argue that using flag_values and flag_masks are vectors that do not natively match and require parsing the flag_meanings attribute. Using the flag_meanings as a string vector allows the indexes to automatically match.

If backward compatibility is the goal we could introduce a new attribute name (i.e. flag_definions) to keep flag_meanings of type char[] while using string[] array for cf-2.0. I suggest skipping the notion of a string[] type separated by space characters for flag_meanings entirely to reduce the options to only one.

@dblodgett-usgs
Copy link
Contributor

@rsignell-usgs and @kenkehoe -- should we migrate this issue to the latest cf-conventions issue tracker?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests