Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CF quantization/bitgrooming variable attributes required #765

Open
durack1 opened this issue Dec 17, 2024 · 7 comments
Open

CF quantization/bitgrooming variable attributes required #765

durack1 opened this issue Dec 17, 2024 · 7 comments

Comments

@durack1
Copy link
Contributor

durack1 commented Dec 17, 2024

There are some additional steps required to get the CF standards aligned with the new functionality in CMOR3.9 to use new netcdf 4.9.x lossy compression - this will require the addition of a couple of variable attributes if such compression is used - see cf-convention/cf-conventions#403.

It would be great to get this technical work finalized for the CMOR 3.10 release, along with the branded variable changes - all changes occurring to the netcdf variable.

ping @czender @taylor13 @mauzey1

@czender
Copy link

czender commented Dec 17, 2024

Hi All,
Paul and I discussed this at AGU, so thanks @durack1 for getting the ball rolling on this. The clearest summary of what would need to be done in CMOR to support the new CF Quantization Convention is in Examples 8.8 and 8.9 in the convention itself. Stealing from those example, CMOR would need to add a "container variable" like

 char quantization_info ;
      quantization_info:algorithm = "bitround" ;
      quantization_info:implementation = "libnetcdf version 4.9.2" ;

to any file that includes a quantized variable, and point to the container variable from every variable that is quantized like this

float ps(time,lat,lon) ;
      ps:quantization = "quantization_info" ;
      ps:quantization_nsb = 9 ;
      ps:standard_name = "surface_air_pressure" ;
      ps:units = "Pa" ;

The name of the container variable is up to you. That is it for the required portions of the Convention.
However, nothing in the Convention prevents implementors from adding additional quantization information so it is also possible to write error metrics for the quantization into the per-variable metadata. For example NCO automatically writes the maximum relative error (MRE) metric for the bitround algorithm like this:

 variables:
    float ps(time,lat,lon) ;
      ps:standard_name = "surface_air_pressure" ;
      ps:units = "Pa" ;
      ps:quantization = "quantization_info" ;
      ps:quantization_nsb = 13 ;
      ps:quantization_maximum_relative_error = 6.103516e-05f ;

(The MRE for bitround is 2^[-(NSB+1)]). Again error metrics are completely optional. The thought is that some users might like to know this info.
Adding this quantization convention would allow CMOR to trim a lot of size from datasets in a CF-compliant way! I'm happy to chime-in on any questions or implementation issues that may arise.

@taylor13
Copy link
Collaborator

taylor13 commented Dec 18, 2024

thanks for this nice summary of what's needed. We'll work on it. A few comments/questions:

I'd like to come up with a "container" name that might be a little more obvious to most users. In this context what is the meaning of "quantization"? Could we say "rounding_method" or "rounding_info" or "rounding_specs" instead?

It isn't clear to me whether the the netCDF library actually reduces the size of each variable (from say 32 bits to whatever), or if the floating point values occupy the same space in memory as the original, but if you compress the data after it's written, that's when you get much better reduction (because of all the 0-valued bits at the end of each floating point mantissa). I assume, the library actually directly reduces the size of the file. (This question arose because the CF conventions document says "strictly speaking, quantization only pre-conditions data for more efficient compression by a subsequent compressor." So the question is: does the netCDF library automatically invoke the compressor or does the user have to apply a compressor to the file written by netCDF?)

For CMIP, should we require a specific algorithm be used to bit groom? Should everyone using CMOR invoke the same algorithm? What algorithm do you recommend? (Yours, I presume.). Which one is yours? The CF document says BitRound, BitGroom, and Granular BitRound can all be directly invoked from the netCDF library, so I'm guessing we want to limit the options to one of these.

I think it would be nice if CMOR calculated from the "nsb" the "MRE" and recorded it in the file, as in your example, but I don't think that should be a requirement for CMIP data. Any other opinions about this?

@czender
Copy link

czender commented Dec 18, 2024

Responses interleaved...

I'd like to come up with a "container" name that might be a little more obvious to most users. In this context what is the meaning of "quantization"? Could we say "rounding_method" or "rounding_info" or "rounding_specs" instead?

In this context "quantization" means the setting or rounding (they are distinct) the least significant bits to 0 or to 1. It is legal for CMOR to call it "rounding" though that would be slightly misleading because some (older) quantization methods (e.g., BitGroom, BitShave) do not round per se, they simply truncate (set trailing bits to zero). "rounding" also has some related meanings (e.g., IEEE rounding methods/direction) and some think of it as a base10 operation that turns a float into an int (e.g., C-library function round()). Thus I think "quantization" is more precise than "rounding". Maybe that's splitting hairs, though. Choose the container name you think is best. Any name will be CF-compliant.

It isn't clear to me whether the the netCDF library actually reduces the size of each variable (from say 32 bits to whatever), or if the floating point values occupy the same space in memory as the original, but if you compress the data after it's written, that's when you get much better reduction (because of all the 0-valued bits at the end of each floating point mantissa). I assume, the library actually directly reduces the size of the file. (This question arose because the CF conventions document says "strictly speaking, quantization only pre-conditions data for more efficient compression by a subsequent compressor." So the question is: does the netCDF library automatically invoke the compressor or does the user have to apply a compressor to the file written by netCDF?)

Quantization by itself does not reduce the dataset size. Some form of compression must be applied after quantization to reduce the netCDF dataset size. Almost always this will be lossless compression. Quantization is a pre-filter for compression, much like the Shuffle filter is. In contrast to Shuffle, though, Quantization leaves the number in IEEE754 format so no decoder is necessary.

For CMIP, should we require a specific algorithm be used to bit groom? Should everyone using CMOR invoke the same algorithm? What algorithm do you recommend? (Yours, I presume.)

It's not clear to me how this will shake out. If CMIP endorses quantization, I think it is likely to require the BitRound algorithm because that allows the finest granularity of precision and also has easily accessible implementations in pure Python and in Julia (i.e., these would not require the netCDF library). Furthermore, CMIP decision-makers may be more comfortable thinking in terms of bits. On balance I too prefer BitRound. However, many climate researchers and CMIP users might prefer quantization algorithms characterized by decimal digits because they are used to thinking in terms of significant digits. I think best algorithm for that is Granular BitRound (not BitGroom!). This is not the place to decide the algorithm for CMIP though.

I think it would be nice if CMOR calculated from the "nsb" the "MRE" and recorded it in the file, as in your example, but I don't think that should be a requirement for CMIP data. Any other opinions about this?

I'm curious to see the responses to this question.

@taylor13
Copy link
Collaborator

Thanks for the feedback. I hope you don't mind a few follow-up questions:

You explained that the quantization does not reduce dataset size. So do we envision for CMIP the following?

  1. a data provider bitgrooms an array using some method of his/her choosing.
  2. the data provider writes the array to a file through the netCDF API
  3. the data provider compresses the netCDF file (using gzip?) and then publishes it on ESGF

Or

  1. A data provider sends an array (without any quantization) through the netCDF library, along with quantization directives, and netCDF executes a bitgrooming algorithm and writes the data to a file
  2. the data provider compresses the netCDF file and then publishes it on ESGF

or some other procedure?

Under either method, is it correct that the data user will have to uncompress the netCDF file before reading it (through the netCDF API)?

Regarding the last question just above, I didn't know there were commonly used alternatives to the netCDF API for reading netCDF files. Is it true that you can read the file directly using python through some other method? Also, this is the first I've heard of Julia ... looks like it's becoming popular in some circles. Is it attractive mostly for high performance, highly parallelized operations?

@czender
Copy link

czender commented Dec 18, 2024

Responses interleaved:

You explained that the quantization does not reduce dataset size. So do we envision for CMIP the following?

a data provider bitgrooms an array using some method of his/her choosing.
the data provider writes the array to a file through the netCDF API
the data provider compresses the netCDF file (using gzip?) and then publishes it on ESGF

The above is certainly possible. It's hard to guess though based on CMIP6 workflows I think providers are most likely to use the netCDF library (rather than "some other method") for all these features. However, if "some other method" includes Xarray, then a goodly fraction of providers will indeed use it because Xarray and Pangeo are so popular.

A data provider sends an array (without any quantization) through the netCDF library, along with quantization directives, and netCDF executes a bitgrooming algorithm and writes the data to a file
the data provider compresses the netCDF file and then publishes it on ESGF
or some other procedure?

I think the libnetCDF method is most likely. However, the simplest approach within the netCDF API is that provider does quantization immediately followed by lossless compression. This obviates the need to store intermediate results to disk and is exactly analogous to what providers did for CMIP6. Recall that CMIP6 data uses the Shuffle filter and DEFLATE. The proposed CMIP7 workflow simply inserts one more filter (quantization) between the Shuffle and lossless compression steps. The netCDF4 library executes Shuffle then Quantize then Lossless on one chunk of data at a time without writing to disk in between. For the Lossless codec I recommend Zstandard because its compression is faster and nearly as efficient as DEFLATE while its decompression is much faster than DEFLATE, leading to a more pleasant user experience.
I attach the poster I showed at AGU which, in the fourth box from the top on the LHS, shows where and how the CMIP7 procedure would differ from the CMIP6 procedure using the libnetCDF approach. The CMIP6 format requires setting up two sequntial codecs (Shuffle and DEFLATE) in libnetCDF whereas the proposed CMIP7 format requires setting up three sequential codecs (e.g., Shuffle, Quantize, Zstandard) in libnetCDF.

Other possibilities include doing the quantization and compression with a separate post-processing toolkit such as NCO (briefly shown in the poster), Xarray, or CMOR. (I don't know whether CDO has added the hooks to the quantize functions in libnetCDF). Since NCO and CMOR both use the netCDF library, those methods should probably be considered as part of the preceding workflow too.

Under either method, is it correct that the data user will have to uncompress the netCDF file before reading it (through the netCDF API)?

Yes in that the data must be uncompressed before it can be used. No in that the user does not need to do anything by hand. The netCDF API automatically decompresses the data, identically to how it works with CMIP6 datasets. That is part of the beauty of quantization and of the netCDF API. Xarray also hides the messiness of decompression from the user. Hence the requirements on the user will not change no matter how the producers decide to implement their CMIP7 workflows.

Regarding the last question just above, I didn't know there were commonly used alternatives to the netCDF API for reading netCDF files. Is it true that you can read the file directly using python through some other method?

Yes. Xarray provides its own interface to writing netCDF4-compliant files in HDF5 or Zarr back-end formats. This interface does not depend at all on the netCDF library.

Also, this is the first I've heard of Julia ... looks like it's becoming popular in some circles. Is it attractive mostly for high performance, highly parallelized operations?

I defer this question to more knowledgeable Julia users. I'm not sure whether Julia uses a wrapper around libnetCDF or (like Xarray) provides its own independent backend to write netCDF4-compliant files.

@taylor13
Copy link
Collaborator

Thanks @czender ! Now I understand everything I need to understand.

@czender
Copy link

czender commented Dec 18, 2024

Oops forgot to attach poster mentioned above...
pst_cf_agu_202412.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants