-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CF quantization/bitgrooming variable attributes required #765
Comments
Hi All,
to any file that includes a quantized variable, and point to the container variable from every variable that is quantized like this
The name of the container variable is up to you. That is it for the required portions of the Convention.
(The MRE for |
thanks for this nice summary of what's needed. We'll work on it. A few comments/questions: I'd like to come up with a "container" name that might be a little more obvious to most users. In this context what is the meaning of "quantization"? Could we say "rounding_method" or "rounding_info" or "rounding_specs" instead? It isn't clear to me whether the the netCDF library actually reduces the size of each variable (from say 32 bits to whatever), or if the floating point values occupy the same space in memory as the original, but if you compress the data after it's written, that's when you get much better reduction (because of all the 0-valued bits at the end of each floating point mantissa). I assume, the library actually directly reduces the size of the file. (This question arose because the CF conventions document says "strictly speaking, quantization only pre-conditions data for more efficient compression by a subsequent compressor." So the question is: does the netCDF library automatically invoke the compressor or does the user have to apply a compressor to the file written by netCDF?) For CMIP, should we require a specific algorithm be used to bit groom? Should everyone using CMOR invoke the same algorithm? What algorithm do you recommend? (Yours, I presume.). Which one is yours? The CF document says BitRound, BitGroom, and Granular BitRound can all be directly invoked from the netCDF library, so I'm guessing we want to limit the options to one of these. I think it would be nice if CMOR calculated from the "nsb" the "MRE" and recorded it in the file, as in your example, but I don't think that should be a requirement for CMIP data. Any other opinions about this? |
Responses interleaved...
In this context "quantization" means the setting or rounding (they are distinct) the least significant bits to 0 or to 1. It is legal for CMOR to call it "rounding" though that would be slightly misleading because some (older) quantization methods (e.g., BitGroom, BitShave) do not round per se, they simply truncate (set trailing bits to zero). "rounding" also has some related meanings (e.g., IEEE rounding methods/direction) and some think of it as a base10 operation that turns a float into an int (e.g., C-library function
Quantization by itself does not reduce the dataset size. Some form of compression must be applied after quantization to reduce the netCDF dataset size. Almost always this will be lossless compression. Quantization is a pre-filter for compression, much like the Shuffle filter is. In contrast to Shuffle, though, Quantization leaves the number in IEEE754 format so no decoder is necessary.
It's not clear to me how this will shake out. If CMIP endorses quantization, I think it is likely to require the BitRound algorithm because that allows the finest granularity of precision and also has easily accessible implementations in pure Python and in Julia (i.e., these would not require the netCDF library). Furthermore, CMIP decision-makers may be more comfortable thinking in terms of bits. On balance I too prefer BitRound. However, many climate researchers and CMIP users might prefer quantization algorithms characterized by decimal digits because they are used to thinking in terms of significant digits. I think best algorithm for that is Granular BitRound (not BitGroom!). This is not the place to decide the algorithm for CMIP though.
I'm curious to see the responses to this question. |
Thanks for the feedback. I hope you don't mind a few follow-up questions: You explained that the quantization does not reduce dataset size. So do we envision for CMIP the following?
Or
or some other procedure? Under either method, is it correct that the data user will have to uncompress the netCDF file before reading it (through the netCDF API)? Regarding the last question just above, I didn't know there were commonly used alternatives to the netCDF API for reading netCDF files. Is it true that you can read the file directly using python through some other method? Also, this is the first I've heard of Julia ... looks like it's becoming popular in some circles. Is it attractive mostly for high performance, highly parallelized operations? |
Responses interleaved:
The above is certainly possible. It's hard to guess though based on CMIP6 workflows I think providers are most likely to use the netCDF library (rather than "some other method") for all these features. However, if "some other method" includes Xarray, then a goodly fraction of providers will indeed use it because Xarray and Pangeo are so popular.
I think the libnetCDF method is most likely. However, the simplest approach within the netCDF API is that provider does quantization immediately followed by lossless compression. This obviates the need to store intermediate results to disk and is exactly analogous to what providers did for CMIP6. Recall that CMIP6 data uses the Shuffle filter and DEFLATE. The proposed CMIP7 workflow simply inserts one more filter (quantization) between the Shuffle and lossless compression steps. The netCDF4 library executes Shuffle then Quantize then Lossless on one chunk of data at a time without writing to disk in between. For the Lossless codec I recommend Zstandard because its compression is faster and nearly as efficient as DEFLATE while its decompression is much faster than DEFLATE, leading to a more pleasant user experience. Other possibilities include doing the quantization and compression with a separate post-processing toolkit such as NCO (briefly shown in the poster), Xarray, or CMOR. (I don't know whether CDO has added the hooks to the quantize functions in libnetCDF). Since NCO and CMOR both use the netCDF library, those methods should probably be considered as part of the preceding workflow too.
Yes in that the data must be uncompressed before it can be used. No in that the user does not need to do anything by hand. The netCDF API automatically decompresses the data, identically to how it works with CMIP6 datasets. That is part of the beauty of quantization and of the netCDF API. Xarray also hides the messiness of decompression from the user. Hence the requirements on the user will not change no matter how the producers decide to implement their CMIP7 workflows.
Yes. Xarray provides its own interface to writing netCDF4-compliant files in HDF5 or Zarr back-end formats. This interface does not depend at all on the netCDF library.
I defer this question to more knowledgeable Julia users. I'm not sure whether Julia uses a wrapper around libnetCDF or (like Xarray) provides its own independent backend to write netCDF4-compliant files. |
Thanks @czender ! Now I understand everything I need to understand. |
Oops forgot to attach poster mentioned above... |
There are some additional steps required to get the CF standards aligned with the new functionality in CMOR3.9 to use new netcdf 4.9.x lossy compression - this will require the addition of a couple of variable attributes if such compression is used - see cf-convention/cf-conventions#403.
It would be great to get this technical work finalized for the CMOR 3.10 release, along with the branded variable changes - all changes occurring to the netcdf variable.
ping @czender @taylor13 @mauzey1
The text was updated successfully, but these errors were encountered: