-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata to encode quantization properties #403
Comments
Dear Charlie @czender Thanks to you and others at the hackathon for considering this. I agree with you that it is a good idea to split up the description of the compression into a number of pieces rather than gluing them together in something like I may be misunderstanding your CDL. I can only see two metadata schemes sketched out: a container variable with attributes, or a string-valued attribute with key-value pairs. Either of those is CF-compatible. Of those two schemes, I think the container variable is better, because (a) it can be shared by many data variables, thus reducing redundancy, (b) it makes the data variables themselves more readable, (c) since some of the values are numeric, it feels nicer not to have to extract them from a string. What would it mean if the Would the vocabulary for these schemes be defined by an appendix of the convention, or by other documents (like the standard name table), do you think? Best wishes Jonathan |
Apologies everyone for my long absence from updating this issue. Thanks @JonathanGregory for your comments last October. I updated the issue in December 2022, and then @davidhassell provided more feedback in March 2023 that included comments on issues Jonathan had raised earlier (above). I responded to David's feedback in May 2023 and I wanted to retain the content of that offline exchange so that others could follow our discussion. With his permission, below I quote David's comments from March (double indented) and my responses to those (single indented).
And let me also respond directly to Jonathan's question which has not yet been directly addressed:
Yes
The proposal is currently only fully flesh-out for Quantization. Other potential lossy compression algorithms include Sz and Zfp. I know little about those algorithms other than their homepages. Whether they might comprise their own "families", or both be members of a single |
Hi @davidhassell and @JonathanGregory, I am back in Santander today and all next week, courtesy of @cofinoa :) My goal next week is to draft and submit the CF PR for this issue. It would be helpful to have some guidance on a few high level question before, so I start off on the right foot with the PR:
There is also one recent update to report. NCO has supported the method number 2 draft proposal since version 5.2.0. So if you use NCO to perform lossy compression, the output dataset will contain the container variable and the minimal amount of lossy metadata in the proposed format. My upcoming EGU24 poster shows some of this: EGU24-13651 Here is an example:
|
Dear Charlie @czender Thanks for the update and for being willing to work on this. There's not a definite rule about what to do with the first posting in the issue when it gets outdated. My own view is that it's good to keep the proposal as originally stated, because it is needed in order to understand how the discussion began. If I were you, I would make a new posting to this issue, perhaps with some of the headings and text repeated from the first posting if that suits your purposes, and edit the first posting itself to give a link to the new version e.g. under the latest moderator summary section. But I think any way of proceeding which is clear and suitable would be acceptable, to be honest! I think your plan for a new subsection and appendix are fine. Whether to have an appendix depends on how long the detail is. If it's not huge, it could be in the subsection e.g. all the detail about That's great news that you've already implemented the draft proposal! I hope you enjoy your time in Spain. Cheers Jonathan |
Hi Charlie, This is good news! I agree with Jonathan's comments. In particular (largely just restating items Jonathan said :))
All the best, |
Hi @czender - is the PR ready for review? |
@davidhassell not yet. It still needs formatting, examples, and clean-up. Maybe next week. Don't worry, I'll request your review when ready. |
Dear All, I submitted #519 to implement the core of this proposal. As noted above, it does not include anything about lossy compresion error metrics. It could, but I was running out of steam. I would appreciate anyfeedback on the core of #519 before deciding whether to extend it to include error metrics. One issue with error metrics is that there is a large menagerie of them. I think someone with a better background in statistics could do a better job than me, at least with the "Post metrics". There is one (and only one, AFAICT) "Pre metric" that is easy to implement and to understand. Specifically is is the maximum relative error incurred by BitRound. This error is simply
Here at EGU24 today, a few people mentioned they would be more inclined to use lossy compression if the error metrics were included with the data. Any of the metrics mentioned above are possible to add if the software has access to both the quantized and raw arrays. So you can imagine, e.g., correlations, SNR, etc. being computed and added by software that wants to do so (and has access to raw and quantized data). Thoughts on whether to add the maximum_relative_error, and whether to add more complex metrics, to the PR are welcome. Personally I am only inclined to add the maximum_relative_error myself, and only to add it to the convention once the initial PR has been finalized (because I'm learning, again, how much work it is to draft these conventions!). |
Hi Charlie, I think it makes sense to wait to add error metrics until after the initial PR is finished. It sounds like a topic that warrants a separate discussion of its own. (And it also sounds like that will help keep this issue from stalling out, which is good.) |
Hi Charlie, |
Hi Charlie, I'm currently reviewing the text, but before I carry on, I thought it would be good mention the term lossy comrpession variable. I not sure that this is a good name. I think that it's fine to describe quantization in chapter 8, because it's primary use case is to compress, but the container variable described is not generic to lossy compression (e.g. lossy compression by coordinate subsampling also has a container variable called interpolation variable). Rather, it is specific to quantization, so wouldn't quantization variable be better? I would follow this through to other uses of "lossy compression", e.g. I might call the per-variable attribute What do you think? Cheers, |
Hi David, Thanks for editing this. Let me explain the intent. And then you and @JonathanGregory can decide: A bit further on in the "family" section the PR says: "Other potential families of lossy algorithms include rounding, packing, zfp, and fpzip." And that's not a comprehensive list. Other lossy algorithm families could be, e.g., discrete cosine transform, layer packing, logarithmic packing...AFAICT, the If we change the name to Note that we discussed what to name the container variable at the 2022 workshop. I think there were folks on both sides of the question you are (re-?) raising. The current PR implements what we hammered out at that workshop. Of course, it's better to change things now if in hindsight you/we think the workshop conclusions could be improved. |
As a data producer / consumer, I strongly favor having a single It's much easier to check for that and then decide how to deal with it than to check for a whole bunch of different indicators, and the failure state is much better if the compression uses an algorithm that you didn't know existed. It's also more communicative to folks who may have never encountered the issue before; if I'm scanning through the headers of a file, the term |
Hi Charlie and Seth, You both put the reasons for using However I'm concerned that we are being misleading in the case that a creator wants to remove false precision for scientific reasons, but doesn't want to use native compression (or indeed can't - netcdf-3). In that case we haven't done any lossy compression, but have done a lossy filter (is that the right word?). I retract my suggested renaming, but (thinking out loud, here!) wonder if we need two containers: "lossy_compression", and another one for algorithms that have changed the data without compressing it ("filter", perhaps?). The former would not be needed at this time as there is no use case, yet. Thanks, |
Hi David, That's an interesting point, and it will deserve thoughtful consideration if and when we get a use case for it. There are a lot of potential uses for filtering, and I think if we're going to represent them in CF, we want to set ourselves up to use an umbrella approach that can handle all of them in a unified way, rather than dealing with each of them completely independently of the others. So we should try to remember this (potential) use case if/when that comes up. (I think CF is starting to bump up against the issue of explosive combinatorics making namespaces unmanageable in a number of areas, and I'm concerned about forestalling more of these problems in the future.) But since we aren't currently (AFAIK) considering any use cases for filtering, we should leave that aside for now. Cheers, |
Hi All, @davidhassell your suggestion is absolutely correct. And thank you @sethmcg for your thoughts. Here's my response to your suggestion: @JonathanGregory raised a similar point, above, which I copy along with my response here:
It might be better to segregate, as you suggest, these dual roles into different container variables. Let's consider the implications of a distinct container variable to describe transformations that do no compression. One example is the Shuffle filter. Byte-Shuffle and bit-Shuffle rearrange (bytes and bits, respectively) to enhance subsequent compression, similar to quantization. However, shuffling leaves the values in a non-IEEE format that cannot be stored in a netCDF3 file. A reverse transformation (unshuffle?) is required before using the data so Shuffle must be and is implemented as an HDF5 filter so that it is done transparently for the user. Quantization is IEEE-compatible and has no such requirement so it's fine for netCDF3 files. Which begs the question: Do we really want a container variable for transformations that do no compression? Which transformations are you thinking of? Sure, quantization fits the bill. But is that enough? Are there other transformation algorithms you have in mind that would justify creating a generic container variable to hold them? Deciding on the right level of generality is indeed hard as there are bound to be trade-offs. Perhaps quantization is just in a category by itself. Which would support your recent suggestion of naming the container "quantization" instead of "lossy compression". I wrote the current PR based on the discussion at the 2022 workshop. Let me reiterate that I'll go along with (and rewrite the PR for) whatever you and @JonathanGregory and any other interested people like @sethmg and @cofinoa decide is best for CF. |
We seem to be stalled on this issue. Would everyone interested in this proposal, in some form, seeing the light of day please respond with what changes, if any, you would like to see in the current PR? Helpful responses include, though are not limited to:
Mahalo, |
Hi Charlie, |
Dear Charlie @czender I'm sorry that I've kept you waiting. Below are some minor review comments. None of these imply any change to the proposed convention; they're just about the text. I hope they make sense and you can work out what each applies to. Regarding the question of what the container attribute should be called, I tend to think that "lossy compression" sounds too general, because Sect 8.3 is about "lossy compression" as well, and the basic "packing" of Sect 8.1 is a kind of lossy compression. Furthermore, as you write in the preable of the new section, this convention is actually not about compression at all, but about modifying the data to make it suitable for compression. The title of the new section being "Lossy Compression by Quantization" makes me prefer "quantization" as a name for the container, and dispense with You also sensibly remark that there may be other algorithms that come along, which could also be described as lossy compression and might want similar container variables. That's true, but also they might not! I don't think we are tying our hands. The emergence of new use-case in future, which needs a similar treatment but can't be described as quantization, might suggest a suitable name for a container that could handle both quantization and the new method. We could then, for example, make quantization into an alias. The new attributes should be described in Appendix A, within which you will also need to invent a new "Type" value for the container variable. Many thanks for developing this addition. I support its going ahead. I hope it won't take long. Best wishes Jonathan Conventioncodec. Could you replace this word with less technical English. define a metadata framework to provide quantization properties. Maybe replace "provide" with "describe" or "record"? irrecoverable. It's possible it still exists, but that is beside the point. Could we say "differ from the original unquantized data, which are not stored in the dataset and may no longer exist." I suggest omitting the paragraph "Software can use ...". This isn't part of the purpose or description of the convention; it's an explanation of why you proposed this form of it. It's useful in making the proposal, but we don't usually preserve such reasoning in the conventions document.
If and when algorithms. Actually we would probably not do that, since we have a generous interpretation of backward compatibility, in order to minimise misinterpretations, in principle 9 of sect 1.2, "Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions." Following this principle, we would never make The final attribute. As well as "free-form", you could say "unstandardised". Each variable that. I think this should be "each data variable". It might also be useful, for the sake of clarity, to state in the preamble that this convention is only for data variables, if that's the case. "... must include at least two attributes to describe the lossy compression. First, all such data variables must have a This section briefly describes ... for BitRound. I think "thus" should be omitted before "bias-free". This word implies that BitRound is free of bias because the IEEE adopted it! ConformanceI would say, "The value of The value of implementation. I would omit this, because it can't be verified by a checker program (unless it's truly intelligent). If you think it's essential to include, I would say, "must be a string that concisely". The value of "The value of |
Thank you for these comments @JonathanGregory. I'll try to incorporate them in time for the 1.12 release. |
@davidhassell and @JonathanGregory and @sethmcg I have addressed and merged the comments from Jonathan's recent review in #519. I hope I interpreted all the suggestions as intended. Please take another look and LMK what further changes are desired. (FYI I have also updated the reference implementation in NCO to adhere to this updated version). |
Hi Charlie, I'm away for a week from tomorrow, but look forward to taking a good look when I get back. |
Dear Charlie @czender Many thanks for making changes following my comments. I hope they made sense to you. You have some text explaining why it would not be a good idea to quantize grid metrics. That's new text, isn't it? I think this is useful background to include, although it's not strictly part of the convention. However, it would be helpful to clarify that this text is intended to explain why quantization is not allowed for coordinate variables, bounds variables, etc. (as indicated by Appendix A, where the If so, I would suggest moving and modifying (in bold) a few sentences in the preamble, like this:
Finally, you have the sentence, "In general, we recommend against quantizing any I hadn't previously noted the point that it's only for floating-point data (which makes sense, of course). I think that ought to be a requirement in the conformance document. Best wishes and thanks for your patience Jonathan |
@JonathanGregory Thanks for your additional suggestions. Responses interleaved...
No, that text was in the original PR.
Yes, you have.
Agreed. Done.
Agreed. Done.
Yes, I think that CF should recommend against quantizing all data variables that are in "Data variables that appear in
I just added this to the conformance document: "Only floating-point type data variables can be quantized." The PR now contains all of the above changes. Let's keep this momentum going and resolve all the issues. More feedback welcome. Charlie |
Dear Charlie Thanks very much for explanations and improvements. I think this is essentially fine! Hoping that you have patience, I would like to make the following minor suggestions:
Seth was already happy and I hope that he still is. I expect we will hear soon from David. Best wishes Jonathan |
Aloha @JonathanGregory, |
Dear All, I have now heard from and addressed comments from a number of helpful reviewers. The associated PR appears likely to be merged without any further substantive changes. This is a last call for feedback before that occurs. |
Thanks, Charlie. In that case, since the proposal has already received sufficient support to be adopted, according to the rules, we can "start the clock", by saying it will be accepted (and merged) in three weeks from today (on 9th August), if no-one raises any concerns before then. |
Three weeks ago I said that this proposal would be accepted if no further comments were made. Two weeks ago @davidhassell and Charlie @czender had a discussion in the pull request #519 about two matters of substance. For the sake of clarity I give here a digest of the conversation. It would be good if an agreement can swiftly be reached so that the proposal can be accepted in time for the next release. David proposed a requirement that "Variables that are referenced by any other variable must not have a
To which Charlie replied: Your suggested replacement sentence might be OK. However, it's hard for me to parse and I do not understand the logic of excluding variables that belong to a domain from being quantized. (NB: I just learned about domain variables today so I do not understand them that well). I agree that coordinate (of any type) variables that define a domain should never be quantized. Is that what you mean by variables that are "part of a domain"? To me "variables that are part of a domain" naturally includes data variables that occupy the domain, and I do not see the logic in proscribing quantization of these data variables. Please explain your reasoning or suggest a rewording that allows data variables in a domain to be quantized. Or are data variables and domain variables mutually exclusive so "data variables in a domain" is oxymoronic? David and Charlie agreed that ancillary variables, which are data variables in their own right, should not be excluded from quantization, although they are referred to by the data variable. On the other hand, If my understanding is correct, I suggest that the simplest and clearest thing may be to spell out which kinds of variable can or should not be quantized. This could be done by replacing
with
David suggested deleting the conformance recommendation:
because Charlie's text states that the implementation attribute can also contain any information that is useful, beyond the library version number. But if any such information were included, a checker program would give a warning, because of the above recommendation. Charlie replied: Ahh, I see. That's fair. My intent was to recommend that the implementation attribute begin with "library version number" or "client version number", and then be followed by any additional information. It seems like all conceivable implementations will have either an umambiguous client or library, together with a version. Is a recommendation to put those before any other disambiguating information not worth including? If you think checking that would be too hard or error prone then I agree the recommendation should be removed from the conformance doc. I suggest that the optional additional information could be put within What do you think, David and Charlie? Please could you reply in this issue rather than in the pull request, to keep the record straight. Thanks. 😃 |
@JonathanGregory Thank you for moving this along so it has a chance of being included in the next CF release. I agree with your two suggestions, and will add a notes to underscore some points you make in the summary of the discussion that @davidhassell and I have recorded in the PR. First I'll note that the discussion was in the PR not on this issue page because it started in response to specific wording suggestions in David's review. I am not familiar with the etiquettte of CF discussions that are substantive, so I responded in the PR. Thank you for summarizing this discussion in the issue. Regarding
You suspect correctly. I share (with you and David) the concern that no coordinates or metrics be quantized. I thought that David's suggestion risked prohibiting quantization of any data variables that occupy a domain, which would be overly restrictive in my opinion. Finally, regarding the Conformance recommendation for Charlie |
Dear Charlie @czender Thanks for following up. I'm happy with the idea that Best wishes Jonathan |
Hello Charlie and Jonathan, Thanks, Jonathan, for moderating here - it has brought me back to thinking about it. I'm very happy with the current suggestions, specifically:
I like this new, clear text.
I'm happy with this, and that therefore that this should therefore be a requirement in the conformance document. David |
@JonathanGregory and @davidhassell and any others interested: The latest PR has been updated to include these suggestions. As far as I can tell, it includes all requested changes, reads pretty well, and the formatting looks good. Please point out any remaining issues at your leisure. |
Dear Charlie @czender I'm entirely happy with the new version. Thanks for your hard work. We will merge it three weeks from today (5th September) if @davidhassell agrees and there are no further concerns raised. Best wishes Jonathan |
@davidhassell I just merged two recent changes to |
Thank you, Charlie @czender and @davidhassell. I will merge it now. |
Marvelous! Thanks to Charlie for persisting with this over many years, and also to Antonio for working on the final PR. |
Thanks for all your editorial help @JonathanGregory and @davidhassell. Pinging WIP members @durack1 @taylor13 @mauzey1 who know that quantization and Zstandard were recently enabled in CMOR (PCMDI/cmor#751). WIP folks: If CMOR were to implement this CF convention on top of the quantization/Zstandard work already done, then CMOR could implement the CMIP7 quantization proposal that I presented and we discussed in January. Seems like an opportune time to further consider that proposal. This could save a lot of disk space in a transparent, CF-compliant way without sacrificing any scientifically meaningful data! |
Metadata to encode lossy compression properties
Moderator
@davidhassell
Moderator Status Review [last updated: 2024-04-18]
Submit PR #519
Write "Technical Proposal Current 2024 Draft"
Upload "Original Technical Proposal Options Discussed at 2022 CF Workshop"
Requirement Summary
Final proposal should make the lossy compression properties of data variables clear to interested users
Technical Proposal Summary Current 2024 Draft
The current draft is Option 2 of the five discussed at the 2022 CF workshop (those five options are retained below for completeness). The framework of attributes and controlled vocabularies (CVs) to encode lossy compression is thought to be sufficiently general and flexible to handle a wide variety of lossy compression algorithms. However, the initial PR describes the attributes and CVs required for only the four forms of quantization: DigitRound, BitGroom, BitRound, and Granular BitRound. The latter three algorithms are also currently implemented in the netCDF library.
Six properties of lossy compression were agreed to be good metadata candidates based on their utility to help data users understand precisely how the data were modified, and the statistical differences expected and/or achieved between the raw and the lossily compressed data. Keeping with CF precedents, the CV values are case-insensitive with white space replaced by underscores. The properties and CVs supported in the initial PR are
1.
family
:quantize
2.
algorithm
: depends onfamily
above. Allowed values forquantize
currently includebitgroom
,bitround
,granular_bitround
.3.
implementation
: This property contains free-form text that concisely conveys the algorithm provenance, including the name of the library or client that performed the quantization, the software version, and the name of the author(s) if deemed relevant.4. Parameters used in the application of
algorithm
to each variable. The CV ofbitround
parameters isNSB
ornumber_of_significant_bits
orkeep_bits
. The CV of thebitgroom
andgranular_bitgroom
algorithms isNSD
ornumber_of_significant_digits
.Additionally, two categories of error metrics merit consideration for inclusion as optional attributes. These categories are referred to as "Prior metrics" and "Post metrics". Both metrics quantify the rounding errors incurred by quantizing the raw data. However, neither of these categories is included in the initial PR:
algorithm
. CV forbitround
ismaximum_relative_error
(computed as2^(-NSB)
).CV for
bitgroom
andgranular_bitgroom
could bemaximum_absolute_error
(computed as formula TBD).post_metrics
: These metrics can be computed for all algorithms after the lossy compression has occurred. CV could includemaximum_absolute_error
,DSSIM
,data_structural_similarity_image_metric
,mean_absolute_error
,snr
,signal_to_noise_ratio
,standard_deviation_of_difference
,rmsd
,root_mean_square_difference
.Benefits
Users (and producers) would benefit from knowing whether, how, and by how much their data has been distorted by lossy compression. Controlled vocabularies, algorithm parameters, and metrics will make this possible. Stakeholders include data repositories, MIP organizers, and downstream users.
Status Quo
The netCDF 4.9.2 library applies a single library attribute of the form _QuantizeBitGroomNumberOfSignificantDigits=3. This lacks some desirable properties such as extensibility, controlled vocabulary, algorithm provenance. Other than that, there are currently no known metadata standards for lossy compression being employed for CF-compliant geophysical datasets.
Associated pull request
#519
Detailed Proposal
See #519.
Technical Proposal Options Discussed at 2022 CF Workshop
There was a consensus among all who participated in the relevant discussions at the 2022 CF workshop that CF should standardize the metadata that describes the properties of lossy compression algorithms that have been applied to data variables. This presentation gave background information on the topic, and notes from the ensuing breakout session and hackathon are available here. This issue summarizes the points of agreement reached and presents the outcome of these discussions for feedback from the wider CF community.
The hackathon produced five candidate schemes (below) for encoding lossy compression properties. We encourage interested researchers to indicate their opinions on any schemes they think are especially promising or poisonous for CF to (modify and) adopt. Suggestions for other approaches are also welcome.
Six properties of lossy compression were agreed to be good metadata candidates based on their utility to help data users understand precisely how the data were modified, and the statistical differences expected and/or achieved between the raw and the lossily compressed data. The properties are
The acceptable values for some properties would best be set by controlled vocabularies (CVs). Keeping with CF precedents, the CV values could be case-insensitive with white space replaced by underscores. Candidate CVs for the lossy algorithms introduced in netCDF 4.9.X follow. These CVs would be expanded to support other families, algorithms, parameters, and metrics:
1.
family
:quantize
orunknown
2.
algorithm
: depends onfamily
above. Allowed values forquantize
currently includeBitGroom
,BitRound
,granular_bitround
.3.
implementation
: A CV might not fit well the implementation properties?4.
parameters
: Depends onalgorithm
. CV forbitround
isNSB
ornumber_of_significant_bits
orkeep_bits
. CV forbitgroom
andgranular_bitgroom
isNSD
ornumber_of_significant_digits
.5.
prior_metrics
: Depends onalgorithm
. CV forbitround
ismaximum_relative_error
(computed as formula TBD). CV forbitgroom
andgranular_bitgroom
could bemaximum_absolute_error
(computed as formula TBD).6.
post_metrics
: Independent ofalgorithm
. CV could includeDSSIM
,data_structural_similarity_image_metric
,mean_absolute_error
,snr
,signal_to_noise_ratio
,standard_deviation_of_difference
,rmsd
,root_mean_square_difference
The same lossy compression algorithm is often applied with either minor or no changes to the input parameters across the set of target data variables. Hence the optimal convention for recording the lossy compression needs to allow for per-variable differences in lossy compression properties while ensuring that compression metadata does not overwhelm or reduce the visibility of other metadata such as
standard_name
,units
,_FillValue
, etc.The five current candidate metadata schemes are:
The text was updated successfully, but these errors were encountered: