Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging attributes of different cubes before concatenation/merging #5050

Open
schlunma opened this issue Nov 4, 2022 · 9 comments
Open

Merging attributes of different cubes before concatenation/merging #5050

schlunma opened this issue Nov 4, 2022 · 9 comments

Comments

@schlunma
Copy link
Contributor

schlunma commented Nov 4, 2022

✨ Feature Request

Cube concatenation and merging fails if the cubes contain different attributes. A possible way out of this is iris.util.equalise_attributes, which simply deletes attributes that are not identical across cubes.

An alternative approach could be to "merge" attributes that differ, e.g., combine them in a single string with a given delimiter, e.g, " ".

Example

Merge the attributes exp: historical and exp: ssp585 to exp: historical ssp585.

I can image three possible locations for code like this:

  • A separate function in the util module, e.g, iris.util.merge_attributes.
  • As part of the concatenate/merge code.
  • As part of iris.common.resolve.

Is this something that might be interesting for Iris? If yes, I could try to tackle this.

Motivation

Using the existing function iris.util.equalise_attributes is not an optimal solution if one wants to keep attributes.

Additional context

A possible implementation is given in the ESMValTool repository here.

@pp-mo
Copy link
Member

pp-mo commented Nov 9, 2022

From offline "peloton" discussions ...

@bjlittle
we could introduce a callback that is specific to merge or concatenate, which allows users to have some fine grained controls in the process

@pp-mo
I don't see the motivation for this... Regarding Merging attributes of different cubes before concatenation/merging #5050, I don't think a callback is a particularly sensible approach, since it's not about controlling merge/concat, just the handling of attributes, which AFAICT can happily be done in advance ?
So, I think we really need callbacks in loading, because there are no data objects before the load which you can inspect or modify, but we need to specify changes that affect the loading process itself. Whereas, for merge/concat, we can just pre-adjust the cubes going in.
Or am I missing something here ?

@bjlittle
The difference is that the callback on load only sees one cube. There is no sense of controlling how metadata is combined (no matter how odd what the user wants to do)... I was just floating the idea of opening up the box of what merge/concatenate does to allow for user specialisation

@bjlittle
(merge and concatenate are impenetrable black boxes of magic, as they stand)

@bsherratt
I could imagine a callback to merge more along the lines of
cubes.merge(some_attr=lambda val1, val2: f"{val1} {val2}")
 ie just telling it exactly what you want for specific attributes
@bsherratt
and if there are still unresolvable differences then it would error/not merge as usual

@bjlittle
yup, something along those lines

@pp-mo re: "callback on load only sees one cube..."
Sorry, I was rather forgetting that the principal use of merge is on loading ...

@bjlittle
Copy link
Member

bjlittle commented Nov 9, 2022

@schlunma Thanks for raising this issue 👍

In general, merge and concatenate are black boxes of magic that the user can't influence. I've often wanted to relax that contract in a flexible and meaningful way so that the user can control some of the decisions that are currently hardwired.

Just to get people thinking about this in a general way... for example, what if we were to expose a callback-like capability to merge and concatenate which would allow the user to control certain behaviours, such has how pain-points like Cube.attributes are combined?

Clearly, needs a lot of thought, but to me, without thinking about this too deeply, seems attractive in principle.

The mechanics of how that's implemented is a separate concern, but at the moment I'm more interested to know whether such a capability would have genuine user benefit.

Just a thought...

@pp-mo
Copy link
Member

pp-mo commented Nov 9, 2022

Re: @bjlittle "The difference is that the callback on load only sees one cube"
and @bsherratt cubes.merge(some_attr=lambda val1, val2: f"{val1} {val2}")

That's nice, but the proposed form still only deals with cubes 2 at a time.
I'm not too sure what it looks like if we generalise that, but I guess we can provide a
"function : *cube_metadata --> result_metadata"
?

I certainly do agree that it makes sense to apply an operation to a specific attribute, as suggested -- ? or a set of them, somehow?. But anyway control it somehow : Applying it regardless to all attributes will be liable to mess up anything with controlled meaning (i.e. not CF terms, which Iris excludes from user attributes, but terms with meaning according to "some other conventions", like for a data repository)

One thing that also bothers me is that with multiple cubes (>=3), it's not too clear whether for general purposes we would always want to form a list of content options, or just make an entry for each distinct value (i.e. a "set" approach),
e.g. cube::calc_method = "standard; alt1; altx" or cube::calc_method = "alt1; alt1; alt1; standard; alt1; altx" .
NOTE: the linked implementation in ESMValTool is using the "set" approach.
This is obviously about specifics of this particular operation. But it may illuminate how a general facility needs to be designed.

@schlunma
Copy link
Contributor Author

schlunma commented Nov 10, 2022

Thanks all for having a look at this!

3 points from our (ESMValTool) perspective:

  1. From our perspective a simple function iris.util.merge_attributes would be enough, but of course a callback would also be great! Though I guess to keep it as general as possible the call signature should be something like callback(cubes) -> cubes.
  2. We are actually using concatenate directly, not within a load call. Thus, for us, a callback within concatenate would be totally fine, but I guess it would also be a good idea to allow setting this callback in the load functions.
  3. Regarding the set vs. list approach - we just opted to the set approach out of convenience, but would also be 100% happy with the list approach as long as all attributes for all cubes are equal.

@larsbarring
Copy link
Contributor

I had -- and still have -- a use case very similar to the what @schlunma describes (and that is why equalize_attributes now at least return the deleted attributes instead of just deleting them). But the question is in what order concatenate and merge does their magic on the set of individual, see here for a use case example.

Copy link
Contributor

github-actions bot commented Dec 1, 2024

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

@github-actions github-actions bot added the Stale A stale issue/pull-request label Dec 1, 2024
@pp-mo
Copy link
Member

pp-mo commented Dec 2, 2024

Re-raising this is timely, as we're hoping to address some related questions under the banner of #5375, when we concentrate on some 'dragon tasks' over the next few weeks.

However I see that the needs here go beyond the use of 'equalise_attributes' or similar, in that it seems you may need an operation which will focus on some specific identified attributes.
I think that would probably still fit into the plans to provide a more general-purpose configurable "equalise" operations, as described here (and related comments)

What is the status of the outstanding need here @schlunma @larsbarring ?
Can you see whether/how it might fit into the proposed scheme ?

@larsbarring
Copy link
Contributor

larsbarring commented Dec 2, 2024

I believe our needs have been solved by several "pre-processor" tools that harmonise the data before going into the main analysis tools. A key part of the pre-processors is the take care of the output from the iris.util.equalise_attributes as part of the general handling of global attributes. For our purposes, where the data cube is spread across several time-slice files, the order of the individual files' attribute is straightforward in relation to the concatenation order. This is not necessarily the true in more general 2-d or even n-d cases.

But with respect coordinate attributes, that often contain floating point data, I still think that something else than exact equality is required.

ping @nCarolina

@github-actions github-actions bot removed the Stale A stale issue/pull-request label Dec 3, 2024
@schlunma
Copy link
Contributor Author

schlunma commented Dec 3, 2024

We have a custom solution in ESMValCore that basically does what I described in this PR description. Since we are quite pleased with this, I am happy to close this issue here in favor of #5375. If there was a similar feature within Iris, we would use it, but this is really not critical for us at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Status: No status
Development

No branches or pull requests

4 participants