Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to derive fields from input datasets #29

Open
ealerskans opened this issue Oct 29, 2024 · 20 comments
Open

Add ability to derive fields from input datasets #29

ealerskans opened this issue Oct 29, 2024 · 20 comments
Assignees

Comments

@ealerskans
Copy link
Contributor

ealerskans commented Oct 29, 2024

One outstanding task is to implement the derivation of forcings from neural-lam here in mllam-data-prep.

I have had a look at the current forcing calculations from neural-lam:
https://github.com/mllam/neural-lam/blob/feature_calculate_forcings/create_forcings.py

All thanks to @observingClouds I have attempted a first draft of the TOA radiation calculations in mllam-data-prep here:

I have so far only tested on the DANRA zarr files and from this I have some observations:

  1. When calculating the TOA radiation we're using: lat, lon, and time, all which are coordinates. Since coordinates are read in eagerly and not lazily in xarray we can not just calculate the radiation lazily. However, since the DANRA data is 30 years of data, if we try to just do the calculations using xarray we run into memory issues since it would try to allocate ~300 GB. Therefore, Hauke's solution was to create a new dataset using dask and chunking and in that way we could do the calculations. The implemented solution is just a first draft and there is some hard-coding included as well. A few points to discuss/iterate over would be:
  • The chunking is currently hard-coded
ds_dict = {}
ds_dict["lat"] = (list(ds.lat.dims), da.from_array(ds.lat.values, chunks=(-1, -1)))
ds_dict["lon"] = (list(ds.lon.dims), da.from_array(ds.lon.values, chunks=(-1, -1)))
ds_dict["t"] = (list(ds.time.dims), da.from_array(ds.time.values, chunks=(10)))
ds_chunks = xr.Dataset(ds_dict)

and this is not optimal. Either we find a way to make this chunking more flexible (using the chunking key from the config file?) or if there is a smarter way to do this. I have tested this approach for 1 year of DANRA data and it works with the time chunks of 10 works. This is a comparison of the timings (based on the log output) when adding the TOA radiation as forcing compared to not adding it:

  • With TOA radiation: 1 min 23 s - 54 GB dataset (30 min 37 s on my own laptop)
  • Without TOA radiation: 1 min 13 s - 43 GB dataset (26 min 47 s on my own laptop)
  1. I have slightly modified config.py to include and additional attribute to the InputDataset class: derive_variables (default is False). This way I can add an extra section in the input config file for the forcings to be derived:
danra_additional_forcings:
  path: /dmidata/projects/cloudphysics/danra/data/v0.4.0/single_levels.zarr
  dims: [time, x, y]
  variables:
    # calculate ToA radiation
    - toa_radiation
  derive_variables: True
  dim_mapping:
    time:
      method: rename
      dim: time
    grid_index:
      method: stack
      dims: [x, y]
    forcing_feature:
      method: stack_variables_by_var_name
      name_format: f"{var_name}"
  target_output_variable: forcing

This then requires the forcing variables to be specified in separate section(s).

  1. The quick implementation on my branch works like this:
  • Checks if the variable(s) should be derived or not using this new attribute
  • If they should, then the necessary variables for calculating the forcings are extracted
  • The dataset is then subsetted
  • The forcing is calculated where the function to use is chosen based on the variable name
    This is just a quick implementation to see that it works.

I have not looked at the other forcings from create_forcings.py so far but those and perhaps others would also need to be implemented.

An aspect that I think has been discussed previously is if the forcings should be calculated as part of the mllam-data-prep or if they should be part of a separate package, which then mllam-data-prep could call. I would like to start a discussion here what would be the best approach (the implementation I have is just to see if it works) and to get input from you guys - @joeloskarsson @sadamov @leifdenby @khintz

@observingClouds
Copy link
Contributor

observingClouds commented Oct 29, 2024

Hi @ealerskans,
Thanks for putting this together. The TOA forcing derived from coordinates is a really tough first derived product. If we manage to generalize from this, the rest is probably straight forward.

With regards to the generalization:

  • Definition of derived inputs
    We should try to define the forcing e.g. toa_radiation and its dependencies in the config file, so that we do not need to hard-code this into the functions. One way might be to:
danra_additional_forcings:
  path: /dmidata/projects/cloudphysics/danra/data/v0.4.0/single_levels.zarr
  dims: [time, x, y]
  variables:
    # calculate ToA radiation
    - toa_radiation:
        dependencies:
           - time
           - lat
           - lon
        method: calc_toa_radiation # reference function that takes variables as inputs
  dim_mapping:
    time:
      method: rename
      dim: time
    grid_index:
      method: stack
      dims: [x, y]
    forcing_feature:
      method: stack_variables_by_var_name
      name_format: f"{var_name}"
  target_output_variable: forcing

Checks if the variable(s) should be derived or not using this new attribute

Would not be necessary as it becomes obvious from the presence of the dependencies keyword.

@observingClouds
Copy link
Contributor

observingClouds commented Oct 29, 2024

With regards to the chunking, we could introduce an optional chunking field also for the input. Currently chunking is only defined for the output, but for datasets that have bad chunking or need some extra care, defining input chunks can be an effective way to optimize the reading.

I would even say that instead of introducing a chunking parameter, I would suggest to add an option that allows us to define any xarray kwargs, which would be used in

try:
ds = xr.open_zarr(fp)
except ValueError:
ds = xr.open_dataset(fp)

@joeloskarsson
Copy link
Contributor

Great seeing the progress on this! I agree that the TOA forcing is definitely the most complex, and it should be quite simple to extend to the other ones once this is working. There is some more complex static fields (e.g the land-sea-mask, implemented in https://github.com/mllam/neural-lam/blob/feature_calculate_forcings/create_forcings.py), but since that does not have a time dimension it is small and does not have to be very efficiently implemented.

I have no strong opinions regarding if forcing pre-computation should sit here or in a separate package. I don't see much of the need for things to be spread over many packages, if things are nicely separated into their own functions no matter what package they are in. But if you find that structuring the code around this would be easier if it was it's own package then go for it. Just keep in mind that there is likely not that many different functions for forcing field and static field generation that has to be implemented.

@ealerskans
Copy link
Contributor Author

ealerskans commented Oct 30, 2024

We should try to define the forcing e.g. toa_radiation and its dependencies in the config file, so that we do not need to hard-code this into the functions. One way might be to:

danra_additional_forcings:
  path: /dmidata/projects/cloudphysics/danra/data/v0.4.0/single_levels.zarr
  dims: [time, x, y]
  variables:
    # calculate ToA radiation
    - toa_radiation:
        dependencies:
           - time
           - lat
           - lon
        method: calc_toa_radiation # reference function that takes variables as inputs
  dim_mapping:
    time:
      method: rename
      dim: time
    grid_index:
      method: stack
      dims: [x, y]
    forcing_feature:
      method: stack_variables_by_var_name
      name_format: f"{var_name}"
  target_output_variable: forcing

@observingClouds Good idea! I agree. It will be much more flexible if the dependencies and the method for deriving the forcing variable is included in the config file. Then we would need to distinguish this type of input variable from the other two types of variables:

  • dict
    variables:
      u:
        altitude:
          values: [100,]
          units: m
    
  • list
    variables:
      - swavr0m
      - t2m
    

In the load_and_subset_dataset function, the dataset is loaded and subsetted differently depending on which of the two variables above are given (if it is a list of dict). We would then also need to differentiate between variables defined as a list of strings (such as the swavr0m and t2m example above) and variables defined as a list of a dictionary (like the derived toa_radiation variable). I also think that the load_and_subset_dataset shouldn't be used to derive the forcings, but that it should be done in the create_dataset function somewhere so we would need to have a check there as well which checks if the variable should be derived or not.

@observingClouds
Copy link
Contributor

Good point! I think we should change the config schema slightly and have the following options:

variables:
  u:
    coordinates:
      altitude:
        values: [100,]
        units: m
    dependencies:
      - time
      - lat
      - lon

where coordinates and dependencies are optional. If none of those are given then variables will be a list instead of a dict.

@ealerskans
Copy link
Contributor Author

Hmm, that might actually be cleaner. What I have been playing around with (and which I think I have gotten to work) is to define the derived variables as a list of a dictionary (as your first example) and then I differentiate between if it is a list of strings ("normal" variables) or if it is a list of a dictionary (derived variables). But this seems a bit messy perhaps.

@ealerskans
Copy link
Contributor Author

@observingClouds so I thought a bit more about it, and what do we actually gain if we would specify the dependencies for the derived variable like this

variables:
  toa_radiation:
    dependencies:
      - time
      - lat
      - lon
    method: derive_toa_radiation

Initially I thought that this would be more flexible since then you can choose the names for your variables, e.g. if your variable is named latitude instead of lat this would allow you to extract that one instead. But how would we then make sure that latitude would be used in the function instead of lat ? We would still need to refer to the variables in the function doing the calculation if we don't add some sort of mapping from latitude to lat? How would we go about generalizing this then? I think that there are still some things to think over and also how much we can generalize the approach

@observingClouds
Copy link
Contributor

Good point @ealerskans. derive_toa_radiation or any other method could have just positional arguments so that the order of the dependencies matches that of the function inputs. Or we make use of the dim_mapping section:

dim_mapping:
    time:
      method: rename
      dim: time
    latitude:
      method: rename
      dim: lat

@observingClouds
Copy link
Contributor

I'm curious to hear @leifdenby thoughts as he had a different solution in mind.

@leifdenby
Copy link
Member

Sorry for taking so long to reply here. I will try to make my thoughts brief:

  1. Awesome work @ealerskans and @observingClouds! It looks like you have already made it over most of the technical hurdles. I think your working implementation is great. I will just suggest some minor adjustments.
  2. chunking
    • I would as a start say we apply (i.e. do .chunk(...) on the input fields) the chunking defined for the output in the config, for the dimensions that coincide between the field being derived and the output variables. This makes sense to me because we are creating output variables here and the output chunking defines how we want the output to be stored.
    • I would raise a warning if after applying chunking the chunksize is larger than say order 1GB. We won't currently implement anything to mitigate this, but we could allow for chunksize definitions for the inputs in the config if this turns out to be an issue.
  3. config file format, currently you have the following (I think, am not quite sure how the suggestion @observingClouds should fit, maybe we should discuss on a whiteboard):
danra_additional_forcings:
  path: /dmidata/projects/cloudphysics/danra/data/v0.4.0/single_levels.zarr
  dims: [time, x, y]
  variables:
    # calculate ToA radiation
    - toa_radiation:
        dependencies:
           - time
           - lat
           - lon
        method: calc_toa_radiation # reference function that takes variables as inputs
  dim_mapping:
    time:
      method: rename
      dim: time
    grid_index:
      method: stack
      dims: [x, y]
    forcing_feature:
      method: stack_variables_by_var_name
      name_format: f"{var_name}"
  target_output_variable: forcing
  • config section: I had originally thought it would be nicer to introduce a derived section next to the input and output sections, just to make the separation clearer. But on reflection it seems simpler now to tie the derived variables to a single input dataset, and so this definition should then sit with a single input. However, this does mean that we won't be able to derive fields from two different inputs (let's say pressure and temperature are in two different files, then we won't be able to compute potential temperature for example).
  • Methods to use: I would still like to keep these derived variable separate, so how about instead of adding derived variables to the variables section of an input we introduce a derived_variables section? That removes the ambiguity about whether a variable should be taken from the input or not.
  • Function/method reference: I like the idea of naming the function used explicitly, if we could use the full namespace (e.g. module.submodule.function_name) then people could also use functions in external libraries. Maybe we should call this function rather than method since "method" in python typically refers to functions on a class (as in methods-of-an-object)?
  • Function arguments: rather than call this dependencies I would use args and kwargs, these are quite standard names for python. With the latter we could allow people to rename arguments. I would suggest args should be a list of strings, kwargs would be Dict[str,str] and any argument string (either from args or key of the kwargs dict) would be used to index as key into ds_input[key]. This would make it possible to get both variables and coordinates in an idential manner. We could add lat and lon as special derived coordinates that we will always ensure are present on ds_input (when Make mllam-data-prep aware of geographical coordinates (by introducing projection info) #33 is implemented), and mention in the README that these will be the latitude and longitude and will always be available.
  • Units and long-name: Downstream applications need to know what the units of fields are and descriptions of fields. I think we should add config attributes that allows the user to set these (just an optional attrs in the config, matching ds.attrs in xarray, would suffice). I would suggest we check whether function for creating derived field returns a xr.DataArray. If it does we check whether the units and long_name attributes are set. If this isn't the case we raise an exception asking them user to either explicitly set these in the config or change the function definition. We could also enforce that the function returns the an xr.DataArray with the attributes set, that would be simpler to do for now.

So with all that the above config would become:

danra_additional_forcings:
  path: /dmidata/projects/cloudphysics/danra/data/v0.4.0/single_levels.zarr
  dims: [time, x, y]
  derived_variables:
    # calculate ToA radiation
    - toa_radiation:
        method: mllam_data_prep.derived_variables.calculate_toa_radiation
        kwargs:
          time: time
          lat: lat
          lon: lon
  dim_mapping:
    time:
      method: rename
      dim: time
    grid_index:
      method: stack
      dims: [x, y]
    forcing_feature:
      method: stack_variables_by_var_name
      name_format: f"{var_name}"
  target_output_variable: forcing

@ealerskans
Copy link
Contributor Author

@leifdenby thank you for the comments and very nice suggestions, and also to you and @observingClouds for the discussion just now! I will try and summarize here what we discussed and the proposed solution and then you can correct me if I misunderstood something or add if I have missed anything.

  1. chunking:
    • For now we will apply the chunking (i.e. do .chunk(...) on the input fields) using the chunking defined for the output in the config file, for the dimensions that coincide between the field being derived and the output variables. This means that for e.g. toa_radiation, which depend on lat, lon and time we would apply chunking for the time variable only and nothing for lat and lon as these are not dimensions of the output data.
    • After having done the calculation we will raise a warning if the chunksize is larger than e.g. 1 GB.
  2. config file format: The structure of the config file will look like what @leifdenby proposed:
danra_additional_forcings:
  path: /dmidata/projects/cloudphysics/danra/data/v0.4.0/single_levels.zarr
  dims: [time, x, y]
  derived_variables:
    # calculate ToA radiation
    - toa_radiation:
        function: mllam_data_prep.derived_variables.calculate_toa_radiation
        kwargs:
          time: time
          lat: lat
          lon: lon
  dim_mapping:
    time:
      method: rename
      dim: time
    grid_index:
      method: stack
      dims: [x, y]
    forcing_feature:
      method: stack_variables_by_var_name
      name_format: "{var_name}"
  target_output_variable: forcing
  • This means that to derive a variable the input variables all need to come from the same input dataset and can not come from different ones.
  • By defining a new section called derived_variables we can separate derived variables (which would go under the derived_variables section) from non-derived variables (which would go under the variables section) and have both in the same input data section
  • function will be used for specifying which funciton to use to calculate the derived variable. By using the full namespace people would be able to use functions from external libraries.
  • The function arguments will be called kwargs and will be a dictionary (Dict[str, str]) where the key will be used to subset/select variables from the input dataset and the value would be used for renaming.
  • To get lat and lon when we need the as arguments to the derived variables or as variables to be included in the output dataset we should call a function that can be used to get lat/lon values (either calculates them based on the projection info or selects from the dataset) as proposed in Make mllam-data-prep aware of geographical coordinates (by introducing projection info) #33.
  • Add an optional attrs in the config for derived_variables where users can define e.g. units and long_name. Then we would check in the function that calculates the derived variables if the calculated variable is a data-array, and if it is we would then add these attributes.

I imagine then that this is how the workflow would be:

  1. In create_dataset.py, we open our input dataset, ds_input
  2. We do the subsetting/variables selection
    1. Check if we have derived_variables or if we have variables and subset differently depending on which we have
    2. Apply the chunking specified for the outputs to the variables, for dimensions that coincide between the derived variable and the output variables (e.g. the time dimension).
  3. Derive variables
    1. When calling the funciton used to calculate the variable, we want to pass only the variables needed. This way the funciton will be able to work for both scalars and data-arrays. We then make use of the kwargs when doing this and function as well.
    2. In the function where we calculate the derived variable we add a check to see if it is an xarray data-array or not. If it is we add the optional attributes (if defined).
  4. After deriving a variable we check the chunksize of the derived variable and if it is larger than e.g. 1 GB then we raise a warning to say the the chunksize is big and that the creation of the output zarr dfataset might fail.
  5. We carry on with the rest of the flow in create_dataset.py such as mapping and stacking.

Hopefully this makes sense and is at least approximately what we discussed :)

Now I actually have some additional questions.

  1. For arguments using kwargs, this would then mean that the keys would be used to select the variables from the input dataset and the values would need to correspond to the argument of the function used to calculate the derived variable, or?
  2. Would the way that we define the dependencies through kwargs for the derived variables work if we have a variable on a specific height level, e.g. u at 100 m? Is this something we want to take into consideration as well?
  3. When doing the chunking on the input data, I guess it should only be for the derived variables and not the other, non-derived variables?

@leifdenby leifdenby changed the title Deriving forcings Add ability to derive fields from input datasets Nov 12, 2024
@observingClouds
Copy link
Contributor

For arguments using kwargs, this would then mean that the keys would be used to select the variables from the input dataset and the values would need to correspond to the argument of the function used to calculate the derived variable, or?

I would do it the other way around, such that you can do my_func_to_create_derived_variable(**kwargs), where kwargs is pasted directly from the config file (though you need to actually replace the values with the actual variables from the input dataset)

@leifdenby
Copy link
Member

  • Add an optional attrs in the config for derived_variables where users can define e.g. units and long_name. Then we would check in the function that calculates the derived variables if the calculated variable is a data-array, and if it is we would then add these attributes

I would change this to instead to do two things:

  1. Internally in any function implemented in mllam_data_prep.derived_variables.{function_name} that computes the derived field we check if the variable resulting from the calculation is an xr.DataArray object (this will happen if any of the inputs to the function are xr.DataArrays). If this is the case the functions should set the units and long_name attrs. Checking that the resulting variable is an xr.DataArray will ensure that the function still executes correctly when applied to just scalar values too.
  2. We check that the result is a xr.DataArray and that both units and long_name is set. If the these attributes aren't set (say because the user is calling a function external to mllam_data_prep or because they haven't added setting these attributes when creating their new function) then we require that the attrs have been defined in the config. If both the resulting field and the config has the attrs set we raise a warning that The units of the derived field {field_name} is being overwritten from {old_units} to {new_units} or something like that
  1. Check if we have derived_variables or if we have variables and subset differently depending on which we have

I would simply iterate over the derived_variables and an construct an ds_input_subset dataset for each variable to derive. Something like:

required_variables = derived_variable.kwargs.values()

latlon_coords_to_include = [
   required_variables.pop(c) for c in ["lat", "lon"] if c in required_variables
]

kwargs = {}

if len(latlon_coords_to_include):
   latlon = get_latlon_coords_for_input(ds_input)  # implemented elsewhere, #33
   
   if c for c in latlon_coords_to_include:
       kwargs[c] = latlon[c]

kwargs = {
  v: ds_input[v] for v in required_variables
}

da_field = derived_field_method(**kwargs)

@observingClouds
Copy link
Contributor

When doing the chunking on the input data, I guess it should only be for the derived variables and not the other, non-derived variables?

Good question. If the output chunksize is larger than the input chunksize, I think the performance can be negatively affected, but I would go for now with chunking all input data with the defined output chunks.

@observingClouds
Copy link
Contributor

observingClouds commented Nov 12, 2024

Would the way that we define the dependencies through kwargs for the derived variables work if we have a variable on a specific height level, e.g. u at 100 m? Is this something we want to take into consideration as well?

Because we do the calculation of the derived variables lazily, the calculation would not be affected by a selection of the input variables and we could even do a selection on a derived variable.

So maybe after the variable has been derived, the derived_variables and variables sections should be internally combined, so that one can also do a selection on e.g. altitude levels of a derived variable.

We might want to match the derived_variables section with the variables section and e.g. not have a list but a dict:

inputs:
  danra_height_levels:
...
    variables:
      u:
        altitude:
          values: [100,]
          units: m
    derived_variables:
      toa_radiation:  # note the missing hyphen
        function: mllam_data_prep.derived_variables.calculate_toa_radiation
        kwargs:
          time: time
          lat: lat
          lon: lon

This would allow the following syntax:

all_vars = config['inputs']['danra_height_levels']['variables'] | config['inputs']['danra_height_levels']['derived_variables']

@leifdenby
Copy link
Member

Because we do the calculation of the derived variables lazily, the calculation would not be affected by a selection of the input variables and we could even do a selection on a derived variable.

We could in theory true use one derived variable as input for another, but that would mean that would mean the order at which derived variables are added is important. As of python 3.6 dict should preserve order (https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-compactdict), so that the order in the parsed yaml so should match the order in the source yaml file. It seems a little extra complicated to me, but that would mean that you would for start by creating ds_input_subset from ds_input (with only the selected variables) and then add the derived variables one by one to ds_input_subset. Maybe the code should raise an exception if adding a derived field would ever result in overwriting a variable already there? This would avoid collisions between variables both given in the variables and derived_variables subsection.

We might want to match the derived_variables section with the variables section and e.g. not have a list but a dict:

Yes! Sorry I missed this. Yes, this should be a dict just like variables in the config.

@leifdenby
Copy link
Member

leifdenby commented Nov 12, 2024

2. Would the way that we define the dependencies through kwargs for the derived variables work if we have a variable on a specific height level, e.g. u at 100 m? Is this something we want to take into consideration as well?

Hmm... good question. Either we a) allow the user to derive fields based on any of the variables in the loaded input dataset, in which case the derived variables should be derived before doing the subsetting or b) we only allow deriving variables from the selected subsets. I think maybe option a) is best here, but that would mean we should avoid the possibility of deriving one variable from another derived variable I think. The code would start getting very complicated otherwise. As in: I advocate that we only allow deriving variables from the existing variables in the loaded input dataset.

@ealerskans
Copy link
Contributor Author

I would simply iterate over the derived_variables and an construct an ds_input_subset dataset for each variable to derive. Something like:

required_variables = derived_variable.kwargs.values()

latlon_coords_to_include = [
   required_variables.pop(c) for c in ["lat", "lon"] if c in required_variables
]

kwargs = {}

if len(latlon_coords_to_include):
   latlon = get_latlon_coords_for_input(ds_input)  # implemented elsewhere, #33
   
   if c for c in latlon_coords_to_include:
       kwargs[c] = latlon[c]

kwargs = {
  v: ds_input[v] for v in required_variables
}

da_field = derived_field_method(**kwargs)

I see. Then if we want to create a ds_input_subset dataset for each variable to derive then load_and_subset_dataset would have to be modified to be able to create a subset for each derived variable and then also do the derivation in here instead of outside in create_dataset as I have done up until now. Either that or a new function would be needed to load, subset and create the derived variables separately from the non-derived variables. To me it would make more sense to have the derived variables being handled separately from the loading and subsetting of the non-derived variables, but I would like to hear other opinions as well.

Either way, if we are choosing to have both derived_variables and variables in the config then they need to be made optional, or perhaps that at minimum one of them are included for each named dataset.

@ealerskans
Copy link
Contributor Author

Hmm... good question. Either we a) allow the user to derive fields based on any of the variables in the loaded input dataset, in which case the derived variables should be derived before doing the subsetting or b) we only allow deriving variables from the selected subsets. I think maybe option a) is best here, but that would mean we should avoid the possibility of deriving one variable from another derived variable I think. The code would start getting very complicated otherwise. As in: I advocate that we only allow deriving variables from the existing variables in the loaded input dataset.

I am not really sure I follow what the different options mean, except that option a) means that we can not derive a variable from another derived variable. When you say that the "derived variables should be derived before doing the subsetting" what do you mean then? Which subsetting? Currently we are selecting relevant variables (subsetting?) before calculating the derived variables, but I don't think that is what you are referring to?

@joeloskarsson
Copy link
Contributor

I'm working on removing the forcing computation from the neural-lam roadmap, as that should go in here. To not lose the list of links associated with that I'll put them here. Just for reference and to have them collected somewhere:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants