-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to derive fields from input datasets #29
Comments
Hi @ealerskans, With regards to the generalization:
Would not be necessary as it becomes obvious from the presence of the |
With regards to the chunking, we could introduce an optional chunking field also for the input. Currently I would even say that instead of introducing a mllam-data-prep/mllam_data_prep/ops/loading.py Lines 20 to 23 in 0458840
|
Great seeing the progress on this! I agree that the TOA forcing is definitely the most complex, and it should be quite simple to extend to the other ones once this is working. There is some more complex static fields (e.g the land-sea-mask, implemented in https://github.com/mllam/neural-lam/blob/feature_calculate_forcings/create_forcings.py), but since that does not have a time dimension it is small and does not have to be very efficiently implemented. I have no strong opinions regarding if forcing pre-computation should sit here or in a separate package. I don't see much of the need for things to be spread over many packages, if things are nicely separated into their own functions no matter what package they are in. But if you find that structuring the code around this would be easier if it was it's own package then go for it. Just keep in mind that there is likely not that many different functions for forcing field and static field generation that has to be implemented. |
@observingClouds Good idea! I agree. It will be much more flexible if the dependencies and the method for deriving the forcing variable is included in the config file. Then we would need to distinguish this type of input variable from the other two types of variables:
In the |
Good point! I think we should change the config schema slightly and have the following options:
where coordinates and dependencies are optional. If none of those are given then variables will be a |
Hmm, that might actually be cleaner. What I have been playing around with (and which I think I have gotten to work) is to define the derived variables as a list of a dictionary (as your first example) and then I differentiate between if it is a list of strings ("normal" variables) or if it is a list of a dictionary (derived variables). But this seems a bit messy perhaps. |
@observingClouds so I thought a bit more about it, and what do we actually gain if we would specify the dependencies for the derived variable like this
Initially I thought that this would be more flexible since then you can choose the names for your variables, e.g. if your variable is named |
Good point @ealerskans.
|
I'm curious to hear @leifdenby thoughts as he had a different solution in mind. |
Sorry for taking so long to reply here. I will try to make my thoughts brief:
danra_additional_forcings:
path: /dmidata/projects/cloudphysics/danra/data/v0.4.0/single_levels.zarr
dims: [time, x, y]
variables:
# calculate ToA radiation
- toa_radiation:
dependencies:
- time
- lat
- lon
method: calc_toa_radiation # reference function that takes variables as inputs
dim_mapping:
time:
method: rename
dim: time
grid_index:
method: stack
dims: [x, y]
forcing_feature:
method: stack_variables_by_var_name
name_format: f"{var_name}"
target_output_variable: forcing
So with all that the above config would become: danra_additional_forcings:
path: /dmidata/projects/cloudphysics/danra/data/v0.4.0/single_levels.zarr
dims: [time, x, y]
derived_variables:
# calculate ToA radiation
- toa_radiation:
method: mllam_data_prep.derived_variables.calculate_toa_radiation
kwargs:
time: time
lat: lat
lon: lon
dim_mapping:
time:
method: rename
dim: time
grid_index:
method: stack
dims: [x, y]
forcing_feature:
method: stack_variables_by_var_name
name_format: f"{var_name}"
target_output_variable: forcing |
@leifdenby thank you for the comments and very nice suggestions, and also to you and @observingClouds for the discussion just now! I will try and summarize here what we discussed and the proposed solution and then you can correct me if I misunderstood something or add if I have missed anything.
I imagine then that this is how the workflow would be:
Hopefully this makes sense and is at least approximately what we discussed :) Now I actually have some additional questions.
|
I would do it the other way around, such that you can do |
I would change this to instead to do two things:
I would simply iterate over the required_variables = derived_variable.kwargs.values()
latlon_coords_to_include = [
required_variables.pop(c) for c in ["lat", "lon"] if c in required_variables
]
kwargs = {}
if len(latlon_coords_to_include):
latlon = get_latlon_coords_for_input(ds_input) # implemented elsewhere, #33
if c for c in latlon_coords_to_include:
kwargs[c] = latlon[c]
kwargs = {
v: ds_input[v] for v in required_variables
}
da_field = derived_field_method(**kwargs) |
Good question. If the output chunksize is larger than the input chunksize, I think the performance can be negatively affected, but I would go for now with chunking all input data with the defined output chunks. |
Because we do the calculation of the derived variables lazily, the calculation would not be affected by a selection of the input variables and we could even do a selection on a derived variable. So maybe after the variable has been derived, the We might want to match the inputs:
danra_height_levels:
...
variables:
u:
altitude:
values: [100,]
units: m
derived_variables:
toa_radiation: # note the missing hyphen
function: mllam_data_prep.derived_variables.calculate_toa_radiation
kwargs:
time: time
lat: lat
lon: lon This would allow the following syntax: all_vars = config['inputs']['danra_height_levels']['variables'] | config['inputs']['danra_height_levels']['derived_variables'] |
We could in theory true use one derived variable as input for another, but that would mean that would mean the order at which derived variables are added is important. As of python 3.6
Yes! Sorry I missed this. Yes, this should be a |
Hmm... good question. Either we a) allow the user to derive fields based on any of the variables in the loaded input dataset, in which case the derived variables should be derived before doing the subsetting or b) we only allow deriving variables from the selected subsets. I think maybe option a) is best here, but that would mean we should avoid the possibility of deriving one variable from another derived variable I think. The code would start getting very complicated otherwise. As in: I advocate that we only allow deriving variables from the existing variables in the loaded input dataset. |
I see. Then if we want to create a Either way, if we are choosing to have both |
I am not really sure I follow what the different options mean, except that option a) means that we can not derive a variable from another derived variable. When you say that the "derived variables should be derived before doing the subsetting" what do you mean then? Which subsetting? Currently we are selecting relevant variables (subsetting?) before calculating the derived variables, but I don't think that is what you are referring to? |
I'm working on removing the forcing computation from the neural-lam roadmap, as that should go in here. To not lose the list of links associated with that I'll put them here. Just for reference and to have them collected somewhere: |
One outstanding task is to implement the derivation of forcings from
neural-lam
here inmllam-data-prep
.I have had a look at the current forcing calculations from
neural-lam
:https://github.com/mllam/neural-lam/blob/feature_calculate_forcings/create_forcings.py
All thanks to @observingClouds I have attempted a first draft of the TOA radiation calculations in
mllam-data-prep
here:I have so far only tested on the DANRA zarr files and from this I have some observations:
lat
,lon
, andtime
, all which are coordinates. Since coordinates are read in eagerly and not lazily inxarray
we can not just calculate the radiation lazily. However, since the DANRA data is 30 years of data, if we try to just do the calculations usingxarray
we run into memory issues since it would try to allocate ~300 GB. Therefore, Hauke's solution was to create a new dataset usingdask
and chunking and in that way we could do the calculations. The implemented solution is just a first draft and there is some hard-coding included as well. A few points to discuss/iterate over would be:and this is not optimal. Either we find a way to make this chunking more flexible (using the
chunking
key from the config file?) or if there is a smarter way to do this. I have tested this approach for 1 year of DANRA data and it works with the time chunks of 10 works. This is a comparison of the timings (based on the log output) when adding the TOA radiation as forcing compared to not adding it:config.py
to include and additional attribute to theInputDataset
class:derive_variables
(default isFalse
). This way I can add an extra section in the input config file for the forcings to be derived:This then requires the forcing variables to be specified in separate section(s).
This is just a quick implementation to see that it works.
I have not looked at the other forcings from create_forcings.py so far but those and perhaps others would also need to be implemented.
An aspect that I think has been discussed previously is if the forcings should be calculated as part of the
mllam-data-prep
or if they should be part of a separate package, which thenmllam-data-prep
could call. I would like to start a discussion here what would be the best approach (the implementation I have is just to see if it works) and to get input from you guys - @joeloskarsson @sadamov @leifdenby @khintzThe text was updated successfully, but these errors were encountered: