-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make mllam-data-prep
aware of geographical coordinates (by introducing projection info)
#33
Comments
This sounds to me like a good and useful proposal! This functionality will also be of interest for the derived variable section (#29) which might rely on something like |
My first thought would probably be to just:
But if there is interest to build this functionality into mdp then I think that is great! If the most common situation is that you have your LAM data sitting without lat-lons then you would anyhow have to do this conversion, so if mdp can do it for you that is really nice. One thing to keep in mind is that neural-lam will eventually need this projection information explicitly in the config: https://github.com/mllam/neural-lam/blob/7112013f24ad36d8d9d19b4b5f853b11a2bbebf4/neural_lam/data_config.yaml#L59-L64 (or something similar after mllam/neural-lam#66 is merged). So in that sense the user never gets away from specifying the projection, even in the case that it is correctly stored in the input dataset. But this seems a bit unnecessary, so maybe a better option would be to have mdp always store the projection information in the output zarr and let neural-lam just use that? (more a question for neural-lam really, but becomes relevant for this) |
@observingClouds just adding a few links here that I found it looking more into how to best define projection information. The key thing I learnt is that links:
This makes me feel like that instead of using I would suggest we then add a convenience function so that if someone has a projection as a |
From a CF-convention standpoint:
From pyproj view:
Preliminary conclusions:
|
Based on brainstorming with @SimonKamuk the introduction of projection information raises a few sub-questions:
A bit of background information to answer these questions:
Implementation questions/suggestions:
NoteAs one can see, adding cf-compliant projection information is not as straightforward as maybe initially thought. Nevertheless, thanks to the great work done in the external packages an implementation seems doable. It should be noted though that the libraries and specifications are still far from perfect and things will fail. I created a small repository to test the different packages. My observations are:
|
@SimonKamuk feel free to add/edit in case I missed something |
Thanks for this @observingClouds and @SimonKamuk! Really great overview. In terms of the questions you ask I would suggest we try and stage the implementation.
How about the following plan:
One thing I am unsure about:
|
The plan sounds great!
Maybe I return this question and ask if we want to store this information in general. If so, then the CF-convention are a great solution even if we break them per variable (e.g. because they are auxiliary coords) but use the convention in the crs variable. |
Yes, sorry. I meant to say whether it is actually possible, not whether it makes sense. Maybe you could start there to see if you simple set the correct attributes in the output from |
@observingClouds I didn't manage to comment on your config suggestion change. If we assume that all inputs use the same projection could we start with the following: inputs:
danra_surface:
path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
variables:
u:
altitude:
values: [100,]
projection:
dims: [x, y]
crs_wkt: "GEODCRS['WGS 84',DATUM['World Geodetic System 1984', ELLIPSOID['WGS 84',6378137,..." ? I appreciate that you would like to have |
Well, I really would like to get the |
@leifdenby #38 follows now mostly the above mentioned structure, with just two minor adjustments:
In summary:
|
Because we do not only want to read and save complete CF-conform projection information, but also be able to plot the data on a given projection this issue and connected PR depend on upstream issues: SciTools/cartopy#2479 |
After some deeper dive, it turns out that by providing a valid BBOX entry to the WKT strings (e.g. do not include the poles for cylindrical projections), plotable cartopy Projection objects can be created from WKT strings. |
The problem
Currently,
mllam-data-prep
has no understanding of projection information, which means that it doesn't know how to define the latitude and longitude coordinate values for a given input dataset or how to define the same in zarr datasets produced bymllam-data-prep
. We could of course require that latitude and longitude values are given for example as variables with the specific names (lat
andlon
for example) in each input dataset, but that seems limiting to me since deriving latitude and longitude coordinate values from the coordinate values actually used in a dataset is exactly what the projection information is for. If a given input dataset is in fact given on a lat/lon grid I vote we make that explicit (by requiring that the projection be set ascartopy.crs.PlateCarree
) - give example configs below.To my mind being able to define the latitude and longitude coordinate values is necessary for:
neural-lam
these are required so thatlat
/lon
can be returned by a datastore and the projection used when plottingProposed solution
I propose to provide two ways of deriving projection information in
mllam-data-prep
:projection
information config parameter available onInputDataset
's in the config, where the type of projection (given as acartopy.crs.Projection
ClassName
, e.g.LambertConformal
)The new
projection
section to theinput
section would have three values to set the dimension names for the dimensions used in the projection (calleddims
), thecartopy.crs.Projection
-derived projection class name (class_name
) and the kwargs passed to thisProjectionClass
to create an instance of the projection object. In the case where we can parse the projection from from CF-compliant attributes and the user has also defined projection in the config we warn the user that they are overwriting the projection information.I will give two examples of what this would look like:
a) For DANRA (which is given on a Lambert Conformal grid) this would look like the following:
b) for ERA5 (which is stored on a lat/lon grid) stored in Google's S3 bucket:
Having this new config option will then make it possible to implement a function (say
get_input_dataset_lat_lons
) that can be used to get the lat/lon values for each grid-point. The output of this function can then be used when implementing th dataset merging that @TomasLandelius is working on.This would have the following limitation which could be handled later: All input datasets would have to be given on the same projection. This limitation primarily stems from the fact that
mllam-data-prep
requires that all input datasets share the exact same grid coordinates. We could in future implement reprojection and interpolation inmllam-data-prep
, but I was going to aim for that in this piece of work. This would only be for adding "how to extract the lat lon coordinates for a given input dataset" and "given that all input datasets share the same projection and coordinates how to write this projection information to the output dataset".As I mentioned in #31 if we have a new
__common__
section in the config then the user would only have to write this projection info once in the config.I would greatly appreciate your input @joeloskarsson, @khintz, @observingClouds, @sadamov, @SimonKamuk
The text was updated successfully, but these errors were encountered: