Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What makes a dimension a grid? #127

Open
btupper opened this issue Aug 30, 2024 · 20 comments
Open

What makes a dimension a grid? #127

btupper opened this issue Aug 30, 2024 · 20 comments

Comments

@btupper
Copy link

btupper commented Aug 30, 2024

The lead example on the README shows that two dimensions, lon and lat, appear under the listing of grids. They happen to be the active dimensions, but is that a coincidence? It doesn't quite fit my paradigm, which is more like this roadmap, to have the dimensions show up under the grid list. Clearly variables and grids are not exactly the same thing in this new paradigm.

Here is another example in which case all of the dimensions are included in grids.

url = "http://psl.noaa.gov/thredds/dodsC/Datasets/noaa.oisst.v2.highres/sst.day.mean.2023.nc"
tidync::tidync(url)
#> not a file: 
#> ' http://psl.noaa.gov/thredds/dodsC/Datasets/noaa.oisst.v2.highres/sst.day.mean.2023.nc '
#> 
#> ... attempting remote connection
#> Connection succeeded.
#> 
#> Data Source (1): sst.day.mean.2023.nc ...
#> 
#> Grids (4) <dimension family> : <associated variables> 
#> 
#> [1]   D2,D1,D0 : sst    **ACTIVE GRID** ( 378432000  values per variable)
#> [2]   D0       : time
#> [3]   D1       : lat
#> [4]   D2       : lon
#> 
#> Dimensions 3 (all active): 
#>   
#>   dim   name  length      min    max start count     dmin   dmax unlim coord_dim 
#>   <chr> <chr>  <dbl>    <dbl>  <dbl> <int> <int>    <dbl>  <dbl> <lgl> <lgl>     
#> 1 D0    time     365  8.14e+4 8.18e4     1   365  8.14e+4 8.18e4 TRUE  TRUE      
#> 2 D1    lat      720 -8.99e+1 8.99e1     1   720 -8.99e+1 8.99e1 FALSE TRUE      
#> 3 D2    lon     1440  1.25e-1 3.60e2     1  1440  1.25e-1 3.60e2 FALSE TRUE

Created on 2024-08-30 with reprex v2.1.1

Thanks!

@mdsumner
Copy link
Collaborator

mdsumner commented Aug 30, 2024

I made up 'grid', I didn't find a name for it. My key insight was that netcdf has dimensions upfront but then scatters them about in a long list of variables, to me there are variables that belong together (like columns in a table) because they use the same dimensions.

So, a grid in my weird take is instances of dimensions grouped together. (I'm using 'group' divorced from the hdf/netcdf meaning here).

When a variable uses only one dimension, that makes that a group of one (a grid with one dimension, that might have many variables associated with it). Anything in a "grid" can be slurped up together because they have the same shape. I found that a natural way to "group" variables, something entirely lost in the ncdump -h jungle.

Does that make sense? I did find a word eventually that maybe would have been better.

@mdsumner
Copy link
Collaborator

mdsumner commented Aug 30, 2024

Maybe "shape" is what it should have been, I've become used to that term in python. There are 4 shapes in that file above, and it's possible that the "time" shape could be used by multiple variables, but it isn't here (when "time" is used in sst it's part of the D0,D1,D2 shape, an unfortunate label but I couldn't think of anything better).

@btupper
Copy link
Author

btupper commented Aug 30, 2024

Hmmm. You're stretching my brain here like a rubber band - I hope it doesn't snap!

So, that makes the data (sst) a shape?

@mdsumner
Copy link
Collaborator

No it has a shape. Those files could have more variables, in fact the daily ones do, sst, anom, ice, err all exist on the same grid/shape.

@btupper
Copy link
Author

btupper commented Aug 30, 2024

Ah, the old "is a" versus "has a" is new again! I think I have caught just enough of this to be able to lurch forward. I'll try to say back what I think you are saying...

Grid section
This section shows the various data elements grouped by shape (aka family of dimensions). In a given row, one or more data elements sharing a particular dimensionality are shown.

Dimension section

The dimension section shows the details of each dimension (of which one or more might be "active"). Active dimensions show additional info (start,count,dmin anddmax) including the user specified filtering bounds. By default, the entirety of each active dimension is selected, but the user may modify that using hyper_filter()

Am I getting closer?

@mdsumner
Copy link
Collaborator

Yes 🙌

@mdsumner
Copy link
Collaborator

and so this one, has more sets of grids

# wget https://thredds.nci.org.au/thredds/fileServer/cj50/access-om2/cf-compliant/access-om2/v20171212/1deg_jra55v13_ryf0304_RCP45/output150/ice/iceh.0301-01.nc


Data Source (1): iceh.0301-01.nc ...

Grids (6) <dimension family> : <associated variables>

[1]   D1,D2,D3,D7 : aicen_m, vicen_m, fsurfn_ai_m, fcondtopn_ai_m, fmelttn_ai_m, flatn_ai_m    **ACTIVE GRID** ( 540000  values per variable)
[2]   D1,D2,D7    : hi_m, hs_m, Tsfc_m, aice_m, uvel_m, vvel_m, uatm_m, vatm_m, sice_m, fswdn_m, flwdn_m, snow_ai_m, rain_ai_m, sst_m, sss_m, uocn_m, vocn_m, frzmlt_m, fswfac_m, fswabs_ai_m, albsni_m, alvdf_m, alidf_m, albice_m, albsno_m, flat_ai_m, fsens_ai_m, flwup_ai_m, evap_ai_m, Tair_m, congel_m, frazil_m, snoice_m, meltt_m, melts_m, meltb_m, meltl_m, fresh_ai_m, fsalt_ai_m, fhocn_ai_m, fswthru_ai_m, strairx_m, strairy_m, strtltx_m, strtlty_m, strcorx_m, strcory_m, strocnx_m, strocny_m, strintx_m, strinty_m, strength_m, divu_m, shear_m, dvidtt_m, dvidtd_m, daidtt_m, daidtd_m, mlt_onset_m, frz_onset_m, trsig_m, ice_present_m, fcondtop_ai_m
[3]   D0,D7       : time_bounds
[4]   D1,D2       : TLON, TLAT, ULON, ULAT, tmask, blkmask, tarea, uarea, dxt, dyt, dxu, dyu, HTN, HTE, ANGLE, ANGLET
[5]   D3          : NCAT
[6]   D7          : time

Dimensions 9 (4 active):

  dim   name  length    min    max start count   dmin   dmax unlim coord_dim
  <chr> <chr>  <dbl>  <dbl>  <dbl> <int> <int>  <dbl>  <dbl> <lgl> <lgl>
1 D1    ni       360      1    360     1   360      1    360 FALSE FALSE
2 D2    nj       300      1    300     1   300      1    300 FALSE FALSE
3 D3    nc         5      1      5     1     5      1      5 FALSE FALSE
4 D7    time       1 109531 109531     1     1 109531 109531 TRUE  TRUE

Inactive dimensions:

  dim   name      length   min   max unlim coord_dim
  <chr> <chr>      <dbl> <dbl> <dbl> <lgl> <lgl>
1 D0    d2             2     1     2 FALSE FALSE
2 D4    nkice          4    NA    NA FALSE FALSE
3 D5    nksnow         1    NA    NA FALSE FALSE
4 D6    nkbio          9    NA    NA FALSE FALSE
5 D8    nvertices      4    NA    NA FALSE FALSE

I said that a 1D "grid" might have multiple variables, but actually I think that probably never happens. Above the "time" dimension occurs in 3 grids 'D1,D2,D3,D7', 'D1,D2,D7', and 'D7'.

If I activate the first, then hyper_array/hyper_tibble will emit values from a 4D variable, but if I activate("D7") I'll just get the values from the "time" var, the only variable that is defined on that grid.

It allows the "activate" scheme to work on any set of variables and not treat any of them as special. This was all done before I knew anything about xarray of course, and I'd defer to concepts there going forward. (Be nice if they can transcend the degenerate rectlinear model from netcdf/matlab though).

@mdsumner
Copy link
Collaborator

mdsumner commented Aug 30, 2024

here's a more complex one, where the D8 and D9 grids do have multiple 1D vars associated with them. In ncdumph output you have to scan the variables to see which ones go together, which is why I organized this way.

Data Source (1): ocean_avg_0538.nc ...

Grids (17) <dimension family> : <associated variables>

[1]   D0,D4,D8,D12 : temp, salt
[2]   D0,D4,D9,D12 : w
[3]   D1,D5,D8,D12 : u    **ACTIVE GRID** ( 258690350  values per variable)
[4]   D2,D6,D8,D12 : v
[5]   D0,D4,D12    : zeta, m
[6]   D1,D5,D12    : ubar
[7]   D2,D6,D12    : vbar
[8]   D10,D11      : Tobc_in, Tobc_out
[9]   D0,D4        : h, zice, f, pm, pn, x_rho, y_rho, angle, mask_rho
[10]   D1,D5        : x_u, y_u, mask_u
[11]   D2,D6        : x_v, y_v, mask_v
[12]   D3,D7        : x_psi, y_psi, mask_psi
[13]   D10          : nl_tnu2, LtracerSponge, Akt_bak, Tnudg, LtracerSrc, LtracerCLM, LnudgeTCLM
[14]   D11          : FSobc_in, FSobc_out, M2obc_in, M2obc_out, M3obc_in, M3obc_out
[15]   D12          : ocean_time
[16]   D8           : s_rho, Cs_r
[17]   D9           : s_w, Cs_w

Dimensions 13 (4 active):

  dim   name       length       min      max start count     dmin     dmax unlim
  <chr> <chr>       <dbl>     <dbl>    <dbl> <int> <int>    <dbl>    <dbl> <lgl>
1 D1    xi_u         3149   1   e+0  3.15e+3     1  3149  1   e+0  3.15e+3 FALSE
2 D5    eta_u        2650   1   e+0  2.65e+3     1  2650  1   e+0  2.65e+3 FALSE
3 D8    s_rho          31  -9.84e-1 -1.61e-2     1    31 -9.84e-1 -1.61e-2 FALSE
4 D12   ocean_time      1   2.32e+8  2.32e+8     1     1  2.32e+8  2.32e+8 TRUE
# ℹ 1 more variable: coord_dim <lgl>

Inactive dimensions:

  dim   name     length   min   max unlim coord_dim
  <chr> <chr>     <dbl> <dbl> <dbl> <lgl> <lgl>
1 D0    xi_rho     3150     1  3150 FALSE FALSE
2 D2    xi_v       3150     1  3150 FALSE FALSE
3 D3    xi_psi     3149     1  3149 FALSE FALSE
4 D4    eta_rho    2650     1  2650 FALSE FALSE
5 D6    eta_v      2649     1  2649 FALSE FALSE
6 D7    eta_psi    2649     1  2649 FALSE FALSE
7 D9    s_w          32    -1     0 FALSE TRUE
8 D10   tracer        2     1     2 FALSE FALSE
9 D11   boundary      4     1     4 FALSE FALSE

@mdsumner
Copy link
Collaborator

xarray at least lists them together so it's easier to scan

<xarray.Dataset> Size: 7GB
Dimensions:        (tracer: 2, boundary: 4, s_rho: 31, s_w: 32, eta_rho: 2650,
                    xi_rho: 3150, eta_u: 2650, xi_u: 3149, eta_v: 2649,
                    xi_v: 3150, eta_psi: 2649, xi_psi: 3149, ocean_time: 1)
Coordinates:
  * s_rho          (s_rho) float64 248B -0.9839 -0.9516 ... -0.04839 -0.01613
  * s_w            (s_w) float64 256B -1.0 -0.9677 -0.9355 ... -0.03226 0.0
    x_rho          (eta_rho, xi_rho) float64 67MB ...
    y_rho          (eta_rho, xi_rho) float64 67MB ...
    x_u            (eta_u, xi_u) float64 67MB ...
    y_u            (eta_u, xi_u) float64 67MB ...
    x_v            (eta_v, xi_v) float64 67MB ...
    y_v            (eta_v, xi_v) float64 67MB ...
    x_psi          (eta_psi, xi_psi) float64 67MB ...
    y_psi          (eta_psi, xi_psi) float64 67MB ...
  * ocean_time     (ocean_time) datetime64[ns] 8B 2014-05-11T12:00:00
Dimensions without coordinates: tracer, boundary, eta_rho, xi_rho, eta_u, xi_u,
                                eta_v, xi_v, eta_psi, xi_psi
Data variables:
    ntimes         int32 4B ...
    ndtfast        int32 4B ...
    dt             float64 8B ...
    dtfast         float64 8B ...
    dstart         datetime64[ns] 8B ...
    nHIS           int32 4B ...
    ndefHIS        int32 4B ...
    nRST           int32 4B ...
    ntsAVG         int32 4B ...
    nAVG           int32 4B ...
    ndefAVG        int32 4B ...
    Falpha         float64 8B ...
    Fbeta          float64 8B ...
    Fgamma         float64 8B ...
    nl_tnu2        (tracer) float64 16B ...
    nl_visc2       float64 8B ...
    LuvSponge      int32 4B ...
    LtracerSponge  (tracer) int32 8B ...
    Akt_bak        (tracer) float64 16B ...
    Akv_bak        float64 8B ...
    rdrg           float64 8B ...
    rdrg2          float64 8B ...
    Zob            float64 8B ...
    Zos            float64 8B ...
    Znudg          float64 8B ...
    M2nudg         float64 8B ...
    M3nudg         float64 8B ...
    Tnudg          (tracer) float64 16B ...
    FSobc_in       (boundary) float64 32B ...
    FSobc_out      (boundary) float64 32B ...
    M2obc_in       (boundary) float64 32B ...
    M2obc_out      (boundary) float64 32B ...
    Tobc_in        (boundary, tracer) float64 64B ...
    Tobc_out       (boundary, tracer) float64 64B ...
    M3obc_in       (boundary) float64 32B ...
    M3obc_out      (boundary) float64 32B ...
    rho0           float64 8B ...
    gamma2         float64 8B ...
    LuvSrc         int32 4B ...
    LwSrc          int32 4B ...
    LtracerSrc     (tracer) int32 8B ...
    LsshCLM        int32 4B ...
    Lm2CLM         int32 4B ...
    Lm3CLM         int32 4B ...
    LtracerCLM     (tracer) int32 8B ...
    LnudgeM2CLM    int32 4B ...
    LnudgeM3CLM    int32 4B ...
    LnudgeTCLM     (tracer) int32 8B ...
    spherical      int32 4B ...
    xl             float64 8B ...
    el             float64 8B ...
    Vtransform     int32 4B ...
    Vstretching    int32 4B ...
    theta_s        float64 8B ...
    theta_b        float64 8B ...
    Tcline         float64 8B ...
    hc             float64 8B ...
    Cs_r           (s_rho) float64 248B ...
    Cs_w           (s_w) float64 256B ...
    h              (eta_rho, xi_rho) float64 67MB ...
    zice           (eta_rho, xi_rho) float64 67MB ...
    f              (eta_rho, xi_rho) float64 67MB ...
    pm             (eta_rho, xi_rho) float64 67MB ...
    pn             (eta_rho, xi_rho) float64 67MB ...
    angle          (eta_rho, xi_rho) float64 67MB ...
    mask_rho       (eta_rho, xi_rho) float64 67MB ...
    mask_u         (eta_u, xi_u) float64 67MB ...
    mask_v         (eta_v, xi_v) float64 67MB ...
    mask_psi       (eta_psi, xi_psi) float64 67MB ...
    zeta           (ocean_time, eta_rho, xi_rho) float32 33MB ...
    m              (ocean_time, eta_rho, xi_rho) float32 33MB ...
    ubar           (ocean_time, eta_u, xi_u) float32 33MB ...
    vbar           (ocean_time, eta_v, xi_v) float32 33MB ...
    u              (ocean_time, s_rho, eta_u, xi_u) float32 1GB ...
    v              (ocean_time, s_rho, eta_v, xi_v) float32 1GB ...
    w              (ocean_time, s_w, eta_rho, xi_rho) float32 1GB ...
    temp           (ocean_time, s_rho, eta_rho, xi_rho) float32 1GB ...
    salt           (ocean_time, s_rho, eta_rho, xi_rho) float32 1GB ...
Attributes:
    file:              ocean_avg_0538.nc
    format:            netCDF-3 64bit offset file
    Conventions:       CF-1.4
    type:              ROMS/TOMS nonlinear model averages file
    title:             Whole Antarctic and Ocean Application, 2 km resolution
    rst_file:          ocean_rst.nc
    his_base:          ocean_his
    avg_base:          ocean_avg
    grd_file:          /g/data2/gh9/oxr581/waom_frc/waom2_grd.nc
    ini_file:          ocean_rst.nc
    frc_file_01:       /g/data2/gh9/oxr581/waom_frc/waom2_tds.nc
    frc_file_02:       /g/data2/gh9/oxr581/waom_frc/waom2_shflux.nc

....

@btupper
Copy link
Author

btupper commented Sep 3, 2024

Hmmm. Not to get hung up on semantics, but in xarray what is the difference between dimensions and coordinates?

@mdsumner
Copy link
Collaborator

mdsumner commented Sep 3, 2024

The same difference as in netcdf, MATLAB, and R.

@btupper
Copy link
Author

btupper commented Sep 3, 2024

Oh, well now that's embarrassing - after 20 something years of mucking about in these things I have never distinguished between the two. I guess I have some homework.

@mdsumner
Copy link
Collaborator

mdsumner commented Sep 3, 2024

I'm not sure what else to say, R arrays are dimensioned but have no coordinates, image(x, y, z) uses the same coordinate model as most netcdf files (option to use centre OR edge) and xarray formalizes that in n-dimensions. Sadly image(z) puts the space in 0,1 but I think that's mistake - though not the only unconventional choice cf. rasterImage() - and see now terra and stars use 0,ncol/nrow (also note GDAL defaults to +y which flips an image, though there's a difference between most imagery and netcdf in that regard).

The index of a dimension is the coordinates by default, but it's not the dimension. (Terminology is tough here, dimension and resolution get conflated but I think dimension is clear in the R context). 🙏

xarray calls coordinates labelling as well, which I find weird - but a bit like R's dimnames which were never really leveraged.

@btupper
Copy link
Author

btupper commented Sep 3, 2024

Yeah, I think I have just been lazy in my thinking (in fact, I'm an expert in laziness at this point), and I haven't had to think about it until now.

@btupper
Copy link
Author

btupper commented Sep 3, 2024

BTW - I just found your blog post (pre-pandemic!), and this section really gets at the nut of it very well.

@mdsumner
Copy link
Collaborator

mdsumner commented Sep 3, 2024

ah indeed, glad that helps

@btupper
Copy link
Author

btupper commented Sep 23, 2024

I think I am getting the hang of it thanks to your many many examples. Of course my first foray had to be curvilinear grids where the lon/lat transforms are stored in a separate NetCDF from the NetCDF with the data. Doh! Trial by fire!

But a question surfaces for me (which I can move to a separate issue if preferred). Why is the CF timestamp stored as character? Perhaps that is because you want tidync to be generic, and you do not want to add too many bells-and-whistles? But if that is the case, then what role does the really CFtime package play?

@mdsumner
Copy link
Collaborator

I don't actually know, @pvanlaake contributed the CFTime support. To me it seemed "too hard" (== I never trusted any automatic way to do it) so I didn't do anything with metadata or units before.

Curvilinear grids are best dealt with the GDAL warper api (as long as they are mass properties, not directional ones). But, and I've seen this before - that latitude spacing is really weird, whereas the rectilinear noise in longitude is just noise. Why do they do this ... The curve in the latitude is potentially Mercator stretch (aviso used to do that).

image

I actually think it means you aren't supposed to care about data north of a particular latitude.

@pvanlaake
Copy link
Contributor

On the type of the timestamp that tidync reports:
This comes from package CFtime. That package supports all 9 defined calendars from the CF Metadata Conventions. Several of those calendars are not compatible with POSIXt and using the built-in date-formatting routines will produce NAs (such as for 2024-02-30 which is valid in a 360_day calendar). Hence why the default timestamp formatting used by CFtime is a string. If you are using the standard, gregorian or proleptic_gregorian calendar you can safely use as.POSIXct() or as.Date(), as you are doing in your cefi_time() function. Package CFtime offers a similar functionality throughCFtime::as_timestamp(x, asPOSIX = TRUE) (with x being a CFtime instance) but unfortunately the CFtime instance is not (yet?) exposed through tidync.

On the issue of dimensions, axes, grids, shapes, coordinates, variables and the like:
This all originates from the good-ol' netcdf library, as you will have picked up from the netCDF user's guide that you linked to. Do keep in mind, however, that a lot of environmental data on climate and ocean dynamics follows the CF Metadata Conventions (see link for the calendars, above) and that adds a new layer of terminology and complexity (auxiliary coordinate variable, anyone?). You are well-advised to have a look at those (even if your data does not report adherence to the CF Conventions).

@btupper
Copy link
Author

btupper commented Sep 24, 2024

That makes perfect sense - thanks for the explanation and the heads up.

Your point is well taken - the CEFI historical and forecast data are explicit about the calendars used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants