Skip to content

2024‐05‐01

Teagan King edited this page May 1, 2024 · 5 revisions

May 1, 2024

Agenda:

  1. General CUPiD updates
    • Current Status
    • PRs in progress
    • Future Plans
    • Anything else to share?
  2. How Might CUPiD Tie-in to CMORization?
  3. CMORization discussion
    • Overview of Current Status
    • Background and context of CESM and CMIP
    • Moving forward for CMIP7
    • Open discussion: Determine a workflow for managing CESM model output data given upcoming CMIP7 effort
      • What are the ideal tools for this workflow?
      • Is it feasible to use existing PyConform and PyReshaper tools?
      • What database requirements do we have
      • Should we consider zarr, kerchunk, NCzarr formats?
      • Determine whether we are will update the existing workflow, create a new tool, or find a different existing tool

Slides

Notes

  • Want to be able to run CMORization tool from CUPiD (and dev environment)
  • Notebooks should work on CMORIzed data
  • CMORization beyond the scope of CUPiD
  • Bring in an outside tool similarly to timeseries
  • CMORization Current Status
    • Background of CMIP7: 4th participation in CMIP; over time, has gotten larger & more complicated
    • CMORization: exercise in translation (eg, file name updating, information is in translation tables); lots of metadata is added (some details not yet determined at time of file creation)
    • CESM rundb was used for CMIP6;
    • Scope of translation: ~1000 CMIP6 vars mapped to CESM vars, ~500 unchanged, ~230 unit changes, ~75 fields interpolated to pressure levels, ~200 mathematical operations > all read by PyConform
    • Variable translations probably won't change in CMIP7 (though some fields dropped or added)
    • DB: CMIP variable names were defined in sea ice data, and then flag either turned on CMIP variable names or not. Is it worth the effort to add this into the model? GS: would be ideal if model could output MIP-compliant data directly, but very non-trivial that doesn't help the science but would use those resources.
    • ML: MOM6 tries to use CMIP-compliant data; is this useful? There is possibly roughly 40% ocn, 30% atm, etc...
    • DL: In any case, we'll still have to do some translation even if we had a lot of MIP-compliant data. We also have scripts that work with CESM data here at NCAR.
    • Overall workflow & set of existing tools:
      • PyReshaper (timeseries generation)
      • then PyConform (CMORization), both developed at CISL by Kevin Paul, haven't been maintained for 4yrs
      • Some other resources may be available in CISL, but without knowledge of those tools or CESM
    • Timeseries tool that's existing in CUPiD is fine for small volumes, not super fast (PyReshaper is faster, but has some edge cases that don't work correctly)
      • ADF doesn't currently do compression & currently runs on monthly means (IS) -- Dani is working on concatenating time dimension if unlimited; can also run on daily or 6-hourly, etc
      • Would be good to understand exactly what the timing is like
      • ADF timeseries was not intended to be stand alone tool
      • JN: ADF function is somewhat parallel (but still can't scale to the extent PyReshaper does)
      • DL: Tool for ADF isn't really well suited for this; we may want to resurrect PyReshaper.
      • BD: PyReshaper will get us limping through slightly better
    • NR: reiterate that CMIP7 workflow effort is a massive undertaking that should not be taken on as a side-project under CUPiD-- we need someone fully committed to this project
    • CMIP6 used cylc; S2S now uses ECFLOW; JE recommends ECFLOW; no longer have database programmer & need someone to support this; want to use workflow tool to have less human intervention
    • Data storage: zarr, kerchunk, nczarr are not really on the table at this point, but should be considered for a future model
    • Timeseries generation still up in the air, as well as how to do the translation
    • E3SM is going to be doing a similar process for CMIP7 and use a similar dycore - possibly there could be some collaboration on this
    • SE dycore: PyReshaper and PyConform should be agnostic to the grid; do variables need to be regridded? (Will be a MIP decision-- ie SE fields may be on other grids)
      • bilinear remapping is not entirely working at the moment, open PR on that
      • Don't know yet what grid the data request will be for
      • NR: Can we leverage existing PyConform and PyReshaper? We will be getting a new person hopefully this summer; may be able to explore whether CMOR from DOE or PyConform is better?
      • Sense of how broken PyConform & PyReshaper are?
        • Gary is running a CMIP6 run through PyConform.
        • PyReshaper seems to be working ok in a container.
        • Some limitations, but we know how to work around these.
        • Some issues with HR data; probably because can't do partial runs
          • Have to leave some processors idle so it doesn't run out of memory
        • We have found these workarounds during smaller scale production runs; there is concern for finding more bugs during larger-scale runs
      • DL: Would be nice to test PyConform & PyReshaper. Would be nice to get an optimistic & conservative estimate of each of these tools & describe how critical each piece is and how likely to break. Wants to take this information to Jon and CISL once we have a more full description of the steps necessary & big picture.
      • JE: ECFLOW: don't have anything written in it yet, but this isn't extremely difficult to do
      • DL: Run Database may be one of the larger problems?
        • Rundb was defunct
        • xml files instead may be a solution
        • crosswalk is a large step; also need mechanism for resolving any mistakes that may come up
      • Need someone who is aware of all of the steps
      • GS: Encourage to not start over from scratch; this has been done a few times already. If left tools alone for a long time, may also be a significant effort to restart
      • KF: PyReshaper is dependent on tools that have not been maintained for years.
        • Are the broken pieces irreplaceable? Not trivial to replace, but possible & would require resources. Anderson may have worked on this briefly & that work was not completed.
        • At some point, we need more maintainable versions.
        • If we were supporting tools in the community, we would have bug fixes, but don't have resources for that. We could try to identify where the code is brittle. Maybe we could add testing to the code to help us keep this up to date.
      • There are some other MIPs and large ensembles-- if these tools were easier to use, it might be useful more often than just for CMIP.
      • CUPiD will be a place where you can easily run outside packages including CMORization
      • KF: MPI for python is not heavily used anymore, as there has been a shift to dask; a few other things in PyConform were mostly supported by Kevin Paul. Maybe taking a look at Anderson's more recent work would be helpful. Could also use ESMF regridding and command line tools.
      • E3SM, PCMDI, others working on this problem-- can we use existing resources? Jim has brought over DOE's tool & built it here; still issues about its parallelization. Could be a potential starting point, just have to decide which is the best starting point.
      • Direction forward: Will get info back from Jim's discussions & Gary's testing. We also want to write down resource needs.
Clone this wiki locally