Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remaining corpus of idx mapping functionality #523

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

emfdavid
Copy link
Contributor

This is the remainder of the idx mapping code - primarily reinflate_grib_store and supporting methods that allow creating datasets from the idx files.

This PR is pretty large. I think the notebook example should go to https://github.com/ProjectPythia/kerchunk-cookbook ?
But I wanted you to be able to see the rest and how it works. Happy to break it up however you want. I know the methods need better doc strings, but I wanted you to give me a little more feedback on how to proceed first.

There are also pretty extensive tests and I will move those too but they require a bit more work to convert from unittest to pytest.

@rsignell
Copy link

rsignell commented Oct 28, 2024

@martindurant , just to make sure I understand, this is awaiting your review?
And when merged, should allow ProjectPythia/kerchunk-cookbook#64 to work?

@martindurant
Copy link
Member

Yes, but it's only 12hr old :) @emfdavid is asking for feedback above, so if you are in a place to do that, we would appreciate that too.

@rsignell
Copy link

@martindurant, sorry, didn't mean to suggest impatience! 🙃
I will try giving it a test right now!

@emfdavid
Copy link
Contributor Author

Recently meet Taylor who has a different take on the same concept... his approach is more in the on-the-fly category more akin to hypergrib.

@emfdavid
Copy link
Contributor Author

@rsignell I am impatient 😆 did you kick the tires yet?

@rsignell
Copy link

@emfdavid , heh heh, sorry, got distracted with another burning fire, but I just tried it out. Running into some xarray/datatree issue, perhaps because the latest release (2024.10.0) of xarray has datatree built in now and there is some conflict if you try to not use the built in version?

ImportError: cannot import name 'HybridMappingProxy' from 'xarray.core.utils' (/home/rsignell/miniforge3/envs/grib-idx/lib/python3.13/site-packages/xarray/core/utils.py)

@martindurant
Copy link
Member

The CI is showing that too, so yes, probably need to pin xarray or not-install tree

@rsignell
Copy link

Ideally this code would use the now-built-in xarray datatree -- would that be an easy fix for you @emfdavid ?

@rsignell
Copy link

rsignell commented Oct 28, 2024

So I did manage to run the test notebook using xarray=2024.7.0 but had to change ulwrf/avg/nominalTop to sulwrf/avg/nominalTop (there was no ulwrf).

Do the extracted data at the end of the rendered notebook (Cell [21]) look okay? Seem a bit strange...

@emfdavid
Copy link
Contributor Author

I updated the code to use the now built in datatree 🎉
Excited that is merged into xarray now!

I can't run your notebook tonight but those nan values do look funny. Building the axes is really finicky... I can try to take a look tomorrow.

I did run down the issue with the variable names dswrf -> sdswrf, ulwrf -> sulwrf
It looks like eccodes updated their definitions.

Is that something we can ask them to undo?
I will be inclined to pin cfgrib/eccodes till that is confirmed... but the mapping based approach works just fine either way!

@rsignell
Copy link

rsignell commented Oct 29, 2024

I was able to run a slightly-modified version of the notebook using the latest xarray (2024.10.0) and your latest fixes to #523 !

When I extracted some data values from the virtual dataset, however, they seem unlikely to be correct.

@emfdavid, @Anu-Ra-g, or @martindurant, is there a simple alternate method by which we could check to see if the data values in the last cell are correct?

Perhaps we should hold off merging until we decide whether the data extraction is working properly?

@emfdavid
Copy link
Contributor Author

I will see if I can debug your notebook tonight. If you have a chance to try the one in this PR I would appreciate it.

As far as when to merge... do you (@martindurant) want the tests added to this already large PR?
https://github.com/asascience-open/nextgen-dmac/blob/main/grib_index_aggregation/test_dynamic_zarr_store.py
That will take a bit more work to convert from unittest to pytest but should be straight forward enough?

@emfdavid
Copy link
Contributor Author

Here is the state of the art on huggingface... I would be really excited to see an update here!
Screenshot 2024-10-29 at 9 16 57 AM

@rsignell
Copy link

rsignell commented Oct 29, 2024

@emfdavid okay, I'm testing the notebook in the repo now...

And it took awhile (like 10 minutes), but it worked: https://nbviewer.org/gist/rsignell/64234851f1eafeab261ed8a774aea5ca

I used this conda environment file to create the environment to run it.

@Anu-Ra-g
Copy link
Contributor

Anu-Ra-g commented Oct 29, 2024

@rsignell In this notebook, I was able to reinflate the generated indexes but I forgot to look at the actual data in those indexes.

@Anu-Ra-g
Copy link
Contributor

I was able to run a slightly-modified version of the notebook using the latest xarray (2024.10.0) and your latest fixes to #523 !

When I extracted some data values from the virtual dataset, however, they seem unlikely to be correct.

@emfdavid, @Anu-Ra-g, or @martindurant, is there a simple alternate method by which we could check to see if the data values in the last cell are correct?

Perhaps we should hold off merging until we decide whether the data extraction is working properly?

The data extraction is working properly as there are minimal code changes and refactors from the original code. But I used GEFS grib data for making that notebook. Maybe that could be a reason!

As per the original demonstrations by @emfdavid with the GFS data, the data extraction works fine.

If you want, I could change it from GEFS to some other model's data.

@rsignell
Copy link

I'd like to make sure it's actually working as expected on the GEFS data - we don't want to sidestep a potential bug, right? If those NaN values are correct, then yes, perhaps switching to GFS would make a more pleasing notebooks.

@emfdavid
Copy link
Contributor Author

@rsignell There is a white space error in the double for loop. The notebook is only processing the 18z forecast for each day which is why there are three nan value between each real value.

Screenshot 2024-10-29 at 10 32 26 PM

I also noticed there are a lot of duplicate attrs in that much older 2017 file. The GEFS model changed significantly around 2020-09-25. I think these are really issues with the NOAA Grib file and its compatibility with cfgrib.
Screenshot 2024-10-29 at 10 07 54 PM
But these UGRD variables all reading out on the same grib level=0.0 are going to be a problem.

I don't see this on the more recent GEFS data, but I have only really looked at the geavg files so far.

@emfdavid
Copy link
Contributor Author

@martindurant I tried to fix the build issue, with ImportError "HybridMappingProxy" for just the 3.10 build but I think I made it worse with d40cca4
Any suggestions on what to do here? I think maybe conda doesn't have a xarray 2024.10 for python 3.10?

We can definitely rebase away the notebooks before merging - those are not to be checked in here like this.

@rsignell
Copy link

Whoa, great find on that white space indent error @emfdavid ! The notebook indeed works fine now: https://nbviewer.org/gist/rsignell/c3fd58368ed9d0ae50c26807b6a51678

@martindurant
Copy link
Member

maybe conda doesn't have a xarray 2024.10 for python 3.10

It does. Actually, xarray is noarch, but it has compiled deps. It does install into a fresh env, though.

'>2024.10.0'

should be

'>=2024.10.0'

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants