remaining corpus of idx mapping functionality #523

emfdavid · 2024-10-28T01:57:31Z

This is the remainder of the idx mapping code - primarily reinflate_grib_store and supporting methods that allow creating datasets from the idx files.

This PR is pretty large. I think the notebook example should go to https://github.com/ProjectPythia/kerchunk-cookbook ?
But I wanted you to be able to see the rest and how it works. Happy to break it up however you want. I know the methods need better doc strings, but I wanted you to give me a little more feedback on how to proceed first.

There are also pretty extensive tests and I will move those too but they require a bit more work to convert from unittest to pytest.

rsignell · 2024-10-28T13:58:41Z

@martindurant , just to make sure I understand, this is awaiting your review?
And when merged, should allow ProjectPythia/kerchunk-cookbook#64 to work?

martindurant · 2024-10-28T14:00:19Z

Yes, but it's only 12hr old :) @emfdavid is asking for feedback above, so if you are in a place to do that, we would appreciate that too.

rsignell · 2024-10-28T14:01:54Z

@martindurant, sorry, didn't mean to suggest impatience! 🙃
I will try giving it a test right now!

emfdavid · 2024-10-28T15:39:21Z

Recently meet Taylor who has a different take on the same concept... his approach is more in the on-the-fly category more akin to hypergrib.

emfdavid · 2024-10-28T18:33:15Z

@rsignell I am impatient 😆 did you kick the tires yet?

rsignell · 2024-10-28T20:03:50Z

@emfdavid , heh heh, sorry, got distracted with another burning fire, but I just tried it out. Running into some xarray/datatree issue, perhaps because the latest release (2024.10.0) of xarray has datatree built in now and there is some conflict if you try to not use the built in version?

ImportError: cannot import name 'HybridMappingProxy' from 'xarray.core.utils' (/home/rsignell/miniforge3/envs/grib-idx/lib/python3.13/site-packages/xarray/core/utils.py)

martindurant · 2024-10-28T20:05:08Z

The CI is showing that too, so yes, probably need to pin xarray or not-install tree

rsignell · 2024-10-28T20:06:15Z

Ideally this code would use the now-built-in xarray datatree -- would that be an easy fix for you @emfdavid ?

rsignell · 2024-10-28T21:08:51Z

So I did manage to run the test notebook using xarray=2024.7.0 but had to change ulwrf/avg/nominalTop to sulwrf/avg/nominalTop (there was no ulwrf).

Do the extracted data at the end of the rendered notebook (Cell [21]) look okay? Seem a bit strange...

emfdavid · 2024-10-29T01:57:56Z

I updated the code to use the now built in datatree 🎉
Excited that is merged into xarray now!

I can't run your notebook tonight but those nan values do look funny. Building the axes is really finicky... I can try to take a look tomorrow.

I did run down the issue with the variable names dswrf -> sdswrf, ulwrf -> sulwrf
It looks like eccodes updated their definitions.

Is that something we can ask them to undo?
I will be inclined to pin cfgrib/eccodes till that is confirmed... but the mapping based approach works just fine either way!

rsignell · 2024-10-29T12:16:58Z

I was able to run a slightly-modified version of the notebook using the latest xarray (2024.10.0) and your latest fixes to #523 !

When I extracted some data values from the virtual dataset, however, they seem unlikely to be correct.

@emfdavid, @Anu-Ra-g, or @martindurant, is there a simple alternate method by which we could check to see if the data values in the last cell are correct?

Perhaps we should hold off merging until we decide whether the data extraction is working properly?

emfdavid · 2024-10-29T13:07:12Z

I will see if I can debug your notebook tonight. If you have a chance to try the one in this PR I would appreciate it.

As far as when to merge... do you (@martindurant) want the tests added to this already large PR?
https://github.com/asascience-open/nextgen-dmac/blob/main/grib_index_aggregation/test_dynamic_zarr_store.py
That will take a bit more work to convert from unittest to pytest but should be straight forward enough?

emfdavid · 2024-10-29T13:18:10Z

Here is the state of the art on huggingface... I would be really excited to see an update here!

rsignell · 2024-10-29T15:24:01Z

@emfdavid okay, I'm testing the notebook in the repo now...

And it took awhile (like 10 minutes), but it worked: https://nbviewer.org/gist/rsignell/64234851f1eafeab261ed8a774aea5ca

I used this conda environment file to create the environment to run it.

Anu-Ra-g · 2024-10-29T20:20:07Z

@rsignell In this notebook, I was able to reinflate the generated indexes but I forgot to look at the actual data in those indexes.

Anu-Ra-g · 2024-10-29T20:28:53Z

I was able to run a slightly-modified version of the notebook using the latest xarray (2024.10.0) and your latest fixes to #523 !

When I extracted some data values from the virtual dataset, however, they seem unlikely to be correct.

@emfdavid, @Anu-Ra-g, or @martindurant, is there a simple alternate method by which we could check to see if the data values in the last cell are correct?

Perhaps we should hold off merging until we decide whether the data extraction is working properly?

The data extraction is working properly as there are minimal code changes and refactors from the original code. But I used GEFS grib data for making that notebook. Maybe that could be a reason!

As per the original demonstrations by @emfdavid with the GFS data, the data extraction works fine.

If you want, I could change it from GEFS to some other model's data.

rsignell · 2024-10-29T20:47:56Z

I'd like to make sure it's actually working as expected on the GEFS data - we don't want to sidestep a potential bug, right? If those NaN values are correct, then yes, perhaps switching to GFS would make a more pleasing notebooks.

emfdavid · 2024-10-30T02:37:47Z

@rsignell There is a white space error in the double for loop. The notebook is only processing the 18z forecast for each day which is why there are three nan value between each real value.

I also noticed there are a lot of duplicate attrs in that much older 2017 file. The GEFS model changed significantly around 2020-09-25. I think these are really issues with the NOAA Grib file and its compatibility with cfgrib.

But these UGRD variables all reading out on the same grib level=0.0 are going to be a problem.

I don't see this on the more recent GEFS data, but I have only really looked at the geavg files so far.

emfdavid · 2024-10-30T02:41:37Z

@martindurant I tried to fix the build issue, with ImportError "HybridMappingProxy" for just the 3.10 build but I think I made it worse with d40cca4
Any suggestions on what to do here? I think maybe conda doesn't have a xarray 2024.10 for python 3.10?

We can definitely rebase away the notebooks before merging - those are not to be checked in here like this.

rsignell · 2024-10-30T11:47:45Z

Whoa, great find on that white space indent error @emfdavid ! The notebook indeed works fine now: https://nbviewer.org/gist/rsignell/c3fd58368ed9d0ae50c26807b6a51678

martindurant · 2024-10-30T13:05:59Z

maybe conda doesn't have a xarray 2024.10 for python 3.10

It does. Actually, xarray is noarch, but it has compiled deps. It does install into a fresh env, though.

'>2024.10.0'

should be

'>=2024.10.0'

?

martindurant · 2024-11-05T20:27:16Z

OK, I am happy to push this in, as is - it's fine to include the example notebook, and we can always iterate yet. OK, everyone?

emfdavid · 2024-11-06T17:48:57Z

Sure - your call on the notebook - it is a bit chunky to have in the git tree at ~12mb...
Let me know and I can remove it or you can merge as is.

I will work on the tests shortly. There will be some moderate size test fixture files for that.
Does this structure look okay?

martindurant · 2024-11-06T20:23:32Z

Can we put in the un-evaluated notebook?

emfdavid · 2024-11-06T20:36:53Z

Pushed the change - can you squash merge?
If not, I can rebase the old commits away.

remaining corpus of idx mapping functionality

d89bb90

martindurant mentioned this pull request Oct 28, 2024

added the reinflate api #499

Closed

Update to use xarray 2024.10.0 which includes datatree

5323073

rsignell mentioned this pull request Oct 29, 2024

Were there any blog posts or final report written on this work? Anu-Ra-g/GSoC2024_Kerchunk#1

Open

David Stuebe added 2 commits October 29, 2024 21:56

Fix test env

d40cca4

Add second test notebook

062b0f9

David Stuebe added 2 commits October 30, 2024 10:12

Fix xarray version

514cf12

Fix grib test using builtin datatree

c9ac3ff

Clear notebook output

a40fde9

martindurant merged commit 79b7051 into fsspec:main Nov 7, 2024
5 checks passed

emfdavid deleted the more_grib_idx branch November 7, 2024 15:48

emfdavid mentioned this pull request Nov 27, 2024

Add tests for grib idx & reinflate #528

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remaining corpus of idx mapping functionality #523

remaining corpus of idx mapping functionality #523

emfdavid commented Oct 28, 2024

rsignell commented Oct 28, 2024 •

edited

Loading

martindurant commented Oct 28, 2024

rsignell commented Oct 28, 2024

emfdavid commented Oct 28, 2024

emfdavid commented Oct 28, 2024

rsignell commented Oct 28, 2024

martindurant commented Oct 28, 2024

rsignell commented Oct 28, 2024

rsignell commented Oct 28, 2024 •

edited

Loading

emfdavid commented Oct 29, 2024

rsignell commented Oct 29, 2024 •

edited

Loading

emfdavid commented Oct 29, 2024

emfdavid commented Oct 29, 2024

rsignell commented Oct 29, 2024 •

edited

Loading

Anu-Ra-g commented Oct 29, 2024 •

edited

Loading

Anu-Ra-g commented Oct 29, 2024

rsignell commented Oct 29, 2024

emfdavid commented Oct 30, 2024

emfdavid commented Oct 30, 2024

rsignell commented Oct 30, 2024

martindurant commented Oct 30, 2024

martindurant commented Nov 5, 2024

emfdavid commented Nov 6, 2024

martindurant commented Nov 6, 2024

emfdavid commented Nov 6, 2024

remaining corpus of idx mapping functionality #523

remaining corpus of idx mapping functionality #523

Conversation

emfdavid commented Oct 28, 2024

rsignell commented Oct 28, 2024 • edited Loading

martindurant commented Oct 28, 2024

rsignell commented Oct 28, 2024

emfdavid commented Oct 28, 2024

emfdavid commented Oct 28, 2024

rsignell commented Oct 28, 2024

martindurant commented Oct 28, 2024

rsignell commented Oct 28, 2024

rsignell commented Oct 28, 2024 • edited Loading

emfdavid commented Oct 29, 2024

rsignell commented Oct 29, 2024 • edited Loading

emfdavid commented Oct 29, 2024

emfdavid commented Oct 29, 2024

rsignell commented Oct 29, 2024 • edited Loading

Anu-Ra-g commented Oct 29, 2024 • edited Loading

Anu-Ra-g commented Oct 29, 2024

rsignell commented Oct 29, 2024

emfdavid commented Oct 30, 2024

emfdavid commented Oct 30, 2024

rsignell commented Oct 30, 2024

martindurant commented Oct 30, 2024

martindurant commented Nov 5, 2024

emfdavid commented Nov 6, 2024

martindurant commented Nov 6, 2024

emfdavid commented Nov 6, 2024

rsignell commented Oct 28, 2024 •

edited

Loading

rsignell commented Oct 28, 2024 •

edited

Loading

rsignell commented Oct 29, 2024 •

edited

Loading

rsignell commented Oct 29, 2024 •

edited

Loading

Anu-Ra-g commented Oct 29, 2024 •

edited

Loading