Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the way to calculate time_means in script get_stats.py is wrong #3

Open
veya2ztn opened this issue Aug 16, 2022 · 3 comments
Open

the way to calculate time_means in script get_stats.py is wrong #3

veya2ztn opened this issue Aug 16, 2022 · 3 comments

Comments

@veya2ztn
Copy link

Please see: https://github.com/NVlabs/FourCastNet/blob/master/data_process/get_stats.py

**time_means = np.zeros((1,21,721, 1440))**

for ii, year in enumerate(years):
    
    with h5py.File('/pscratch/sd/s/shas1693/data/era5/train/'+ str(year) + '.h5', 'r') as f:

        rnd_idx = np.random.randint(0, 1460-500)
        global_means += np.mean(f['fields'][rnd_idx:rnd_idx+500], keepdims=True, axis = (0,2,3))
        global_stds += np.var(f['fields'][rnd_idx:rnd_idx+500], keepdims=True, axis = (0,2,3))

global_means = global_means/len(years)
global_stds = np.sqrt(global_stds/len(years))
**time_means = time_means/len(years)**

the time_means is constant zero follow this script.
What is the correct defination for this value?

BTW, may I know how you calculate the time_means_daily.h5 file?
From its size (127G) I can only guess it is a $(1460,21,720,1440)$ tensor.

@YueZhou-oh
Copy link

hey, are training and test .h5 files , eg. train/2015.h5 with simliar data shape (4D data)

@phrasenmaeher
Copy link

I am also wondering about that, did you find any solution so far?
In their paper they write

we use a time-averaged climatology in this work, motivated by [Rasp et al., 2020])

which is https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2020MS002405 defined just above A1, so that seems to be the correct way 🤷🏼

@phrasenmaeher
Copy link

phrasenmaeher commented Sep 8, 2023

Digging further into this, I found in the appendix this description:

long-term-mean-subtracted value of predicted (/true) variable v at the location denoted by the grid co-ordinates (m, n) at the forecast time-step l. The long-term mean of a variable is simply the mean value of that variable over a large number of historical samples in the training dataset. The long-term mean-subtracted variables X ̃ pred/true represent the anomalies of those variables that are not captured by the long term mean values

which reads that we subtract from our variables their mean -- which we do during data loading, and the mean is correctly computed over a long term (in get_stats.py)

--
Edit: However, there's the thing that the variables are also scaled by their std_dev. so it's not only the mean that is removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants