Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

25 normalize data on gpu #39

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

25 normalize data on gpu #39

wants to merge 6 commits into from

Conversation

sadamov
Copy link
Collaborator

@sadamov sadamov commented May 25, 2024

Summary

This PR introduces on_after_batch_transfer logic to normalize data on GPU instead of on CPU before transfer.

Rationale

Normalization is faster on GPU than CPU. In the current code the data was normalized in the pytorch dataset class in the get_item method potentially slowing down training; especially on systems with fewer CPU cores.

Changes

  • Introduces on_after_batch_transfer in the ar_model.py script
  • Remove everything related to "standardization" from the remaining scripts
  • Adapted the create_parameter_weights.py script to work with the new changes (not reloading standardized dataset)

Testing

Both training and evaluation was successful. The training loss of 3.230 on the meps_example is identical to before the changes. The create_parameter_weights script was executed to successfully generate the stats.

Not-In-Scope

The normalization stats and other static features will all become zarr archives in the future. Their path defined in the data_config.yaml file.

@sadamov sadamov added the enhancement New feature or request label May 25, 2024
@sadamov sadamov self-assigned this May 25, 2024
@sadamov sadamov linked an issue May 25, 2024 that may be closed by this pull request
@sadamov sadamov mentioned this pull request May 25, 2024
@leifdenby
Copy link
Member

This is looking good @sadamov! Can I suggest we merge #38 first, you merge that into your branch and then I do a review? That way we can ensure everything keeps working :)

Copy link
Collaborator

@joeloskarsson joeloskarsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot cleaner and also good to move computations from CPU to GPU. There is a question mark about forcing standardization that should be double-checked. In the future we might want to handle standardization of forcing the same for all forcing variables, or have a more robust way to control this (i.e. which forcing should be standardized or not), but let's leave that to then.

init_states, target_states, forcing_features = batch
init_states = (init_states - self.data_mean) / self.data_std
target_states = (target_states - self.data_mean) / self.data_std
forcing_features = (forcing_features - self.flux_mean) / self.flux_std
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now all forcing seem to be normalized with the flux statistics, but there is more forcing than flux. Note how this was only applied to the flux in WeatherDataset before.
Have you tested that this gives exactly the same tensors as before? (e.g. save the first batch to disk on main, check this out, save first batch and compare).

Copy link
Collaborator Author

@sadamov sadamov Jun 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, the forcings are now handled differently. I suggest to implement a new logic to handle forcings in #54. The user can define combined_vars that share statistics and also define vars that should not be normalized.

neural_lam/models/ar_model.py Outdated Show resolved Hide resolved
@sadamov
Copy link
Collaborator Author

sadamov commented Jun 9, 2024

I merged with the latest updates from main @leifdenby and commented on your suggestions @joeloskarsson. In the following I want to show that the output tensor was not affected by this change. As you suggested I stored the last batch for both gpu/cpu normalization with deterministic=True and a set seed:

Pasted image

Then I compared the tensors like this:

Screenshot from 2024-06-07 21-06-37

So except for the forcings tensor which are handled differently the two approaches create identical output.

@joeloskarsson
Copy link
Collaborator

How do you think we should progress with this @sadamov ? If the forcing is handled differently in #54, would it make more sense to try to merge this after that? (I guess baking this change into #54 would just make it even bigger). Or should we merge this first so that #54 can build on it and be adapted to use it?

I would not be happy to merge this without fixing so also the forcing tensors match the previous implementation. But we could just do a quick fix for now so the standardization is only applied to the flux dimensions of the forcing tensor?

@sadamov
Copy link
Collaborator Author

sadamov commented Jun 12, 2024

I propose to merge #54 first and leave this open until then. We now know that on_after_batch_transfer works as expected and can shortly after implement this PR.

@joeloskarsson
Copy link
Collaborator

As we look at this after #66 we should think about more options for rescaling of the different variables. In #66 all variables (including state/forcing/static) are standardized. There are benefits to allowing for all three of:

  1. Standardize to $\mu=0, \sigma=0$
  2. Normalize to $[0,1]$
  3. Perform no rescaling

for different variables. We do however need a way to specify what should be used for each variable, as well as computing the needed data statistics. Some of this might have to be done in mllam_data_prep, but at least the final computation should be in scope for this PR, as it is what the code has to do on the GPU.

@joeloskarsson joeloskarsson added this to the v0.5.0 milestone Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Normalize Data on GPU
3 participants