Dealing with training on large structures #254

tgmaxson · 2022-09-29T20:25:45Z

tgmaxson
Sep 29, 2022

I have a 512 atom system that I run out of memory using a batch size of 1 on A100 GPUs. Is there any way to split the structure into multiple parts for evaluation such that I effectively do a 0.5 batch? My understanding is the NN is evaluated wrt each atom rather than the structure as a whole so this seems like it should be possible in theory, but maybe not implemented? Is there anything else I can do in this case other than evaluate it on the CPU?

Solutions involving multiple GPUs could work as well but I think that is not available right now.

tgmaxson · 2022-09-29T20:26:51Z

tgmaxson
Sep 29, 2022
Author

I should also add that I am using the stress ModelBuilder if that is an issue in this case.

0 replies

Linux-cpp-lisp · 2022-09-29T20:27:51Z

Linux-cpp-lisp
Sep 29, 2022
Maintainer

@tgmaxson can you post the actual OOM error? You may not be running OOM on model evaluation but elsewhere...

4 replies

tgmaxson Sep 29, 2022
Author

(work) [tgmaxson@chpc-login-1 NPT]$ tail -f slurm-401440.out
chpc-gpu-12-0
Torch device: cuda
Successfully loaded the data set of type ASEDataset(101)...
Replace string dataset_per_atom_total_energy_std to 0.026655161753296852
Replace string dataset_per_atom_total_energy_mean to -0.1092929095029831
Atomic outputs are scaled by: [H, O: 0.026655], shifted by [H, O: -0.109293].
/home/tgmaxson/mambaforge/envs/work/lib/python3.9/site-packages/nequip/nn/_grad_output.py:196: UserWarning: !! Stresses in NequIP are in BETA and UNDER DEVELOPMENT: _please_ carefully check the sanity of your results and report any (potential) issues on the GitHub
  warnings.warn(
Replace string dataset_total_energy_std to tensor([40.9423])
Initially outputs are globally scaled by: tensor([40.9423]), total_energy are globally shifted by None.
Successfully built the network...
Number of weights: 252320
! Starting training ...

validation
# Epoch batch         loss       loss_f       loss_e  loss_stress        f_mae       f_rmse        e_mae       e_rmse   stress_mae  stress_rmse
      0     1       0.0141     7.03e-05     4.56e-09     8.89e-08        0.225        0.343         4.25         4.25      0.00724       0.0122
      0     2      0.00153     5.05e-06     5.08e-06     7.61e-08        0.069        0.092          142          142      0.00659       0.0113


  Initialization     #    Epoch      wal       LR       loss_f       loss_e  loss_stress         loss        f_mae       f_rmse        e_mae       e_rmse   stress_mae  stress_rmse
! Initial Validation          0    2.261    0.005     3.77e-05     2.54e-06     8.25e-08       0.0078        0.147        0.251           73          100      0.00692       0.0118
Wall time: 2.303542183712125
! Best model        0    0.008
Traceback (most recent call last):
  File "/home/tgmaxson/mambaforge/envs/work/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/home/tgmaxson/mambaforge/envs/work/lib/python3.9/site-packages/nequip/scripts/train.py", line 78, in main
    trainer.train()
  File "/home/tgmaxson/mambaforge/envs/work/lib/python3.9/site-packages/nequip/train/trainer.py", line 775, in train
    self.epoch_step()
  File "/home/tgmaxson/mambaforge/envs/work/lib/python3.9/site-packages/nequip/train/trainer.py", line 913, in epoch_step
    self.batch_step(
  File "/home/tgmaxson/mambaforge/envs/work/lib/python3.9/site-packages/nequip/train/trainer.py", line 827, in batch_step
    loss.backward()
  File "/home/tgmaxson/mambaforge/envs/work/lib/python3.9/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/tgmaxson/mambaforge/envs/work/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 6.08 GiB (GPU 0; 39.44 GiB total capacity; 28.07 GiB already allocated; 1012.12 MiB free; 37.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here is the error.

tgmaxson Sep 29, 2022
Author

It seems like you are right, it seems to be failing in the backward differentiation, which is why it makes it through the first validation models.

Linux-cpp-lisp Sep 30, 2022
Maintainer

Looks like you are also right in that it does fail in the model evaluation (with gradients); I'd thought it might be failing in the dataset processing which is much easier to deal with.

tgmaxson Sep 30, 2022
Author

Guessing this is non-trivial to fix then? This would also enable easier training on weaker GPUs (such as 3090/4090 soon).

As an alternate solution, does a multi GPU setup fix memory issues here?

Linux-cpp-lisp · 2022-09-30T18:21:58Z

Linux-cpp-lisp
Sep 30, 2022
Maintainer

Is there any way to split the structure into multiple parts for evaluation such that I effectively do a 0.5 batch?

For Allegro, due to the locality, this is possible but is much more complicated and is not something we support. I've been writing some helpers for unit tests you might be able to adapt to do this with CPU offloading, but it's unfortunately not something that is a priority on my end... let me know if you'd want to try.

As an alternate solution, does a multi GPU setup fix memory issues here?

Our approach to multi-GPU is data-parallel, so if a single example won't fit on one GPU it will unfortunately not help.

Practically and for the moment, have you checked whether you can get away with a smaller r_max or network?

3 replies

tgmaxson Sep 30, 2022
Author

Is there any way to split the structure into multiple parts for evaluation such that I effectively do a 0.5 batch?

For Allegro, due to the locality, this is possible but is much more complicated and is not something we support. I've been writing some helpers for unit tests you might be able to adapt to do this with CPU offloading, but it's unfortunately not something that is a priority on my end... let me know if you'd want to try.

CPU offloading in this case would involve running a dataset on the CPU in parallel to the GPU then? Or just using the CPU in serial for a dataset? This may work, but we can probably just retrain the model on the more complex dataset afterwards on the CPU for final refinement.

As an alternate solution, does a multi GPU setup fix memory issues here?

Our approach to multi-GPU is data-parallel, so if a single example won't fit on one GPU it will unfortunately not help.

Does this currently work? I haven't tried the GPU branch yet, but we do have access to multi-GPU setups right now.

Practically and for the moment, have you checked whether you can get away with a smaller r_max or network?

This is probably going to have to be the solution it seems. Lowering r_max is unlikely to work for us on this, but currently we are working with lmax=3. From @simonbatzner's webinar a while back, it seems lmax=2 is probably sufficient though so I will test if this works. I imagine this will save more resources than anything else. I have included our yaml file incase you have any other suggestions. I also realized I typoed 512 atoms when its really 512 * 3 (512 water).

# General
root: models
allow_tf32: true
append: true
run_name: ERROR

# Dataset
dataset: ase
dataset_file_name: H2O-512-Bulk/opt-360.traj
shuffle: true
global_rescale_scale: dataset_total_energy_std
per_species_rescale_shifts: dataset_per_atom_total_energy_mean
per_species_rescale_scales: dataset_per_atom_total_energy_std
chemical_symbols:
- H
- O

# Allegro Model
num_basis: 8
r_max: 6.0
avg_num_neighbors: auto
BesselBasis_trainable: false
PolynomialCutoff_p: 20
num_layers: 3
env_embeed_multiplicity: 32
two_body_latent_mlp_latent_dimensions:
- 32
- 64
- 128
two_body_latent_mlp_nonlinearity: silu
latent_mlp_latent_dimensions:
- 128
latent_mlp_nonlinearity: silu
latent_resnet: true
env_embed_mlp_latent_dimensions: []
env_embed_mlp_nonlinearity: null
edge_eng_mlp_latent_dimensions:
- 32
edge_eng_mlp_nonlinearity: null
l_max: 3
parity: o3_full
model_builders:
- allegro.model.Allegro
- PerSpeciesRescale
- StressForceOutput
- RescaleEnergyEtc

# Training
n_train: 20
n_val: 2
learning_rate: 0.005
batch_size: 1
validation_batch_size: 1
max_epochs: 10000

# Loss Schedule
lr_scheduler_name: CosineAnnealingWarmRestarts
lr_scheduler_T_0: 100
lr_scheduler_T_mult: 2
lr_scheduler_eta_min: 0
lr_scheduler_last_epoch: -1

# Loss Function
loss_coeffs:
  forces: 200.0
  total_energy:
  - 100
  - PerAtomMSELoss
  stress: 100

# Misc
optimizer_name: Adam
per_species_rescale_shifts_trainable: false
per_species_rescale_scales_trainable: false
wandb: false

tgmaxson Sep 30, 2022
Author

Lowering to lmax=2 seems to have been enough, now the GPU utilization of memory seems to be 26/40 GB of memory instead. If the results will be as good is yet to be seen, but we can call this a solution for now. I will put in my vote for a partial batch solution in Allegro someday though since it would be great to not have this limitation.

Linux-cpp-lisp Oct 13, 2022
Maintainer

CPU offloading in this case would involve running a dataset on the CPU in parallel to the GPU then? Or just using the CPU in serial for a dataset? This may work, but we can probably just retrain the model on the more complex dataset afterwards on the CPU for final refinement.

Not quite, I was referring to offloading intermediate tensors during the inference pass to the CPU, which PyTorch supports while maintaining the correct autograd graph. I remember finding some code that did this automatically for stored data in the autograd tree as some kind of autograd hook, which looked easy to try, but I can't find it right now...

Does this currently work? I haven't tried the GPU branch yet, but we do have access to multi-GPU setups right now.

Not quite yet, but it's en route 🙂

This is probably going to have to be the solution it seems. Lowering r_max is unlikely to work for us on this, but currently we are working with lmax=3. From @simonbatzner's webinar a while back, it seems lmax=2 is probably sufficient though so I will test if this works. I imagine this will save more resources than anything else. I have included our yaml file incase you have any other suggestions. I also realized I typoed 512 atoms when its really 512 * 3 (512 water).

Yes, lmax = 2 is usually fine and is a good stopgap as your later comment says. 512*3 is a lot of atoms for training (many more than we usually encounter from DFT) so you're at the edges of what is well used. In terms of correctness everything will be right, of course, but some of the typical parameters around batch size, etc. may not apply, as you are discovering.

I will put in my vote for a partial batch solution in Allegro someday though since it would be great to not have this limitation.

Yes, this may be possible soon-ish. I'll let you know if it is.

Linux-cpp-lisp · 2022-10-13T03:06:18Z

Linux-cpp-lisp
Oct 13, 2022
Maintainer

OK, I found it @tgmaxson !! 🎉 🎉

It's save_to_cpu: https://pytorch.org/docs/stable/autograd.html#torch.autograd.graph.save_on_cpu.

Check out the save_to_cpu branch for what seems to be a working beta implementation, but please test this yourself for correctness and reproducibility on something that fits in GPU memory, and let me know what you find! Turn it on with the --gpu-oom-offload flag to nequip-train.

Caveats:

Slows things down
From PyTorch about pinned CPU host memory: 'If you overuse pinned memory, it can cause serious problems when running low on RAM!'
Need a lot of host RAM still and can OOM on host as well (watch job scheduler limits...)

(If you really feel like it or have issues you could benchmark a comparison with pin_memory=False and pin_memory=True in train.py:L91, but I think True is a sane choice for speed here.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with training on large structures #254

{{title}}

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Dealing with training on large structures #254

tgmaxson Sep 29, 2022

Replies: 4 comments · 7 replies

tgmaxson Sep 29, 2022 Author

Linux-cpp-lisp Sep 29, 2022 Maintainer

tgmaxson Sep 29, 2022 Author

tgmaxson Sep 29, 2022 Author

Linux-cpp-lisp Sep 30, 2022 Maintainer

tgmaxson Sep 30, 2022 Author

Linux-cpp-lisp Sep 30, 2022 Maintainer

tgmaxson Sep 30, 2022 Author

tgmaxson Sep 30, 2022 Author

Linux-cpp-lisp Oct 13, 2022 Maintainer

Linux-cpp-lisp Oct 13, 2022 Maintainer

tgmaxson
Sep 29, 2022

Replies: 4 comments 7 replies

tgmaxson
Sep 29, 2022
Author

Linux-cpp-lisp
Sep 29, 2022
Maintainer

tgmaxson Sep 29, 2022
Author

tgmaxson Sep 29, 2022
Author

Linux-cpp-lisp Sep 30, 2022
Maintainer

tgmaxson Sep 30, 2022
Author

Linux-cpp-lisp
Sep 30, 2022
Maintainer

tgmaxson Sep 30, 2022
Author

tgmaxson Sep 30, 2022
Author

Linux-cpp-lisp Oct 13, 2022
Maintainer

Linux-cpp-lisp
Oct 13, 2022
Maintainer