ERROR: CUDA_tensor_histogram failed to get free global memory when using `nequip-train` #402

hmcezar · 2024-01-16T10:42:31Z

hmcezar
Jan 16, 2024

I'm trying to train a model on LUMI which has AMD GPUs.

Starting from rocm/pytorch Docker images, I successfully created a container containing everything I need with:

bootstrap: docker
from: rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_1.13.1

%post
    # Install software
    apt-get update
    apt-get install -y file g++ gcc gfortran make gdb strace wget ca-certificates git --no-install-recommends

    # Clone and install NequIP and Allegro
    pip install wandb
    git clone --depth 1 https://github.com/mir-group/nequip.git
    cd nequip
    sed -i 's/"torch>=1.10.0,<1.13,!=1.9.0",/"torch>=1.10.0",/g' setup.py
    pip install .
    cd ..
    rm -rf nequip

Since I used the rocm/pytorch image as starting point, I'm pretty sure pytorch is correctly installed (version 1.13.1 in this case, but I tried 2.0.1 as well).
I also tried the develop branch, but I get the same error.

Using this container, I can run the minimal.yaml example on gpu without a problem.
However, if I try to run the example.yaml example I get:

Torch device: cuda
Downloading http://quantum-machine.org/gdml/data/npz/toluene_ccsd_t.zip
Processing dataset...
Loaded data: Batch(batch=[15000], cell=[1000, 3, 3], edge_cell_shift=[154352, 3], edge_index=[2, 154352], forces=[15000, 3], pbc=[1000, 3], pos=[15000, 3], ptr=[1001], total_energy=[1000, 1])
    processed data size: ~4.63 MB
Cached processed data to disk
Done!
Successfully loaded the data set of type NpzDataset(1000)...
/opt/conda/lib/python3.9/site-packages/torch/jit/_check.py:181: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn("The TorchScript type system doesn't support "
Replace string dataset_forces_rms to 30.621034622192383
Replace string dataset_per_atom_total_energy_mean to -11319.556640625
Atomic outputs are scaled by: [H, C: 30.621035], shifted by [H, C: -11319.556641].
Replace string dataset_forces_rms to 30.621034622192383
Initially outputs are globally scaled by: 30.621034622192383, total_energy are globally shifted by None.
Successfully built the network...
Number of weights: 363096
Number of trainable weights: 363096
! Starting training ...
Traceback (most recent call last):
  File "/opt/conda/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.9/site-packages/nequip/scripts/train.py", line 78, in main
    trainer.train()
  File "/opt/conda/lib/python3.9/site-packages/nequip/train/trainer.py", line 778, in train
    self.epoch_step()
  File "/opt/conda/lib/python3.9/site-packages/nequip/train/trainer.py", line 916, in epoch_step
    self.batch_step(
  File "/opt/conda/lib/python3.9/site-packages/nequip/train/trainer.py", line 855, in batch_step
    loss, loss_contrib = self.loss(pred=scaled_out, ref=_data_unscaled)
  File "/opt/conda/lib/python3.9/site-packages/nequip/train/loss.py", line 104, in __call__
    _loss = self.funcs[key](
  File "/opt/conda/lib/python3.9/site-packages/nequip/train/_loss.py", line 74, in __call__
    N = torch.bincount(ref[AtomicDataDict.BATCH_KEY])
RuntimeError: hipGetLastError() == hipSuccess INTERNAL ASSERT FAILED at "/var/lib/jenkins/pytorch/aten/src/ATen/native/hip/SummaryOps.hip":202, please report a bug to PyTorch. CUDA_tensor_histogram failed to get free global memory
srun: error: nid005018: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=5735026.0

On CPU, the exact same example runs.
Any idea why I'm getting this?

Thanks!

Linux-cpp-lisp · 2024-01-16T22:09:16Z

Linux-cpp-lisp
Jan 16, 2024
Maintainer

Hi @hmcezar ,

Thanks for your interest in our codes---this is a bit odd. The relevant difference would seem to be the use of a larger default batch size in example.yaml (5 vs 1). Can you try example.yaml with batch_size: 1 and minimal.yaml with batch_size: 5 to check if this is the cause?

(Also CCing some of those who might have insight or find this relevant: @anjohan @svandenhaute @johkl and linking mir-group/pair_allegro#23)

3 replies

hmcezar Jan 17, 2024
Author

Thanks for your answer @Linux-cpp-lisp !

It turns out it was a problem with incompatible versions of rocm on the host and container.
I reported it on pytorch's repo and they helped me to figure it out: pytorch/pytorch#117545

I'll make more tests and let you know if I have any problems, but thank you for your help so far!

svandenhaute Jan 17, 2024

Just FYI, they recently added a rocm/5.6.1 module! I've found it difficult to work with versions <5.3 in the past because some names/locations of shared libraries have changed between 5.2 and 5.3, I think.

EDIT: lol nvm, you tried 5.6 as well. I would have also expected a 5.7 binary to be compatible with a 5.6 host!

Linux-cpp-lisp Jan 17, 2024
Maintainer

Ah good to know, thanks for documenting. So it works now with the right ROCm inside the container?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR: CUDA_tensor_histogram failed to get free global memory when using `nequip-train` #402

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

ERROR: CUDA_tensor_histogram failed to get free global memory when using nequip-train #402

hmcezar Jan 16, 2024

Replies: 1 comment · 3 replies

Linux-cpp-lisp Jan 16, 2024 Maintainer

hmcezar Jan 17, 2024 Author

svandenhaute Jan 17, 2024

Linux-cpp-lisp Jan 17, 2024 Maintainer

ERROR: CUDA_tensor_histogram failed to get free global memory when using `nequip-train` #402

hmcezar
Jan 16, 2024

Replies: 1 comment 3 replies

Linux-cpp-lisp
Jan 16, 2024
Maintainer

hmcezar Jan 17, 2024
Author

Linux-cpp-lisp Jan 17, 2024
Maintainer