training runs on only one of requested multiple GPUs #419

beidouamg · 2024-04-13T14:53:15Z

beidouamg
Apr 13, 2024

Hi NequIP developers,

Thanks for developing this code, it helps a lot and gives better machine learning potential than others.
I have some troubles with training (maybe it will turn out to be a noob question), the essential problem is nequip-train use always only one of GPUs in the requested GPU nodes (one GPU node has 4 GPUs). Some details are below:

To use NequIP, I installed PyTorch 1.11.0 and cudatoolkit 11.3 with Python 3.10.10; nequip-train works correctly on one GPU (I requested only 1 GPU) when I run the tutorial.
Then I run for my own work on NERSC Perlmutter GPU nodes. Each GPU node contains a single socket with an AMD EPYC 7763 processor and 4 NVIDIA A100 GPUs. An EPYC 7763 processor has 64 physical cores with 2 hardware threads (logical CPUs) per core. I requested 2 GPU nodes and 8 GPUs should be available; However, from wandb, I found only 1 CPU is used when preprocessing the extxyz structure file and only 1 GPU is used when training, and only 4 GPUs are shown. I guess this problem results in slow training, 71 epochs for 24h with ~30000 structures (318M in size).
It is not bad outcome because it gives good energy/force errors after two days training.
At last, I tried to train with ~100000 structures (2.5G in size), but it always fails in preprocessing the extxyz file, I found only 1 CPU is used and CPU utilization is almost 0. I found there are discussions about training with very large extxyz, but it would be great if i can figure out the root cause of low CPU/GPU utilization, before trying solutions in those discussions.
I pasted my jobscript here in case this problem related to the machines settings:
#!/bin/bash #SBATCH --constraint=gpu #SBATCH --exclusive #SBATCH --qos=debug #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=32 #SBATCH --gpus=8 #SBATCH --time=00:30:00 export OMP_PLACES=threads export OMP_PROC_BIND=spread export SLURM_CPU_BIND="cores" #export NEQUIP_NUM_TASKS=8 module load cudatoolkit/11.7 conda activate nequip nequip-train full.yaml

I have tried to change setting in the jobscript, but it doesn't work. It would be great if you can help to figure out the causes to make full use of available CPU/GPU in the node. Please let me know if you need more info.
Thanks for your time on this problem!

Linux-cpp-lisp · 2024-04-19T16:43:18Z

Linux-cpp-lisp
Apr 19, 2024
Maintainer

Hi @beidouamg ,

Glad to hear!

Regarding multi-GPU training, this is not yet fully supported. You can see discussion of this here: #210.

The dataset preprocessing should, however, be parallelized over CPU cores: you can control this manually with the NEQUIP_NUM_TASKS environment variable.

At present, nequip-train can only run on one MPI task, potentially with multiple threads/cores/CPUs. So --nodes 2 and likely also --ntasks-per-node=4 (instead of 1) are likely inappropriate.

Regarding a very large dataset, you can pre-pre-process your data on a CPU node by running nequip-benchmark myyaml.yaml, potentially running a job with more CPU cores / memory available. The dataset will then be cached at the root and can be used by later nequip-train runs in separate jobs.

2 replies

beidouamg Apr 24, 2024
Author

Hi @Linux-cpp-lisp ,

Thanks a lot for your suggestion. I have got nequip started successfully.
However, a new issue raises, the training runs slowly, ~1.5h to finish 1 epoch due to the large dataset.
At this speed, I think it will take a long long time before I get a good MLP.

One possible workaround is mentioned here in Issues 210. I would like to try the ddp branch to utilize multiple-GPU feature. At the same time, I will keep single-GPU train running to see if there will be any difference.

I would listen to your suggestion if you have better solutions. Thanks!

Linux-cpp-lisp May 1, 2024
Maintainer

You can also try a smaller model (fewer features), which may be sufficient for your application and yield a speed-up, but this depends of course on many factors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training runs on only one of requested multiple GPUs #419

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

training runs on only one of requested multiple GPUs #419

beidouamg Apr 13, 2024

Replies: 1 comment · 2 replies

Linux-cpp-lisp Apr 19, 2024 Maintainer

beidouamg Apr 24, 2024 Author

Linux-cpp-lisp May 1, 2024 Maintainer

beidouamg
Apr 13, 2024

Replies: 1 comment 2 replies

Linux-cpp-lisp
Apr 19, 2024
Maintainer

beidouamg Apr 24, 2024
Author

Linux-cpp-lisp May 1, 2024
Maintainer