-
Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs
-
Hardware Troubleshooting - what to do when one runs into hardware problems
-
NCCL Debug and Performance - notes for debugging NCCL-based software and tuning it up for the peak performance
-
torch-distributed-gpu-test.py - this a
torch.distributed
diagnostics script that checks that all GPUs in the cluster (one or many nodes) can talk to each other and allocate gpu memory. -
NicerTrace - this is an improved
trace
python module with multiple additional flags added to the constructor and more useful output.