The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads.
-
- Detection of hung ranks.
- Restarting training in-job, without the need to reallocate SLURM nodes.
-
- Detecting failures and enabling quick recovery.
-
- Providing an efficient framework for asynchronous checkpointing.
-
- Providing an efficient framework for local checkpointing.
-
- Monitoring GPU and CPU performance of ranks.
- Identifying slower ranks that may impede overall training efficiency.
-
- Facilitating seamless NVRx integration with PyTorch Lightning.
git clone https://github.com/NVIDIA/nvidia-resiliency-ext
cd nvidia-resiliency-ext
pip install .
pip install nvidia-resiliency-ext
Category | Supported Versions / Requirements |
---|---|
Architecture | x86_64, arm64 |
Operating System | Ubuntu 22.04, 24.04 |
Python Version | >= 3.10, < 3.13 |
PyTorch Version | >= 2.3.1 (injob & chkpt), 2.5.1 & 2.6.0 (inprocess) |
CUDA & CUDA Toolkit | >= 12.5 (12.8 required for GPU health check) |
NVML Driver | >= 535 (570 required for GPU health check) |
NCCL Version | >= 2.21.5 (injob & chkpt), >= 2.21.5 and <= 2.22.3 or 2.26.2 (inprocess) |
For detailed documentation and usage information about each component, please refer to the https://nvidia.github.io/nvidia-resiliency-ext/.