Skip to content

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.

License

Notifications You must be signed in to change notification settings

NVIDIA/nvidia-resiliency-ext

Repository files navigation

NVIDIA Resiliency Extension

The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads.

Figure highlighting core NVRx features including automatic restart, hierarchical checkpointing, fault detection and health checks

Core Components and Capabilities

Installation

From sources

  • git clone https://github.com/NVIDIA/nvidia-resiliency-ext
  • cd nvidia-resiliency-ext
  • pip install .

From PyPI wheel

  • pip install nvidia-resiliency-ext

Platform Support

Category Supported Versions / Requirements
Architecture x86_64, arm64
Operating System Ubuntu 22.04, 24.04
Python Version >= 3.10, < 3.13
PyTorch Version >= 2.3.1 (injob & chkpt), 2.5.1 & 2.6.0 (inprocess)
CUDA & CUDA Toolkit >= 12.5 (12.8 required for GPU health check)
NVML Driver >= 535 (570 required for GPU health check)
NCCL Version >= 2.21.5 (injob & chkpt), >= 2.21.5 and <= 2.22.3 or 2.26.2 (inprocess)

Usage

For detailed documentation and usage information about each component, please refer to the https://nvidia.github.io/nvidia-resiliency-ext/.

About

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.

Resources

License

Security policy

Stars

Watchers

Forks

Packages

No packages published