HPC Parallel Computing

This repository contains examples of parallel computing techniques for SLURM-based High-Performance Computing (HPC) servers, utilizing PyTorch Distributed Data Parallel (DDP) and Ray. These examples demonstrate how to efficiently distribute tasks and leverage the power of parallel computing to enhance computational performance.

Introduction

High-performance computing (HPC) allows for the processing of complex calculations at high speeds. This repository showcases examples of parallel computing on SLURM-based HPC servers using two popular libraries: PyTorch Distributed Data Parallel (DDP) and Ray. All code written is set up to function on the Cannon Cluster.

Features

PyTorch DDP: Demonstrates distributed training of deep learning models using PyTorch.
Ray: Showcases parallel task execution and distributed computing using Ray for hyperparameter tuning of PyTorch models.

Installation

To use the examples in this repository, follow these steps:

Clone the repository:

git clone https://github.com/nswood/HPC_Parallel_Computing.git
cd HPC_Parallel_Computing

Create a new Conda environment from environment.yml and activate it:
```
conda env create -f environment.yml
conda activate hpc_env
```

Usage

PyTorch DDP

Navigate to the DDP directory and run sbatch test.sh. All modifications to the requested compute should be handled through the test.sh SLURM configuration file. Models are wrapped in Trainer.py to enable simple tracking of GPU allocations during parallelization. Extend Trainer.py to provide the necessary functionality for your model and save any needed metrics during training or evaluation.

Ray

Navigate to the Ray directory and run sbatch ray_slurm_template.sh. All modifications to the requested compute should be handled through the ray_slurm_template.sh SLURM configuration file. Ray is not natively compatible with SLURM-based servers, so manually start Ray instances on each compute node allocated through SLURM. Each Ray instance should be allocated all requested GPUs and CPUs. Allocations for individual trainings are handled through Ray, while total compute allocation is handled through SLURM.

To hyperparameter tune your own model, modify the ray_hyp_tune.py file to incorporate your custom structure.

Monitoring Ray Processes

There are many ways to monitor Ray processes using user interfaces. Ray documentation suggests using Prometheus to scrape data from logs and Grafana to display results. However, these methods can be difficult to interface with SLURM, especially on servers that require 2FA for SSH, such as Cannon.

The solution is to use TensorBoard's UI to track Ray. TensorBoard handles both data scraping using the log directory and displays a UI showing current progress. When the Ray servers are initialized, you can track the progress using Tensorboard. From my experimenting, Ray will print out a command with the incorret path:

To visualize your results with TensorBoard, run: `tensorboard --logdir {Insert path to log file}`

All you have to do is insert the path to the output for the Ray instance as the logdir. This path should be to a folder TorchTrainer_{some time/date info} inside the storage_path you supplied in the RunConfig. This will send output UI information to a port the server (assumed to be 6007 here). You must port forward this to your local machine using a command such as:

ssh -L 6007:localhost:6007 your-username@your-sever

You can then access the TensorBoard UI for your Ray process at https://localhost:6007.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
DDP		DDP
Ray		Ray
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HPC Parallel Computing

Table of Contents

Introduction

Features

Installation

Usage

PyTorch DDP

Ray

Monitoring Ray Processes

About

Releases

Packages

Languages

nswood/HPC_Parallel_Computing

Folders and files

Latest commit

History

Repository files navigation

HPC Parallel Computing

Table of Contents

Introduction

Features

Installation

Usage

PyTorch DDP

Ray

Monitoring Ray Processes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages