Solutions to the exercises of the course Advanced High Performance Computing (2024). This repository focuses on implementing and optimizing distributed parallel algorithms for two key computational problems.
The repository contains implementations of:
- Jacobi Method - A parallel implementation of the iterative Jacobi method for solving systems of linear equations
- Matrix Multiplication - Distributed implementations of matrix multiplication algorithms
The Jacobi method is an iterative algorithm for determining the solutions of a diagonally dominant system of linear equations. In this assignment, we parallelize and optimize the algorithm using:
- MPI for distributed memory parallelism
- OpenMP for shared memory parallelism
- CUDA for GPU acceleration
- One-sided MPI communication as an optimization strategy
To compile the program:
cd Jacobi
bash jobs/compile.sh [cpu|gpu|oneside]
Where cpu
, gpu
, or oneside
specifies the version to compile.
To run a scaling study:
bash jobs/scal.sh [MATRIX_SIZE] [ITERATIONS] [cpu|gpu|oneside]
Parameters:
MATRIX_SIZE
: Size of the matrix (N×N)ITERATIONS
: Number of Jacobi iterations to performcpu|gpu|oneside
: Implementation to use
This assignment implements and compares different approaches to distributed matrix multiplication:
- Naive implementation (basic distributed algorithm)
- CBLAS implementation (CPU-optimized using optimized linear algebra library)
- CUBLAS implementation (GPU-accelerated using NVIDIA's linear algebra library)
All versions distribute computation across multiple nodes while optimizing for performance.
To compile:
cd Matrix_Multiplication
bash jobs/compile.sh
To run a scaling study:
bash jobs/scal.sh [MATRIX_SIZE] [cpu|gpu]
For CPU implementation, specify an additional argument:
bash jobs/scal.sh [MATRIX_SIZE] cpu [0|1]
Parameters:
MATRIX_SIZE
: Size of the matrices to multiplycpu|gpu
: Platform to use0|1
: When using CPU, specifies Naive (0) or CBLAS (1) implementation
Jacobi
- Jacobi method implementations (CPU, GPU, One-sided)Matrix_Multiplication
- Matrix multiplication implementationsreport
- Performance analysis and documentation
Both assignments include performance analysis with:
- Strong scaling measurements
- Communication vs. computation time breakdown
- Performance comparison across implementations
- Efficiency metrics at different scales
The code is designed to run on HPC clusters with:
- MPI implementation (for distributed computing)
- CUDA toolkit (for GPU implementations)
- BLAS libraries (for optimized CPU matrix operations)
Detailed performance analysis, scalability charts and implementation explanations are available in the Report.