Optimizing Attention Layers for efficient Transformer implementation

The repository implements the following parts of implementing transformer layers:

A tensor module that handles matrix processing in the CPU as well as GPU.
A comprehensive library of various activation and processing functions.
Various optimization notes.
A comprehensive makefile for building the code for a naive implementation, AVX optimized build, and a CUDA implementation.

Running the project:

Clone the repository.
Make sure you have CUDA installed.
Run the following commands: make normal, make avx, make cuda for building the respective implementations.
Run the program using ./run_transformer_{normal/avx/cuda} <num_cols> <num_heads>.

Implementation specifics

Normal

This is a naive implementation using native C++ codes and functions. There is no parallelization used.

AVX

AVX intrinsics were used for SIMD processing of data specifically for matrix multiplication. OpenMP was used for parallel processing. Another optimization tried was switching parallelization between matrix multiplication and individual heads of the attention layer depending on the size.

CUDA

The codebase was ported to CUDA with kernels for handling different matrix processing functions and activation functions. Main optimizations were performed on the matrix multiplication part. Primarily in optimizing the memory access patterns by coalescing and block-tiling the matrix processing. Advanced methods of optimizing for operations in a warp can be explored for further optimizations.

Results

Max Speedup for AVX: ~45 Max Speedup for CUDA: ~160

Acknowledgement

This project was completed as a part of the course ITCS - 5182 High Performance Computing at UNC Charlotte under the guidance of Dr. Erik Saule.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
results		results
.gitignore		.gitignore
D_Tensor.cu		D_Tensor.cu
D_Tensor.cuh		D_Tensor.cuh
Journal.md		Journal.md
Makefile		Makefile
README.md		README.md
Tensor.cpp		Tensor.cpp
Tensor.h		Tensor.h
activations.cpp		activations.cpp
activations.h		activations.h
attention.cpp		attention.cpp
attention.h		attention.h
bench.sh		bench.sh
computations.cpp		computations.cpp
computations.h		computations.h
cuda_kernels.cu		cuda_kernels.cu
cuda_kernels.cuh		cuda_kernels.cuh
d_activations.cu		d_activations.cu
d_activations.cuh		d_activations.cuh
exceptions.cpp		exceptions.cpp
exceptions.h		exceptions.h
helpers.cpp		helpers.cpp
helpers.h		helpers.h
initializations.cpp		initializations.cpp
initializations.h		initializations.h
main.cpp		main.cpp
queue_centaurus_gold.sh		queue_centaurus_gold.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optimizing Attention Layers for efficient Transformer implementation

Running the project:

Implementation specifics

Normal

AVX

CUDA

Results

Acknowledgement

About

Releases

Packages

Languages

nevinbaiju/transformer_cpp_ITCS-5182

Folders and files

Latest commit

History

Repository files navigation

Optimizing Attention Layers for efficient Transformer implementation

Running the project:

Implementation specifics

Normal

AVX

CUDA

Results

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages