GitHub - Wrench-Git/CUDA_gemm: A simple high performance CUDA GEMM implementation.

introduction

A simple high performance CUDA GEMM, Block Sparse GEMM and Non-uniform Quantized GEMM implementation.

C = alpha * A * B + beta * C

located in src/cuda/

MatrixMulCUDA
- one element of C is assigned one thread
- global memory coalesce of B
MatrixMulCUDA1
- texture load
MatrixMulCUDA2
- one 4 * 4 grid of C is assigned one thread
MatrixMulCUDA3
- vectorized A B load
MatrixMulCUDA4
- vectorized C store
MatrixMulCUDA5
- block sparse version
MatrixMulCUDA6
- vectorized A B load coalesce
MatrixMulCUDA7
- warp shuffle to enable C store coalesce
MatrixMulCUDAQuantize8bit
- 8 bit non-uniform quantized matmul

located in benchmark/

benchmark_dense
- Compare My Gemm with Cublas
benchmark_sparse
- Compare My block sparse Gemm with Cusparse
benchmark_quantization_8bit
- Compare My Gemm with Cublas
benchmark_quantization
- Compare My Gemm with My quantized non-uniform 8 bit Gemm

(MatrixMulCUDA7) write back to C matrix, warp shuffle to enable global memory coalesce
(MatrixMulCUDA8) double buffering

mkdir builds
make benchmark_[experiment name]
bash scripts/benchmark_[experiment name].sh

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
benchmark		benchmark
data		data
imgs		imgs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
benchmark.md		benchmark.md
makefile		makefile