Motivation for GPUs in Deep Learning
A gentle introduction to CUDA
PMPP Book Access
NVIDIA GPU Glossary
Aalto University's Course on GPU Programming
Simon's Blog on SGEMM (Kernels 1-5 are the most relevant for the assignment)
How to use NCU profiler
Roofline Models
A sequel to Simon's Blog in HGEMM
Bruce's Blog on HGEMM
Spatter's Blog on HGEMM
NVIDIA's Presentation on A100 Tensor Cores
Primer on Inline PTX Assembly
CUTLASS GEMM Documentation
NVIDIA PTX ISA Documentation (Chapter 9.7 is most relevant)
Primer on Parallel Reduction
Warp level Primitives
Vectorization
Efficient Softmax Kernel
Online Softmax Paper
Flash Attention V1 Paper
Aleksa Gordic's Flash Attention Blog

