A collection of GPU kernels implemented one day at a time, progressing from basic to advanced concepts.
- NVIDIA GPU with CUDA support
- CUDA Toolkit installed
- Python 3.11+
- PyTorch
- Day 1 - Basic Vector Addition in CUDA
- Day 2 - Vector Addition with Python/PyTorch Bindings
- Day 3 - RGB to Grayscale Conversion
- Day 4 - RGB to Blurred Image Conversion
- Day 5 - Simple Matrix Multiplication
- Day 6 - Coalased Matrix Multiplication
- Day 7 - GELU Activation function
- Day 8 - NAIVE Batch Normalisation
- Day 9 - Sigmoid Activation function
- Day 10 - Tanh Activation function and Tiled Matrix Multiplication
- Day 11 - Dynamic Tiled Matrix Multiplication
- Day 12 - Layer Normalisation using Shared Memory
- Day 13 - Matrix Transpose
- Day 14 - Softmax using shared memory
- Day 15 - GELU Forward and Backward Kernels
- Day 16 - Querying Gpu Properties
- Day 17 - Custom NF4 Quantization Implementation
- Day 18 - Custom Double NF4 Quantization (QLORA STYLE) Implementation
- Day 19 - Transformers Self Attention Implementation
- Day 20 - Triton Basics
- Day 21 - Dropout in Triton
- Day 22 - Batch Norm in Triton
- Day 23 - [Not working] Chunked Cross Entropy Loss in Triton
- Day 24 - Device Propertiest using pytorch ( WILL USE LATER )
- Day 25 - Sigmoid in Triton
- Day 26 - Blur Kernel in Triton
- Day 27 - Gelu Kernel in Triton
- Day 28 - Tanh Kernel in Triton
- Day 29 - Transpose Kernel in Triton
- Day 30 - Layer Norm Kernel in Triton
- Day 31 - Tiled Matmul Corner Turning Kernel in Triton
- Day 32 - Partial Dequantise Kernel in Triton
- Day 33 - Animated Color Patterns in Cuda
- Day 34 - SiLU in Triton
- Day 35 - RMSNORM in Triton
- Day 36 - DyT in Cuda ( Transformers without normalisation )
- Day 37 - L2 NORM in cuda
- Day 38 - L1 NORM in cuda
- Day 39 - Thread Coarsening TILED MM in cuda
- Day 40 - USING NSIGHT COMPUTE to profile a candidate kernel and generate a report
- Day 41 - Swish Activation function in cuda
- Day 42 - Swapping elements in cuda
- Day 43 - Flash Attention in cuda
- Day 44 - GeGelu activation in cuda
- Day 45 - Tinkered with numba cuda
- Day 46 - Rope Embedding Kernel cuda
- Day 47 - SIMPLE BLAS Ops in CUBLAS cuda
- Day 48 - Matmul using mma(tensor cores) in cuda
- Day 49 - Rope Backward Pass in cuda
- Day 50 - SELF ATTENTION Backward Pass in cuda
- Day 51 - Lightning ATTENTION Forward Pass in cuda
- Day 52 - Optimising restnet18 using custom fused add relu cuda kernel
- Day 53 - Optimising kokoro using custom fused cuda kernels (Over next few days I will be optimising different ops in the model and all of the kernels would be applied here in the kokoro pipeline)
- Day 54 - Triplet Loss in cuda ( for constrastive training )
- Day 55 - MSE in cuda
- Day 56 - AdaIN with Snake Activation in cuda
- Day 57 - MISH activation in cuda
- Day 58 - Cosine Similarity in cuda
- Day 59 - Hinge Loss in cuda
- Day 60 - KL DIVERGENCE in cuda
- Day 61 - GEMM BIAS RELU in cuda
- Day 62 - Vector, Ldg, restrict and warp divergence optimised ELU kernel in cuda
- Day 63 - Parallel Block Reduction, Vector, Ldg, restrict and optimized mean and variance calculation Layer Normalization kernel in cuda with shared memory
- Day 64 - Relu Kernel optimised using 2d indexing and float4 vectorization
- Day 65 - Selu Kernel optimised using 2d indexing and float4 vectorization
- Day 66 - Sigmoid Kernel optimised using 2d indexing and float4 vectorization
- Day 67 - Tanh Kernel optimised using 2d indexing and float4 vectorization
- Day 68 - Transpose kernel using Cutlass Cute framework
- Day 69 - Huberloss float8 cuda
- Day 70 - Optimised Transpose kernel using Cutlass Cute framework
- Day 71 - Shared Memory Transpose kernel using Cutlass Cute framework
- Day 72 - Frobenius Norm in cuda
- Day 73 - MAX POOL in cuda
- Day 74 - AVG POOL in cuda
- Day 75 - Softplus in cuda using inline function
- Day 76 - Softplus Backward in cuda using inline function
- Day 77 - Hard Sigmoid in cuda
- Day 78 - Hard Sigmoid Backward in cuda
- Day 79 - Maxpool Backward in cuda
- Day 80 - Warp-Level intrinsics in CUDA
- Day 81 - Row Sum backward in CUDA
- Day 82 - Cosine Similarity using warp reduction in CUDA
- Day 83 - Jensen Shannon Distance in CUDA
- Day 84 - Batched Embedding Lookup in CUDA
- Day 85 - Lower Triangular Matrix in CUDA
- Day 86 - Gemm Kernel using Cutlass in CUDA
- Day 87 - Gemm Kernel using Shared Memory and Tiled Cutlass in CUDA
- Day 88 - TENSOR PARALLELISM IN CUDA
- Day 89 - Relu Fp16x8 IN CUDA
- Day 90 - HardShrink Fp16x8 IN CUDA
- Day 91 - AdamOptimizer implementation in Pytorch and CUDA
- Day 92 - Binary Neural Network IN CUDA
- Day 93 - Trying cp_async prefetch ptx instruction in cuda
- Day 94 - Understood use of cuda graphs and got speedup in pytorch
- Day 95 - Optimizing resnet50 using cuda graphs
- Day 96 - KV CACHE IN CUDA
- Day 97 - Trying Fused Swish MLP IN CUDA
- Day 98 - Trying Lora Kernel IN CUDA
- Day 99 - Trying GQA Kernel IN CUDA
- Day 100 - SPLITK GEMM Kernel IN CUDA
Flash Attention Implementation - A more memory-efficient attention mechanism- Flash attention Backward pass implementation
- Rotary Position Embedding (RoPE) - Key component in modern transformers
- Fused Multi-Head Attention - Combining operations for better performance
- KV Cache Implementation - Essential for inference optimization
- Grouped Query Attention (GQA) - Used in models like Llama 3
- PagedAttention - Memory-efficient attention algorithm from vLLM
- Quantization-Aware Training (QAT) - Training with simulated quantization
Weight Update with Adam Optimizer - Implement the optimizer directly in CUDA- FP8 or FP16 Training Loop - Explore lower precision training
Tensor Parallelism - Split computation across multiple GPUs