Skip to content

mananchawla2005/gpukernels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU Programming Learning Journey

A collection of GPU kernels implemented one day at a time, progressing from basic to advanced concepts.

Prerequisites

  • NVIDIA GPU with CUDA support
  • CUDA Toolkit installed
  • Python 3.11+
  • PyTorch

Directory Structure

  • Day 1 - Basic Vector Addition in CUDA
  • Day 2 - Vector Addition with Python/PyTorch Bindings
  • Day 3 - RGB to Grayscale Conversion
  • Day 4 - RGB to Blurred Image Conversion
  • Day 5 - Simple Matrix Multiplication
  • Day 6 - Coalased Matrix Multiplication
  • Day 7 - GELU Activation function
  • Day 8 - NAIVE Batch Normalisation
  • Day 9 - Sigmoid Activation function
  • Day 10 - Tanh Activation function and Tiled Matrix Multiplication
  • Day 11 - Dynamic Tiled Matrix Multiplication
  • Day 12 - Layer Normalisation using Shared Memory
  • Day 13 - Matrix Transpose
  • Day 14 - Softmax using shared memory
  • Day 15 - GELU Forward and Backward Kernels
  • Day 16 - Querying Gpu Properties
  • Day 17 - Custom NF4 Quantization Implementation
  • Day 18 - Custom Double NF4 Quantization (QLORA STYLE) Implementation
  • Day 19 - Transformers Self Attention Implementation
  • Day 20 - Triton Basics
  • Day 21 - Dropout in Triton
  • Day 22 - Batch Norm in Triton
  • Day 23 - [Not working] Chunked Cross Entropy Loss in Triton
  • Day 24 - Device Propertiest using pytorch ( WILL USE LATER )
  • Day 25 - Sigmoid in Triton
  • Day 26 - Blur Kernel in Triton
  • Day 27 - Gelu Kernel in Triton
  • Day 28 - Tanh Kernel in Triton
  • Day 29 - Transpose Kernel in Triton
  • Day 30 - Layer Norm Kernel in Triton
  • Day 31 - Tiled Matmul Corner Turning Kernel in Triton
  • Day 32 - Partial Dequantise Kernel in Triton
  • Day 33 - Animated Color Patterns in Cuda
  • Day 34 - SiLU in Triton
  • Day 35 - RMSNORM in Triton
  • Day 36 - DyT in Cuda ( Transformers without normalisation )
  • Day 37 - L2 NORM in cuda
  • Day 38 - L1 NORM in cuda
  • Day 39 - Thread Coarsening TILED MM in cuda
  • Day 40 - USING NSIGHT COMPUTE to profile a candidate kernel and generate a report
  • Day 41 - Swish Activation function in cuda
  • Day 42 - Swapping elements in cuda
  • Day 43 - Flash Attention in cuda
  • Day 44 - GeGelu activation in cuda
  • Day 45 - Tinkered with numba cuda
  • Day 46 - Rope Embedding Kernel cuda
  • Day 47 - SIMPLE BLAS Ops in CUBLAS cuda
  • Day 48 - Matmul using mma(tensor cores) in cuda
  • Day 49 - Rope Backward Pass in cuda
  • Day 50 - SELF ATTENTION Backward Pass in cuda
  • Day 51 - Lightning ATTENTION Forward Pass in cuda
  • Day 52 - Optimising restnet18 using custom fused add relu cuda kernel
  • Day 53 - Optimising kokoro using custom fused cuda kernels (Over next few days I will be optimising different ops in the model and all of the kernels would be applied here in the kokoro pipeline)
  • Day 54 - Triplet Loss in cuda ( for constrastive training )
  • Day 55 - MSE in cuda
  • Day 56 - AdaIN with Snake Activation in cuda
  • Day 57 - MISH activation in cuda
  • Day 58 - Cosine Similarity in cuda
  • Day 59 - Hinge Loss in cuda
  • Day 60 - KL DIVERGENCE in cuda
  • Day 61 - GEMM BIAS RELU in cuda
  • Day 62 - Vector, Ldg, restrict and warp divergence optimised ELU kernel in cuda
  • Day 63 - Parallel Block Reduction, Vector, Ldg, restrict and optimized mean and variance calculation Layer Normalization kernel in cuda with shared memory
  • Day 64 - Relu Kernel optimised using 2d indexing and float4 vectorization
  • Day 65 - Selu Kernel optimised using 2d indexing and float4 vectorization
  • Day 66 - Sigmoid Kernel optimised using 2d indexing and float4 vectorization
  • Day 67 - Tanh Kernel optimised using 2d indexing and float4 vectorization
  • Day 68 - Transpose kernel using Cutlass Cute framework
  • Day 69 - Huberloss float8 cuda
  • Day 70 - Optimised Transpose kernel using Cutlass Cute framework
  • Day 71 - Shared Memory Transpose kernel using Cutlass Cute framework
  • Day 72 - Frobenius Norm in cuda
  • Day 73 - MAX POOL in cuda
  • Day 74 - AVG POOL in cuda
  • Day 75 - Softplus in cuda using inline function
  • Day 76 - Softplus Backward in cuda using inline function
  • Day 77 - Hard Sigmoid in cuda
  • Day 78 - Hard Sigmoid Backward in cuda
  • Day 79 - Maxpool Backward in cuda
  • Day 80 - Warp-Level intrinsics in CUDA
  • Day 81 - Row Sum backward in CUDA
  • Day 82 - Cosine Similarity using warp reduction in CUDA
  • Day 83 - Jensen Shannon Distance in CUDA
  • Day 84 - Batched Embedding Lookup in CUDA
  • Day 85 - Lower Triangular Matrix in CUDA
  • Day 86 - Gemm Kernel using Cutlass in CUDA
  • Day 87 - Gemm Kernel using Shared Memory and Tiled Cutlass in CUDA
  • Day 88 - TENSOR PARALLELISM IN CUDA
  • Day 89 - Relu Fp16x8 IN CUDA
  • Day 90 - HardShrink Fp16x8 IN CUDA
  • Day 91 - AdamOptimizer implementation in Pytorch and CUDA
  • Day 92 - Binary Neural Network IN CUDA
  • Day 93 - Trying cp_async prefetch ptx instruction in cuda
  • Day 94 - Understood use of cuda graphs and got speedup in pytorch
  • Day 95 - Optimizing resnet50 using cuda graphs
  • Day 96 - KV CACHE IN CUDA
  • Day 97 - Trying Fused Swish MLP IN CUDA
  • Day 98 - Trying Lora Kernel IN CUDA
  • Day 99 - Trying GQA Kernel IN CUDA
  • Day 100 - SPLITK GEMM Kernel IN CUDA

FUTURE IDEAS

  • Flash Attention Implementation - A more memory-efficient attention mechanism
  • Flash attention Backward pass implementation
  • Rotary Position Embedding (RoPE) - Key component in modern transformers
  • Fused Multi-Head Attention - Combining operations for better performance
  • KV Cache Implementation - Essential for inference optimization
  • Grouped Query Attention (GQA) - Used in models like Llama 3
  • PagedAttention - Memory-efficient attention algorithm from vLLM
  • Quantization-Aware Training (QAT) - Training with simulated quantization
  • Weight Update with Adam Optimizer - Implement the optimizer directly in CUDA
  • FP8 or FP16 Training Loop - Explore lower precision training
  • Tensor Parallelism - Split computation across multiple GPUs

About

The repository contains one gpu kernel each day :)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published