Skip to content

gpu-mode/lectures

Folders and files

NameName
Last commit message
Last commit date

Latest commit

45e5dee · Feb 9, 2025
Dec 3, 2024
May 4, 2024
May 4, 2024
May 4, 2024
May 4, 2024
Jan 6, 2025
May 4, 2024
May 4, 2024
May 4, 2024
Dec 21, 2024
May 4, 2024
Sep 11, 2024
Aug 19, 2024
Sep 28, 2024
Oct 6, 2024
Oct 17, 2024
Nov 4, 2024
Nov 9, 2024
Nov 17, 2024
Nov 23, 2024
Dec 2, 2024
Feb 9, 2025
Feb 5, 2024
Mar 13, 2024
Feb 9, 2025
Feb 12, 2024

Repository files navigation

Supplementary Material for Lectures

YouTube Channel

The PMPP Book: Programming Massively Parallel Processors: A Hands-on Approach (Amazon link)

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Lecture 2: Recap Ch. 1-3 from the PMPP book

Lecture 3: Getting Started With CUDA

Lecture 4: Intro to Compute and Memory Architecture

Lecture 5: Going Further with CUDA for Python Programmers

Lecture 6: Optimizing PyTorch Optimizers

Lecture 7: Advanced Quantization

Lecture 8: CUDA Performance Checklist

Lecture 9: Reductions

Lecture 10: Build a Prod Ready CUDA Library

Lecture 11: Sparsity

Lecture 12: Flash Attention

Lecture 13: Ring Attention

Lecture 14: Practitioner's Guide to Triton

Lecture 15: CUTLASS

Lecture 16: On Hands profiling

Bonus Lecture: CUDA C++ llm.cpp

Lecture 17: GPU Collective Communication (NCCL)

Lecture 18: Fused Kernels

Lecture 19: Data Processing on GPUs

Lecture 20: Scan Algorithm

Lecture 21: Scan Algorithm Part 2

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Lecture 23: Tensor Cores

  • Speaker: Vijay Thakkar & Pradeep Ramani
  • Slides

Lecture 24: Scan at the Speed of Light

  • Speaker: Jake Hemstad & Georgii Evtushenko

Lecture 25: Speaking Composable Kernel

  • Speaker: Haocong Wang
  • Slides

Lecture 26: SYCL MODE (Intel GPU)

Lecture 27: gpu.cpp

Lecture 28: Liger Kernel

Lecture 29: Triton Internals

Lecture 30: Quantized training

Lecture 31: Beginners Guide to Metal Kernels

Lecture 32: Unsloth - LLM Systems Engineering

Lecture 33: BitBLAS

Lecture 34: Low Bit Triton Kernels

Lecture 35: SGLang Performance Optimization

Lecture 36: CUTLASS and Flash ATtention 3

Lecture 37: Introduction to SASS & GPU Microarchitecture

Lecture 38: Lowbit kernels for ARM CPU

Lecture 39: TorchTitan

  • Speaker: Mark Saroufim and Tianyu Liu

Lecture 40: Flash Infer

Lecture 41: CUDA Docs for Humans

Lecture 42: Mosaic GPU

Lecture 43:

  • Speaker: Erik Schultheis
  • Slides