Supplementary Material for Lectures

YouTube Channel

The PMPP Book: Programming Massively Parallel Processors: A Hands-on Approach (Amazon link)

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Speaker: Mark Saroufim
Notebook and slides in lecture_001 folder

Lecture 2: Recap Ch. 1-3 from the PMPP book

Speaker: Andreas Koepf
Slides: The powerpoint file lecture_002/cuda_mode_lecture2.pptx can be found in the root directory of this repository. Alternatively here as Google docs presentation.

Lecture 3: Getting Started With CUDA

Speaker: Jeremy Howard
Notebook: See the lecture_003 folder, or run the Colab version

Lecture 4: Intro to Compute and Memory Architecture

Speaker: Thomas Viehmann
Notebook and slides in the lecture_004 folder.

Lecture 5: Going Further with CUDA for Python Programmers

Speaker: Jeremy Howard
Notebook in the lecture_005 folder.

Lecture 6: Optimizing PyTorch Optimizers

Speaker: Jane Xu
Slides

Lecture 7: Advanced Quantization

Speaker: Charles Hernandez
Slides

Lecture 8: CUDA Performance Checklist

Speaker: Mark Saroufim
Code in the lecture_008 folder
Slides

Lecture 9: Reductions

Speaker: Mark Saroufim
Code in the lecture_009 folder
Slides

Lecture 10: Build a Prod Ready CUDA Library

Speaker: Oscar Amoros Huguet
slides

Lecture 11: Sparsity

Speaker: Jesse Cai
Slides

Lecture 12: Flash Attention

Speaker: Thomas Viehmann

Lecture 13: Ring Attention

Speaker: Andreas Koepf
Slides

Lecture 14: Practitioner's Guide to Triton

Date: 2024-04-13, Speaker: Umer Adil
Notebook

Lecture 15: CUTLASS

Speaker: Eric Auld

Lecture 16: On Hands profiling

Speaker: Taylor Robbie

Bonus Lecture: CUDA C++ llm.cpp

Speaker: Jake Hemstad & Georgii Evtushenko
Slides

Lecture 17: GPU Collective Communication (NCCL)

Speaker: Dan Johnson
Code in the lecture_017 folder

Lecture 18: Fused Kernels

Speaker: Kapil Sharma
Code in the lecture_018 folder

Lecture 19: Data Processing on GPUs

Speaker: Devavret Makkar

Lecture 20: Scan Algorithm

Speaker: Izzat El Haj
Slides

Lecture 21: Scan Algorithm Part 2

Speaker: Izzat El Haj
Slides

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Speaker: Cade Daniel
Slides

Lecture 23: Tensor Cores

Speaker: Vijay Thakkar & Pradeep Ramani
Slides

Lecture 24: Scan at the Speed of Light

Speaker: Jake Hemstad & Georgii Evtushenko

Lecture 25: Speaking Composable Kernel

Speaker: Haocong Wang
Slides

Lecture 26: SYCL MODE (Intel GPU)

Speaker: Patric Zhao
Slides

Lecture 27: gpu.cpp

Speaker: Austin Huang
Slides

Lecture 28: Liger Kernel

Lecture 29: Triton Internals

Speaker: Kapil Sharma
Code/presentation in the lecture_029 folder

Lecture 30: Quantized training

Speaker: Thien Tran
Code/presentation in the lecture_030 folder

Lecture 31: Beginners Guide to Metal Kernels

Speaker: Nikita Shulga
Code/presentation in the lecture_031 folder

Lecture 32: Unsloth - LLM Systems Engineering

Speaker: Daniel Han
Slides

Lecture 33: BitBLAS

Speaker: Wang Lei
Code/presentation in the lecture_033 folder

Lecture 34: Low Bit Triton Kernels

Speaker: Hicham Badri
Slides

Lecture 35: SGLang Performance Optimization

Speaker: Yineng Zhang
Slides

Lecture 36: CUTLASS and Flash ATtention 3

Speaker: Jay Shah
Slides

Lecture 37: Introduction to SASS & GPU Microarchitecture

Speaker: Arun Demeure
Slides

Lecture 38: Lowbit kernels for ARM CPU

Speaker: Scott Roy
Slides

Lecture 39: TorchTitan

Speaker: Mark Saroufim and Tianyu Liu

Lecture 40: Flash Infer

Speaker: Zihao Ye

Lecture 41: CUDA Docs for Humans

Speaker: Charles Frye
Slides

Lecture 42: Mosaic GPU

Speaker: Adam Paszke

Lecture 43:

Speaker: Erik Schultheis
Slides

Name	Name	Last commit message	Last commit date
Latest commit msaroufim Update README.md Feb 9, 2025 45e5dee · Feb 9, 2025 History 115 Commits
lecture_001	lecture_001	add slide_001 (#41 )	Dec 3, 2024
lecture_002	lecture_002	unified folder naming, use zero padding for correct lexicographic order	May 4, 2024
lecture_003	lecture_003	unified folder naming, use zero padding for correct lexicographic order	May 4, 2024
lecture_004	lecture_004	unified folder naming, use zero padding for correct lexicographic order	May 4, 2024
lecture_005	lecture_005	unified folder naming, use zero padding for correct lexicographic order	May 4, 2024
lecture_008	lecture_008	Adding warmup steps to coarsening.cu (#46 )	Jan 6, 2025
lecture_009	lecture_009	unified folder naming, use zero padding for correct lexicographic order	May 4, 2024
lecture_011	lecture_011	unified folder naming, use zero padding for correct lexicographic order	May 4, 2024
lecture_013	lecture_013	unified folder naming, use zero padding for correct lexicographic order	May 4, 2024
lecture_014	lecture_014	Create Qs.md	Dec 21, 2024
lecture_017	lecture_017	unified folder naming, use zero padding for correct lexicographic order	May 4, 2024
lecture_018	lecture_018	[refactor] replace hardcoded conda env,	Sep 11, 2024
lecture_025	lecture_025	add slides for composable kernel lecture	Aug 19, 2024
lecture_029	lecture_029	Triton Internals Presentation and code	Sep 28, 2024
lecture_030	lecture_030	add lecture 30	Oct 6, 2024
lecture_031	lecture_031	Slides/materials for Lecture 31 (#36 )	Oct 17, 2024
lecture_033	lecture_033	Bitblas (#38 )	Nov 4, 2024
lecture_035	lecture_035	docs: add SGLang Performance Optimization GPU MODE talk slide (#39 )	Nov 9, 2024
lecture_036	lecture_036	Lecture 36	Nov 17, 2024
lecture_037	lecture_037	Lecture 37 - Introduction to SASS & GPU Microarchitecture (#40 )	Nov 23, 2024
lecture_038	lecture_038	add lec38 (#43 )	Dec 2, 2024
lecture_042	lecture_042	erik lecture	Feb 9, 2025
.gitignore	.gitignore	incomplete matmul_l5.ipynb	Feb 5, 2024
LICENSE	LICENSE	Update LICENSE	Mar 13, 2024
README.md	README.md	Update README.md	Feb 9, 2025
utils.py	utils.py	templated kernel	Feb 12, 2024

License

gpu-mode/lectures

Folders and files

Latest commit

History

Repository files navigation

Supplementary Material for Lectures

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Lecture 2: Recap Ch. 1-3 from the PMPP book

Lecture 3: Getting Started With CUDA

Lecture 4: Intro to Compute and Memory Architecture

Lecture 5: Going Further with CUDA for Python Programmers

Lecture 6: Optimizing PyTorch Optimizers

Lecture 7: Advanced Quantization

Lecture 8: CUDA Performance Checklist

Lecture 9: Reductions

Lecture 10: Build a Prod Ready CUDA Library

Lecture 11: Sparsity

Lecture 12: Flash Attention

Lecture 13: Ring Attention

Lecture 14: Practitioner's Guide to Triton

Lecture 15: CUTLASS

Lecture 16: On Hands profiling

Bonus Lecture: CUDA C++ llm.cpp

Lecture 17: GPU Collective Communication (NCCL)

Lecture 18: Fused Kernels

Lecture 19: Data Processing on GPUs

Lecture 20: Scan Algorithm

Lecture 21: Scan Algorithm Part 2

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Lecture 23: Tensor Cores

Lecture 24: Scan at the Speed of Light

Lecture 25: Speaking Composable Kernel

Lecture 26: SYCL MODE (Intel GPU)

Lecture 27: gpu.cpp

Lecture 28: Liger Kernel

Lecture 29: Triton Internals

Lecture 30: Quantized training

Lecture 31: Beginners Guide to Metal Kernels

Lecture 32: Unsloth - LLM Systems Engineering

Lecture 33: BitBLAS

Lecture 34: Low Bit Triton Kernels

Lecture 35: SGLang Performance Optimization

Lecture 36: CUTLASS and Flash ATtention 3

Lecture 37: Introduction to SASS & GPU Microarchitecture

Lecture 38: Lowbit kernels for ARM CPU

Lecture 39: TorchTitan

Lecture 40: Flash Infer

Lecture 41: CUDA Docs for Humans

Lecture 42: Mosaic GPU

Lecture 43:

About

Resources

License

Stars

Watchers

Forks

Contributors 27

Languages