git clone https://github.com/pytorch-labs/tritonbench.git
git submodule update --init --recursive
python install.py
python run.py --op gemm
For tests and benchmarks, it needs to be installed as a library:
pip install -e .
Caveats:
- 49.3 GB
- 44 min of build time
# From root git folder
cd docker
docker build . --file tritonbench-nightly.dockerfile -t tritonbench
docker run --gpus all \
--shm-size 32g \
--network=host \
-v <local_HF_HOME>:<container_HF_HOME> \
--name tritonbench_workload \
-it \
--rm \
--ipc=host \
tritonbench bash
No docs → SO docs
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
-
Custom kernels
- (CUDA, HIP) kernels
- (CUDA, HIP) generative-recommenders
- (CUDA, HIP) Liger-Kernel
- (CUDA) xformers
- (CUDA) flash-attention
- (CUDA) FBGEMM
- (CUDA) ThunderKittens
- (CUDA) cutlass-kernels
-
Folder structure
- operators: different operators to choose from
├── addmm ├── bf16xint16_gemm ├── cross_entropy ├── decoding_attention ├── embedding ├── flash_attention ├── fp8_attention ├── fp8_fused_quant_gemm_rowwise ├── fp8_gemm ├── fp8_gemm_blockwise ├── fp8_gemm_rowwise ├── fused_linear_cross_entropy ├── fused_linear_jsd ├── gather_gemv ├── geglu ├── gemm ├── grouped_gemm ├── int4_gemm ├── jagged_layer_norm ├── jagged_mean ├── jagged_softmax ├── jagged_sum ├── jsd ├── kl_div ├── launch_latency ├── layer_norm ├── low_mem_dropout ├── mixed_gemm ├── op.py ├── op_task.py ├── ragged_attention ├── rms_norm ├── rope ├── softmax ├── sum ├── swiglu ├── template_attention ├── test_op ├── vector_add └── welford
- kernels: Triton implementation of the Flash Attention v2
- benchmarks: Perform simple benchmarks indicating tflops
├── compile_time ├── flash_attention_bench ├── gemm_bench └── nightly - tests:
-
Arguments
run.py
-
-op
- Operators to benchmark. Split with commas if multiple. -
-op-collection
- Operator collections to benchmark. Conflicts with-op
. Choices:default
,liger
,all
. -
-mode
- Test mode. Choices:fwd
,bwd
,fwd_bwd
,fwd_no_grad
. Default:fwd
. -
-bwd
- Run backward pass. -
-fwd-bwd
- Run both forward and backward passes. -
-fwd-no-grad
- Run forward pass without gradients. -
-precision
,-dtype
- Operator input precision. Default:bypass
. Choices:{AVAILABLE_PRECISIONS}
.# tritonbench/tritonbench/utils/env_utils.py AVAILABLE_PRECISIONS = [ "bypass", "fp32", "tf32", "fp16", "amp", "fx_int8", "bf16", "amp_fp16", "amp_bf16", "fp8", ]
-
-device
- Device to benchmark. Default:cuda
. -
-warmup
- Number of warmup runs per benchmark. Default:DEFAULT_WARMUP=25
. -
-iter
- Number of iterations per benchmark. Default:DEFAULT_RUN_ITERS= 100
. -
-csv
- Print result as CSV. -
-output-dir
- Output result CSV to the specified directory. -
-output
- Output result CSV to a file. -
-output-json
- Output result JSON to a file. -
-skip-print
- Skip printing results. -
-plot
- Plot the results. -
-ci
- Run in CI mode. -
-metrics
- Metrics to collect, separated by commas. Example:latency,tflops,speedup
. -
-metrics-gpu-backend
- Backend for GPU metrics collection. Choices:torch
,nvml
. Default:torch
. -
-only
- Specify kernel implementations to run. -
-skip
- Specify kernel implementations to skip. -
-baseline
- Override the default baseline. -
-num-inputs
- Number of example inputs. -
-keep-going
- Continue execution even if errors occur. -
-input-id
- Start input ID to run. -
-test-only
- Run in test mode, skipping expensive steps like autotuning. -
-dump-ir
- Dump Triton IR. -
-gpu-lockdown
- Lock GPU frequency and clocks. -
-operator-loader
- Benchmark ATen operators intritonbench/operator_loader
. -
-cudagraph
- Benchmark with CUDA Graph. -
-isolate
- Run each operator in a separate process. -
-bypass-fail
- Continue execution even if an operator fails. -
-shuffle-shapes
- Randomly shuffle inputs before benchmarking. -
-compile-cold-start
- Include cold start time in compilation metrics.
-
-
examples
python run.py --m 4096 \ --n 4096 \ --k 4096 \ --precision fp16 \ --only triton_tutorial_matmul \ --metrics tflops
python run.py --op fp8_gemm --mode fwd --device cuda --metrics tflops
git clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt
# CARE: using cu126 instead 124
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --force-reinstall
# Download the tokenizer
python torchtitan/datasets/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B --tokenizer_path "original"
# HF_HOME location
export HF_HOME=<HF_HOME_LOCATION>
# Example workload
CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh
Caveats:
- HF_TOKEN needs to be specified in the host
- WORKDIR specified for torch-titan commands
- Examples of commands as comments in the end
<local_HF_HOME>:<container_HF_HOME>
loading from local nvme
# Use NVIDIA's official CUDA 12.4 base image
FROM pytorch/pytorch:2.6.0-cuda12.6-cudnn9-runtime
#FROM ubuntu:22.04
ARG HF_TOKEN
ENV HF_TOKEN=${HF_TOKEN}
# makes sure the shell used for subsequent RUN commands is exactly Bash, as located in /bin.
SHELL ["/bin/bash", "-c"]
# Install dependencies
# llamacpp gcc compilation tools
RUN apt-get update && apt-get install -y \
build-essential \
fzf \
ripgrep \
nvtop \
sudo \
kmod \
wget \
vim \
git \
curl \
bzip2 \
ca-certificates \
libglib2.0-0 \
libxext6 \
libsm6 \
libxrender1 \
libssl-dev
# Cleanup command to remove the apt cache and reduce the image size: # IMPORTANT: Enforces using sudo apt update when entering the container
#&& rm -rf /var/lib/apt/lists/*
# Cloning the repo
RUN git clone https://github.com/pytorch/torchtitan
RUN pip install -r torchtitan/requirements.txt
# CARE: using cu126 instead 124
RUN pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --force-reinstall
# Change to the repo directory using WORKDIR
WORKDIR /workspace/torchtitan
# Download the tokenizer
RUN python torchtitan/datasets/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B --tokenizer_path "original"
#Â docker build . --build-arg HF_TOKEN="$HF_TOKEN" -t torchtitan_cuda126_torch26
#Â docker run --gpus all --shm-size 32g --network=host <local_HF_HOME>:<container_HF_HOME> --name torchtitan_workload -it --rm --ipc=host torchtitan_cuda126_torch26 bash -c 'CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh'
GPU metrics:
torchtitan: loss, GPU Memory, Token per second per device(tps) and mfu
tritonbench: FLOPS, latency, and speedup (relative to other kernels)
CPU metrics:
to be researched when GPU metrics are affected
-
Workload: train_configs/llama3_8b.toml with 100 steps
# torchtitan Config.toml # NOTE: this toml config is a preset for 64 A100 GPUs. [job] dump_folder = "./outputs" description = "Llama 3 8B training" [profiling] enable_profiling = true save_traces_folder = "profile_trace" profile_freq = 100 [metrics] log_freq = 10 enable_tensorboard = true save_tb_folder = "tb" [model] name = "llama3" flavor = "8B" norm_type = "rmsnorm" # layernorm / np_layernorm / rmsnorm tokenizer_path = "./torchtitan/datasets/tokenizer/original/tokenizer.model" [optimizer] name = "AdamW" lr = 3e-4 [training] batch_size = 1 seq_len = 8192 warmup_steps = 200 # lr scheduler warm up max_norm = 1.0 # grad norm clipping steps = 100 data_parallel_replicate_degree = 1 data_parallel_shard_degree = -1 tensor_parallel_degree = 1 compile = false dataset = "c4" [experimental] context_parallel_degree = 1 pipeline_parallel_degree = 1 [checkpoint] enable_checkpoint = false folder = "checkpoint" interval_type = "steps" interval = 500 model_weights_only = false export_dtype = "float32" async_mode = "disabled" # ["disabled", "async", "async_with_pinned_mem"] [activation_checkpoint] mode = 'selective' # ['none', 'selective', 'full'] selective_ac_option = 'op' # 'int' = ac every positive int layer or 'op', ac based on ops policy [float8] enable_float8_linear = false
-
Results
nohup: ignoring input + NGPU=8 + LOG_RANK=0 + CONFIG_FILE=./train_configs/llama3_8b.toml + overrides= + '[' 0 -ne 0 ']' + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/llama3_8b.toml W0213 11:33:04.996000 14767 site-packages/torch/distributed/run.py:792] W0213 11:33:04.996000 14767 site-packages/torch/distributed/run.py:792] ***************************************** W0213 11:33:04.996000 14767 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0213 11:33:04.996000 14767 site-packages/torch/distributed/run.py:792] ***************************************** [rank0]:2025-02-13 11:33:11,198 - root - INFO - Starting job: Llama 3 8B training [rank0]:2025-02-13 11:33:12,134 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2025-02-13 11:33:12,138 - root - INFO - CUDA capacity: NVIDIA H200 with 139.83GiB memory [rank0]:2025-02-13 11:33:12,140 - root - WARNING - Error running lspci: [Errno 2] No such file or directory: 'lspci', fallback to use device_name [rank0]:2025-02-13 11:33:12,140 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14 [rank0]:2025-02-13 11:33:12,140 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8] [rank0]:2025-02-13 11:33:18,527 - root - INFO - Building tiktoken tokenizer locally from ./torchtitan/datasets/tokenizer/original/tokenizer.model [rank0]:2025-02-13 11:33:18,746 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001 [rank0]:2025-02-13 11:33:18,746 - root - INFO - Preparing c4 dataset from allenai/c4 [rank0]:2025-02-13 11:33:28,973 - root - INFO - Building llama3 8B with ModelArgs(dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, norm_type='rmsnorm') [rank0]:2025-02-13 11:33:29,156 - root - INFO - Model llama3 8B [31msize: 8,030,261,248 total parameters [rank0]:2025-02-13 11:33:29,157 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2025-02-13 11:33:29,239 - root - INFO - Applied FSDP to the model [rank0]:2025-02-13 11:33:29,478 - root - INFO - CUDA memory usage for model: 3.77GiB(2.70%) [rank0]:2025-02-13 11:33:29,482 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250213-1133 [rank0]:2025-02-13 11:33:29,482 - root - INFO - Training starts at step 1, with local batch size 1, global batch size 8, sequence length 8192, total steps 100 (warmup 200) [rank0]:2025-02-13 11:33:29,482 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace [rank0]:2025-02-13 11:33:35,026 - root - INFO - step: 1 loss: 12.2417 memory: 42.08GiB(30.10%) tps: 1,478 mfu: 8.65% [rank0]:2025-02-13 11:33:35,027 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:2025-02-13 11:33:45,521 - root - INFO - step: 10 loss: 9.8821 mmemory: 49.59GiB(35.46%) tps: 7,026 mfu: 41.15% [rank0]:2025-02-13 11:33:57,093 - root - INFO - step: 20 loss: 8.3854 memory: 49.59GiB(35.46%) tps: 7,080 mfu: 41.46% [rank0]:2025-02-13 11:34:08,663 - root - INFO - step: 30 loss: 7.6991 memory: 49.59GiB(35.46%) tps: 7,082 mfu: 41.47% [rank0]:2025-02-13 11:34:20,249 - root - INFO - step: 40 loss: 7.3925 memory: 49.59GiB(35.46%) tps: 7,072 mfu: 41.41% [rank0]:2025-02-13 11:34:31,859 - root - INFO - step: 50 loss: 7.0563 memory: 49.59GiB(35.46%) tps: 7,057 mfu: 41.32% [rank0]:2025-02-13 11:34:43,479 - root - INFO - step: 60 loss: 6.8581 memory: 49.59GiB(35.46%) tps: 7,051 mfu: 41.29% [rank0]:2025-02-13 11:34:55,115 - root - INFO - step: 70 loss: 6.9885 memory: 49.59GiB(35.46%) tps: 7,042 mfu: 41.24% [rank0]:2025-02-13 11:35:06,760 - root - INFO - step: 80 loss: 6.6429 memory: 49.59GiB(35.46%) tps: 7,036 mfu: 41.20% [rank0]:2025-02-13 11:35:18,410 - root - INFO - step: 90 loss: 6.7054 memory: 49.59GiB(35.46%) tps: 7,033 mfu: 41.19% [rank0]:2025-02-13 11:35:30,360 - root - INFO - step: 100 loss: 6.4719 memory: 49.59GiB(35.46%) tps: 6,856 mfu: 40.15% [rank0]:2025-02-13 11:35:30,986 - root - INFO - Dumping profiler traces at step 100 [rank0]:2025-02-13 11:35:31,218 - root - INFO - Finished dumping profiler traces in 0.23 seconds [rank0]:2025-02-13 11:35:31,219 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:2025-02-13 11:35:33,220 - root - INFO - Training completed
-
Workloads: gemm, decoding_attention, flash_attention, layer_norm, rms_norm, softmax.
Any specific matrix dimension?
gemm:
python run.py --op gemm --mode fwd --device cuda --metrics latency,tflops,speedup --output-json "baseline_gemm.jsonl"
decoding_attention:
python run.py --op decoding_attention --mode fwd --device cuda --metrics latency,tflops,speedup --output-json "baseline_decoding_attention.jsonl"
-
ERROR: NameError: name 'fmha' is not defined
Traceback (most recent call last): File "/mnt/co-research/home/rodri/tritonbench/run.py", line 125, in <module> run() File "/mnt/co-research/home/rodri/tritonbench/run.py", line 121, in run _run(args, extra_args) File "/mnt/co-research/home/rodri/tritonbench/run.py", line 35, in _run Opbench = load_opbench_by_name(args.op) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/operators/op.py", line 70, in load_opbench_by_name module = importlib.import_module(f"..{op_pkg}", package=__name__) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/miniconda3/envs/tritonbench/lib/python3.11/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen importlib._bootstrap>", line 1204, in _gcd_import File "<frozen importlib._bootstrap>", line 1176, in _find_and_load File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 690, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 940, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/mnt/co-research/home/rodri/tritonbench/tritonbench/operators/decoding_attention/__init__.py", line 1, in <module> from .operator import Operator File "/mnt/co-research/home/rodri/tritonbench/tritonbench/operators/decoding_attention/operator.py", line 132, in <module> fmha.attn_bias.BlockDiagonalCausalWithOffsetPaddedKeysMask, ^^^^ NameError: name 'fmha' is not defined (tritonbench) rodri@cluster-h200-01-f2:~/tritonbench$ python run.py --op decoding_attentio --device cuda --metrics tflops speedup --output-json "baseli ne_decoding_attention.jsonl" Traceback (most recent call last): File "/mnt/co-research/home/rodri/tritonbench/run.py", line 125, in <module> run() File "/mnt/co-research/home/rodri/tritonbench/run.py", line 121, in run _run(args, extra_args) File "/mnt/co-research/home/rodri/tritonbench/run.py", line 35, in _run Opbench = load_opbench_by_name(args.op) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/operators/op.py", line 60, in load_opbench_by_name raise RuntimeError(f"{op_name} is not found in the Tritonbench operator list.") RuntimeError: decoding_attentio is not found in the Tritonbench operator list. (tritonbench) rodri@cluster-h200-01-f2:~/tritonbench$ python run.py --op decoding_attention --device cuda --metrics tflops speedup --output-json "basel ine_decoding_attention.jsonl" Traceback (most recent call last): File "/mnt/co-research/home/rodri/tritonbench/run.py", line 125, in <module> run() File "/mnt/co-research/home/rodri/tritonbench/run.py", line 121, in run _run(args, extra_args) File "/mnt/co-research/home/rodri/tritonbench/run.py", line 35, in _run Opbench = load_opbench_by_name(args.op) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/operators/op.py", line 70, in load_opbench_by_name module = importlib.import_module(f"..{op_pkg}", package=__name__) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/miniconda3/envs/tritonbench/lib/python3.11/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen importlib._bootstrap>", line 1204, in _gcd_import File "<frozen importlib._bootstrap>", line 1176, in _find_and_load File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 690, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 940, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/mnt/co-research/home/rodri/tritonbench/tritonbench/operators/decoding_attention/__init__.py", line 1, in <module> from .operator import Operator File "/mnt/co-research/home/rodri/tritonbench/tritonbench/operators/decoding_attention/operator.py", line 132, in <module> fmha.attn_bias.BlockDiagonalCausalWithOffsetPaddedKeysMask, ^^^^ NameError: name 'fmha' is not defined
flash_attn:
python run.py --op flash_attention --mode fwd --device cuda --metrics latency,tflops,speedup --output-json "baseline_flash.jsonl"
-
ERROR: NameError: name 'make_packed_qkv' is not defined
TMA benchmarks will be running with experimental grid constant TMA descriptor. 0%| | 0/8 [00:00<?, ?it/s] Caught exception, terminating early with partial results Traceback (most recent call last): File "/mnt/co-research/home/rodri/tritonbench/tritonbench/utils/triton_op.py", line 833, in run y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce( ^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/utils/triton_op.py", line 821, in _reduce_benchmarks acc[bm_name] = self._do_bench( ^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/utils/triton_op.py", line 1076, in _do_bench fn = self._get_bm_func(fn_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/utils/triton_op.py", line 716, in _get_bm_func fwd_fn = fwd_fn_lambda(*self.example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/utils/triton_op.py", line 532, in _inner return function(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/operators/flash_attention/operator.py", line 254, in flash_v2 qkv = make_packed_qkv(q, k, v) ^^^^^^^^^^^^^^^ NameError: name 'make_packed_qkv' is not defined (Batch, Heads, SeqLen, Dhead) -------------------------------
layer_norm:
python run.py --op layer_norm --mode fwd --device cuda --metrics latency,tflops,speedup --output-json "baseline_layernorm.jsonl"
rms_norm:
python run.py --op rms_norm --mode fwd --device cuda --metrics latency,tflops,speedup --output-json "baseline_rmsnorm.jsonl"
-
ERROR: 'NoneType' object is not callable
Caught exception, terminating early with partial results Traceback (most recent call last): File "/mnt/co-research/home/rodri/tritonbench/tritonbench/utils/triton_op.py", line 833, in run y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce( ^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/utils/triton_op.py", line 821, in _reduce_benchmarks acc[bm_name] = self._do_bench( ^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/utils/triton_op.py", line 1076, in _do_bench fn = self._get_bm_func(fn_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/utils/triton_op.py", line 716, in _get_bm_func fwd_fn = fwd_fn_lambda(*self.example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/utils/triton_op.py", line 532, in _inner return function(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/co-research/home/rodri/tritonbench/tritonbench/operators/rms_norm/operator.py", line 62, in liger_rms self.liger_rms_op = LigerRMSNorm(hidden_size=H, eps=self.eps).to(self.device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: 'NoneType' object is not callable (M, H) --------
softmax:
python run.py --op softmax --mode fwd --device cuda --metrics latency,tflops,speedup --output-json "baseline_softmax.jsonl"
-