Allow cluster sizes across m,n,k to be reported in cutlass profiler #2078

mandroid6 · 2025-02-04T23:46:00Z

Currently cutlass profiler lists down all the arguments to the benchmark but doesn't list down per kernel values for cluster_k, cluster_m and cluster_n.

This change updates the profiler report generation to include these arguments.

Before:

As we see below, the values for cluster_m,cluster_n,cluster_k are missing in the kernel result.

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,use_pdl,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8,incorrect,success,universal,4352,4096,4096,bf16:row,bf16:column,bf16:column,bf16:column,1,0,serial,1,1,heuristic,false,1,tensorop,f32,128,128,64,,,,7,4,2,1,64,128,16,90,90,104857600,146064539648,1392,0.235348,414.944,620633

After:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,use_pdl,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8,incorrect,success,universal,4352,4096,4096,bf16:row,bf16:column,bf16:column,bf16:column,1,0,serial,1,1,heuristic,false,1,tensorop,f32,128,128,64,1,2,1,7,4,2,1,64,128,16,90,90,104857600,146064539648,1392,0.235348,414.944,620633

Repro commands:

Build cutlass

git clone https://github.com/NVIDIA/cutlass
cd cutlass
mkdir build
cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=cutlass3x_sm90_tensorop_s*16gemm_bf16_bf16_f32_bf16_bf16_*tnn* -DCUTLASS_ENABLE_TESTS=OFF -GNinja -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL=9992 -DCUTLASS_LIBRARY_OPERATIONS=Gemm

Run profiler

 ./tools/profiler/cutlass_profiler --operation=Gemm --output=data --dist=gaussian,mean:0.0,stddev:1.0,scale:-1 --m=4352 --n=4096 --k=4096 --A=bf16:row --B=bf16:column --C=bf16:column --D=bf16:column

Currently cutlass profiler lists down all the arguments to the benchmark but doesn't list down per kernel values for cluster_k, cluster_m and cluster_n. This change updates the profiler report generation to include these arguments.

mandroid6 · 2025-02-05T00:03:44Z

@hwu36 @kerrmudgeon

hwu36 · 2025-02-05T12:04:35Z

@itramble , could you please review first?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow cluster sizes across m,n,k to be reported in cutlass profiler #2078

Allow cluster sizes across m,n,k to be reported in cutlass profiler #2078

mandroid6 commented Feb 4, 2025 •

edited

Loading

mandroid6 commented Feb 5, 2025

hwu36 commented Feb 5, 2025 •

edited

Loading

Allow cluster sizes across m,n,k to be reported in cutlass profiler #2078

Are you sure you want to change the base?

Allow cluster sizes across m,n,k to be reported in cutlass profiler #2078

Conversation

mandroid6 commented Feb 4, 2025 • edited Loading

Repro commands:

mandroid6 commented Feb 5, 2025

hwu36 commented Feb 5, 2025 • edited Loading

mandroid6 commented Feb 4, 2025 •

edited

Loading

hwu36 commented Feb 5, 2025 •

edited

Loading