Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow cluster sizes across m,n,k to be reported in cutlass profiler #2078

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mandroid6
Copy link

@mandroid6 mandroid6 commented Feb 4, 2025

Currently cutlass profiler lists down all the arguments to the benchmark but doesn't list down per kernel values for cluster_k, cluster_m and cluster_n.

This change updates the profiler report generation to include these arguments.

Before:

As we see below, the values for cluster_m,cluster_n,cluster_k are missing in the kernel result.

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,use_pdl,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8,incorrect,success,universal,4352,4096,4096,bf16:row,bf16:column,bf16:column,bf16:column,1,0,serial,1,1,heuristic,false,1,tensorop,f32,128,128,64,,,,7,4,2,1,64,128,16,90,90,104857600,146064539648,1392,0.235348,414.944,620633

After:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,use_pdl,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8,incorrect,success,universal,4352,4096,4096,bf16:row,bf16:column,bf16:column,bf16:column,1,0,serial,1,1,heuristic,false,1,tensorop,f32,128,128,64,1,2,1,7,4,2,1,64,128,16,90,90,104857600,146064539648,1392,0.235348,414.944,620633

Repro commands:

Build cutlass

git clone https://github.com/NVIDIA/cutlass
cd cutlass
mkdir build
cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=cutlass3x_sm90_tensorop_s*16gemm_bf16_bf16_f32_bf16_bf16_*tnn* -DCUTLASS_ENABLE_TESTS=OFF -GNinja -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL=9992 -DCUTLASS_LIBRARY_OPERATIONS=Gemm

Run profiler

 ./tools/profiler/cutlass_profiler --operation=Gemm --output=data --dist=gaussian,mean:0.0,stddev:1.0,scale:-1 --m=4352 --n=4096 --k=4096 --A=bf16:row --B=bf16:column --C=bf16:column --D=bf16:column

Currently cutlass profiler lists down all the arguments to the benchmark but doesn't list down per kernel values for cluster_k, cluster_m and cluster_n.

This change updates the profiler report generation to include these arguments.
@mandroid6
Copy link
Author

@hwu36 @kerrmudgeon

@hwu36
Copy link
Collaborator

hwu36 commented Feb 5, 2025

@itramble , could you please review first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants