Nsight-compute fail when profile the kernel performance #50

weilinquan · 2024-09-24T13:07:50Z

I have compiled my example by such pass pipeline.
oec-opt --stencil-shape-inference --convert-stencil-to-std --cse --parallel-loop-tiling='parallel-loop-tile-sizes=128,1,1' --canonicalize --test-gpu-greedy-parallel-loop-mapping --convert-parallel-loops-to-gpu --canonicalize --lower-affine --convert-scf-to-std --stencil-kernel-to-cubin ../test/Examples/test.mlir > temp.mlir
mlir-translate --mlir-to-llvmir temp.mlir > temp.bc
llc -O3 temp.bc -o temp.s
clang -c temp.s -o temp.o
nvcc --default-stream per-thread -allow-unsupported-compiler -ccbin clang main.cc temp.o -lcuda-runtime-wrappers -lcudart -lcuda
Here are main.cc and test.mlir files in zip. Are there any steps wrong in my pipeline? I want to use ncu to profile more details. Thank you for your help!
test.zip

gysit · 2024-09-24T16:34:34Z

I don't have a setup to reproduce this anymore. Does the code per se work? Note that this was tested on a much older CUDA version so maybe things don't work anymore nowadays.

In this post I discussed a bit a more extended pass pipeline with more optimizations:
#46 (comment)

Maybe this works. However if the code is functional then I suspect there is some tool incompatibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nsight-compute fail when profile the kernel performance #50

Nsight-compute fail when profile the kernel performance #50

weilinquan commented Sep 24, 2024

gysit commented Sep 24, 2024

Nsight-compute fail when profile the kernel performance #50

Nsight-compute fail when profile the kernel performance #50

Comments

weilinquan commented Sep 24, 2024

gysit commented Sep 24, 2024