You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Creating a multitude of small copies for benchmarking slows AMDGPU.jl down a lot, something not observed in CUDA.jl. The solution for this specific code is to avoid allocations all together, but this is (maybe?) not possible with every type of code. (I also remember having had some issues with benchmarktools, but cannot manage to reproduce them right now) Sharing the code here for future reference:
using AMDGPU, BSON
n_values=(2 .^(1:14))
timings=zeros(2,length(n_values))
function mybelapsed(A, B)
AMDGPU.rocBLAS.gemm('N','N',copy(A),copy(B))
t=0.0
k=0
while (k<1e5 && t<1)
Acpy=copy(A)
Bcpy=copy(B)
AMDGPU.synchronize()
t+= @elapsed (AMDGPU.@sync AMDGPU.rocBLAS.gemm('N','N',Acpy,Bcpy))
AMDGPU.synchronize()
k+=1
end
return t/k
end
function mybelapsed2(A, B)
AMDGPU.rocBLAS.gemm('N','N',copy(A),copy(B))
t=0.0
k=0
Acpy=copy(A)
Bcpy=copy(B)
if(k<1e5 && t<1)
AMDGPU.synchronize()
t+= @elapsed (AMDGPU.@sync AMDGPU.rocBLAS.gemm('N','N',Acpy,Bcpy);)
AMDGPU.synchronize()
Acpy.=A
Bcpy.=B
k+=1
end
return t/k
end
for (i,n) in enumerate(n_values)
A=ROCArray(rand(Float32,n,n));
B=ROCArray(rand(Float32,n,n));
println(n)
timings[1,i]=mybelapsed(A,B)
GC.gc()
sleep(1)
timings[2,i]=mybelapsed2(A,B)
GC.gc()
sleep(1)
BSON.@save "AMD_matmul_bench.bson" timings
end
Adding AMDGPU.unsafe_free! in every iteration does not solve this problem either, neither does turning GC off, and manually running GC.enable(true); AMDGPU.unsafe_free!(Acpy); AMDGPU.unsafe_free!(Bcpy); GC.gc(); sleep(0.001); GC.enable(false); between every iteration. The same code with AMDGPU replaced by CUDA (and ROCblasgemm by Acpy*Bcpy) shows barely any performance difference between both codes (even slightly better and more stable performance when using copies):
Thanks for reporting. It would be interesting to profile further using rocprof and compare the trace with CUDA Nsight to see where the slowdown occurs when using copying. Seems the extra copies on AMDGPU keeps somehow the device busy avoiding it to perform the compute tasks at expected perf for small arrays.
Creating a multitude of small copies for benchmarking slows AMDGPU.jl down a lot, something not observed in CUDA.jl. The solution for this specific code is to avoid allocations all together, but this is (maybe?) not possible with every type of code. (I also remember having had some issues with benchmarktools, but cannot manage to reproduce them right now) Sharing the code here for future reference:
Adding AMDGPU.unsafe_free! in every iteration does not solve this problem either, neither does turning GC off, and manually running
GC.enable(true); AMDGPU.unsafe_free!(Acpy); AMDGPU.unsafe_free!(Bcpy); GC.gc(); sleep(0.001); GC.enable(false);
between every iteration. The same code with AMDGPU replaced by CUDA (and ROCblasgemm by Acpy*Bcpy) shows barely any performance difference between both codes (even slightly better and more stable performance when using copies):Versions:
@jpsamaroo @vchuravy @pxl-th
The text was updated successfully, but these errors were encountered: