You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems to be possible to get some working version of high-level functions by adding to highlevel.jl:
## NRM2
import Base.LinAlg.BLAS: nrm2
for (func, elty) in [(:clblasSnrm2, Float32), (:clblasDnrm2, Float64),
(:clblasCnrm2, CL_float2), (:clblasZnrm2, CL_double2)]
@eval function nrm2(n::Integer, x::CLArray{$elty}, incx::Integer;
queue=cl.queue(x))
# need temporary buffers
ctx = cl.context(x)
norm2_buff = cl.Buffer($elty, ctx, :w, 1)
scratch_buff = cl.Buffer($elty, ctx, :rw, 2*length(x))
$func(Csize_t(n), pointer(norm2_buff), Csize_t(0), pointer(x), Csize_t(0), Cint(incx),
pointer(scratch_buff), [queue])
# read return value
result = Vector{$elty}(1)
cl.enqueue_read_buffer(queue, norm2_buff, result, Csize_t(0), nothing, true)
@inbounds norm2 = result[1]
return norm2
end
end
However, this seems to be far from optimal. The corresponding clBLAS functions use a temporary buffer scratch_buff that has to be allocated for each call. Here are some benchmarks using the implementation above (I did not make a PR since I think it is too bad):
I have used @time for the last tests, since I get the following error if I run @benchmark on LinAlg.BLAS.nrm2(length(dvl), GPUArrays.blasbuffer(dvl), 1):
julia> @benchmark LinAlg.BLAS.nrm2(length($dvl), GPUArrays.blasbuffer($dvl), 1)
ERROR: CLError(code=-4, CL_MEM_OBJECT_ALLOCATION_FAILURE)
Stacktrace:
[1] #clblasSnrm2#119(::Array{Ptr{Void},1}, ::Function, ::UInt64, ::Ptr{Void}, ::UInt64, ::Ptr{Void}, ::UInt64, ::Int32, ::Ptr{Void}, ::Array{OpenCL.cl.CmdQueue,1}) at /home/.../.julia/v0.6/CLBLAS/src/macros.jl:132
[2] #nrm2#451(::OpenCL.cl.CmdQueue, ::Function, ::Int64, ::OpenCL.cl.CLArray{Float32,1}, ::Int64) at /home/.../.julia/v0.6/CLBLAS/src/highlevel.jl:57
[3] ##core#743(::CLArrays.CLArray{Float32,1}, ::CLArrays.CLArray{Float32,1}) at /home/.../.julia/v0.6/BenchmarkTools/src/execution.jl:316
[4] ##sample#744(::BenchmarkTools.Parameters) at /home/.../.julia/v0.6/BenchmarkTools/src/execution.jl:324
[5] #_lineartrial#23(::Int64, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#742")}, ::BenchmarkTools.Parameters) at /home/.../.julia/v0.6/BenchmarkTools/src/execution.jl:92
[6] _lineartrial(::BenchmarkTools.Benchmark{Symbol("##benchmark#742")}, ::BenchmarkTools.Parameters) at /home/.../.julia/v0.6/BenchmarkTools/src/execution.jl:84
[7] #lineartrial#20(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#742")}, ::BenchmarkTools.Parameters) at /home/.../.julia/v0.6/BenchmarkTools/src/execution.jl:47
[8] #tune!#26(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#742")}, ::BenchmarkTools.Parameters) at /home/.../.julia/v0.6/BenchmarkTools/src/execution.jl:156
[9] tune!(::BenchmarkTools.Benchmark{Symbol("##benchmark#742")}) at /home/.../.julia/v0.6/BenchmarkTools/src/execution.jl:155
The text was updated successfully, but these errors were encountered:
It would be very nice to have high-level methods of
norm
anddot
for JuliaGPU/GPUArrays.jl#66 and JuliaGPU/GPUArrays.jl#122.It seems to be possible to get some working version of high-level functions by adding to highlevel.jl:
However, this seems to be far from optimal. The corresponding clBLAS functions use a temporary buffer
scratch_buff
that has to be allocated for each call. Here are some benchmarks using the implementation above (I did not make a PR since I think it is too bad):I have used
@time
for the last tests, since I get the following error if I run@benchmark
onLinAlg.BLAS.nrm2(length(dvl), GPUArrays.blasbuffer(dvl), 1)
:The text was updated successfully, but these errors were encountered: