-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement reverse lookup (Ptr->Tuple) for CUDNN descriptors. #1948
Conversation
Here's a small script that shows the loading and saving import Pkg;
Pkg.activate("@CUDAPerfCachingTest")
# Setup:
# I don't think you can link to the cuDNN.jl module within CUDA.jl directly, so
# you'll have to clone github.com/romeov/CUDA.jl and then link
# Pkg.develop(path="<local>/romeov/CUDA.jl/lib/cudnn")
# Also
# Pkg.add("Flux")
# Pkg.add("JLD2")
#
# Execute e.g. with `julia caching_test.jl save` or `julia caching_test.jl load` or just `julia caching_test.jl`
using Flux, JLD2
import Flux.Zygote: gradient
function load_conv_caches!(; cudnn_mod::Module=Flux.cuDNN, filename="/tmp/conv_cache.jld2")
@info "Loading conv_cache."
conv_data_cache = JLD2.load(filename, "conv_data_cache");
push!(cudnn_mod.cudnnConvolutionBwdDataAlgoPerfCache,
conv_data_cache...)
conv_filter_cache = JLD2.load(filename, "conv_filter_cache");
push!(cudnn_mod.cudnnConvolutionBwdFilterAlgoPerfCache,
conv_filter_cache...)
end
function save_conv_caches(; cudnn_mod::Module=Flux.cuDNN, filename="/tmp/conv_cache.jld2")
@info "Storing conv_cache."
JLD2.save(filename,
"conv_data_cache", cudnn_mod.cudnnConvolutionBwdDataAlgoPerfCache,
"conv_filter_cache", cudnn_mod.cudnnConvolutionBwdFilterAlgoPerfCache,
)
end
if "load" in ARGS
load_conv_caches!()
end
model = Chain(Conv((3, 3), 3=>64, relu; pad=SamePad()),
Conv((3, 3), 64=>32, relu),
GlobalMeanPool(),
Flux.flatten,
Dense(32=>1))
x = rand(Float32, 32, 32, 3, 7);
let x = gpu(x),
model = gpu(model),
ps = Flux.params(model)
t0 = time()
gradient(ps) do
model(x) |> sum
end
@info "done in $(time() - t0) seconds :)"
end;
("load" in ARGS) && @show length(Flux.cuDNN.cudnnConvolutionBwdDataAlgoPerfCache)
if "save" in ARGS
save_conv_caches()
end
|
lib/cudnn/src/convolution.jl
Outdated
# Helper fct to recover cudnn descriptor tuples from cudnn descriptor pointers | ||
# so that we can cache algorithms based on data descriptors. | ||
# Actually just reverses the cache dict and returns the descriptor as a tuple. | ||
map_cudnn_ptr_to_jl_tuple(cache_dict, desc_ptr) = Dict(zip(values(cache_dict), | ||
keys(cache_dict)))[desc_ptr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of recreating the cache in reversed form and searching it every time (expensive!), CUDA.jl provides functions for pulling out the info from a descriptor (cheap!). See
CUDA.jl/lib/cudnn/src/tensor.jl
Lines 49 to 72 in 864ec5e
function cudnnGetTensorDescriptor(d::cudnnTensorDescriptor) | |
nbDimsRequested = CUDNN_DIM_MAX | |
dataType = Ref{cudnnDataType_t}(CUDNN_DATA_FLOAT) | |
nbDims = Ref{Cint}(0) | |
dimA = Array{Cint}(undef, CUDNN_DIM_MAX) | |
strideA = Array{Cint}(undef, CUDNN_DIM_MAX) | |
cudnnGetTensorNdDescriptor(d, nbDimsRequested, dataType, nbDims, dimA, strideA) | |
T = juliaDataType(dataType[]) | |
D = (dimA[nbDims[]:-1:1]...,) | |
S = (strideA[nbDims[]:-1:1]...,) | |
return T,D,S | |
end | |
function cudnnGetFilterDescriptor(d::cudnnFilterDescriptor) | |
nbDimsRequested = CUDNN_DIM_MAX | |
dataType = Ref{cudnnDataType_t}(CUDNN_DATA_FLOAT) | |
format = Ref{cudnnTensorFormat_t}(CUDNN_TENSOR_NCHW) | |
nbDims = Ref{Cint}(0) | |
dimA = Array{Cint}(undef, CUDNN_DIM_MAX) | |
cudnnGetFilterNdDescriptor(d, nbDimsRequested, dataType, format, nbDims, dimA) | |
T = juliaDataType(dataType[]) | |
D = (dimA[nbDims[]:-1:1]...,) | |
return T,D,format[] | |
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The descriptors as they are still have some Cenum
types in them, which we could convert to julia Ints or something if we run into serialization trouble.
There is already `cudnnGetTensorDescriptor` and `cudnnGetFilterDescriptor`, so now we have everything to cache algorithm performances.
However, there's still a few `CUDNN_xyz_t` datatypes, which are Cenums. We could still map those to Julia integers if serialization is difficult otherwise.
a765151
to
9da9e11
Compare
This is still marked WIP; anything to do here @RomeoV @ToucheSir? |
From my end no, didn't even notice the PR title still had WIP. |
dyDesc_native = cudnnGetTensorDescriptor(dyDesc) | ||
convDesc_native = cudnnGetConvolutionDescriptor(convDesc) | ||
|
||
key = (xDesc_native, dyDesc_native, convDesc_native) | ||
val = lock(cudnnConvolutionBwdFilterAlgoPerfCacheLock) do | ||
get(cudnnConvolutionBwdFilterAlgoPerfCache, (xDesc, dyDesc, convDesc), nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RomeoV whoops, I think I missed this line. It should be get(cudnnConvolutionBwdFilterAlgoPerfCache, key, nothing)
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks. Opened another PR with that one-line change.
This is a follow up to JuliaGPU#1948.
This is a follow up to JuliaGPU#1948.
This is a follow up to #1948.
This fixes #1947.