You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Installing and launching NSight is a bit much for most users, so we need something convenient a la Base.@profile.
Here's a CUPTI demo that traces API calls and kernel executions, which should be a good start for a profiler.
using CUDA
using.CUPTI
#XXX: (activity) callbacks are called from within CUDA; does that imply we cannot yield# and/or expect to be able to reach a safepoint (in order to trigger GC)?# Ref: https://github.com/JuliaLang/julia/issues/48147#XXX: either CUPTI memorizes the callback pointers, or @cfunction isn't #265,# but we seem to need @invokelatest to make callbacks revisable.# callback APIcallback(userdata, domain, callback_id, data_ptr) =@invokelatest_callback(userdata, domain, callback_id, data_ptr)
function_callback(userdata, domain, callback_id, data_ptr)
#XXX: see top; can we even perform I/O here? it doesn't seem to hang...tryif domain == CUPTI.CUPTI_CB_DOMAIN_DRIVER_API
callback_data =unsafe_load(convert(Ptr{CUPTI.CUpti_CallbackData}, data_ptr))
site = callback_data.callbackSite == CUPTI.CUPTI_API_ENTER ?"enter":"exit"@info"Callback $site: driver API call (id=$(callback_data.correlationId)) to $(unsafe_string(callback_data.functionName)) at $(time())"else@warn"Unsupported callback domain $domain"endcatch err
@error"Error during callback handling" exception=(err, catch_backtrace())
endreturnend# activity APIconst available_buffers = Vector{UInt8}[]
const active_buffers =Dict{Ptr{UInt8},Vector{UInt8}}()
const pending_completions = []
request_buffer(dest_ptr, sz_ptr, max_num_records_ptr) =@invokelatest_request_buffer(dest_ptr, sz_ptr, max_num_records_ptr)
complete_buffer(ctx_handle, stream_id, buf_ptr, sz, valid_sz) =@invokelatest_complete_buffer(ctx_handle, stream_id, buf_ptr, sz, valid_sz)
function_request_buffer(dest_ptr, sz_ptr, max_num_records_ptr)
dest = Base.unsafe_wrap(Array, dest_ptr, 1)
sz = Base.unsafe_wrap(Array, sz_ptr, 1)
max_num_records = Base.unsafe_wrap(Array, max_num_records_ptr, 1)
#XXX: do we need to use locks here, or will CUPTI only ever call us from one thread?# if we need locks; see top message.# "For typical workloads, it's suggested to choose a size between 1 and 10 MB."
buf =ifisempty(available_buffers)
#XXX: see top; it is safe to allocate here?Array{UInt8}(undef, 8*1024*1024) # 8 MBelsepop!(available_buffers)
end
ptr =pointer(buf)
active_buffers[ptr] = buf
dest[] =pointer(buf)
sz[] =sizeof(buf)
max_num_records[] =0returnendfunction_complete_buffer(ctx_handle, stream_id, buf_ptr, sz, valid_sz)
#XXX: this function *has* been observed to hang when doing I/O (see message at the top)# so we defer the actual work to the main thread by using an async condition.push!(pending_completions, (ctx_handle, stream_id, buf_ptr, sz, valid_sz))
@ccalluv_async_send(async_complete_buffer.handle::Ptr{Nothing})::Cintreturnendconst async_complete_buffer = Base.AsyncCondition() do async_cond
local_pending_completions =copy(pending_completions)
empty!(pending_completions)
for data in local_pending_completions
try@invokelatest_async_complete_buffer(data...)
catch err
# we really can't fail here, or the async condition won't fire again@error"Error during asynchronous buffer completion" exception=(err, catch_backtrace())
endendreturnendfunction_async_complete_buffer(ctx_handle, stream_id, buf_ptr, sz, valid_sz)
@info"Received CUPTI activity buffer (using $(Base.format_bytes(valid_sz))/$(Base.format_bytes(sz)))"
buf = active_buffers[buf_ptr]
delete!(active_buffers, buf_ptr)
ctx = CUDA._CuContext(ctx_handle)
# extract activity records
record_ptr =Ref{Ptr{CUPTI.CUpti_Activity}}(C_NULL)
whiletruetry
CUPTI.cuptiActivityGetNextRecord(buf_ptr, valid_sz, record_ptr)
record =unsafe_load(record_ptr[])
functiontyped_record(kind)
typed_ptr =convert(Ptr{kind}, record_ptr[])
unsafe_load(typed_ptr)
end# driver API callsif record.kind == CUPTI.CUPTI_ACTIVITY_KIND_DRIVER
api_record =typed_record(CUPTI.CUpti_ActivityAPI)
name =Ref{Cstring}(C_NULL)
CUPTI.cuptiGetCallbackName(CUPTI.CUPTI_CB_DOMAIN_DRIVER_API, api_record.cbid, name)
t0, t1 = api_record.start/1e9, api_record._end/1e9println(" - host-side driver API (id=$(api_record.correlationId)): $(unsafe_string(name[]))$(format_time(t0, t1))")
# kernel executionelseif record.kind in [CUPTI.CUPTI_ACTIVITY_KIND_KERNEL,
CUPTI.CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL]
kernel_record =typed_record(CUPTI.CUpti_ActivityKernel8)
t0, t1 = kernel_record.start/1e9, kernel_record._end/1e9println(" - device-side kernel (id=$(kernel_record.correlationId)): $(unsafe_string(kernel_record.name))$(format_time(t0, t1)), $(kernel_record.blockX)×$(kernel_record.blockY)×$(kernel_record.blockZ) blocks on a $(kernel_record.gridX)×$(kernel_record.gridY)×$(kernel_record.gridZ) grid, using $(Int(kernel_record.registersPerThread)) registers/thread")
else@warn"Unsupported activity record $(record.kind)"endcatch err
ifisa(err, CUPTIError) && err.code == CUPTI.CUPTI_ERROR_MAX_LIMIT_REACHED
breakendrethrow()
endendpush!(available_buffers, buf)
returnendfunctionformat_time(t0, t1)
delta = t1 - t0
io =IOBuffer()
print(io, "at $t0")
print(io, "taking ")
if delta <1e-6# less than 1 microsecondprint(io, round(delta *1e9, digits=2), " ns")
elseif delta <1e-3# less than 1 millisecondprint(io, round(delta *1e6, digits=2), " µs")
elseif delta <1# less than 1 secondprint(io, round(delta *1e3, digits=2), " ms")
elseprint(io, round(delta, digits=2), " s")
endreturnString(take!(io))
endfunctionmain()
ctx =context()
callback_ptr =@cfunction(callback, Cvoid,
(Ptr{Cvoid}, CUPTI.CUpti_CallbackDomain,
CUPTI.CUpti_CallbackId, Ptr{Cvoid}))
request_buffer_ptr =@cfunction(request_buffer, Cvoid,
(Ptr{Ptr{UInt8}}, Ptr{Csize_t}, Ptr{Csize_t}))
complete_buffer_ptr =@cfunction(complete_buffer, Cvoid,
(CUDA.CUcontext, UInt32, Ptr{UInt8}, Csize_t, Csize_t))
# NOTE: we only need the subscriber for the callback API
subscriber_ref =Ref{CUPTI.CUpti_SubscriberHandle}()
res = CUPTI.cuptiSubscribe(subscriber_ref, callback_ptr, C_NULL)
subscriber = subscriber_ref[]
activity_kinds = [
CUPTI.CUPTI_ACTIVITY_KIND_DRIVER,
CUPTI.CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL,
]
try
CUPTI.cuptiEnableDomain(1, subscriber, CUPTI.CUPTI_CB_DOMAIN_DRIVER_API)
for activity_kind in activity_kinds
CUPTI.cuptiActivityEnableContext(ctx, activity_kind)
end#XXX: CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL instruments the kernel at module load# time, so we either need to ensure we profile before the first compilation,# or teach CUDA.jl to shard the module cache by whether CUPTI was enabled.
CUPTI.cuptiActivityRegisterCallbacks(request_buffer_ptr, complete_buffer_ptr)
actual_main()
nothingfinally
CUPTI.cuptiUnsubscribe(subscriber)
for activity_kind in activity_kinds
CUPTI.cuptiActivityDisableContext(ctx, activity_kind)
endend# flush all activity records, even incomplete ones
CUPTI.cuptiActivityFlushAll(CUPTI.CUPTI_ACTIVITY_FLAG_FLUSH_FORCED)
# wait for our asynchronous processing to completewhile!isempty(active_buffers)
yield()
endendfunctionactual_main()
@cudaidentity(nothing)
synchronize()
endisinteractive() ||main()
This gives us the following data:
julia> main()
[ Info: Callback enter: driver API call (id=1149) to cuCtxGetCurrent at 1.690985659233663e9
[ Info: Callback exit: driver API call (id=1149) to cuCtxGetCurrent at 1.690985659233739e9
[ Info: Callback enter: driver API call (id=1150) to cuCtxGetCurrent at 1.690985659233794e9
[ Info: Callback exit: driver API call (id=1150) to cuCtxGetCurrent at 1.690985659234145e9
[ Info: Callback enter: driver API call (id=1151) to cuCtxGetCurrent at 1.690985659234211e9
[ Info: Callback exit: driver API call (id=1151) to cuCtxGetCurrent at 1.690985659234263e9
[ Info: Callback enter: driver API call (id=1152) to cuCtxGetCurrent at 1.690985659234329e9
[ Info: Callback exit: driver API call (id=1152) to cuCtxGetCurrent at 1.690985659234373e9
[ Info: Callback enter: driver API call (id=1153) to cuLaunchKernel at 1.690985659234421e9
[ Info: Callback exit: driver API call (id=1153) to cuLaunchKernel at 1.690985659332615e9
[ Info: Callback enter: driver API call (id=1154) to cuCtxGetCurrent at 1.690985659332642e9
[ Info: Callback exit: driver API call (id=1154) to cuCtxGetCurrent at 1.690985659332669e9
[ Info: Callback enter: driver API call (id=1155) to cuStreamQuery at 1.690985659332701e9
[ Info: Callback exit: driver API call (id=1155) to cuStreamQuery at 1.690985659332739e9
[ Info: Callback enter: driver API call (id=1156) to cuCtxGetCurrent at 1.690985659332769e9
[ Info: Callback exit: driver API call (id=1156) to cuCtxGetCurrent at 1.690985659332797e9
[ Info: Callback enter: driver API call (id=1157) to cuStreamSynchronize at 1.690985659332825e9
[ Info: Callback exit: driver API call (id=1157) to cuStreamSynchronize at 1.690985659332854e9
[ Info: Callback enter: driver API call (id=1158) to cuCtxGetCurrent at 1.690985659332883e9
[ Info: Callback exit: driver API call (id=1158) to cuCtxGetCurrent at 1.690985659332911e9
[ Info: Received CUPTI activity buffer (using 608 bytes/8.000 MiB)
- host-side driver API (id=1150): cuCtxGetCurrent at 1.6909856592338278e9 taking 305.18 µs
- device-side kernel (id=1153): _Z8identityv at 1.690985659332649e9 taking 715.26 ns, 1×1×1 blocks on a 1×1×1 grid, using 16 registers/thread
- host-side driver API (id=1153): cuLaunchKernel at 1.69098565923445e9 taking 98.16 ms
- host-side driver API (id=1154): cuCtxGetCurrent at 1.6909856593326595e9 taking 715.26 ns
- host-side driver API (id=1155): cuStreamQuery at 1.690985659332727e9 taking 2.86 µs
- host-side driver API (id=1156): cuCtxGetCurrent at 1.6909856593327875e9 taking 715.26 ns
- host-side driver API (id=1157): cuStreamSynchronize at 1.6909856593328438e9 taking 1.43 µs
- host-side driver API (id=1158): cuCtxGetCurrent at 1.690985659332901e9 taking 953.67 ns
- host-side driver API (id=1159): cuCtxGetCurrent at 1.6909856593329332e9 taking 238.42 ns
The real question is how to present this to the user in a way that's useful. Maybe we should look back at what nvprof used to show. It had a summary mode:
This is nice! I do like nvprof output. Would it be possible to show performance counters as well? For example, I find it useful to see if there are any local loads and stores as they typically indicate poor performance for the kernels I am interested in.
Would it be possible to show performance counters as well?
Over time, yes, but that changes the runtime characteristics significantly (potentially having to replay the execution multiple times). I'm going to focus on the simple things first, and IIRC @vchuravy was working on something portable for reporting metrics anyway.
Installing and launching NSight is a bit much for most users, so we need something convenient a la
Base.@profile
.Here's a CUPTI demo that traces API calls and kernel executions, which should be a good start for a profiler.
This gives us the following data:
The real question is how to present this to the user in a way that's useful. Maybe we should look back at what
nvprof
used to show. It had a summary mode:... a GPU trace mode:
... and an API trace mode:
See https://docs.nvidia.com/cuda/profiler-users-guide/index.html#profiling-modes for more details.
The text was updated successfully, but these errors were encountered: