Native profiler using CUPTI #2017

maleadt · 2023-08-02T14:17:34Z

Installing and launching NSight is a bit much for most users, so we need something convenient a la Base.@profile.

Here's a CUPTI demo that traces API calls and kernel executions, which should be a good start for a profiler.

using CUDA
using .CUPTI

# XXX: (activity) callbacks are called from within CUDA; does that imply we cannot yield
#      and/or expect to be able to reach a safepoint (in order to trigger GC)?
# Ref: https://github.com/JuliaLang/julia/issues/48147

# XXX: either CUPTI memorizes the callback pointers, or @cfunction isn't #265,
#      but we seem to need @invokelatest to make callbacks revisable.

# callback API
callback(userdata, domain, callback_id, data_ptr) =
    @invokelatest _callback(userdata, domain, callback_id, data_ptr)
function _callback(userdata, domain, callback_id, data_ptr)
    # XXX: see top; can we even perform I/O here? it doesn't seem to hang...
    try
        if domain == CUPTI.CUPTI_CB_DOMAIN_DRIVER_API
            callback_data = unsafe_load(convert(Ptr{CUPTI.CUpti_CallbackData}, data_ptr))
            site = callback_data.callbackSite == CUPTI.CUPTI_API_ENTER ? "enter" : "exit"

            @info "Callback $site: driver API call (id=$(callback_data.correlationId)) to $(unsafe_string(callback_data.functionName)) at $(time())"
        else
            @warn "Unsupported callback domain $domain"
        end
    catch err
        @error "Error during callback handling" exception=(err, catch_backtrace())
    end
    return
end

# activity API
const available_buffers = Vector{UInt8}[]
const active_buffers = Dict{Ptr{UInt8},Vector{UInt8}}()
const pending_completions = []
request_buffer(dest_ptr, sz_ptr, max_num_records_ptr) =
    @invokelatest _request_buffer(dest_ptr, sz_ptr, max_num_records_ptr)
complete_buffer(ctx_handle, stream_id, buf_ptr, sz, valid_sz) =
    @invokelatest _complete_buffer(ctx_handle, stream_id, buf_ptr, sz, valid_sz)
function _request_buffer(dest_ptr, sz_ptr, max_num_records_ptr)
    dest = Base.unsafe_wrap(Array, dest_ptr, 1)
    sz = Base.unsafe_wrap(Array, sz_ptr, 1)
    max_num_records = Base.unsafe_wrap(Array, max_num_records_ptr, 1)

    # XXX: do we need to use locks here, or will CUPTI only ever call us from one thread?
    #      if we need locks; see top message.

    # "For typical workloads, it's suggested to choose a size between 1 and 10 MB."
    buf = if isempty(available_buffers)
        # XXX: see top; it is safe to allocate here?
        Array{UInt8}(undef, 8 * 1024 * 1024)  # 8 MB
    else
        pop!(available_buffers)
    end
    ptr = pointer(buf)
    active_buffers[ptr] = buf

    dest[] = pointer(buf)
    sz[] = sizeof(buf)
    max_num_records[] = 0

    return
end
function _complete_buffer(ctx_handle, stream_id, buf_ptr, sz, valid_sz)
    # XXX: this function *has* been observed to hang when doing I/O (see message at the top)
    #      so we defer the actual work to the main thread by using an async condition.

    push!(pending_completions, (ctx_handle, stream_id, buf_ptr, sz, valid_sz))
    @ccall uv_async_send(async_complete_buffer.handle::Ptr{Nothing})::Cint

    return
end
const async_complete_buffer = Base.AsyncCondition() do async_cond
    local_pending_completions = copy(pending_completions)
    empty!(pending_completions)

    for data in local_pending_completions
        try
            @invokelatest _async_complete_buffer(data...)
        catch err
            # we really can't fail here, or the async condition won't fire again
            @error "Error during asynchronous buffer completion" exception=(err, catch_backtrace())
        end
    end

    return
end
function _async_complete_buffer(ctx_handle, stream_id, buf_ptr, sz, valid_sz)
    @info "Received CUPTI activity buffer (using $(Base.format_bytes(valid_sz))/$(Base.format_bytes(sz)))"

    buf = active_buffers[buf_ptr]
    delete!(active_buffers, buf_ptr)

    ctx = CUDA._CuContext(ctx_handle)

    # extract activity records
    record_ptr = Ref{Ptr{CUPTI.CUpti_Activity}}(C_NULL)
    while true
        try
            CUPTI.cuptiActivityGetNextRecord(buf_ptr, valid_sz, record_ptr)

            record = unsafe_load(record_ptr[])
            function typed_record(kind)
                typed_ptr = convert(Ptr{kind}, record_ptr[])
                unsafe_load(typed_ptr)
            end

            # driver API calls
            if record.kind == CUPTI.CUPTI_ACTIVITY_KIND_DRIVER
                api_record = typed_record(CUPTI.CUpti_ActivityAPI)

                name = Ref{Cstring}(C_NULL)
                CUPTI.cuptiGetCallbackName(CUPTI.CUPTI_CB_DOMAIN_DRIVER_API, api_record.cbid, name)

                t0, t1 = api_record.start/1e9, api_record._end/1e9

                println("  - host-side driver API (id=$(api_record.correlationId)): $(unsafe_string(name[])) $(format_time(t0, t1))")

            # kernel execution
            elseif record.kind in [CUPTI.CUPTI_ACTIVITY_KIND_KERNEL,
                                   CUPTI.CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL]
                kernel_record = typed_record(CUPTI.CUpti_ActivityKernel8)

                t0, t1 = kernel_record.start/1e9, kernel_record._end/1e9

                println("  - device-side kernel (id=$(kernel_record.correlationId)): $(unsafe_string(kernel_record.name)) $(format_time(t0, t1)), $(kernel_record.blockX)×$(kernel_record.blockY)×$(kernel_record.blockZ) blocks on a $(kernel_record.gridX)×$(kernel_record.gridY)×$(kernel_record.gridZ) grid, using $(Int(kernel_record.registersPerThread)) registers/thread")
            else
                @warn "Unsupported activity record $(record.kind)"
            end
        catch err
            if isa(err, CUPTIError) && err.code == CUPTI.CUPTI_ERROR_MAX_LIMIT_REACHED
                break
            end
            rethrow()
        end
    end

    push!(available_buffers, buf)

    return
end

function format_time(t0, t1)
    delta = t1 - t0
    io = IOBuffer()
    print(io, "at $t0 ")
    print(io, "taking ")
    if delta < 1e-6  # less than 1 microsecond
        print(io, round(delta * 1e9, digits=2), " ns")
    elseif delta < 1e-3  # less than 1 millisecond
        print(io, round(delta * 1e6, digits=2), " µs")
    elseif delta < 1  # less than 1 second
        print(io, round(delta * 1e3, digits=2), " ms")
    else
        print(io, round(delta, digits=2), " s")
    end
    return String(take!(io))
end

function main()
    ctx = context()
    callback_ptr = @cfunction(callback, Cvoid,
                              (Ptr{Cvoid}, CUPTI.CUpti_CallbackDomain,
                               CUPTI.CUpti_CallbackId, Ptr{Cvoid}))
    request_buffer_ptr = @cfunction(request_buffer, Cvoid,
                                    (Ptr{Ptr{UInt8}}, Ptr{Csize_t}, Ptr{Csize_t}))
    complete_buffer_ptr = @cfunction(complete_buffer, Cvoid,
                                     (CUDA.CUcontext, UInt32, Ptr{UInt8}, Csize_t, Csize_t))

    # NOTE: we only need the subscriber for the callback API
    subscriber_ref = Ref{CUPTI.CUpti_SubscriberHandle}()
    res = CUPTI.cuptiSubscribe(subscriber_ref, callback_ptr, C_NULL)
    subscriber = subscriber_ref[]

    activity_kinds = [
        CUPTI.CUPTI_ACTIVITY_KIND_DRIVER,
        CUPTI.CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL,
    ]

    try
        CUPTI.cuptiEnableDomain(1, subscriber, CUPTI.CUPTI_CB_DOMAIN_DRIVER_API)

        for activity_kind in activity_kinds
            CUPTI.cuptiActivityEnableContext(ctx, activity_kind)
        end

        # XXX: CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL instruments the kernel at module load
        #      time, so we either need to ensure we profile before the first compilation,
        #      or teach CUDA.jl to shard the module cache by whether CUPTI was enabled.

        CUPTI.cuptiActivityRegisterCallbacks(request_buffer_ptr, complete_buffer_ptr)

        actual_main()

        nothing
    finally
        CUPTI.cuptiUnsubscribe(subscriber)

        for activity_kind in activity_kinds
            CUPTI.cuptiActivityDisableContext(ctx, activity_kind)
        end
    end

    # flush all activity records, even incomplete ones
    CUPTI.cuptiActivityFlushAll(CUPTI.CUPTI_ACTIVITY_FLAG_FLUSH_FORCED)
    # wait for our asynchronous processing to complete
    while !isempty(active_buffers)
        yield()
    end
end

function actual_main()
    @cuda identity(nothing)
    synchronize()
end

isinteractive() || main()

This gives us the following data:

julia> main()
[ Info: Callback enter: driver API call (id=1149) to cuCtxGetCurrent at 1.690985659233663e9
[ Info: Callback exit: driver API call (id=1149) to cuCtxGetCurrent at 1.690985659233739e9
[ Info: Callback enter: driver API call (id=1150) to cuCtxGetCurrent at 1.690985659233794e9
[ Info: Callback exit: driver API call (id=1150) to cuCtxGetCurrent at 1.690985659234145e9
[ Info: Callback enter: driver API call (id=1151) to cuCtxGetCurrent at 1.690985659234211e9
[ Info: Callback exit: driver API call (id=1151) to cuCtxGetCurrent at 1.690985659234263e9
[ Info: Callback enter: driver API call (id=1152) to cuCtxGetCurrent at 1.690985659234329e9
[ Info: Callback exit: driver API call (id=1152) to cuCtxGetCurrent at 1.690985659234373e9
[ Info: Callback enter: driver API call (id=1153) to cuLaunchKernel at 1.690985659234421e9
[ Info: Callback exit: driver API call (id=1153) to cuLaunchKernel at 1.690985659332615e9
[ Info: Callback enter: driver API call (id=1154) to cuCtxGetCurrent at 1.690985659332642e9
[ Info: Callback exit: driver API call (id=1154) to cuCtxGetCurrent at 1.690985659332669e9
[ Info: Callback enter: driver API call (id=1155) to cuStreamQuery at 1.690985659332701e9
[ Info: Callback exit: driver API call (id=1155) to cuStreamQuery at 1.690985659332739e9
[ Info: Callback enter: driver API call (id=1156) to cuCtxGetCurrent at 1.690985659332769e9
[ Info: Callback exit: driver API call (id=1156) to cuCtxGetCurrent at 1.690985659332797e9
[ Info: Callback enter: driver API call (id=1157) to cuStreamSynchronize at 1.690985659332825e9
[ Info: Callback exit: driver API call (id=1157) to cuStreamSynchronize at 1.690985659332854e9
[ Info: Callback enter: driver API call (id=1158) to cuCtxGetCurrent at 1.690985659332883e9
[ Info: Callback exit: driver API call (id=1158) to cuCtxGetCurrent at 1.690985659332911e9

[ Info: Received CUPTI activity buffer (using 608 bytes/8.000 MiB)
  - host-side driver API (id=1150): cuCtxGetCurrent at 1.6909856592338278e9 taking 305.18 µs
  - device-side kernel (id=1153): _Z8identityv at 1.690985659332649e9 taking 715.26 ns, 1×1×1 blocks on a 1×1×1 grid, using 16 registers/thread
  - host-side driver API (id=1153): cuLaunchKernel at 1.69098565923445e9 taking 98.16 ms
  - host-side driver API (id=1154): cuCtxGetCurrent at 1.6909856593326595e9 taking 715.26 ns
  - host-side driver API (id=1155): cuStreamQuery at 1.690985659332727e9 taking 2.86 µs
  - host-side driver API (id=1156): cuCtxGetCurrent at 1.6909856593327875e9 taking 715.26 ns
  - host-side driver API (id=1157): cuStreamSynchronize at 1.6909856593328438e9 taking 1.43 µs
  - host-side driver API (id=1158): cuCtxGetCurrent at 1.690985659332901e9 taking 953.67 ns
  - host-side driver API (id=1159): cuCtxGetCurrent at 1.6909856593329332e9 taking 238.42 ns

The real question is how to present this to the user in a way that's useful. Maybe we should look back at what nvprof used to show. It had a summary mode:

$ nvprof matrixMul
[Matrix Multiply Using CUDA] - Starting...
==27694== NVPROF is profiling process 27694, command: matrixMul
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 35.35 GFlop/s, Time= 3.708 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==27694== Profiling application: matrixMul
==27694== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.94%  1.11524s       301  3.7051ms  3.6928ms  3.7174ms  void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
  0.04%  406.30us         2  203.15us  136.13us  270.18us  [CUDA memcpy HtoD]
  0.02%  248.29us         1  248.29us  248.29us  248.29us  [CUDA memcpy DtoH]

==27964== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 49.81%  285.17ms         3  95.055ms  153.32us  284.86ms  cudaMalloc
 25.95%  148.57ms         1  148.57ms  148.57ms  148.57ms  cudaEventSynchronize
 22.23%  127.28ms         1  127.28ms  127.28ms  127.28ms  cudaDeviceReset
  1.33%  7.6314ms       301  25.353us  23.551us  143.98us  cudaLaunch
  0.25%  1.4343ms         3  478.09us  155.84us  984.38us  cudaMemcpy
  0.11%  601.45us         1  601.45us  601.45us  601.45us  cudaDeviceSynchronize
  0.10%  564.48us      1505     375ns     313ns  3.6790us  cudaSetupArgument
  0.09%  490.44us        76  6.4530us     307ns  221.93us  cuDeviceGetAttribute
  0.07%  406.61us         3  135.54us  115.07us  169.99us  cudaFree
  0.02%  143.00us       301     475ns     431ns  2.4370us  cudaConfigureCall
  0.01%  42.321us         1  42.321us  42.321us  42.321us  cuDeviceTotalMem
  0.01%  33.655us         1  33.655us  33.655us  33.655us  cudaGetDeviceProperties
  0.01%  31.900us         1  31.900us  31.900us  31.900us  cuDeviceGetName
  0.00%  21.874us         2  10.937us  8.5850us  13.289us  cudaEventRecord
  0.00%  16.513us         2  8.2560us  2.6240us  13.889us  cudaEventCreate
  0.00%  13.091us         1  13.091us  13.091us  13.091us  cudaEventElapsedTime
  0.00%  8.1410us         1  8.1410us  8.1410us  8.1410us  cudaGetDevice
  0.00%  2.6290us         2  1.3140us     509ns  2.1200us  cuDeviceGetCount
  0.00%  1.9970us         2     998ns     520ns  1.4770us  cuDeviceGet

... a GPU trace mode:

==27706== Profiling application: matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 35.36 GFlop/s, Time= 3.707 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==27706== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
133.81ms  135.78us                    -               -         -         -         -  409.60KB  3.0167GB/s  GeForce GT 640M         1         2  [CUDA memcpy HtoD]
134.62ms  270.66us                    -               -         -         -         -  819.20KB  3.0267GB/s  GeForce GT 640M         1         2  [CUDA memcpy HtoD]
134.90ms  3.7037ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [94]
138.71ms  3.7011ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [105]
<...more output...>
1.24341s  3.7011ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [2191]
1.24711s  3.7046ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [2198]
1.25089s  248.13us                    -               -         -         -         -  819.20KB  3.3015GB/s  GeForce GT 640M         1         2  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

... and an API trace mode:

==27722== Profiling application: matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 35.35 GFlop/s, Time= 3.708 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==27722== Profiling result:
   Start  Duration  Name
108.38ms  6.2130us  cuDeviceGetCount
108.42ms     840ns  cuDeviceGet
108.42ms  22.459us  cuDeviceGetName
108.45ms  11.782us  cuDeviceTotalMem
108.46ms     945ns  cuDeviceGetAttribute
149.37ms  23.737us  cudaLaunch (void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [2198])
149.39ms  6.6290us  cudaEventRecord
149.40ms  1.10156s  cudaEventSynchronize
<...more output...>
1.25096s  21.543us  cudaEventElapsedTime
1.25103s  1.5462ms  cudaMemcpy
1.25467s  153.93us  cudaFree
1.25483s  75.373us  cudaFree
1.25491s  75.564us  cudaFree
1.25693s  10.901ms  cudaDeviceReset

See https://docs.nvidia.com/cuda/profiler-users-guide/index.html#profiling-modes for more details.

The text was updated successfully, but these errors were encountered:

lcw · 2023-08-02T17:18:56Z

This is nice! I do like nvprof output. Would it be possible to show performance counters as well? For example, I find it useful to see if there are any local loads and stores as they typically indicate poor performance for the kernels I am interested in.

maleadt · 2023-08-02T19:10:03Z

Would it be possible to show performance counters as well?

Over time, yes, but that changes the runtime characteristics significantly (potentially having to replay the execution multiple times). I'm going to focus on the simple things first, and IIRC @vchuravy was working on something portable for reporting metrics anyway.

maleadt · 2023-08-08T15:28:47Z

Sneak peek:

maleadt added the enhancement New feature or request label Aug 2, 2023

maleadt mentioned this issue Aug 9, 2023

Add a native profiler. #2024

Merged

2 tasks

maleadt closed this as completed in #2024 Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native profiler using CUPTI #2017

Native profiler using CUPTI #2017

maleadt commented Aug 2, 2023 •

edited

Loading

lcw commented Aug 2, 2023

maleadt commented Aug 2, 2023

maleadt commented Aug 8, 2023 •

edited

Loading

Native profiler using CUPTI #2017

Native profiler using CUPTI #2017

Comments

maleadt commented Aug 2, 2023 • edited Loading

lcw commented Aug 2, 2023

maleadt commented Aug 2, 2023

maleadt commented Aug 8, 2023 • edited Loading

maleadt commented Aug 2, 2023 •

edited

Loading

maleadt commented Aug 8, 2023 •

edited

Loading