Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a benchmark script for IJFVH datalayout. #1963

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sriharshakandala
Copy link
Member

Add a benchmark script for IJFVH datalayout.

  • Code follows the style guidelines OR N/A.
  • Unit tests are included OR N/A.
  • Code is exercised in an integration test OR N/A.
  • Documentation has been added/updated OR N/A.

@charleskawczynski
Copy link
Member

Hi @sriharshakandala,

Can we please perform multiple kernel launches (e.g., 50 or so) inside a single CUDA.@sync? Our real use-case does not have syncs between every operation. Then, just take the average time per call.

@charleskawczynski
Copy link
Member

using

function bandwidth_efficiency(;
    problem_size,
    float_type,
    device_bandwidth_GBs=2_039,
    kernel_time_s,
    n_reads_writes
  )
  N = prod(problem_size)
  GB = N*n_reads_writes*sizeof(float_type)/1024^3
  achieved_bandwidth_GBs = GB/kernel_time_s
  percent_efficiency = achieved_bandwidth_GBs/device_bandwidth_GBs*100
  @info "Bandwidth info" problem_size float_type achieved_bandwidth_GBs device_bandwidth_GBs percent_efficiency
end;

I get:

kernel: `sum_a[i, j, 1, v, bh] = a[i, j, 1, v, bh] + a[i, j, 2, v, bh] + a[i, j, 3, v, bh]`
julia> using Revise; include("benchmarks/scripts/benchmark_IJFVH.jl")
Cartesian (Float32): min = TrialEstimate(63.839 μs)
linear    (Float32): min = TrialEstimate(62.799 μs)
Cartesian (Float64): min = TrialEstimate(108.809 μs)
linear    (Float64): min = TrialEstimate(112.019 μs)
[ Info: Cartesian Float32 (full) percent_efficiency: 61.137879899746764
[ Info: Linear Float32 (full) percent_efficiency: 62.489236549846005
[ Info: Cartesian Float64 (full) percent_efficiency: 70.74930479387923
[ Info: Linear Float64 (full) percent_efficiency: 70.03856901394448

kernel: `sum_a[i, j, 1, v, bh] = a[i, j, 1, v, bh]`
Cartesian (Float32): min = TrialEstimate(48.109 μs)
linear    (Float32): min = TrialEstimate(44.579 μs)
Cartesian (Float64): min = TrialEstimate(64.009 μs)
linear    (Float64): min = TrialEstimate(63.609 μs)
[ Info: Cartesian Float32 (low utilization) percent_efficiency: 40.577282944956536
[ Info: Linear Float32 (low utilization) percent_efficiency: 43.79040591307374
[ Info: Cartesian Float64 (low utilization) percent_efficiency: 60.99556328637892
[ Info: Linear Float64 (low utilization) percent_efficiency: 61.3791289031085

Which means that this configuration still does not perform well compared to the F-ending datalayouts. Maybe the performance will improve if we remove the launch latency, that would be a good update to this PR.

@charleskawczynski
Copy link
Member

charleskawczynski commented Aug 29, 2024

Also, from Tim Besard on slack:

launch latency is generally measured on and expressed as time spent on the CPU

on the GPU there's no real latency -- if there's kernels queued they can execute one after another without virtually no latency in between (or they can even overlap execution)

it's the CPU where things matter. if the latency to launch kernels is too high, you won't be able to saturate the GPU

though note that the actual latency of launching operations in CUDA.jl (i.e. the time spent by @cuda) most of the time isn't going to dominate the total latency between GPU operations. it's more likely that some expensive operations in between prevent efficient use of the GPU (e.g., allocations, or synchronizations, or just irrelevant CPU computations)

@sriharshakandala
Copy link
Member Author

Also, from Tim Besard on slack:

launch latency is generally measured on and expressed as time spent on the CPU

on the GPU there's no real latency -- if there's kernels queued they can execute one after another without virtually no latency in between (or they can even overlap execution)

it's the CPU where things matter. if the latency to launch kernels is too high, you won't be able to saturate the GPU

though note that the actual latency of launching operations in CUDA.jl (i.e. the time spent by @cuda) most of the time isn't going to dominate the total latency between GPU operations. it's more likely that some expensive operations in between prevent efficient use of the GPU (e.g., allocations, or synchronizations, or just irrelevant CPU computations)

I believe the blocks need to synchronize at the end to provide the correct result. It also depends on whether the output from one kernel is needed by the next kernel, in which case it has to wait for the execution to finish. I believe timing the entire function call is right way for benchmarking, but I also added the additional benchmark to match with #1950 for comparison!

@charleskawczynski
Copy link
Member

I believe the blocks need to synchronize at the end to provide the correct result. It also depends on whether the output from one kernel is needed by the next kernel, in which case it has to wait for the execution to finish. I believe timing the entire function call is right way for benchmarking, but I also added the additional benchmark to match with #1950 for comparison!

I don't think that the sync is needed, but I think that this is still very helpful. Thanks for adding the extra benchmark!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants