Add a benchmark script for `IJFVH` datalayout. #1963

sriharshakandala · 2024-08-29T15:29:03Z

Add a benchmark script for IJFVH datalayout.

Code follows the style guidelines OR N/A.
Unit tests are included OR N/A.
Code is exercised in an integration test OR N/A.
Documentation has been added/updated OR N/A.

charleskawczynski · 2024-08-29T18:15:12Z

Can we please perform multiple kernel launches (e.g., 50 or so) inside a single CUDA.@sync? Our real use-case does not have syncs between every operation. Then, just take the average time per call.

charleskawczynski · 2024-08-29T18:18:09Z

using

function bandwidth_efficiency(;
    problem_size,
    float_type,
    device_bandwidth_GBs=2_039,
    kernel_time_s,
    n_reads_writes
  )
  N = prod(problem_size)
  GB = N*n_reads_writes*sizeof(float_type)/1024^3
  achieved_bandwidth_GBs = GB/kernel_time_s
  percent_efficiency = achieved_bandwidth_GBs/device_bandwidth_GBs*100
  @info "Bandwidth info" problem_size float_type achieved_bandwidth_GBs device_bandwidth_GBs percent_efficiency
end;

I get:

kernel: `sum_a[i, j, 1, v, bh] = a[i, j, 1, v, bh] + a[i, j, 2, v, bh] + a[i, j, 3, v, bh]`
julia> using Revise; include("benchmarks/scripts/benchmark_IJFVH.jl")
Cartesian (Float32): min = TrialEstimate(63.839 μs)
linear    (Float32): min = TrialEstimate(62.799 μs)
Cartesian (Float64): min = TrialEstimate(108.809 μs)
linear    (Float64): min = TrialEstimate(112.019 μs)
[ Info: Cartesian Float32 (full) percent_efficiency: 61.137879899746764
[ Info: Linear Float32 (full) percent_efficiency: 62.489236549846005
[ Info: Cartesian Float64 (full) percent_efficiency: 70.74930479387923
[ Info: Linear Float64 (full) percent_efficiency: 70.03856901394448

kernel: `sum_a[i, j, 1, v, bh] = a[i, j, 1, v, bh]`
Cartesian (Float32): min = TrialEstimate(48.109 μs)
linear    (Float32): min = TrialEstimate(44.579 μs)
Cartesian (Float64): min = TrialEstimate(64.009 μs)
linear    (Float64): min = TrialEstimate(63.609 μs)
[ Info: Cartesian Float32 (low utilization) percent_efficiency: 40.577282944956536
[ Info: Linear Float32 (low utilization) percent_efficiency: 43.79040591307374
[ Info: Cartesian Float64 (low utilization) percent_efficiency: 60.99556328637892
[ Info: Linear Float64 (low utilization) percent_efficiency: 61.3791289031085

Which means that this configuration still does not perform well compared to the F-ending datalayouts. Maybe the performance will improve if we remove the launch latency, that would be a good update to this PR.

charleskawczynski · 2024-08-29T18:19:22Z

Also, from Tim Besard on slack:

launch latency is generally measured on and expressed as time spent on the CPU

on the GPU there's no real latency -- if there's kernels queued they can execute one after another without virtually no latency in between (or they can even overlap execution)

it's the CPU where things matter. if the latency to launch kernels is too high, you won't be able to saturate the GPU

though note that the actual latency of launching operations in CUDA.jl (i.e. the time spent by @cuda) most of the time isn't going to dominate the total latency between GPU operations. it's more likely that some expensive operations in between prevent efficient use of the GPU (e.g., allocations, or synchronizations, or just irrelevant CPU computations)

sriharshakandala · 2024-08-30T00:20:37Z

Also, from Tim Besard on slack:

launch latency is generally measured on and expressed as time spent on the CPU

on the GPU there's no real latency -- if there's kernels queued they can execute one after another without virtually no latency in between (or they can even overlap execution)

it's the CPU where things matter. if the latency to launch kernels is too high, you won't be able to saturate the GPU

though note that the actual latency of launching operations in CUDA.jl (i.e. the time spent by @cuda) most of the time isn't going to dominate the total latency between GPU operations. it's more likely that some expensive operations in between prevent efficient use of the GPU (e.g., allocations, or synchronizations, or just irrelevant CPU computations)

I believe the blocks need to synchronize at the end to provide the correct result. It also depends on whether the output from one kernel is needed by the next kernel, in which case it has to wait for the execution to finish. I believe timing the entire function call is right way for benchmarking, but I also added the additional benchmark to match with #1950 for comparison!

charleskawczynski · 2024-08-30T18:45:01Z

I believe the blocks need to synchronize at the end to provide the correct result. It also depends on whether the output from one kernel is needed by the next kernel, in which case it has to wait for the execution to finish. I believe timing the entire function call is right way for benchmarking, but I also added the additional benchmark to match with #1950 for comparison!

I don't think that the sync is needed, but I think that this is still very helpful. Thanks for adding the extra benchmark!

sriharshakandala added 3 commits August 29, 2024 08:26

Add a benchmark script for IJFVH datalayout.

b164824

Use NH = 5400

3d1f8f9

Update performance numbers for NH = 5400

0fc5f25

Add additional benchmark.

3483bf9

Add results to file

08307cc

charleskawczynski mentioned this pull request Sep 4, 2024

Use prescribed thread-block configurations #1969

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a benchmark script for `IJFVH` datalayout. #1963

Add a benchmark script for `IJFVH` datalayout. #1963

sriharshakandala commented Aug 29, 2024

charleskawczynski commented Aug 29, 2024

charleskawczynski commented Aug 29, 2024

charleskawczynski commented Aug 29, 2024 •

edited

Loading

sriharshakandala commented Aug 30, 2024

charleskawczynski commented Aug 30, 2024

Add a benchmark script for IJFVH datalayout. #1963

Are you sure you want to change the base?

Add a benchmark script for IJFVH datalayout. #1963

Conversation

sriharshakandala commented Aug 29, 2024

charleskawczynski commented Aug 29, 2024

charleskawczynski commented Aug 29, 2024

charleskawczynski commented Aug 29, 2024 • edited Loading

sriharshakandala commented Aug 30, 2024

charleskawczynski commented Aug 30, 2024

Add a benchmark script for `IJFVH` datalayout. #1963

Add a benchmark script for `IJFVH` datalayout. #1963

charleskawczynski commented Aug 29, 2024 •

edited

Loading