-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a benchmark script for IJFVH
datalayout.
#1963
base: main
Are you sure you want to change the base?
Conversation
Can we please perform multiple kernel launches (e.g., 50 or so) inside a single |
using function bandwidth_efficiency(;
problem_size,
float_type,
device_bandwidth_GBs=2_039,
kernel_time_s,
n_reads_writes
)
N = prod(problem_size)
GB = N*n_reads_writes*sizeof(float_type)/1024^3
achieved_bandwidth_GBs = GB/kernel_time_s
percent_efficiency = achieved_bandwidth_GBs/device_bandwidth_GBs*100
@info "Bandwidth info" problem_size float_type achieved_bandwidth_GBs device_bandwidth_GBs percent_efficiency
end; I get: kernel: `sum_a[i, j, 1, v, bh] = a[i, j, 1, v, bh] + a[i, j, 2, v, bh] + a[i, j, 3, v, bh]`
julia> using Revise; include("benchmarks/scripts/benchmark_IJFVH.jl")
Cartesian (Float32): min = TrialEstimate(63.839 μs)
linear (Float32): min = TrialEstimate(62.799 μs)
Cartesian (Float64): min = TrialEstimate(108.809 μs)
linear (Float64): min = TrialEstimate(112.019 μs)
[ Info: Cartesian Float32 (full) percent_efficiency: 61.137879899746764
[ Info: Linear Float32 (full) percent_efficiency: 62.489236549846005
[ Info: Cartesian Float64 (full) percent_efficiency: 70.74930479387923
[ Info: Linear Float64 (full) percent_efficiency: 70.03856901394448
kernel: `sum_a[i, j, 1, v, bh] = a[i, j, 1, v, bh]`
Cartesian (Float32): min = TrialEstimate(48.109 μs)
linear (Float32): min = TrialEstimate(44.579 μs)
Cartesian (Float64): min = TrialEstimate(64.009 μs)
linear (Float64): min = TrialEstimate(63.609 μs)
[ Info: Cartesian Float32 (low utilization) percent_efficiency: 40.577282944956536
[ Info: Linear Float32 (low utilization) percent_efficiency: 43.79040591307374
[ Info: Cartesian Float64 (low utilization) percent_efficiency: 60.99556328637892
[ Info: Linear Float64 (low utilization) percent_efficiency: 61.3791289031085 Which means that this configuration still does not perform well compared to the F-ending datalayouts. Maybe the performance will improve if we remove the launch latency, that would be a good update to this PR. |
Also, from Tim Besard on slack:
|
I believe the blocks need to synchronize at the end to provide the correct result. It also depends on whether the output from one kernel is needed by the next kernel, in which case it has to wait for the execution to finish. I believe timing the entire function call is right way for benchmarking, but I also added the additional benchmark to match with #1950 for comparison! |
I don't think that the sync is needed, but I think that this is still very helpful. Thanks for adding the extra benchmark! |
Add a benchmark script for
IJFVH
datalayout.