-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simple batched dot kernel is ~1.7x slower with Const on Titan RTX #479
Comments
Can you post a profile https://cuda.juliagpu.org/stable/development/profiling/#Integrated-profiler so that we can determine if the overhead is in the kernel or the kernel launch. |
sure! i made the problem 8-fold bigger in both L and N to emphasize the difference and got:
so it's definitely the kernel, not the launch. thanks for the quick reply! |
If you changed the problem size then you need to change the number of blocks.
|
hah, right, how's this:
launch times still about the same with KA kernel being ~1.5x slower. |
Ok that is still surprising to me. I expect some overhead but nothing that should scale like that. |
What is Running this locally on a
and the 8x bigger case:
|
i have a variety of other GPUs available to test on if that'd be informative. |
Okay that makes it even more curious. We are looking at the same generation of GPUs Could you run the code through https://cuda.juliagpu.org/stable/development/profiling/#NVIDIA-Nsight-Compute / https://docs.nvidia.com/nsight-compute/NsightCompute/index.html In particular a "Compute Workload Analysis". Also for completeness, try without the
|
curiously,
i suppose it could be due to minor differences in the CUDA drivers too. |
Yeah Const ends up as You can also verify this with using Const and CUDA directly without KA getting involved. |
can you please elaborate on why you expect some overhead with KA? in informal testing i now see near parity btw KA and CUDA if the run times are long, but for small inputs KA becomes progressively slower in comparison. just curious why. |
KA adds some additional integer operations for the index calculations and defaults to Int64. Reducing that overhead is a to-do, but I haven't found time for that. This overhead is more noticeable on AMD. |
indexing overhead should scale with the problem size (ie input arg dims), no? what i'm seeing seems more like overhead in the kernel launch, as for small problems the difference in run times between KA and CUDA is large whereas the difference is small with large problems. |
Latency hiding becomes more effective at larger problem sizes.
That's not infeasible. But the launch code is https://github.com/JuliaGPU/CUDA.jl/blob/e1e5be2b6bf17f03a367cebeb18c4645e593f80d/src/CUDAKernels.jl#L89 which itself is fairly minimal. |
is the indexing overhead in |
using static ranges mitigates some of the performance gap i see between KA and CUDA for small problems. see #470 |
Yes. And it should be CSE'd as you noted constant ndra he's can help as well. |
could someone please help me understand why this should the the case? PTX code is similar and the threads/blocks are identical.
the above yields the times below for KA and CUDA, respetively, so KA is ~1.7x slower:
and here is the PTX code:
KA
CUDA
vendor-agnostic code is really appealing but i'm not sure i'm willing to pay this much of a performance penalty for it. thanks!
The text was updated successfully, but these errors were encountered: