The memory copy speed seems to exceed the hardware limit #1862
Closed
Yangyang-Tan
started this conversation in
General
Replies: 1 comment
-
I can't reproduce this here. Try running under NSight Systems, it'll display the memory throughput when highlighting the copy operation. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm tring to perform a memory copy on a RTX 4090 GPU. It gives a 2TB/s bandwidth speed. It's clearly exceed the theoretical performance of 4090 which is 1008 GB/s.
To reproduce
The Minimal Working Example (MWE):
The 2 in
T_tot1 = 2 * 1 / 1e9 * nx * ny * sizeof(Float32) / t_it1
comes from the memory read and write.Output
The
T_tot1
andT_tot2
give the outputT_tot1=2452
andT_tot1=2281
. While,T_tot3
gives the outputT_tot3=895
Version info
Details on Julia:
Details on CUDA:
Additional context
I also tried on the RTX 2080TI and the TITAN V. I don't see any performance exceeding. It seems that only on the RTX 4090 with the 2^11 * 2^11 size of Float32 matrix(or 2^10 * 2^10 Float64 matrix) will have this behaviour.
Beta Was this translation helpful? Give feedback.
All reactions