NVIDIA A100-SXM4-40GB (sm_80, 39.586 GiB)
Benchmarks were run on a DGX-A100 at PC2.
For comparison: Datasheet by NVIDIA.
Peakflops
CUDA cores
julia> theoretical_peakflops_gpu(; dtype=Float32, tensorcores=false);
+Theoretical Peakflops (TFLOP/s):
+ ├ tensorcores: false
+ ├ dtype: Float32
+ └ max: 19.5
+
+julia> theoretical_peakflops_gpu(; dtype=Float64, tensorcores=false);
+Theoretical Peakflops (TFLOP/s):
+ ├ tensorcores: false
+ ├ dtype: Float64
+ └ max: 9.7
julia> peakflops_gpu(; dtype=Float32, tensorcores=false);
+Peakflops (TFLOP/s):
+ ├ tensorcores: false
+ ├ dtype: Float32
+ └ max: 19.1
+
+julia> peakflops_gpu(; dtype=Float64, tensorcores=false);
+Peakflops (TFLOP/s):
+ ├ tensorcores: false
+ ├ dtype: Float64
+ └ max: 9.6
+
+julia> peakflops_gpu(; dtype=Float16, tensorcores=false);
+Peakflops (TFLOP/s):
+ ├ tensorcores: false
+ ├ dtype: Float16
+ └ max: 12.8
Tensor cores
julia> theoretical_peakflops_gpu(; dtype=Int8, tensorcores=true);
+Theoretical Peakflops (TOP/s):
+ ├ tensorcores: true
+ ├ dtype: Int8
+ └ max: 623.7
+
+julia> theoretical_peakflops_gpu(; dtype=Float16, tensorcores=true);
+Theoretical Peakflops (TFLOP/s):
+ ├ tensorcores: true
+ ├ dtype: Float16
+ └ max: 311.9
+
+julia> theoretical_peakflops_gpu(; dtype=Float32, tensorcores=true);
+Theoretical Peakflops (TFLOP/s):
+ ├ tensorcores: true
+ ├ dtype: Float32
+ └ max: 155.9
julia> peakflops_gpu(; dtype=Int8, tensorcores=true); # as of writing, only works with CUDA.jl#master
+Peakflops (TOP/s):
+ ├ tensorcores: true
+ ├ dtype: Int8
+ └ max: 620.11
+
+julia> peakflops_gpu(; dtype=Float16, tensorcores=true);
+Peakflops (TFLOP/s):
+ ├ tensorcores: true
+ ├ dtype: Float16
+ └ max: 311.2
+
+julia> peakflops_gpu(; dtype=:TensorFloat32, tensorcores=true); # as of writing, only works with Julia >= 1.8.0 and CUDA.jl PR 1419
+Peakflops (TFLOP/s):
+ ├ tensorcores: true
+ ├ dtype: TensorFloat32
+ └ max: 155.55
Memory bandwidth
julia> theoretical_memory_bandwidth();
+Theoretical Maximal Memory Bandwidth (GiB/s):
+ └ max: 1448.4
+
+julia> memory_bandwidth();
+Memory Bandwidth (GiB/s):
+ └ max: 1220.7
+
+julia> GiB(1220.7) |> change_base
+~1310.72 GB
+
+julia> memory_bandwidth_saxpy();
+Memory Bandwidth (GiB/s):
+ └ max: 1192.09
Host-to-device bandwidth
julia> host2device_bandwidth()
+Host <-> Device Bandwidth (GiB/s):
+ └ max: 11.84
+
+Host (pinned) <-> Device Bandwidth (GiB/s):
+ └ max: 24.33
Peer-to-peer bandwidth
julia> p2p_bandwidth();
+Bandwidth (GiB/s):
+ ├ max: 247.32
+ ├ min: 173.5
+ ├ avg: 229.63
+ └ std_dev: 31.67
+
+julia> p2p_bandwidth_all()
+8×8 Matrix{Union{Nothing, Float64}}:
+ nothing 245.706 241.075 244.467 246.434 242.229 245.085 245.033
+ 239.046 nothing 241.776 243.853 241.626 245.136 244.467 240.379
+ 246.957 242.633 nothing 242.937 245.291 248.114 239.193 242.684
+ 244.724 241.375 244.211 nothing 245.861 238.117 245.085 242.28
+ 241.576 246.329 242.582 245.602 nothing 246.59 240.677 243.343
+ 247.114 240.18 245.965 244.006 236.616 nothing 242.28 244.673
+ 243.802 242.028 248.326 239.933 244.365 245.033 nothing 245.498
+ 245.136 246.904 239.488 243.343 244.057 240.627 243.445 nothing
GPU information
julia> CUDA.versioninfo()
+CUDA toolkit 11.5, local installation
+NVIDIA driver 495.29.5, for CUDA 11.5
+CUDA driver 11.5
+
+Libraries:
+- CUBLAS: 11.7.3
+- CURAND: 10.2.6
+- CUFFT: 10.6.0
+- CUSOLVER: 11.2.1
+- CUSPARSE: 11.7.0
+- CUPTI: 16.0.0
+- NVML: 11.0.0+495.29.5
+- CUDNN: missing
+- CUTENSOR: missing
+
+Toolchain:
+- Julia: 1.7.1
+- LLVM: 12.0.1
+- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
+- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
+
+Environment:
+- JULIA_CUDA_USE_BINARYBUILDER: false
+
+8 devices:
+ 0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
+ 1: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
+ 2: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
+ 3: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
+ 4: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
+ 5: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
+ 6: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
+ 7: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
+
+julia> gpuinfo()
+Device: NVIDIA A100-SXM4-40GB (CuDevice(0))
+Total amount of global memory: 42.5 GB
+Number of CUDA cores: 6912
+Number of multiprocessors: 108 (64 CUDA cores each)
+GPU max. clock rate: 1410 Mhz
+Memory clock rate: 1215 Mhz
+Memory bus width: 5120-bit
+L2 cache size: 41.9 MB
+Max. texture dimension sizes (1D): 131072
+Max. texture dimension sizes (2D): 131072, 65536
+Max. texture dimension sizes (3D): 16384, 16384, 16384
+Max. layered 1D texture size: 32768 (2048 layers)
+Max. layered 2D texture size: 32768, 32768 (2048 layers)
+Total amount of constant memory: 65.5 kB
+Total amount of shared memory per block: 49.2 kB
+Total number of registers available per block: 65536
+Warp size: 32
+Max. number of threads per multiprocessor: 2048
+Max. number of threads per block: 1024
+Max. dimension size of a thread block (x,y,z): 1024, 1024, 64
+Max. dimension size of a grid size (x,y,z): 2147483647, 65535, 65535
+Texture alignment: 512.0 B
+Maximum memory pitch: 2.1 GB
+Concurrent copy and kernel execution: Yes with 3 copy engine(s)
+Run time limit on kernels: No
+Integrated GPU sharing host memory: No
+Support host page-locked memory mapping: Yes
+Concurrent kernel execution: Yes
+Alignment requirement for surfaces: Yes
+Device has ECC support: Yes
+Device supports Unified Addressing (UVA): Yes
+Device supports managed memory: Yes
+Device supports compute preemption: Yes
+Supports cooperative kernel launch: Yes
+Supports multi-device co-op kernel launch: Yes
+Device PCI domain ID / bus ID / device ID: 0 / 7 / 0
+Compute mode: Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)
+
+julia> gpuinfo_p2p_access()
+P2P Access Supported:
+8×8 Matrix{Bool}:
+ 0 1 1 1 1 1 1 1
+ 1 0 1 1 1 1 1 1
+ 1 1 0 1 1 1 1 1
+ 1 1 1 0 1 1 1 1
+ 1 1 1 1 0 1 1 1
+ 1 1 1 1 1 0 1 1
+ 1 1 1 1 1 1 0 1
+ 1 1 1 1 1 1 1 0
+
+P2P Atomic Supported:
+8×8 Matrix{Bool}:
+ 0 1 1 1 1 1 1 1
+ 1 0 1 1 1 1 1 1
+ 1 1 0 1 1 1 1 1
+ 1 1 1 0 1 1 1 1
+ 1 1 1 1 0 1 1 1
+ 1 1 1 1 1 0 1 1
+ 1 1 1 1 1 1 0 1
+ 1 1 1 1 1 1 1 0