Skip to content

Latest commit

 

History

History
44 lines (32 loc) · 2.14 KB

NVidia-Ampere.md

File metadata and controls

44 lines (32 loc) · 2.14 KB

Examples

  • RTX 30xx
  • RTX Ax000
  • Orin
  • MX570

References

  1. Tuning CUDA Applications for NVIDIA Ampere GPU Architecture
  2. Ampere Architecture In-Depth
  3. Architecture Whitepaper, [backup]
  4. Compute Capability 8.7
  5. Vulkan features for RTX 3090
  6. Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis, [backup]
  7. Dissecting the Ampere GPU Architecture through Microbenchmarking, [backup]

Notes

  • SM continues to support double-speed FP16 (HFMA) operations which are supported in Turing. [1]

  • Added BF16 type. [1]

  • fp32 : fp16 : bf16 has same tflops. [1]

  • fp32 : i32 has 2:1 rate. [1]

  • Includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations. One datapath in each partition consists of 16 Ampere GPU Architecture In-Depth NVIDIA Ampere GA102 GPU Architecture 13 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores, and is capable of executing either 16 FP32 operations OR 16 INT32 operations per clock. [1]

  • GeForce RTX 3080 L1 bandwidth: 219 GB/sec

  • Work graphs.

Specs

  • ops/clock per SM: [4]

    • TODO
  • RTX 3080 specs:

    • shaderSMCount: 68 [vk]
    • shaderWarpsPerSM: 48 [vk] (32?)
    • warp size: 32 [vk]
    • Shading Units: 8704 [specs] (68 * 32 * 4 ???)
    • total threads: 69 632 [calc] (68 * 32 * 32)
    • Clock: 1440 MHz / 1710 MHz [specs]
    • FP32 TFLOPS: 29.77 [specs], 37.6 [calc] {104.4K threads * 1.44GHz / (4cycles (5?) for ADD/MUL/FMA)}