kmod-5.10-nvidia: move to R535 branch from R470 #181

yeazelm · 2024-10-05T01:43:22Z

Issue number:

Closes # bottlerocket-os/bottlerocket#4220

Description of changes:
The R470 branch is end of life. In order to keep variants using the 5.10 kernel on a supported NVIDIA driver, this commit moves the kmod package for 5.10 to build the R535 branch and brings the driver in line with the other two kernel kmod packages in packaging style.

Note that the only difference between the spec and package configuration of kmod-5.10-nvidia and the other kmod-*-nvidia packages is an additional Provides: Provides: %{name}-tesla-470 so that current variants depending on this will get this R535 version instead. This change moves the naming of the package from kmod-5.10-nvidia-tesla-470 -> kmod-5.10-nvidia-tesla-535 which is depended upon directly at https://github.com/bottlerocket-os/bottlerocket/blob/develop/variants/aws-ecs-1-nvidia/Cargo.toml#L33

Testing done:
Built aws-k8s-1.23-nvidia with the changes and validated that the driver is using R535 on 5.10:

cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.183.06  Wed Jun 26 06:46:07 UTC 2024
GCC version:  gcc version 11.3.0 (Buildroot 2022.11.1)

uname -a
Linux ip-192-168-69-122.us-west-2.compute.internal 5.10.225 #1 SMP Thu Sep 26 02:29:17 UTC 2024 x86_64 GNU/Linux

Built aws-ecs-1-nvidia with the changes and validated that the driver is using R535 and 5.10:

cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.183.06  Wed Jun 26 06:46:07 UTC 2024
GCC version:  gcc version 11.3.0 (Buildroot 2022.11.1)

uname -a
Linux ip-10-0-6-76.us-west-2.compute.internal 5.10.226 #1 SMP Sat Oct 5 00:01:38 UTC 2024 x86_64 GNU/Linux

Ran gpu smoke tests to confirm its working on aws-ecs-1-nvidia:

----------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                    message                                                     |
|---------------|----------------------------------------------------------------------------------------------------------------|
| 1728092131513 | =========================================                                                                      |
| 1728092131513 |   Running sample UnifiedMemoryPerf                                                                             |
| 1728092131513 | =========================================                                                                      |
| 1728092133444 | GPU Device 0: "Turing" with compute capability 7.5                                                             |
| 1728092153552 | Running ........................................................                                               |
| 1728092153552 | Overall Time For matrixMultiplyPerf                                                                            |
| 1728092153552 | Printing Average of 20 measurements in (ms)                                                                    |
| 1728092153552 | Size_KB  UMhint UMhntAs  UMeasy   0Copy MemCopy CpAsync CpHpglk CpPglAs                                        |
| 1728092153552 | 4   0.157   0.171   0.324   0.016   0.032   0.029   0.034   0.026                                              |
| 1728092153552 | 16   0.176   0.212   0.417   0.040   0.060   0.052   0.062   0.056                                             |
| 1728092153552 | 64   0.327   0.319   0.773   0.133   0.167   0.159   0.132   0.124                                             |
| 1728092153552 | 256   0.979   0.772   1.213   0.750   1.003   0.559   0.469   0.464                                            |
| 1728092153552 | 1024   3.101   3.257   3.801   5.005   2.463   2.245   1.889   1.876                                           |
| 1728092153552 | 4096  12.702  13.572  14.605  36.327  10.050   9.835   9.348   9.338                                           |
| 1728092153552 | 16384  57.512  59.709  66.287 309.715  48.587  48.354  46.022  45.981                                          |
| 1728092153552 | NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. |
| 1728092154035 | =========================================                                                                      |
| 1728092154035 |   Running sample deviceQuery                                                                                   |
| 1728092154035 | =========================================                                                                      |
| 1728092155896 | ./deviceQuery Starting...                                                                                      |
| 1728092155896 |  CUDA Device Query (Runtime API) version (CUDART static linking)                                               |
| 1728092155896 | Detected 1 CUDA Capable device(s)                                                                              |
| 1728092155896 | Device 0: "Tesla T4"                                                                                           |
| 1728092155896 |   CUDA Driver Version / Runtime Version          12.2 / 11.4                                                   |
| 1728092155896 |   CUDA Capability Major/Minor version number:    7.5                                                           |
| 1728092155896 |   Total amount of global memory:                 14931 MBytes (15655829504 bytes)                              |
| 1728092155896 |   (040) Multiprocessors, (064) CUDA Cores/MP:    2560 CUDA Cores                                               |
| 1728092155896 |   GPU Max Clock rate:                            1590 MHz (1.59 GHz)                                           |
| 1728092155896 |   Memory Clock rate:                             5001 Mhz                                                      |
| 1728092155896 |   Memory Bus Width:                              256-bit                                                       |
| 1728092155896 |   L2 Cache Size:                                 4194304 bytes                                                 |
| 1728092155896 |   Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)     |
| 1728092155896 |   Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers                                       |
| 1728092155896 |   Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers                                |
| 1728092155896 |   Total amount of constant memory:               65536 bytes                                                   |
| 1728092155896 |   Total amount of shared memory per block:       49152 bytes                                                   |
| 1728092155896 |   Total shared memory per multiprocessor:        65536 bytes                                                   |
| 1728092155896 |   Total number of registers available per block: 65536                                                         |
| 1728092155896 |   Warp size:                                     32                                                            |
| 1728092155896 |   Maximum number of threads per multiprocessor:  1024                                                          |
| 1728092155896 |   Maximum number of threads per block:           1024                                                          |
| 1728092155896 |   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)                                               |
| 1728092155896 |   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)                                     |
| 1728092155896 |   Maximum memory pitch:                          2147483647 bytes                                              |
| 1728092155896 |   Texture alignment:                             512 bytes                                                     |
| 1728092155896 |   Concurrent copy and kernel execution:          Yes with 3 copy engine(s)                                     |
| 1728092155896 |   Run time limit on kernels:                     No                                                            |
| 1728092155896 |   Integrated GPU sharing Host Memory:            No                                                            |
| 1728092155896 |   Support host page-locked memory mapping:       Yes                                                           |
| 1728092155896 |   Alignment requirement for Surfaces:            Yes                                                           |
| 1728092155896 |   Device has ECC support:                        Enabled                                                       |
| 1728092155896 |   Device supports Unified Addressing (UVA):      Yes                                                           |
| 1728092155896 |   Device supports Managed Memory:                Yes                                                           |
| 1728092155896 |   Device supports Compute Preemption:            Yes                                                           |
| 1728092155896 |   Supports Cooperative Kernel Launch:            Yes                                                           |
| 1728092155896 |   Supports MultiDevice Co-op Kernel Launch:      Yes                                                           |
| 1728092155896 |   Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 30                                                    |
| 1728092155896 |   Compute Mode:                                                                                                |
| 1728092155896 |      < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >                  |
| 1728092155896 | deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 11.4, NumDevs = 1        |
| 1728092155896 | Result = PASS                                                                                                  |
| 1728092156341 | =========================================                                                                      |
| 1728092156341 |   Running sample globalToShmemAsyncCopy                                                                        |
| 1728092156341 | =========================================                                                                      |
| 1728092159778 | [globalToShmemAsyncCopy] - Starting...                                                                         |
| 1728092159778 | GPU Device 0: "Turing" with compute capability 7.5                                                             |
| 1728092159778 | MatrixA(1280,1280), MatrixB(1280,1280)                                                                         |
| 1728092159778 | Running kernel = 0 - AsyncCopyMultiStageLargeChunk                                                             |
| 1728092159778 | Computing result using CUDA Kernel...                                                                          |
| 1728092159778 | done                                                                                                           |
| 1728092159778 | Performance= 320.01 GFlop/s, Time= 13.107 msec, Size= 4194304000 Ops, WorkgroupSize= 256 threads/block         |
| 1728092159778 | Checking computed result for correctness: Result = PASS                                                        |
| 1728092159778 | NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. |
| 1728092160210 | =========================================                                                                      |
| 1728092160210 |   Running sample immaTensorCoreGemm                                                                            |
| 1728092160210 | =========================================                                                                      |
| 1728092163253 | Initializing...                                                                                                |
| 1728092163253 | GPU Device 0: "Turing" with compute capability 7.5                                                             |
| 1728092163253 | M: 4096 (16 x 256)                                                                                             |
| 1728092163253 | N: 4096 (16 x 256)                                                                                             |
| 1728092163253 | K: 4096 (16 x 256)                                                                                             |
| 1728092163253 | Preparing data for GPU...                                                                                      |
| 1728092163253 | Required shared memory size: 64 Kb                                                                             |
| 1728092163253 | Computing... using high performance kernel compute_gemm_imma                                                   |
| 1728092163253 | Time: 4.160704 ms                                                                                              |
| 1728092163253 | TOPS: 33.03                                                                                                    |
| 1728092163768 | =========================================                                                                      |
| 1728092163768 |   Running sample reductionMultiBlockCG                                                                         |
| 1728092163768 | =========================================                                                                      |
| 1728092167012 | reductionMultiBlockCG Starting...                                                                              |
| 1728092167012 | GPU Device 0: "Turing" with compute capability 7.5                                                             |
| 1728092167012 | 33554432 elements                                                                                              |
| 1728092167012 | numThreads: 1024                                                                                               |
| 1728092167012 | numBlocks: 40                                                                                                  |
| 1728092167012 | Launching SinglePass Multi Block Cooperative Groups kernel                                                     |
| 1728092167012 | Average time: 1.453631 ms                                                                                      |
| 1728092167012 | Bandwidth:    92.332764 GB/s                                                                                   |
| 1728092167012 | GPU result = 1.992401361465                                                                                    |
| 1728092167012 | CPU result = 1.992401361465                                                                                    |
| 1728092167530 | =========================================                                                                      |
| 1728092167530 |   Running sample shfl_scan                                                                                     |
| 1728092167530 | =========================================                                                                      |
| 1728092169607 | Starting shfl_scan                                                                                             |
| 1728092169607 | GPU Device 0: "Turing" with compute capability 7.5                                                             |
| 1728092169607 | > Detected Compute SM 7.5 hardware with 40 multi-processors                                                    |
| 1728092169607 | Starting shfl_scan                                                                                             |
| 1728092169607 | GPU Device 0: "Turing" with compute capability 7.5                                                             |
| 1728092169607 | > Detected Compute SM 7.5 hardware with 40 multi-processors                                                    |
| 1728092169607 | Computing Simple Sum test                                                                                      |
| 1728092169607 | ---------------------------------------------------                                                            |
| 1728092169607 | Initialize test data [1, 1, 1...]                                                                              |
| 1728092169607 | Scan summation for 65536 elements, 256 partial sums                                                            |
| 1728092169607 | Partial summing 256 elements with 1 blocks of size 256                                                         |
| 1728092169607 | Test Sum: 65536                                                                                                |
| 1728092169607 | Time (ms): 0.027360                                                                                            |
| 1728092169607 | 65536 elements scanned in 0.027360 ms -> 2395.321533 MegaElements/s                                            |
| 1728092169607 | CPU verify result diff (GPUvsCPU) = 0                                                                          |
| 1728092169607 | CPU sum (naive) took 0.030690 ms                                                                               |
| 1728092169607 | Computing Integral Image Test on size 1920 x 1080 synthetic data                                               |
| 1728092169607 | ---------------------------------------------------                                                            |
| 1728092169607 | Method: Fast  Time (GPU Timer): 0.050528 ms Diff = 0                                                           |
| 1728092169607 | Method: Vertical Scan  Time (GPU Timer): 0.134464 ms                                                           |
| 1728092169607 | CheckSum: 2073600, (expect 1920x1080=2073600)                                                                  |
| 1728092170135 | =========================================                                                                      |
| 1728092170135 |   Running sample simpleAWBarrier                                                                               |
| 1728092170135 | =========================================                                                                      |
| 1728092172261 | ./simpleAWBarrier starting...                                                                                  |
| 1728092172261 | GPU Device 0: "Turing" with compute capability 7.5                                                             |
| 1728092172261 | Launching normVecByDotProductAWBarrier kernel with numBlocks = 40 blockSize = 1024                             |
| 1728092172284 | Result = PASSED                                                                                                |
| 1728092172296 | ./simpleAWBarrier completed, returned OK                                                                       |
| 1728092172824 | =========================================                                                                      |
| 1728092172824 |   Running sample simpleAtomicIntrinsics                                                                        |
| 1728092172824 | =========================================                                                                      |
| 1728092174890 | simpleAtomicIntrinsics starting...                                                                             |
| 1728092174890 | GPU Device 0: "Turing" with compute capability 7.5                                                             |
| 1728092174890 | Processing time: 174.744003 (ms)                                                                               |
| 1728092174890 | simpleAtomicIntrinsics completed, returned OK                                                                  |
| 1728092175385 | =========================================                                                                      |
| 1728092175385 |   Running sample simpleVoteIntrinsics                                                                          |
| 1728092175385 | =========================================                                                                      |
| 1728092177454 | [simpleVoteIntrinsics]                                                                                         |
| 1728092177454 | GPU Device 0: "Turing" with compute capability 7.5                                                             |
| 1728092177454 | > GPU device has 40 Multi-Processors, SM 7.5 compute capabilities                                              |
| 1728092177454 | [VOTE Kernel Test 1/3]                                                                                         |
| 1728092177454 |  Running <<Vote.Any>> kernel1 ...                                                                              |
| 1728092177454 |  OK                                                                                                            |
| 1728092177454 | [VOTE Kernel Test 2/3]                                                                                         |
| 1728092177454 |  Running <<Vote.All>> kernel2 ...                                                                              |
| 1728092177454 |  OK                                                                                                            |
| 1728092177454 | [VOTE Kernel Test 3/3]                                                                                         |
| 1728092177454 |  Running <<Vote.Any>> kernel3 ...                                                                              |
| 1728092177454 |  OK                                                                                                            |
| 1728092177454 |  Shutting down...                                                                                              |
| 1728092177986 | =========================================                                                                      |
| 1728092177986 |   Running sample vectorAdd                                                                                     |
| 1728092177986 | =========================================                                                                      |
| 1728092180034 | [Vector addition of 50000 elements]                                                                            |
| 1728092180034 | Copy input data from the host memory to the CUDA device                                                        |
| 1728092180034 | CUDA kernel launch with 196 blocks of 256 threads                                                              |
| 1728092180034 | Copy output data from the CUDA device to the host memory                                                       |
| 1728092180034 | Test PASSED                                                                                                    |
| 1728092180034 | Done                                                                                                           |
| 1728092180552 | =========================================                                                                      |
| 1728092180552 |   Running sample warpAggregatedAtomicsCG                                                                       |
| 1728092180552 | =========================================                                                                      |
| 1728092183087 | GPU Device 0: "Turing" with compute capability 7.5                                                             |
| 1728092183087 | CPU max matches GPU max                                                                                        |
| 1728092183087 | Warp Aggregated Atomics PASSED                                                                                 |
----------------------------------------------------------------------------------------------------------------------------------

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

yeazelm · 2024-10-09T18:42:43Z

^ push to rebase on develop. Validated that gpu tests pass on aws-ecs-1-nvidia and persistenced.

yeazelm · 2024-10-10T04:07:02Z

^ removes the additional Provides which will be handled in a major version bump due to the breaking change for 5.10 kernel variants using this kmod.

The R470 branch is end of life. In order to keep variants using the 5.10 kernel on a supported NVIDIA driver, this commit moves the kmod package for 5.10 to build the R535 branch and brings the driver in line with the other two kernel kmod packages in packaging style. Signed-off-by: Matthew Yeazel <[email protected]>

yeazelm · 2024-10-10T18:22:22Z

^ Remove nvidia-tesla-tmpfiles.conf.in which no longer needed for this package.

yeazelm requested review from bcressey and arnaldo2792 October 5, 2024 01:43

yeazelm force-pushed the 470_to_535 branch 2 times, most recently from 1c4c81f to b9e9915 Compare October 9, 2024 18:41

yeazelm force-pushed the 470_to_535 branch from b9e9915 to d25046c Compare October 9, 2024 23:38

yeazelm force-pushed the 470_to_535 branch from d25046c to d36b035 Compare October 10, 2024 18:19

bcressey approved these changes Oct 11, 2024

View reviewed changes

arnaldo2792 approved these changes Oct 11, 2024

View reviewed changes

yeazelm merged commit e781c61 into bottlerocket-os:develop Oct 11, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kmod-5.10-nvidia: move to R535 branch from R470 #181

kmod-5.10-nvidia: move to R535 branch from R470 #181

yeazelm commented Oct 5, 2024

yeazelm commented Oct 9, 2024

yeazelm commented Oct 10, 2024

yeazelm commented Oct 10, 2024

kmod-5.10-nvidia: move to R535 branch from R470 #181

kmod-5.10-nvidia: move to R535 branch from R470 #181

Conversation

yeazelm commented Oct 5, 2024

yeazelm commented Oct 9, 2024

yeazelm commented Oct 10, 2024

yeazelm commented Oct 10, 2024