Automatically use Open GPU drivers when desired #114

yeazelm · 2024-08-29T02:41:43Z

Description of changes:
This adds the ability to select the open driver to the kmod packages in addition to the proprietary drivers at runtime. This should enable newer instance types that support the open GPU driver to use it automatically while keeping the older instance types on the proprietary driver.

This now uses ExecCondition to call ghostdog match-nvidia-driver to check which devices are present and then either stop the link/load for the non-desired driver or continue if the requested driver matches the requested driver.

It is best reviewed in commit order. ghostdog can be reviewed separately, the two kmod packages are mirrors of each other, and then the os package commit glues it all together.

Additional notes

I added in an additional "copy only" configuration to copy the modules but not load them for nvidia-drm and nvidia-peermem. This is purely optional and we can choose to still not provide them, but since we are building them and they are present, it might be useful to some use cases to make them loadable at runtime. I haven't been able to find a way to exercise them though so we can remove that configuration file from the PR and they will remain in the staging directory until we decide to pull them in.

Testing done:
Tested on g5.2xlarge and g3.4xlarge both k8s 1.30 and 1.25 to confirm that the correct driver is loaded and the smoke tests to ensure the drivers still worked on each node.

g3.4xlarge

bash-5.1# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.183.06  Wed Jun 26 06:46:07 UTC 2024
GCC version:  gcc version 11.3.0 (Buildroot 2022.11.1)

g4d.24xlarge

bash-5.1# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  535.183.06  Release Build  (dvs-builder@U16-I3-A14-4-1)  Wed Jun 26 06:59:29 UTC 2024
GCC version:  gcc version 11.3.0 (Buildroot 2022.11.1)

Example output from the gpu test:

[root@gpu-tests-77lfz /]# ./run.sh

=========================================
  Running sample UnifiedMemoryPerf
=========================================

GPU Device 0: "Ampere" with compute capability 8.6
....

[simpleVoteIntrinsics]
GPU Device 0: "Ampere" with compute capability 8.6

> GPU device has 80 Multi-Processors, SM 8.6 compute capabilities

[VOTE Kernel Test 1/3]
        Running <<Vote.Any>> kernel1 ...
        OK

[VOTE Kernel Test 2/3]
        Running <<Vote.All>> kernel2 ...
        OK

[VOTE Kernel Test 3/3]
        Running <<Vote.Any>> kernel3 ...
        OK
        Shutting down...

=========================================
  Running sample vectorAdd
=========================================

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

=========================================
  Running sample warpAggregatedAtomicsCG
=========================================

GPU Device 0: "Ampere" with compute capability 8.6

CPU max matches GPU max

Warp Aggregated Atomics PASSED

Output for the g4d.24xlarge:

bash-5.1# systemctl | grep kernel
  sys-kernel-config.mount                                                                                                                       loaded active mounted Kernel Configuration File System
  sys-kernel-debug.mount                                                                                                                        loaded active mounted Kernel Debug File System
  sys-kernel-tracing.mount                                                                                                                      loaded active mounted Kernel Trace File System
  var-lib-kernel\x2ddevel-.overlay-lower.mount                                                                                                  loaded active mounted Kernel Development Sources (Read-Only)
  x86_64\x2dbottlerocket\x2dlinux\x2dgnu-sys\x2droot-usr-src-kernels.mount                                                                      loaded active mounted Kernel Development Sources (Read-Write)
  copy-open-gpu-kernel-modules.service                                                                                                          loaded active exited  Link additional kernel modules
  load-open-gpu-kernel-modules.service                                                                                                          loaded active exited  Load additional kernel modules
  systemd-udevd-kernel.socket                                                                                                                   loaded active running udev Kernel Socket

bash-5.1# journalctl -u copy-open-gpu-kernel-modules
Sep 15 22:17:15 localhost systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:18 localhost driverdog[1489]: 22:17:18 [INFO] Copied nvidia.ko
Sep 15 22:17:19 ip-192-168-81-80.us-west-2.compute.internal driverdog[1489]: 22:17:19 [INFO] Copied nvidia-modeset.ko
Sep 15 22:17:25 ip-192-168-81-80.us-west-2.compute.internal driverdog[1489]: 22:17:25 [INFO] Copied nvidia-uvm.ko
Sep 15 22:17:25 ip-192-168-81-80.us-west-2.compute.internal driverdog[1742]: 22:17:25 [INFO] Copied nvidia-drm.ko
Sep 15 22:17:25 ip-192-168-81-80.us-west-2.compute.internal driverdog[1742]: 22:17:25 [INFO] Copied nvidia-peermem.ko
Sep 15 22:17:25 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Finished Link additional kernel modules.

bash-5.1# journalctl -u link-tesla-kernel-modules
Sep 15 22:17:15 localhost systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:15 localhost ghostdog[1415]: tesla is not preferred driver: open-gpu
Sep 15 22:17:15 localhost systemd[1]: link-tesla-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:15 localhost systemd[1]: Condition check resulted in Link additional kernel modules being skipped.
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal ghostdog[1786]: tesla is not preferred driver: open-gpu
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: link-tesla-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Condition check resulted in Link additional kernel modules being skipped.
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal ghostdog[1819]: tesla is not preferred driver: open-gpu
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: link-tesla-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Condition check resulted in Link additional kernel modules being skipped.

bash-5.1# journalctl -u load-tesla-kernel-modules
Sep 15 22:17:15 localhost systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:15 localhost ghostdog[1490]: tesla is not preferred driver: open-gpu
Sep 15 22:17:15 localhost systemd[1]: load-tesla-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:15 localhost systemd[1]: Condition check resulted in Load additional kernel modules being skipped.
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal ghostdog[1790]: tesla is not preferred driver: open-gpu
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: load-tesla-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Condition check resulted in Load additional kernel modules being skipped.
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal ghostdog[1836]: tesla is not preferred driver: open-gpu
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: load-tesla-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Condition check resulted in Load additional kernel modules being skipped.

bash-5.1# journalctl -u load-open-gpu-kernel-modules
Sep 15 22:17:25 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal driverdog[1746]: 22:17:26 [INFO] Updated modules dependencies
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal driverdog[1746]: 22:17:26 [INFO] Loaded kernel modules
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Finished Load additional kernel modules.

Output for the g3.4xlarge:

bash-5.1# systemctl | grep kernel
  sys-kernel-config.mount                                                                                                                       loaded active mounted Kernel Configuration File System
  sys-kernel-debug.mount                                                                                                                        loaded active mounted Kernel Debug File System
  sys-kernel-tracing.mount                                                                                                                      loaded active mounted Kernel Trace File System
  var-lib-kernel\x2ddevel-.overlay-lower.mount                                                                                                  loaded active mounted Kernel Development Sources (Read-Only)
  x86_64\x2dbottlerocket\x2dlinux\x2dgnu-sys\x2droot-usr-src-kernels.mount                                                                      loaded active mounted Kernel Development Sources (Read-Write)
  link-tesla-kernel-modules.service                                                                                                             loaded active exited  Link additional kernel modules
  load-tesla-kernel-modules.service                                                                                                             loaded active exited  Load additional kernel modules
  systemd-udevd-kernel.socket                                                                                                                   loaded active running udev Kernel Socket

bash-5.1# journalctl -u link-tesla-kernel-modules
Sep 15 22:17:18 localhost systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:24 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:24 [INFO] Linked object 'nvidia.o'
Sep 15 22:17:24 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:24 [INFO] Stripped object 'nvidia.o'
Sep 15 22:17:24 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:24 [INFO] Linked object 'nvidia-modeset.o'
Sep 15 22:17:24 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:24 [INFO] Stripped object 'nvidia-modeset.o'
Sep 15 22:17:24 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:24 [INFO] Linked nvidia-uvm.ko
Sep 15 22:17:25 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:25 [INFO] Linked nvidia.ko
Sep 15 22:17:25 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:25 [INFO] Linked nvidia-modeset.ko
Sep 15 22:17:25 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Finished Link additional kernel modules.

bash-5.1# journalctl -u load-tesla-kernel-modules
Sep 15 22:17:25 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:26 ip-192-168-24-94.us-west-2.compute.internal driverdog[2984]: 22:17:26 [INFO] Updated modules dependencies
Sep 15 22:17:26 ip-192-168-24-94.us-west-2.compute.internal driverdog[2984]: 22:17:26 [INFO] Loaded kernel modules
Sep 15 22:17:26 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Finished Load additional kernel modules.

bash-5.1# journalctl -u copy-open-gpu-kernel-modules
Sep 15 22:17:18 localhost systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:18 localhost ghostdog[2217]: open-gpu is not preferred driver: tesla
Sep 15 22:17:18 localhost systemd[1]: copy-open-gpu-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:18 localhost systemd[1]: Condition check resulted in Link additional kernel modules being skipped.
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal ghostdog[3029]: open-gpu is not preferred driver: tesla
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: copy-open-gpu-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Condition check resulted in Link additional kernel modules being skipped.
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal ghostdog[3056]: open-gpu is not preferred driver: tesla
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: copy-open-gpu-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Condition check resulted in Link additional kernel modules being skipped.

bash-5.1# journalctl -u load-open-gpu-kernel-modules
Sep 15 22:17:18 localhost systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:18 localhost ghostdog[2517]: open-gpu is not preferred driver: tesla
Sep 15 22:17:18 localhost systemd[1]: load-open-gpu-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:18 localhost systemd[1]: Condition check resulted in Load additional kernel modules being skipped.
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal ghostdog[3034]: open-gpu is not preferred driver: tesla
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: load-open-gpu-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Condition check resulted in Load additional kernel modules being skipped.
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal ghostdog[3097]: open-gpu is not preferred driver: tesla
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: load-open-gpu-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Condition check resulted in Load additional kernel modules being skipped.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

yeazelm · 2024-08-29T03:04:19Z

sources/ghostdog/src/main.rs

+/// Given a PCI ID, search the Open GPU Supported Devices File to determine if the Open GPU Driver should be used
+fn find_preferred_driver(pci_id: String) -> Result<String> {
+    // This corresponds to T4 devices which are Turing based and supported, but currently do not show up in the supported-devices file
+    if pci_id == "10DE:1EB8" {


I noticed that the g5g instances also don't match up on the supported list https://github.com/NVIDIA/open-gpu-kernel-modules?tab=readme-ov-file#compatible-gpus but are the Turing generation of devices that should support it. I'll do some more testing to see if the driver works on these instances too.

I was able to get light confirmation from NVIDIA that these devices should use the proprietary driver so I've removed this if statement.

packages/kmod-5.15-nvidia/40-nvidia-gpu-driver-select.rules

sources/ghostdog/src/main.rs

packages/kmod-5.15-nvidia/40-nvidia-gpu-driver-select.rules

packages/kmod-6.1-nvidia/nvidia-open-gpu-modules.service.in

sources/ghostdog/src/main.rs

bcressey · 2024-08-30T16:30:08Z

packages/kmod-6.1-nvidia/nvidia-open-gpu-modules.service.in

+[Install]
+WantedBy=load-kernel-modules.service


This is very racy - systemd will always queue the job immediately, and if it happens to start before ghostdog finishes creating the link, then the condition check won't pass and it will fail to start.

It looks like you're working around that by pushing the job after systemd-tmpfiles-setup.service, but that's a hack.

Zooming out a bit, it seems like you've created two synchronization problems for yourself:

the need for the correct config file to be in place before either driverdog link or driverdog load can run

the need to copy the .ko files into place before driverdog load runs, in the open GPU case

You can eliminate the second coordination problem here by making it a driverdog config problem instead - having a copy-source field instead of link-objects to tell driverdog where to find the .ko.

I know I pushed back on more driverdog functionality here but I can be wrong sometimes. It would let you delete this unit (and its sync problem) and focus only on the "correct config for driverdog" problem.

I've resolved this by moving away from udev entirely. We can determine at the time of the unit file running what decision to make. It costs very little to do the ghostdog match-nvidia-driver for these invocations so keeping track of these decisions somewhere else don't look to be worth it. Testing on a g3.4xlarge which should have to crawl the entire file and not make a match (worst case runtime) is still insignificant compared to the time it takes to link/load the modules:

bash-5.1# time ghostdog match-nvidia-driver tesla real 0m0.005s user 0m0.002s sys 0m0.003s bash-5.1# time ghostdog match-nvidia-driver open-gpu open-gpu is not preferred driver: tesla real 0m0.005s user 0m0.005s sys 0m0.000s

I can be more precise in the measurement if needed, but it looks to be that ghostdog is very fast at providing this decision.

packages/kmod-6.1-nvidia/nvidia-open-gpu-load-config.toml.in

packages/kmod-6.1-nvidia/open-gpu-depmod.conf.in

sources/ghostdog/src/main.rs

yeazelm

I've added comments here but also started breaking out the changes into smaller PRs since this one is a lot to review. See #118 for the NVIDIA kmod packaging and #119 for driverdog changes. Once those get merged, I'll rebase this PR on top of those and make this significantly smaller to review.

packages/kmod-5.15-nvidia/40-nvidia-gpu-driver-select.rules

packages/kmod-6.1-nvidia/kmod-6.1-nvidia.spec

sources/ghostdog/src/main.rs

sources/driverdog/src/main.rs

yeazelm · 2024-09-15T23:16:59Z

^ Just pushed a bunch of changes, this should "reset" the PR to be the new approach and should remove the code already reviewed.

bcressey · 2024-09-16T23:11:33Z

packages/os/load-tesla-kernel-modules.service.in

+After=link-tesla-kernel-modules.service
+Requires=link-tesla-kernel-modules.service
 # Disable manual restarts to prevent loading kernel modules
 # that weren't linked by the running system
 RefuseManualStart=true
 RefuseManualStop=true

 [Service]
 Type=oneshot
-ExecStart=/usr/bin/driverdog load-modules
+ExecCondition=/usr/bin/ghostdog match-nvidia-driver tesla
+ExecStart=/usr/bin/driverdog --modules-set nvidia-tesla load-modules


Now that these units are so specific to NVIDIA, they don't really fit with the general purpose code in os.

I would either move them to the kmod-*-nvidia packages, or else turn these into unit templates and adjust the logic a bit so we can do something like:

[Unit] After=link-kernel-modules@%i.service Requires=link-kernel-modules@%i.service ... [Service] Type=oneshot ExecCondition=/usr/bin/ghostdog match-driver %I ExecStart=/usr/bin/driverdog --modules-set %I load-modules

Then the kmod-*-nvidia packages could just add symlinks to instantiate the units they supported.

This makes sense, I'll move them into the kmod packages. I had issues getting templates to block boot so we probably will opt for explicit files here to keep things straightforward even if its a bit of copying the same things over.

bcressey · 2024-09-16T23:37:54Z

sources/ghostdog/src/main.rs

+/// The GPU Device Data contains various features of the device. Only Name, Device ID, and Features are required
+/// a particular device


nit: missing word, guessing ...?

Suggested change

/// The GPU Device Data contains various features of the device. Only Name, Device ID, and Features are required

/// a particular device

/// The GPU Device Data contains various features of the device. Only Name, Device ID, and Features are required

/// for a particular device

bcressey · 2024-09-16T23:44:55Z

sources/ghostdog/src/main.rs

+    for input_device in present_devices.iter() {
+        match open_gpu_devices {
+            SupportedDevicesConfiguration::OpenGpu(ref device_list) => {
+                let formatted_device_id = format!("0x{}", input_device.device());
+                for supported_device in device_list.iter() {
+                    if supported_device.device_id == formatted_device_id {
+                        return Ok("open-gpu".to_string());
+                    }
+                }
+            }
+        }
+    }


In a system with a large number of GPUs that weren't supported by the open driver, and a JSON file with a large number of different devices, we could end up doing a lot of extra work.

It would be more efficient to transform the open GPU device data into a hash set of device IDs. Then we would guarantee one pass through the list of supported devices, and one lookup per device.

I tried putting these into HashSets to ensure we didn't have negative performance even in cases where there are 16 devices and the runs all were taking between 5-6ms both as the Vec implementation as it stands vs the HashSet contains(). It stands to reason that HashSet's would be better so I can convert, but I don't think we gain a lot here with the ~700 devices listed in the current file, even in worst case.

// Bogus PCI devices but "miss" to ensure there will be no match let present_devices: Vec<ListDevicesOutput> = vec![ListDevicesOutput { pci_slot: "00:00.0".to_string(), class: "0600".to_string(), vendor: "8086".to_string(), device: "1234".to_string(), program_interface: Some("00".to_string()), subsystem_vendor: Some("1d0f".to_string()), subsystem_device: Some("1237".to_string()), ..Default::default() };16]; // reduce down to unique ids so we only iterate on uniqueness (down from 16 to 1 here) let unique_ids: HashSet<String> = present_devices.iter().map(|x| format!("0x{}", x.device()).clone()).collect(); match open_gpu_devices { SupportedDevicesConfiguration::OpenGpu(ref device_list) => { let open_devices_list: HashSet<String> = device_list.iter().map(|x| x.device_id.clone()).collect(); for input_device in unique_ids.iter() { if open_devices_list.contains(input_device) { return Ok("open-gpu".to_string()); } } } }

The compiler might be tricking me here and optimizing as well (I did try creating the Vec of unique objects and still saw the same times for input devices) so I can test a bit more, but even so, 5ms runs for this when parsing the entire file is good considering the kernel module loads will take seconds. We might save more by saving the decision to a file to be checked before rerunning the deserialization and searching since this will be called several times by systemd.

yeazelm · 2024-09-19T05:00:59Z

^ updated to reflect comments. This now holds back open gpu use in some cases where it is known to cause issues. Also moved the kernel module services to the kmod packages.

sources/ghostdog/src/main.rs

packages/kmod-5.15-nvidia/link-tesla-kernel-modules.service.in

packages/kmod-5.15-nvidia/kmod-5.15-nvidia.spec

sources/ghostdog/src/main.rs

yeazelm · 2024-09-23T22:09:09Z

^ Pushed a new set of commits, this renames the systemd services to be more clear and fixes the unique id bug. Validated on g5.2xlarge which used proprietary and a g4d.24xlarge that chose the open GPU drivers.

ghostdog can now look up PCI devices and confirm which NVIDIA driver should be used. The match-nvidia-driver takes one argument, which type of driver such as tesla or open-gpu and exits 0 if that matches the PCI devices currently present. If those devices do not match the provided driver, it exits 1. This can be used to set ExecCondition for things like linking or loading drivers. Signed-off-by: Matthew Yeazel <[email protected]>

This adds upon the present logic to build the open-gpu driver to provide configuration that driverdog can use to work with the open-gpu drivers. Signed-off-by: Matthew Yeazel <[email protected]>

The os package should not be concerned with specifics to the NVIDIA kmod module preferences. os will provide driverdog, but the configuration files it reads will be provided by the package providing the modules to keep that specific logic local to their domain. Signed-off-by: Matthew Yeazel <[email protected]>

The os package doesn't need to concern itself with NVIDIA specific loading behavior. It will provide driverdog, but the configurations read by driverdog will be included with the specific kernel modules package that provides the drivers described in the configuration. Signed-off-by: Matthew Yeazel <[email protected]>

The os package doesn't need to concern itself with NVIDIA specific loading behavior. It will provide driverdog, but the configurations read by driverdog will be included with the specific kernel modules package that provides the drivers described in the configuration. This moves the tesla and open-gpu services into the kmod-5.15-nvidia package instead. Signed-off-by: Matthew Yeazel <[email protected]>

The os package doesn't need to concern itself with NVIDIA specific loading behavior. It will provide driverdog, but the configurations read by driverdog will be included with the specific kernel modules package that provides the drivers described in the configuration. This moves the tesla and open-gpu services into the kmod-6.1-nvidia package instead. Signed-off-by: Matthew Yeazel <[email protected]>

yeazelm · 2024-09-23T22:37:43Z

^ rebase on develop

bcressey · 2024-09-23T23:09:28Z

sources/ghostdog/src/main.rs

+    let present_devices =
+        pciclient::list_devices(list_input).context(error::ListPciDevicesSnafu)?;
+
+    // If there a multiple devices with the same ID, dedup them to minimize iterations


nit:

Suggested change

// If there a multiple devices with the same ID, dedup them to minimize iterations

// If there are multiple devices with the same ID, dedup them to minimize iterations

yeazelm requested review from bcressey and arnaldo2792 August 29, 2024 02:48

yeazelm commented Aug 29, 2024

View reviewed changes

larvacea reviewed Aug 29, 2024

View reviewed changes

packages/kmod-5.15-nvidia/40-nvidia-gpu-driver-select.rules Outdated Show resolved Hide resolved

larvacea reviewed Aug 29, 2024

View reviewed changes

sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved

larvacea reviewed Aug 29, 2024

View reviewed changes

sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved

larvacea reviewed Aug 29, 2024

View reviewed changes

sources/ghostdog/src/main.rs Show resolved Hide resolved

bcressey reviewed Aug 30, 2024

View reviewed changes

ginglis13 mentioned this pull request Aug 30, 2024

v1.22.0 ⛰️ Tracking Issue bottlerocket-os/bottlerocket#4170

Closed

5 tasks

yeazelm mentioned this pull request Sep 3, 2024

Build open source NVIDIA kernel modules #118

Merged

yeazelm commented Sep 3, 2024

View reviewed changes

yeazelm force-pushed the open-gpu-drivers branch from e712775 to 8b4aea4 Compare September 15, 2024 23:10

yeazelm changed the title ~~Add Open GPU drivers to kmod-6.1-nvidia and kmod-5.15-nvidia~~ Automatically use Open GPU drivers to kmod-6.1-nvidia and kmod-5.15-nvidia when desired. Sep 15, 2024

yeazelm changed the title ~~Automatically use Open GPU drivers to kmod-6.1-nvidia and kmod-5.15-nvidia when desired.~~ Automatically use Open GPU drivers when desired Sep 16, 2024

yeazelm mentioned this pull request Sep 16, 2024

Add Open GPU drivers from NVIDIA bottlerocket-os/bottlerocket#4172

Closed

4 tasks

bcressey reviewed Sep 16, 2024

View reviewed changes

yeazelm force-pushed the open-gpu-drivers branch from 8b4aea4 to d2e111a Compare September 19, 2024 04:22

ytsssun approved these changes Sep 19, 2024

View reviewed changes

sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved

sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved

sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved

bcressey reviewed Sep 19, 2024

View reviewed changes

yeazelm force-pushed the open-gpu-drivers branch from d2e111a to c9d1de9 Compare September 23, 2024 16:44

yeazelm added 6 commits September 23, 2024 22:22

packages/kmod-6.1-nvidia: provide configuration for open-gpu

f009407

This adds upon the present logic to build the open-gpu driver to provide configuration that driverdog can use to work with the open-gpu drivers. Signed-off-by: Matthew Yeazel <[email protected]>

packages/kmod-5.15-nvidia: provide configuration for open-gpu

7b5d76f

This adds upon the present logic to build the open-gpu driver to provide configuration that driverdog can use to work with the open-gpu drivers. Signed-off-by: Matthew Yeazel <[email protected]>

yeazelm force-pushed the open-gpu-drivers branch from c9d1de9 to b44e4c7 Compare September 23, 2024 22:36

bcressey approved these changes Sep 23, 2024

View reviewed changes

yeazelm merged commit fdf32c2 into bottlerocket-os:develop Sep 23, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically use Open GPU drivers when desired #114

Automatically use Open GPU drivers when desired #114

yeazelm commented Aug 29, 2024 •

edited

Loading

yeazelm Aug 29, 2024

yeazelm Sep 16, 2024

bcressey Aug 30, 2024

yeazelm Sep 16, 2024

yeazelm left a comment

yeazelm commented Sep 15, 2024

bcressey Sep 16, 2024

yeazelm Sep 18, 2024

bcressey Sep 16, 2024

bcressey Sep 16, 2024

yeazelm Sep 18, 2024

yeazelm commented Sep 19, 2024

yeazelm commented Sep 23, 2024

yeazelm commented Sep 23, 2024

bcressey Sep 23, 2024

		/// The GPU Device Data contains various features of the device. Only Name, Device ID, and Features are required
		/// a particular device

	// If there a multiple devices with the same ID, dedup them to minimize iterations
	// If there are multiple devices with the same ID, dedup them to minimize iterations

		[Install]
		WantedBy=load-kernel-modules.service

Automatically use Open GPU drivers when desired #114

Automatically use Open GPU drivers when desired #114

Conversation

yeazelm commented Aug 29, 2024 • edited Loading

Additional notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeazelm left a comment

Choose a reason for hiding this comment

yeazelm commented Sep 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeazelm commented Sep 19, 2024

yeazelm commented Sep 23, 2024

yeazelm commented Sep 23, 2024

Choose a reason for hiding this comment

yeazelm commented Aug 29, 2024 •

edited

Loading