Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically use Open GPU drivers when desired #114

Merged
merged 7 commits into from
Sep 23, 2024

Conversation

yeazelm
Copy link
Contributor

@yeazelm yeazelm commented Aug 29, 2024

Description of changes:
This adds the ability to select the open driver to the kmod packages in addition to the proprietary drivers at runtime. This should enable newer instance types that support the open GPU driver to use it automatically while keeping the older instance types on the proprietary driver.

This now uses ExecCondition to call ghostdog match-nvidia-driver to check which devices are present and then either stop the link/load for the non-desired driver or continue if the requested driver matches the requested driver.

It is best reviewed in commit order. ghostdog can be reviewed separately, the two kmod packages are mirrors of each other, and then the os package commit glues it all together.

Additional notes

I added in an additional "copy only" configuration to copy the modules but not load them for nvidia-drm and nvidia-peermem. This is purely optional and we can choose to still not provide them, but since we are building them and they are present, it might be useful to some use cases to make them loadable at runtime. I haven't been able to find a way to exercise them though so we can remove that configuration file from the PR and they will remain in the staging directory until we decide to pull them in.

Testing done:
Tested on g5.2xlarge and g3.4xlarge both k8s 1.30 and 1.25 to confirm that the correct driver is loaded and the smoke tests to ensure the drivers still worked on each node.

g3.4xlarge

bash-5.1# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.183.06  Wed Jun 26 06:46:07 UTC 2024
GCC version:  gcc version 11.3.0 (Buildroot 2022.11.1)

g4d.24xlarge

bash-5.1# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  535.183.06  Release Build  (dvs-builder@U16-I3-A14-4-1)  Wed Jun 26 06:59:29 UTC 2024
GCC version:  gcc version 11.3.0 (Buildroot 2022.11.1)

Example output from the gpu test:

[root@gpu-tests-77lfz /]# ./run.sh

=========================================
  Running sample UnifiedMemoryPerf
=========================================

GPU Device 0: "Ampere" with compute capability 8.6
....

[simpleVoteIntrinsics]
GPU Device 0: "Ampere" with compute capability 8.6

> GPU device has 80 Multi-Processors, SM 8.6 compute capabilities

[VOTE Kernel Test 1/3]
        Running <<Vote.Any>> kernel1 ...
        OK

[VOTE Kernel Test 2/3]
        Running <<Vote.All>> kernel2 ...
        OK

[VOTE Kernel Test 3/3]
        Running <<Vote.Any>> kernel3 ...
        OK
        Shutting down...

=========================================
  Running sample vectorAdd
=========================================

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

=========================================
  Running sample warpAggregatedAtomicsCG
=========================================

GPU Device 0: "Ampere" with compute capability 8.6

CPU max matches GPU max

Warp Aggregated Atomics PASSED

Output for the g4d.24xlarge:

bash-5.1# systemctl | grep kernel
  sys-kernel-config.mount                                                                                                                       loaded active mounted Kernel Configuration File System
  sys-kernel-debug.mount                                                                                                                        loaded active mounted Kernel Debug File System
  sys-kernel-tracing.mount                                                                                                                      loaded active mounted Kernel Trace File System
  var-lib-kernel\x2ddevel-.overlay-lower.mount                                                                                                  loaded active mounted Kernel Development Sources (Read-Only)
  x86_64\x2dbottlerocket\x2dlinux\x2dgnu-sys\x2droot-usr-src-kernels.mount                                                                      loaded active mounted Kernel Development Sources (Read-Write)
  copy-open-gpu-kernel-modules.service                                                                                                          loaded active exited  Link additional kernel modules
  load-open-gpu-kernel-modules.service                                                                                                          loaded active exited  Load additional kernel modules
  systemd-udevd-kernel.socket                                                                                                                   loaded active running udev Kernel Socket

bash-5.1# journalctl -u copy-open-gpu-kernel-modules
Sep 15 22:17:15 localhost systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:18 localhost driverdog[1489]: 22:17:18 [INFO] Copied nvidia.ko
Sep 15 22:17:19 ip-192-168-81-80.us-west-2.compute.internal driverdog[1489]: 22:17:19 [INFO] Copied nvidia-modeset.ko
Sep 15 22:17:25 ip-192-168-81-80.us-west-2.compute.internal driverdog[1489]: 22:17:25 [INFO] Copied nvidia-uvm.ko
Sep 15 22:17:25 ip-192-168-81-80.us-west-2.compute.internal driverdog[1742]: 22:17:25 [INFO] Copied nvidia-drm.ko
Sep 15 22:17:25 ip-192-168-81-80.us-west-2.compute.internal driverdog[1742]: 22:17:25 [INFO] Copied nvidia-peermem.ko
Sep 15 22:17:25 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Finished Link additional kernel modules.

bash-5.1# journalctl -u link-tesla-kernel-modules
Sep 15 22:17:15 localhost systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:15 localhost ghostdog[1415]: tesla is not preferred driver: open-gpu
Sep 15 22:17:15 localhost systemd[1]: link-tesla-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:15 localhost systemd[1]: Condition check resulted in Link additional kernel modules being skipped.
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal ghostdog[1786]: tesla is not preferred driver: open-gpu
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: link-tesla-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Condition check resulted in Link additional kernel modules being skipped.
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal ghostdog[1819]: tesla is not preferred driver: open-gpu
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: link-tesla-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Condition check resulted in Link additional kernel modules being skipped.

bash-5.1# journalctl -u load-tesla-kernel-modules
Sep 15 22:17:15 localhost systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:15 localhost ghostdog[1490]: tesla is not preferred driver: open-gpu
Sep 15 22:17:15 localhost systemd[1]: load-tesla-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:15 localhost systemd[1]: Condition check resulted in Load additional kernel modules being skipped.
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal ghostdog[1790]: tesla is not preferred driver: open-gpu
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: load-tesla-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Condition check resulted in Load additional kernel modules being skipped.
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal ghostdog[1836]: tesla is not preferred driver: open-gpu
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: load-tesla-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:27 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Condition check resulted in Load additional kernel modules being skipped.

bash-5.1# journalctl -u load-open-gpu-kernel-modules
Sep 15 22:17:25 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal driverdog[1746]: 22:17:26 [INFO] Updated modules dependencies
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal driverdog[1746]: 22:17:26 [INFO] Loaded kernel modules
Sep 15 22:17:26 ip-192-168-81-80.us-west-2.compute.internal systemd[1]: Finished Load additional kernel modules.

Output for the g3.4xlarge:

bash-5.1# systemctl | grep kernel
  sys-kernel-config.mount                                                                                                                       loaded active mounted Kernel Configuration File System
  sys-kernel-debug.mount                                                                                                                        loaded active mounted Kernel Debug File System
  sys-kernel-tracing.mount                                                                                                                      loaded active mounted Kernel Trace File System
  var-lib-kernel\x2ddevel-.overlay-lower.mount                                                                                                  loaded active mounted Kernel Development Sources (Read-Only)
  x86_64\x2dbottlerocket\x2dlinux\x2dgnu-sys\x2droot-usr-src-kernels.mount                                                                      loaded active mounted Kernel Development Sources (Read-Write)
  link-tesla-kernel-modules.service                                                                                                             loaded active exited  Link additional kernel modules
  load-tesla-kernel-modules.service                                                                                                             loaded active exited  Load additional kernel modules
  systemd-udevd-kernel.socket                                                                                                                   loaded active running udev Kernel Socket

bash-5.1# journalctl -u link-tesla-kernel-modules
Sep 15 22:17:18 localhost systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:24 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:24 [INFO] Linked object 'nvidia.o'
Sep 15 22:17:24 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:24 [INFO] Stripped object 'nvidia.o'
Sep 15 22:17:24 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:24 [INFO] Linked object 'nvidia-modeset.o'
Sep 15 22:17:24 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:24 [INFO] Stripped object 'nvidia-modeset.o'
Sep 15 22:17:24 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:24 [INFO] Linked nvidia-uvm.ko
Sep 15 22:17:25 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:25 [INFO] Linked nvidia.ko
Sep 15 22:17:25 ip-192-168-24-94.us-west-2.compute.internal driverdog[2516]: 22:17:25 [INFO] Linked nvidia-modeset.ko
Sep 15 22:17:25 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Finished Link additional kernel modules.

bash-5.1# journalctl -u load-tesla-kernel-modules
Sep 15 22:17:25 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:26 ip-192-168-24-94.us-west-2.compute.internal driverdog[2984]: 22:17:26 [INFO] Updated modules dependencies
Sep 15 22:17:26 ip-192-168-24-94.us-west-2.compute.internal driverdog[2984]: 22:17:26 [INFO] Loaded kernel modules
Sep 15 22:17:26 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Finished Load additional kernel modules.

bash-5.1# journalctl -u copy-open-gpu-kernel-modules
Sep 15 22:17:18 localhost systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:18 localhost ghostdog[2217]: open-gpu is not preferred driver: tesla
Sep 15 22:17:18 localhost systemd[1]: copy-open-gpu-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:18 localhost systemd[1]: Condition check resulted in Link additional kernel modules being skipped.
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal ghostdog[3029]: open-gpu is not preferred driver: tesla
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: copy-open-gpu-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Condition check resulted in Link additional kernel modules being skipped.
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Starting Link additional kernel modules...
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal ghostdog[3056]: open-gpu is not preferred driver: tesla
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: copy-open-gpu-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Condition check resulted in Link additional kernel modules being skipped.

bash-5.1# journalctl -u load-open-gpu-kernel-modules
Sep 15 22:17:18 localhost systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:18 localhost ghostdog[2517]: open-gpu is not preferred driver: tesla
Sep 15 22:17:18 localhost systemd[1]: load-open-gpu-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:18 localhost systemd[1]: Condition check resulted in Load additional kernel modules being skipped.
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal ghostdog[3034]: open-gpu is not preferred driver: tesla
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: load-open-gpu-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:27 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Condition check resulted in Load additional kernel modules being skipped.
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Starting Load additional kernel modules...
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal ghostdog[3097]: open-gpu is not preferred driver: tesla
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: load-open-gpu-kernel-modules.service: Skipped due to 'exec-condition'.
Sep 15 22:17:28 ip-192-168-24-94.us-west-2.compute.internal systemd[1]: Condition check resulted in Load additional kernel modules being skipped.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

/// Given a PCI ID, search the Open GPU Supported Devices File to determine if the Open GPU Driver should be used
fn find_preferred_driver(pci_id: String) -> Result<String> {
// This corresponds to T4 devices which are Turing based and supported, but currently do not show up in the supported-devices file
if pci_id == "10DE:1EB8" {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the g5g instances also don't match up on the supported list https://github.com/NVIDIA/open-gpu-kernel-modules?tab=readme-ov-file#compatible-gpus but are the Turing generation of devices that should support it. I'll do some more testing to see if the driver works on these instances too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to get light confirmation from NVIDIA that these devices should use the proprietary driver so I've removed this if statement.

sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved
Comment on lines 21 to 22
[Install]
WantedBy=load-kernel-modules.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very racy - systemd will always queue the job immediately, and if it happens to start before ghostdog finishes creating the link, then the condition check won't pass and it will fail to start.

It looks like you're working around that by pushing the job after systemd-tmpfiles-setup.service, but that's a hack.

Zooming out a bit, it seems like you've created two synchronization problems for yourself:

  • the need for the correct config file to be in place before either driverdog link or driverdog load can run
  • the need to copy the .ko files into place before driverdog load runs, in the open GPU case

You can eliminate the second coordination problem here by making it a driverdog config problem instead - having a copy-source field instead of link-objects to tell driverdog where to find the .ko.

I know I pushed back on more driverdog functionality here but I can be wrong sometimes. It would let you delete this unit (and its sync problem) and focus only on the "correct config for driverdog" problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've resolved this by moving away from udev entirely. We can determine at the time of the unit file running what decision to make. It costs very little to do the ghostdog match-nvidia-driver for these invocations so keeping track of these decisions somewhere else don't look to be worth it. Testing on a g3.4xlarge which should have to crawl the entire file and not make a match (worst case runtime) is still insignificant compared to the time it takes to link/load the modules:

bash-5.1# time ghostdog match-nvidia-driver tesla

real    0m0.005s
user    0m0.002s
sys     0m0.003s
bash-5.1# time ghostdog match-nvidia-driver open-gpu
open-gpu is not preferred driver: tesla

real    0m0.005s
user    0m0.005s
sys     0m0.000s

I can be more precise in the measurement if needed, but it looks to be that ghostdog is very fast at providing this decision.

packages/kmod-6.1-nvidia/open-gpu-depmod.conf.in Outdated Show resolved Hide resolved
sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved
Copy link
Contributor Author

@yeazelm yeazelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added comments here but also started breaking out the changes into smaller PRs since this one is a lot to review. See #118 for the NVIDIA kmod packaging and #119 for driverdog changes. Once those get merged, I'll rebase this PR on top of those and make this significantly smaller to review.

packages/kmod-6.1-nvidia/kmod-6.1-nvidia.spec Outdated Show resolved Hide resolved
sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved
sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved
sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved
sources/driverdog/src/main.rs Outdated Show resolved Hide resolved
sources/driverdog/src/main.rs Outdated Show resolved Hide resolved
@yeazelm yeazelm changed the title Add Open GPU drivers to kmod-6.1-nvidia and kmod-5.15-nvidia Automatically use Open GPU drivers to kmod-6.1-nvidia and kmod-5.15-nvidia when desired. Sep 15, 2024
@yeazelm
Copy link
Contributor Author

yeazelm commented Sep 15, 2024

^ Just pushed a bunch of changes, this should "reset" the PR to be the new approach and should remove the code already reviewed.

@yeazelm yeazelm changed the title Automatically use Open GPU drivers to kmod-6.1-nvidia and kmod-5.15-nvidia when desired. Automatically use Open GPU drivers when desired Sep 16, 2024
Comment on lines 4 to 14
After=link-tesla-kernel-modules.service
Requires=link-tesla-kernel-modules.service
# Disable manual restarts to prevent loading kernel modules
# that weren't linked by the running system
RefuseManualStart=true
RefuseManualStop=true

[Service]
Type=oneshot
ExecStart=/usr/bin/driverdog load-modules
ExecCondition=/usr/bin/ghostdog match-nvidia-driver tesla
ExecStart=/usr/bin/driverdog --modules-set nvidia-tesla load-modules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that these units are so specific to NVIDIA, they don't really fit with the general purpose code in os.

I would either move them to the kmod-*-nvidia packages, or else turn these into unit templates and adjust the logic a bit so we can do something like:

[Unit]
After=link-kernel-modules@%i.service
Requires=link-kernel-modules@%i.service

...

[Service]
Type=oneshot
ExecCondition=/usr/bin/ghostdog match-driver %I
ExecStart=/usr/bin/driverdog --modules-set %I load-modules

Then the kmod-*-nvidia packages could just add symlinks to instantiate the units they supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, I'll move them into the kmod packages. I had issues getting templates to block boot so we probably will opt for explicit files here to keep things straightforward even if its a bit of copying the same things over.

Comment on lines 72 to 73
/// The GPU Device Data contains various features of the device. Only Name, Device ID, and Features are required
/// a particular device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing word, guessing ...?

Suggested change
/// The GPU Device Data contains various features of the device. Only Name, Device ID, and Features are required
/// a particular device
/// The GPU Device Data contains various features of the device. Only Name, Device ID, and Features are required
/// for a particular device

Comment on lines 194 to 230
for input_device in present_devices.iter() {
match open_gpu_devices {
SupportedDevicesConfiguration::OpenGpu(ref device_list) => {
let formatted_device_id = format!("0x{}", input_device.device());
for supported_device in device_list.iter() {
if supported_device.device_id == formatted_device_id {
return Ok("open-gpu".to_string());
}
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a system with a large number of GPUs that weren't supported by the open driver, and a JSON file with a large number of different devices, we could end up doing a lot of extra work.

It would be more efficient to transform the open GPU device data into a hash set of device IDs. Then we would guarantee one pass through the list of supported devices, and one lookup per device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried putting these into HashSets to ensure we didn't have negative performance even in cases where there are 16 devices and the runs all were taking between 5-6ms both as the Vec implementation as it stands vs the HashSet contains(). It stands to reason that HashSet's would be better so I can convert, but I don't think we gain a lot here with the ~700 devices listed in the current file, even in worst case.

    // Bogus PCI devices but "miss" to ensure there will be no match
    let present_devices: Vec<ListDevicesOutput> = vec![ListDevicesOutput {
                pci_slot: "00:00.0".to_string(),
                class: "0600".to_string(),
                vendor: "8086".to_string(),
                device: "1234".to_string(),
                program_interface: Some("00".to_string()),
                subsystem_vendor: Some("1d0f".to_string()),
                subsystem_device: Some("1237".to_string()),
                ..Default::default()
            };16];

    // reduce down to unique ids so we only iterate on uniqueness (down from 16 to 1 here)
    let unique_ids: HashSet<String> = present_devices.iter().map(|x| format!("0x{}", x.device()).clone()).collect();
    match open_gpu_devices {
            SupportedDevicesConfiguration::OpenGpu(ref device_list) => {
                let open_devices_list: HashSet<String> = device_list.iter().map(|x| x.device_id.clone()).collect();
                for input_device in unique_ids.iter() {
                    if open_devices_list.contains(input_device) {
                        return Ok("open-gpu".to_string());
                    }
            }
       }
    }

The compiler might be tricking me here and optimizing as well (I did try creating the Vec of unique objects and still saw the same times for input devices) so I can test a bit more, but even so, 5ms runs for this when parsing the entire file is good considering the kernel module loads will take seconds. We might save more by saving the decision to a file to be checked before rerunning the deserialization and searching since this will be called several times by systemd.

@yeazelm
Copy link
Contributor Author

yeazelm commented Sep 19, 2024

^ updated to reflect comments. This now holds back open gpu use in some cases where it is known to cause issues. Also moved the kernel module services to the kmod packages.

sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved
sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved
sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved
packages/kmod-5.15-nvidia/kmod-5.15-nvidia.spec Outdated Show resolved Hide resolved
sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved
sources/ghostdog/src/main.rs Outdated Show resolved Hide resolved
@yeazelm
Copy link
Contributor Author

yeazelm commented Sep 23, 2024

^ Pushed a new set of commits, this renames the systemd services to be more clear and fixes the unique id bug. Validated on g5.2xlarge which used proprietary and a g4d.24xlarge that chose the open GPU drivers.

ghostdog can now look up PCI devices and confirm which NVIDIA driver
should be used. The match-nvidia-driver takes one argument, which type
of driver such as tesla or open-gpu and exits 0 if that matches the PCI
devices currently present. If those devices do not match the provided
driver, it exits 1. This can be used to set ExecCondition for things
like linking or loading drivers.

Signed-off-by: Matthew Yeazel <[email protected]>
This adds upon the present logic to build the open-gpu driver to provide
configuration that driverdog can use to work with the open-gpu drivers.

Signed-off-by: Matthew Yeazel <[email protected]>
This adds upon the present logic to build the open-gpu driver to provide
configuration that driverdog can use to work with the open-gpu drivers.

Signed-off-by: Matthew Yeazel <[email protected]>
The os package should not be concerned with specifics to the NVIDIA kmod
module preferences. os will provide driverdog, but the configuration
files it reads will be provided by the package providing the modules to
keep that specific logic local to their domain.

Signed-off-by: Matthew Yeazel <[email protected]>
The os package doesn't need to concern itself with NVIDIA specific
loading behavior. It will provide driverdog, but the configurations read
by driverdog will be included with the specific kernel modules package
that provides the drivers described in the configuration.

Signed-off-by: Matthew Yeazel <[email protected]>
The os package doesn't need to concern itself with NVIDIA specific
loading behavior. It will provide driverdog, but the configurations read
by driverdog will be included with the specific kernel modules package
that provides the drivers described in the configuration. This moves the
tesla and open-gpu services into the kmod-5.15-nvidia package instead.

Signed-off-by: Matthew Yeazel <[email protected]>
The os package doesn't need to concern itself with NVIDIA specific
loading behavior. It will provide driverdog, but the configurations read
by driverdog will be included with the specific kernel modules package
that provides the drivers described in the configuration. This moves the
tesla and open-gpu services into the kmod-6.1-nvidia package instead.

Signed-off-by: Matthew Yeazel <[email protected]>
@yeazelm
Copy link
Contributor Author

yeazelm commented Sep 23, 2024

^ rebase on develop

let present_devices =
pciclient::list_devices(list_input).context(error::ListPciDevicesSnafu)?;

// If there a multiple devices with the same ID, dedup them to minimize iterations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// If there a multiple devices with the same ID, dedup them to minimize iterations
// If there are multiple devices with the same ID, dedup them to minimize iterations

@yeazelm yeazelm merged commit fdf32c2 into bottlerocket-os:develop Sep 23, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants