Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for nvidia MIG #4306

Merged

Conversation

piyush-jena
Copy link
Contributor

@piyush-jena piyush-jena commented Nov 19, 2024

Issue number:

Closes #

Description of changes:

  • Added migrations for the following settings and configuration-files
settings.kubelet-device-plugins.nvidia.device-partitioning-strategy
settings.kubelet-device-plugins.nvidia.mig
configuration-files.nvidia-k8s-device-plugin-mig-conf
  • Rebased with core-kit version bump

Testing done:

Migration test between v1.32.0 and v1.33.0

v1.32.0

[ssm-user@control]$ apiclient get settings.kubelet-device-plugins.nvidia
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-sharing-strategy": "none",
        "pass-device-specs": true
      }
    }
  }
}
bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.32.0 (aws-k8s-1.29-nvidia)"
PRETTY_NAME="Bottlerocket OS 1.32.0 (aws-k8s-1.29-nvidia)"
VARIANT_ID=aws-k8s-1.29-nvidia
VERSION_ID=1.32.0
BUILD_ID=cacc4ce9
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"
DOCUMENTATION_URL="https://bottlerocket.dev"
bash-5.1# updog check-update -a --json
[
  {
    "variant": "aws-k8s-1.29-nvidia",
    "arch": "x86_64",
    "version": "1.33.0",
    "max_version": "1.33.0",
    "waves": {
      "0": "2025-02-07T12:26:28.495108532Z",
      "20": "2025-02-07T15:26:28.495108532Z",
      "102": "2025-02-08T11:26:28.495108532Z",
      "307": "2025-02-09T11:26:28.495108532Z",
      "819": "2025-02-11T11:26:28.495108532Z",
      "1228": "2025-02-12T11:26:28.495108532Z",
      "1843": "2025-02-13T11:26:28.495108532Z"
    },
    "images": {
      "boot": "bottlerocket-aws-k8s-1.29-nvidia-x86_64-1.33.0-1fb8b819-dirty-boot.ext4.lz4",
      "root": "bottlerocket-aws-k8s-1.29-nvidia-x86_64-1.33.0-1fb8b819-dirty-root.ext4.lz4",
      "hash": "bottlerocket-aws-k8s-1.29-nvidia-x86_64-1.33.0-1fb8b819-dirty-root.verity.lz4"
    }
  }
]
bash-5.1# updog update -i 1.33.0 -r -n
Starting update to 1.33.0
Reboot scheduled for Fri 2025-02-07 11:37:00 UTC, use 'shutdown -c' to cancel.
Update applied: aws-k8s-1.29-nvidia 1.33.0

Post upgrade to v1.33.0 - Unrelated lines in apiclient get configuration-files have been omitted for brevity.


[ssm-user@control]$ apiclient get settings.kubelet-device-plugins.nvidia
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-partitioning-strategy": "none",
        "device-sharing-strategy": "none",
        "pass-device-specs": true
      }
    }
  }
}
[ssm-user@control]$ apiclient get configuration-files
    ****
    "nvidia-k8s-device-plugin-mig-conf": {
      "path": "/etc/nvidia-migmanager/nvidia-migmanager.toml",
      "template-path": "/usr/share/templates/nvidia-k8s-device-plugin-mig-conf"
    },
    ****
bash-5.1# journalctl -u nvidia-migmanager
Feb 07 11:36:39 ip-172-31-0-165.us-west-2.compute.internal systemd[1]: Starting NVIDIA MIG manager service...
Feb 07 11:36:39 ip-172-31-0-165.us-west-2.compute.internal nvidia-migmanager[1394]: 11:36:39 [INFO] nvidia-migmanager started
Feb 07 11:36:39 ip-172-31-0-165.us-west-2.compute.internal nvidia-migmanager[1394]: 11:36:39 [INFO] Fetching GPU devices data ...
Feb 07 11:36:39 ip-172-31-0-165.us-west-2.compute.internal nvidia-migmanager[1394]: 11:36:39 [WARN] Found NVIDIA Device but couldn't confirm variant.
Feb 07 11:36:39 ip-172-31-0-165.us-west-2.compute.internal systemd[1]: Finished NVIDIA MIG manager service.
bash-5.1# signpost rollback-to-inactive

Post downgrade back to v1.32.0 - Unrelated lines in apiclient get configuration-files have been omitted for brevity.


[ssm-user@control]$ apiclient get settings.kubelet-device-plugins.nvidia
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-sharing-strategy": "none",
        "pass-device-specs": true
      }
    }
  }
}
[ssm-user@control]$ apiclient get configuration-files
    ****  
    ****

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@piyush-jena piyush-jena marked this pull request as draft November 19, 2024 13:11
@piyush-jena piyush-jena force-pushed the mig-feature-migrations branch 2 times, most recently from c881771 to 8a8bbe6 Compare January 15, 2025 21:04
@piyush-jena piyush-jena force-pushed the mig-feature-migrations branch 3 times, most recently from 43e460b to 45a18af Compare January 21, 2025 22:45
Copy link
Contributor

@bcressey bcressey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New defaults and migrations look good to me.

@piyush-jena piyush-jena force-pushed the mig-feature-migrations branch from 45a18af to 7fb5ff2 Compare February 7, 2025 10:30
@piyush-jena piyush-jena marked this pull request as ready for review February 7, 2025 11:49
Release.toml Show resolved Hide resolved
@piyush-jena piyush-jena force-pushed the mig-feature-migrations branch 2 times, most recently from 5e1c0e0 to d634d70 Compare February 7, 2025 18:55
@piyush-jena piyush-jena force-pushed the mig-feature-migrations branch from d634d70 to 4e8383e Compare February 7, 2025 22:00
@piyush-jena piyush-jena merged commit 7a4c242 into bottlerocket-os:develop Feb 7, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants