Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kit changes for Nvidia settings APIs #48

Merged
merged 1 commit into from
Aug 1, 2024

Conversation

monirul
Copy link
Contributor

@monirul monirul commented Jul 23, 2024

Issue number:

Closes #

Description of changes:
This PR contains the changes required to support Nvidia settings APIs, settings.nvidia-container-runtime and settings.kubernetes.nvidia

Testing done:
Yes.

bash-5.1# apiclient set settings.kubernetes.device-plugins.nvidia.pass-device-specs=true
bash-5.1# cat /etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  plugin:
    passDeviceSpecs: true
    deviceListStrategy: volume-mounts
    deviceIDStrategy: index
bash-5.1# apiclient set settings.kubernetes.device-plugins.nvidia.device-list-strategy=envvar
bash-5.1# cat /etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  plugin:
    passDeviceSpecs: true
    deviceListStrategy: envvar
    deviceIDStrategy: index
bash-5.1# apiclient set settings.kubernetes.device-plugins.nvidia.device-id-strategy=uuid
bash-5.1# cat /etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  plugin:
    passDeviceSpecs: true
    deviceListStrategy: envvar
    deviceIDStrategy: uuid

bash-5.1# cat /etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false

[nvidia-container-cli]
root = "/"
path = "/usr/bin/nvidia-container-cli"
environment = []
ldconfig = "@/sbin/ldconfig"
bash-5.1# apiclient set settings.nvidia-container-runtime.visible-devices-as-volume-mounts=false
bash-5.1# apiclient set settings.nvidia-container-runtime.visible-devices-envvar-when-unprivileged=true
bash-5.1# cat /etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true

[nvidia-container-cli]
root = "/"
path = "/usr/bin/nvidia-container-cli"
environment = []
ldconfig = "@/sbin/ldconfig"

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

plugin:
passDeviceSpecs: {{default true settings.kubernetes.nvidia.device-plugin.pass-device-specs}}
deviceListStrategy: "{{default "volume-mounts" settings.kubernetes.nvidia.device-plugin.device-list-strategy}}"
deviceIDStrategy: "{{default "index" settings.kubernetes.nvidia.device-plugin.device-id-strategy}}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this result in e.g. "volume-mounts" being double-quoted if the setting is unset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

Comment on lines 9 to 7
ExecStart=/usr/bin/nvidia-device-plugin --device-list-strategy volume-mounts --device-id-strategy index --pass-device-specs=true
ExecStart=/usr/bin/nvidia-device-plugin --config-file=/etc/nvidia-k8s-device-plugin/settings.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned when chatting that we put a default configuration file in place to keep compatibility with downstream variants that won't add the device plugin settings. That makes sense.

An alternative to overwriting a config tempfile could be to place a systemd dropin that overwrites ExecStart based on settings changes. So this file would remain the same, but a template could be rendered to /etc/sysetmd/system/nvidia-k8s-device-plugin.service.d/exec-start adding the alternative line with a reference to the rendered config file.

I know we do something like this for kubelet, though I'm not sure if there are compelling reasons to prefer one to the other. I suppose anything that gets rendered to /etc consumes system memory, so perhaps it's slightly nicer to avoid doing that if we can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed as suggested

@monirul monirul force-pushed the nvidia-api-kit-changes branch 5 times, most recently from d958a6d to 076025b Compare July 29, 2024 22:52
@monirul monirul changed the title [DRAFT] Kit changes for Nvidia settings APIs Kit changes for Nvidia settings APIs Jul 29, 2024
@@ -1 +1 @@
C /etc/nvidia-container-runtime/config.toml - - - - /usr/share/factory/nvidia-container-runtime/nvidia-container-toolkit-config-k8s.toml
d /etc/nvidia-container-runtime - - - - -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't need to be created by tmpfiles.d because the template rendering system will make the directory on our behalf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the tmpfiles as suggested.

Comment on lines +4 to +5
[Service]
{{#if settings.kubernetes.device-plugins.nvidia}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test that this works correctly on instances without an nvidia device plugin set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I tested it if settings.kubernetes.device-plugins.nvidia is not defined then this file is rendered as

[Service]

which are kind of ignored as there is no overriding settings. It works as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels weird and to someone just seeing this on the system would appear to be a failure? Can we add a comment in an else statement to say deliberately left this way or something? Is it not better to just remove the [Service] as well?

@@ -0,0 +1,2 @@
d /etc/nvidia-k8s-device-plugin - - - - -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you could similarly drop the tmpfiles requirement here since both files are templates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted as suggested.

Comment on lines +4 to +5
[Service]
{{#if settings.kubernetes.device-plugins.nvidia}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels weird and to someone just seeing this on the system would appear to be a failure? Can we add a comment in an else statement to say deliberately left this way or something? Is it not better to just remove the [Service] as well?

@monirul monirul merged commit e48f8b8 into bottlerocket-os:develop Aug 1, 2024
2 checks passed
@monirul monirul mentioned this pull request Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants