Skip to content

Commit

Permalink
Update logic to set environment for calls out to nvidia-smi
Browse files Browse the repository at this point in the history
Previously, we were setting the envvars needed for nvidia-smi in the plugin's
environment itself. This caused errors, however, when shelling out to other
binaries that didn't need these envvars set.

This change pushes the setting of the envvars into the actual exec call for
nvidia-smi instead of setting them in the plugin's environment itself.

Closes NVIDIA#106

Signed-off-by: Kevin Klues <[email protected]>

Bump golangci/golangci-lint-action from 4 to 6

Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 4 to 6.
- [Release notes](https://github.com/golangci/golangci-lint-action/releases)
- [Commits](golangci/golangci-lint-action@v4...v6)

---
updated-dependencies:
- dependency-name: golangci/golangci-lint-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>

Add a demo svg file showing the baisc DRA use

Update demo

Adjust demo svg format

Update svg format

Update the description of demo.

update

Signed-off-by: Yuan Chen <[email protected]>
Signed-off-by: Yuan Chen <[email protected]>

Update tje demo svg description

Signed-off-by: Yuan Chen <[email protected]>

Update the svg demo

Signed-off-by: Yuan Chen <[email protected]>

Remove duplicated info.

Signed-off-by: Yuan Chen <[email protected]>

Clean up

Signed-off-by: Yuan Chen <[email protected]>

Add hostPID to MPS daemon template

Without this, the MPS server was not able to find it's own PID via /proc/self
and was failing to start. It's unclear why this wasn't needed previously, but
it makes sense why adding hostPID would solve this.

Signed-off-by: Kevin Klues <[email protected]>

Add basic examples for Linux workstations

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README

Signed-off-by: Yuan Chen <[email protected]>

Update README.md

Signed-off-by: Yuan Chen <[email protected]>

Remove the timeslicing example

Add restartPolicy to examples

Signed-off-by: Yuan Chen <[email protected]>

Update demo files

Signed-off-by: Yuan Chen <[email protected]>
  • Loading branch information
klueska authored and yuanchen8911 committed May 15, 2024
1 parent 917e1ce commit 655f0c8
Show file tree
Hide file tree
Showing 15 changed files with 986 additions and 22 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/golang.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ jobs:
with:
go-version: ${{ env.GOLANG_VERSION }}
- name: Lint
uses: golangci/golangci-lint-action@v4
uses: golangci/golangci-lint-action@v6
with:
version: latest
args: -v --timeout 5m
Expand Down
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ First since we'll launch kind with GPU support, ensure that the following prereq
sudo nvidia-ctk config --set accept-nvidia-visible-devices-as-volume-mounts=true --in-place
```

1. Show the current set of GPUs on the machine
1. Show the current set of GPUs on the machine:
```console
nvidia-smi -L
```
Expand All @@ -53,6 +53,15 @@ cd k8s-dra-driver
```

### Setting up the infrastructure

Here's a demo showing how to install and configure DRA, and run a pod in a `kind` cluster on a Linux workstation.

<p align="center">
<img width="800" src="./demo/specs/quickstart/basic-demo.svg">
</p>

Below are the detailed, step-by-step instructions.

First, create a `kind` cluster to run the demo:
```console
./demo/clusters/kind/create-cluster.sh
Expand Down Expand Up @@ -88,7 +97,7 @@ The `README` in that directory shows the full script of the demo you can walk th
cat demo/specs/quickstart/README.md
```

Deploy the example pods in the demo directory.
Deploy the example pods in the demo directory:
```console
kubectl apply --filename=demo/specs/quickstart/gpu-test{1,2,3}.yaml
```
Expand Down Expand Up @@ -130,11 +139,10 @@ GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)

### Cleaning up the environment

Running
Remove the cluster created in the preceding steps:
```console
./demo/clusters/kind/delete-cluster.sh
```
will remove the cluster created in the preceding steps.

<!--
TODO: This README should be extended with additional content including:
Expand Down
51 changes: 34 additions & 17 deletions cmd/nvidia-dra-plugin/nvlib.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,9 @@ import (

type deviceLib struct {
nvdev.Interface
nvmllib nvml.Interface
nvidiaSMIPath string
nvmllib nvml.Interface
driverLibraryPath string
nvidiaSMIPath string
}

func newDeviceLib(driverRoot root) (*deviceLib, error) {
Expand All @@ -46,32 +47,40 @@ func newDeviceLib(driverRoot root) (*deviceLib, error) {
return nil, fmt.Errorf("failed to locate nvidia-smi: %w", err)
}

// In order for nvidia-smi to run, we need to set the PATH to the parent of
// the nvidia-smi executable and update LD_PRELOAD to include the path to
// libnvidia-ml.so.1
updatePathListEnvvar("LD_PRELOAD", driverLibraryPath)
updatePathListEnvvar("PATH", filepath.Dir(nvidiaSMIPath))

// We construct an NVML library specifying the path to libnvidia-ml.so.1
// explicitly so that we don't have to rely on the library path.
nvmllib := nvml.New(
nvml.WithLibraryPath(driverLibraryPath),
)
d := deviceLib{
Interface: nvdev.New(nvdev.WithNvml(nvmllib)),
nvmllib: nvmllib,
nvidiaSMIPath: nvidiaSMIPath,
Interface: nvdev.New(nvdev.WithNvml(nvmllib)),
nvmllib: nvmllib,
driverLibraryPath: driverLibraryPath,
nvidiaSMIPath: nvidiaSMIPath,
}
return &d, nil
}

// updatePathListEnvvar prepends a specified list of strings to a specified envvar.
func updatePathListEnvvar(envvar string, prepend ...string) {
// prependPathListEnvvar prepends a specified list of strings to a specified envvar and returns its value.
func prependPathListEnvvar(envvar string, prepend ...string) string {
if len(prepend) == 0 {
return
return os.Getenv(envvar)
}
current := filepath.SplitList(os.Getenv(envvar))
os.Setenv(envvar, strings.Join(append(prepend, current...), string(filepath.ListSeparator)))
return strings.Join(append(prepend, current...), string(filepath.ListSeparator))
}

// setOrOverrideEnvvar adds or updates an envar to the list of specified envvars and returns it.
func setOrOverrideEnvvar(envvars []string, key, value string) []string {
var updated []string
for _, envvar := range envvars {
pair := strings.SplitN(envvar, "=", 2)
if pair[0] == key {
continue
}
updated = append(updated, envvar)
}
return append(updated, fmt.Sprintf("%s=%s", key, value))
}

func (l deviceLib) Init() error {
Expand Down Expand Up @@ -481,10 +490,14 @@ func walkMigDevices(d nvml.Device, f func(i int, d nvml.Device) error) error {
func (l deviceLib) setTimeSlice(uuids []string, timeSlice int) error {
for _, uuid := range uuids {
cmd := exec.Command(
"nvidia-smi",
l.nvidiaSMIPath,
"compute-policy",
"-i", uuid,
"--set-timeslice", fmt.Sprintf("%d", timeSlice))

// In order for nvidia-smi to run, we need update LD_PRELOAD to include the path to libnvidia-ml.so.1.
cmd.Env = setOrOverrideEnvvar(os.Environ(), "LD_PRELOAD", prependPathListEnvvar("LD_PRELOAD", l.driverLibraryPath))

output, err := cmd.CombinedOutput()
if err != nil {
klog.Errorf("\n%v", string(output))
Expand All @@ -497,9 +510,13 @@ func (l deviceLib) setTimeSlice(uuids []string, timeSlice int) error {
func (l deviceLib) setComputeMode(uuids []string, mode string) error {
for _, uuid := range uuids {
cmd := exec.Command(
"nvidia-smi",
l.nvidiaSMIPath,
"-i", uuid,
"-c", mode)

// In order for nvidia-smi to run, we need update LD_PRELOAD to include the path to libnvidia-ml.so.1.
cmd.Env = setOrOverrideEnvvar(os.Environ(), "LD_PRELOAD", prependPathListEnvvar("LD_PRELOAD", l.driverLibraryPath))

output, err := cmd.CombinedOutput()
if err != nil {
klog.Errorf("\n%v", string(output))
Expand Down
2 changes: 2 additions & 0 deletions demo/specs/quickstart/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
You can run basic examples on a Linux desktop by following the instructions in the [desktop folder](desktop/README.md) as well.

#### Show current state of the cluster
```console
kubectl get pod -A
Expand Down
1 change: 1 addition & 0 deletions demo/specs/quickstart/basic-demo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 655f0c8

Please sign in to comment.