-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU sharing on cuda compute capability >=7.5 #231
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Swati Gupta <[email protected]>
58f6bfa
to
86de1cb
Compare
Thanks @guptaNswati. I will need to check how this differs from #58? |
if deviceType.Gpu != nil { | ||
cudaCCv := "v" + strings.TrimPrefix(deviceType.Gpu.cudaComputeCapability, "v") | ||
gpuUUID := deviceType.Gpu.UUID | ||
if semver.Compare(semver.Canonical(cudaCCv), semver.Canonical("v7.5")) >= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guptaNswati where does the v7.5
threshold come from? In #58 we check for >= v7.0
and for MPS specifically, v3.5
is mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I picked it from our device-plugin code checking if its Volta https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/device.go#L51
// allow devices only with cuda compute compatility >= 7.5 as time slicing and MPS does not work with old arch | ||
shareableAllocatableDevices := make(AllocatableDevices) | ||
for device, deviceType := range allocatableDevices { | ||
if deviceType.Gpu != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we don't timeslice MIG devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, does it make sense to factor these checks into a function where we can better test the various combinations of options?
@@ -413,7 +431,8 @@ func (s *DeviceState) applySharingConfig(ctx context.Context, config configapi.S | |||
if err != nil { | |||
return nil, fmt.Errorf("error getting MPS configuration: %w", err) | |||
} | |||
mpsControlDaemon := s.mpsManager.NewMpsControlDaemon(string(claim.UID), allocatableDevices) | |||
|
|||
mpsControlDaemon := s.mpsManager.NewMpsControlDaemon(string(claim.UID), shareableAllocatableDevices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we distinguish between timeslicing-sharable and MPS-sharable devices?
This is to add a check on allowing GPU sharing only when its a CUDA compute capability of 7.5 and higher. It skips both timeslicing and MPS. Referencing these 2 issues and related MR
#41
https://github.com/NVIDIA/cloud-native-team/issues/97
https://github.com/NVIDIA/cloud-native-team/issues/96
Tested on Geforce 980 and Titan