Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support skip old architecture version GPU settings time slice #58

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wawa0210
Copy link

@wawa0210 wawa0210 commented Jan 24, 2024

/kind feature

fix #41

Flexible judgment based on the GPU architecture series. If the model does not support time slices, ignore the settings to ensure normal operation of the function.

Test Results

Tesla P4 can run demo1 normally
https://github.com/NVIDIA/k8s-dra-driver/blob/main/demo/specs/quickstart/gpu-test1.yaml

➜  k8s-dra-driver git:(main) ✗ kubectl get all -n kubectl -n gpu-test1
NAME       READY   STATUS    RESTARTS   AGE
pod/pod2   1/1     Running   0          14m
➜  k8s-dra-driver git:(main) ✗ kubectl -n gpu-test1 exec -it pod2 nvidia-smi
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Wed Jan 24 15:58:21 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       On  | 00000000:13:00.0 Off |                    0 |
| N/A   27C    P8               7W /  75W |      0MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+


➜  k8s-dra-driver git:(main) ✗ kubectl get ResourceClaim -n gpu-test1 pod2-gpu -o yaml
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
  creationTimestamp: "2024-01-24T15:43:31Z"
  finalizers:
  - gpu.resource.nvidia.com/deletion-protection
  name: pod2-gpu
  namespace: gpu-test1
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Pod
    name: pod2
    uid: 09c9eae5-44b2-44ed-ba92-a4212c8edfff
  resourceVersion: "53101"
  uid: 0e2c7b7c-7357-4742-8eb1-101b0c673fd1
spec:
  allocationMode: WaitForFirstConsumer
  resourceClassName: gpu.nvidia.com
status:
  allocation:
    availableOnNodes:
      nodeSelectorTerms:
      - matchFields:
        - key: metadata.name
          operator: In
          values:
          - 172-30-40-100
    shareable: true
  driverName: gpu.resource.nvidia.com
  reservedFor:
  - name: pod2
    resource: pods
    uid: 09c9eae5-44b2-44ed-ba92-a4212c8edfff

@wawa0210 wawa0210 changed the title skip old architecture version GPU settings time slice support skip old architecture version GPU settings time slice Jan 24, 2024
@wawa0210
Copy link
Author

wawa0210 commented Jan 24, 2024

@klueska @elezar

friendly ping
Looking forward to review, especially detactSupportTimeSliceByArch needs to add more architecture and needs a complete support matrix

Can you help me update this form?

arch support timeslice
Fermi unknow
Kepler unknow
Maxwell unknow
Pascal false
Volta unknow
Turing true
Ampere true
Hopper true
Ada true

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

Instead of making this change in:

func (l deviceLib) setTimeSlice(uuids []string, timeSlice int) error {

We could perform the checks in:

func (t *TimeSlicingManager) SetTimeSlice(devices *PreparedDevices, config *nascrd.TimeSlicingConfig) error {

This already checks for MIG devices and adding a loop over the devices to filter out those that do not support timeslicing would be simpler here instead of rediscovering the available devices.

Furthermore, PreparedDevices already stores the brand and other relevant information.

Another quesiton that I have is what we expect the behavior to be when timeslicing is configured on a GPU that doesn't support it? Do we expect an error to be raised, or do we want to continue assuming blocking operation if the GPU is shared?

cmd/nvidia-dra-plugin/nvlib.go Outdated Show resolved Hide resolved
cmd/nvidia-dra-plugin/nvlib.go Outdated Show resolved Hide resolved
@elezar elezar self-assigned this Jan 29, 2024
@klueska
Copy link
Collaborator

klueska commented Jan 29, 2024

Another quesiton that I have is what we expect the behavior to be when timeslicing is configured on a GPU that doesn't support it? Do we expect an error to be raised, or do we want to continue assuming blocking operation if the GPU is shared?

I think we should error out and not start the plugin.

@wawa0210 wawa0210 force-pushed the main branch 2 times, most recently from f08aeb2 to 0dc4244 Compare January 29, 2024 15:17
@wawa0210
Copy link
Author

Another quesiton that I have is what we expect the behavior to be when timeslicing is configured on a GPU that doesn't support it? Do we expect an error to be raised, or do we want to continue assuming blocking operation if the GPU is shared?

I think we should error out and not start the plugin.

Do you mean that a gpu card that does not support settings time slice will not be able to use this function and will error out ?

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes.

It's definitely cleaner to do this at the higher level.

Also note, that we could consider just existing early if timeslicing is not supported by at least one device.

cmd/nvidia-dra-plugin/sharing.go Outdated Show resolved Hide resolved
cmd/nvidia-dra-plugin/sharing.go Outdated Show resolved Hide resolved
cmd/nvidia-dra-plugin/sharing.go Outdated Show resolved Hide resolved
cmd/nvidia-dra-plugin/sharing.go Outdated Show resolved Hide resolved
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the modifications.

It would be good to get @klueska's thoughts on my suggestions too.

cmd/nvidia-dra-plugin/sharing.go Outdated Show resolved Hide resolved
@wawa0210
Copy link
Author

wawa0210 commented Mar 1, 2024

@elezar @klueska friendly ping

@elezar
Copy link
Member

elezar commented Mar 1, 2024

@elezar @klueska friendly ping

@wawa0210 as already requested, it would be better if we return an error if timeslicing is configured on a GPU that doesn't support it. This will align with what is done for MIG.

I understand that crashing the plugin at this stage may not be desireable, but that is a separate issue that we would need to address.

@wawa0210
Copy link
Author

wawa0210 commented Mar 5, 2024

@elezar @klueska friendly ping

@wawa0210 as already requested, it would be better if we return an error if timeslicing is configured on a GPU that doesn't support it. This will align with what is done for MIG.

I understand that crashing the plugin at this stage may not be desireable, but that is a separate issue that we would need to address.

Agree with you, updated,maybe we should discuss this scenario in a separate issue --> #81

@wawa0210
Copy link
Author

wawa0210 commented Apr 3, 2024

@elezar PTAL

@klueska
Copy link
Collaborator

klueska commented Sep 13, 2024

The code base has undergone a major overhaul to accommodate the API changes in Kubernetes v1.31. I imagine this code needs a rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DRA does not support Tesla P4 model GPUs because it does not support setting time slices by nvidia-smi
3 participants