Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document deploying DRA to OpenShift #82

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

empovit
Copy link
Contributor

@empovit empovit commented Mar 7, 2024

  • Document the differences on OpenShift
  • Include useful setup scripts

@empovit empovit marked this pull request as ready for review March 7, 2024 13:13
@elezar elezar self-requested a review March 7, 2024 13:20
@elezar elezar self-assigned this Mar 7, 2024
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @empovit. I have some minor comments / questions here.

demo/clusters/openshift/README.md Outdated Show resolved Hide resolved
demo/clusters/openshift/README.md Outdated Show resolved Hide resolved
OpenShift generally requires more stringent security settings than Kubernetes. If you see a warning about security context constraints when deploying the DRA plugin, pass the following to the Helm chart, either via an in-line variable or a values file:

```yaml
kubeletPlugin:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of scope: As a matter of interest, does it make sense to make something like this the default on Openshift?

Copy link
Contributor Author

@empovit empovit Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but this clause is currently in values.yaml. I'm not sure it's possible to use conditional statements there.

demo/clusters/openshift/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,21 @@
#!/usr/bin/env bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we link these scripts in the README?

Copy link
Contributor Author

@empovit empovit Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Actually I wasn't sure if the scripts were needed at all. Let's leave it as it is, anybody trying OpenShift will notice them, but it's important that they understand what the scripts do and why.

@elezar elezar requested a review from cdesiniotis March 7, 2024 13:28
@elezar
Copy link
Member

elezar commented Mar 7, 2024

/cc @cdesiniotis since he has more OpenShift experience than me.

@empovit empovit force-pushed the openshift-doc branch 2 times, most recently from 6bf7bac to c588614 Compare March 7, 2024 14:30
@empovit empovit marked this pull request as draft June 20, 2024 14:59
@empovit empovit force-pushed the openshift-doc branch 2 times, most recently from d5feb81 to 2293f96 Compare June 20, 2024 15:44
@empovit empovit marked this pull request as ready for review July 3, 2024 14:23
@empovit empovit marked this pull request as draft July 11, 2024 10:02
@kannon92
Copy link

kannon92 commented Sep 6, 2024

@klueska @elezar @cdesiniotis How can we go forward with this PR?

@klueska
Copy link
Collaborator

klueska commented Sep 6, 2024

Things are changing drastically for 1.31 (just as they did for the example driver). Once we have main updated with all of the changes for 1.31, we can revisit this.

@zhouhao3
Copy link

zhouhao3 commented Nov 7, 2024

@klueska Hello, I noticed that this pull request aims to add documentation for deploying DRA on Openshift. I am currently attempting to deploy DRA on an RHOCP cluster but have encountered some challenges due to the lack of available documentation. But this PR seems to be on hold.
If there are any known issues or limitations with deploying DRA on RHOCP at the moment? Should I switch to a certain version or wait for something to be fixed? Any guidance or updates on this documentation would be greatly appreciated. Thanks.

@zhouhao3
Copy link

zhouhao3 commented Dec 3, 2024

@elezar @klueska Hi, I wanted to kindly follow up on my previous question regarding this PR. If you have any updates or need further clarification from my side, please let me know. Your feedback would be greatly appreciated.

@empovit
Copy link
Contributor Author

empovit commented Dec 4, 2024

@zhouhao3 I believe the answer is in this comment

Things are changing drastically for 1.31 (just as they did for the example driver). Once we have main updated with all of the changes for 1.31, we can revisit this.

Although the procedure in this PR may still work on older versions of OpenShift with the old "classic DRA" version of the NVIDA DRA driver, it needs revisiting after the new DRA API makes it into OpenShift.

Also, we have discovered some corner cases that must be addressed in this document, notably the fact that applying nvidia.com/mig.config=all-enabled may lead to undesirable wiping out of MIG partitions created by the DRA plugin - if the MIG manager pod crashes and is then re-created, for example.

@zhouhao3
Copy link

zhouhao3 commented Dec 5, 2024

@empovit
Thank you so much for your detailed response . I truly appreciate the time and effort you've put into addressing my queries.

We are currently in the process of setting up an environment with rhocp + DRA structured parameters, and your insights have been incredibly helpful. I wanted to ask if there is a timeline or any plans for when this PR might be merged.

@empovit
Copy link
Contributor Author

empovit commented Dec 5, 2024

@zhouhao3
First, I don't think you need this PR merged to follow the steps. I mean if they work for you, just use them.

Second, an updated version of the PR will not specifically target DRA structured parameters, but the beta version of DRA.

Third, for this to happen, we depend on the following factors:

  • The upstream DRA 1.32 API is merged into OpenShift
  • The NVIDIA DRA driver fully supports DRA 1.32 API
  • The updated NVIDIA DRA driver is tested on an OpenShift version with DRA 1.32

Unfortunately I don't have an ETA for that.

As far as I can tell from my other work related to NVIDIA MIG and DRA, in addition to the documented steps I would recommend setting migManager.env.MIG_PARTED_MODE_CHANGE_ONLY to true:

  migManager:
    ...
    env:
      - name: WITH_REBOOT
        value: 'true'
      - name: MIG_PARTED_MODE_CHANGE_ONLY
        value: 'true'
    ...

Assuming you are using DRA with MIG (an older version of the driver), this will make sure that the MIG manager only takes care of the MIG mode and never tries to delete partitions allocated by the NVIDIA DRA driver. Keep in mind though that it makes cleaning up any existing MIG partitions your responsibility (prior to configuring the DRA driver). E.g. you can apply nvidia.com/mig.config=all-enabled, wait for it to succeed, and then add the MIG_PARTED_MODE_CHANGE_ONLY env variable to the cluster policy.

@zhouhao3
Copy link

zhouhao3 commented Dec 6, 2024

@empovit
Thank you for your patience and professional explanations. I realize that I may have had some misunderstandings and unclear expressions earlier.

First, I don't think you need this PR merged to follow the steps. I mean if they work for you, just use them.

The code has been updated from classic DRA to structured parameter DRA. I am looking to deploy the structured parameter DRA in an RHOCP cluster. Initially, I thought the documentation was for deploying the classic DRA in OpenShift, and that deploying the structured parameter DRA might require an update to this document. That's why I mentioned waiting for the update and merge of this PR, which would mean that structured parameter DRA could be deployed in an OCP cluster.

Third, for this to happen, we depend on the following factors:

From your response, I understand that the current latest version of OpenShift does not yet support structured parameter DRA. It seems that support will only be available after the completion of the three points you mentioned. Is my understanding correct?

If there are any inaccuracies in my understanding, please let me know.

@empovit
Copy link
Contributor Author

empovit commented Dec 6, 2024

@zhouhao3
Let me try to clarify the DRA API versions in OpenShift:

OpenShift version K8s API version DRA version
4.16 1.29 v1alpha2
4.17 1.30 v1alpha2 with structured parameters
4.?? (probably 4.18) 1.31 v1alpha3
4.?? 1.32 v1beta1

See also What version of the Kubernetes API is included with each OpenShift 4.x release?

These instructions are expected to work with any recent version of OpenShift because they have nothing to do with the DRA version, but with having the NVIDIA GPU driver on the nodes, and setting them up for the NVIDIA DRA driver. Having said that, I am waiting for an OpenShift version with DRA v1beta1 and an NVIDIA DRA driver that supports that version to re-verify and merge this PR, as v1beta1 will be the first stable version of DRA.

@zhouhao3
Copy link

zhouhao3 commented Dec 9, 2024

@empovit
I understand now! Your help is greatly appreciated.

@empovit
Copy link
Contributor Author

empovit commented Dec 22, 2024

This PR will need updating according to #211 and #212 when they are merged.

driverRootCtrPath: /run/nvidia/driver
nvidiaCDIHookPath: /var/usrlocal/nvidia/toolkit/nvidia-cdi-hook

* Document the differences on OpenShift
* Include useful setup scripts

Signed-off-by: Vitaliy Emporopulo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants