-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document deploying DRA to OpenShift #82
base: main
Are you sure you want to change the base?
Conversation
empovit
commented
Mar 7, 2024
- Document the differences on OpenShift
- Include useful setup scripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @empovit. I have some minor comments / questions here.
OpenShift generally requires more stringent security settings than Kubernetes. If you see a warning about security context constraints when deploying the DRA plugin, pass the following to the Helm chart, either via an in-line variable or a values file: | ||
|
||
```yaml | ||
kubeletPlugin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of scope: As a matter of interest, does it make sense to make something like this the default on Openshift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but this clause is currently in values.yaml. I'm not sure it's possible to use conditional statements there.
@@ -0,0 +1,21 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we link these scripts in the README?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Actually I wasn't sure if the scripts were needed at all. Let's leave it as it is, anybody trying OpenShift will notice them, but it's important that they understand what the scripts do and why.
/cc @cdesiniotis since he has more OpenShift experience than me. |
6bf7bac
to
c588614
Compare
d5feb81
to
2293f96
Compare
@klueska @elezar @cdesiniotis How can we go forward with this PR? |
Things are changing drastically for 1.31 (just as they did for the example driver). Once we have |
@klueska Hello, I noticed that this pull request aims to add documentation for deploying DRA on Openshift. I am currently attempting to deploy DRA on an RHOCP cluster but have encountered some challenges due to the lack of available documentation. But this PR seems to be on hold. |
@zhouhao3 I believe the answer is in this comment
Although the procedure in this PR may still work on older versions of OpenShift with the old "classic DRA" version of the NVIDA DRA driver, it needs revisiting after the new DRA API makes it into OpenShift. Also, we have discovered some corner cases that must be addressed in this document, notably the fact that applying |
@empovit We are currently in the process of setting up an environment with rhocp + DRA structured parameters, and your insights have been incredibly helpful. I wanted to ask if there is a timeline or any plans for when this PR might be merged. |
@zhouhao3 Second, an updated version of the PR will not specifically target DRA structured parameters, but the beta version of DRA. Third, for this to happen, we depend on the following factors:
Unfortunately I don't have an ETA for that. As far as I can tell from my other work related to NVIDIA MIG and DRA, in addition to the documented steps I would recommend setting migManager:
...
env:
- name: WITH_REBOOT
value: 'true'
- name: MIG_PARTED_MODE_CHANGE_ONLY
value: 'true'
... Assuming you are using DRA with MIG (an older version of the driver), this will make sure that the MIG manager only takes care of the MIG mode and never tries to delete partitions allocated by the NVIDIA DRA driver. Keep in mind though that it makes cleaning up any existing MIG partitions your responsibility (prior to configuring the DRA driver). E.g. you can apply |
@empovit
The code has been updated from classic DRA to structured parameter DRA. I am looking to deploy the structured parameter DRA in an RHOCP cluster. Initially, I thought the documentation was for deploying the classic DRA in OpenShift, and that deploying the structured parameter DRA might require an update to this document. That's why I mentioned waiting for the update and merge of this PR, which would mean that structured parameter DRA could be deployed in an OCP cluster.
From your response, I understand that the current latest version of OpenShift does not yet support structured parameter DRA. It seems that support will only be available after the completion of the three points you mentioned. Is my understanding correct? If there are any inaccuracies in my understanding, please let me know. |
@zhouhao3
See also What version of the Kubernetes API is included with each OpenShift 4.x release? These instructions are expected to work with any recent version of OpenShift because they have nothing to do with the DRA version, but with having the NVIDIA GPU driver on the nodes, and setting them up for the NVIDIA DRA driver. Having said that, I am waiting for an OpenShift version with DRA |
@empovit |
2293f96
to
0c32dfa
Compare
* Document the differences on OpenShift * Include useful setup scripts Signed-off-by: Vitaliy Emporopulo <[email protected]>
0c32dfa
to
79ab75c
Compare