-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu: study and implement a way to limit Pod's i915 count #1408
Comments
K8s supports limit ranges: https://kubernetes.io/docs/concepts/policy/limit-range/ @uniemimu thought that it should support extended resources in addition to core ones, but somebody needs to check (also) whether it actually works when one omits default and min values (i.e. whether it allows pods that do not request a GPU):
@uMartinXu can you try whether that does what you wanted? |
Hi @eero-t, tried limitrange. supports extended resources.
|
Thanks for testing, good to hear that (max) limiting part works! Will it still try to add GPU resource requests for non-GPU pods, if you add to LimitRange:
? |
Yes, added min limit. It still adds request and limit i915 resource to non-GPU pods. I understand its cause of the default limit and default request, is that right? Irrespective of what default value is set, it does add that to non GPU pods.
|
Do you mean that even if you specify That sounds like bug which should be filed to upstream kubernetes. Either as "LimitRange does not respect specified default value", or "LimitRange adds resource request even when default and min resource requests are specified as zero". |
If default values are added to limitrange it is added to non GPU pods. If no default values are added to limit range, it allocates below values to non GPU pods. Is this expected?
|
so if you specify default as |
Yes works as expected. If default is |
With max and min in limitrange, for GPU pods requesting resource > 1, it has a forbidden error. |
@uniemimu You've been looking more into scheduler. Do you see any practical problem with zero extended resource request being added to non-GPU processes 8by I.e. does it just look funny, but work fine in practice? |
Also, as limitrange is namespace scoped- we only add it in the namespace where workloads are deployed right? |
Yes, to namespaces where your cluster config RBAC rules allow given k8s users access to GPU resources. |
Got it, thanks.
Apart from this, can we assume limitrange as an efficient solution? Then we could add the yaml in the project 1.0.0 GA as deployment step after GPU deviceplugin is created- for publishing certified operator on OCP 4.12. |
I'm also assuming this would be for operator configuration option for whether multi-GPU jobs are allowed. But does operator component know which namespaces in given cluster are allowed to access GPU resources? Aren't such RBAC rules rather specific to cluster? |
For openshift, there might be some roles to not deploy workloads/objects in |
On openshift these namespaces: |
Any other possible solutions you would suggest to try apart from limitrange? |
While writing separate Webhook for this is a possibility, they are nasty, and LimitRange already seems to be explicitly designed for this. It just needs to support zero minimum value better (not add request for zero resource). Or do you think it should have also an option for limit being cluster wide instead of namespace specific? |
Thanks, agree with LimitRange. Workloads can be deployed to a specific namespace for individual GPU access. @uMartinXu any thoughts? |
The limitation of the i915 to 1 should be enforced on the whole cluster scope and should not only be applied to the specific namespace. |
There are few alternatives to achieve whole cluster wide GPU count limits:
|
GAS would be a possible place for limiting the i915 resource requests, but that would then require using GAS in general.
I doubt that this is an option as it's quite late in the Pod scheduling flow. I tried returning an error for the |
According to Stackoverflow, this can already be done by using Kyverno: https://stackoverflow.com/questions/73488971/how-can-i-apply-limit-range-to-all-namespaces-in-kubernetes Whis is "a policy engine to validate, mutate, generate, and cleanup Kubernetes resources, and verify image signatures and artifacts to help secure the software supply chain". (LimitRange should really be first improved to not to add zero resource requests to every pod though.) |
I just noticed that That could also be experimented with, whether it works any better than |
In #1377 it was identified that it would be required in some clusters to limit the Pod's i915 resource count to 1 (or some other value). The idea is to allow setting shared-dev-num to >1 and to prevent users from accessing more GPU resource than designed.
Webhook might be a good way to implement this, but it would be good to study other solutions as well.
The text was updated successfully, but these errors were encountered: